Language: Cantonese (with English Slides)
Slides: Chinese NLP with Open Source Tools in Python (CC-BY)
Albert Au Yeung is a co-founder and CTO of Axon Labs Limited, a Hong Kong-based technology company focusing on mobile app development, data mining, machine learning and applied cognitive psychology. Albert has a PhD in Computer Science and has been conducting research on social network analysis, natural language processing, data mining and machine learning. He has been using Python for conducting experiments, processing data and developing applications for over 10 years.
In this talk, we will present an overview of the NLP pipeline for Chinese, including segmentation and tokenization, named entity recognition, term weighting, clustering, topic modelling and generating word2vec representation of the corpus. We will introduce different open source tools, including jieba, gensim, SnowNLP, etc. In particular, we will discuss how similar texts or paragraphs can be extracted from a corpus using latent semantic analysis, similarity based on word2vec. We will also discuss how to leverage existing open data to build dictionaries to assist NLP tasks.