一些NLP的概念

ref: SANER 2019, Chenkai Guo, Systematic Comprehension for Developer Reply in Mobile System Forum


    • Pre-processing
      1. Stop words
        1. Chinese language uses stop words to connect different components within a sentence, such as “着” (akind of interjection), “和” (indicating “and”), etc. Such stop words emerge at a high-frequency, but are often meaningless for our further analysis. Therefore, we eliminate them in the early stage.
      2. Non-Chinese Words
        1. Some reviews involve in non- Chinese words, e.g., English. We translate majority of them into Chinese using Python libraries langid and google translator, and discard some that are too long to express content accurately as general Chinese reviews do. Apart from that, we also do some necessary translation from traditional Chinese to simplified one.
      3. Word Segmentation
        1. Word segmentation is a necessary step in handling Chinese language processing, since words within a Chinese sentence are hard to identify. In our work, a common-used toolkit jieba is exploited to implement such word segmentation.
    • Pre-training
      1. word embedding [19]
        1. such that the input data can be translated from a high dimensional sparse space into a low dimensional dense numeric representation.
        2. Word embedding has been proven to be an effective batch of representation methods to avoid dimension explosion as well as preserve original word semantic and sequence information, which benefits a lot for sentence comprehension, similarity evaluation and sentiment analysis.
      2. Word2Vec[26]
        1. a popular unsupervised neural network method to implement the word embedding.
        2. In details, there are two typical types of algorithms for the Word2Vec [25] implementation.
          1. One is to predict current word based onthe context, called CBOW;
          2. the other is to predict surrounding words of the targeted word, called continuous Skip-Gram.
          3. Generally, Skip-Gram outperforms CBOW in semantic related tasks.
    • LDA
      1. latent Dirichlet allocation 
      2. generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model and belongs to the machine learning field and in a wider sense to the artificial intelligence field.
    • topic model 
      1. a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.
      2. Topic models are also referred to as probabilistic topic models, which refers to statistical algorithms for discovering the latent semantic structures of an extensive text body. In the age of information, the amount of the written material we encounter each day is simply beyond our processing capacity. Topic models can help to organize and offer insights for us to understand large collections of unstructured text bodies. Originally developed as a text-mining tool, topic models have been used to detect instructive structures in data such as genetic information, images, and networks. 
    • Sentiment Score
      1. popular sentiment analyzer snowNLP   
        1. (ref: GitHub - isnowfy/snownlp: Python library for processing Chinese text)
        2. The snowNLP accepts two datasets (a positive sample set and a negative one) as training corpus.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值