一些NLP的概念

最新推荐文章于 2024-09-05 15:28:00 发布

Harriet99

最新推荐文章于 2024-09-05 15:28:00 发布

阅读量96

点赞数

文章标签：自然语言处理人工智能 nlp

本文链接：https://blog.csdn.net/Harriet99/article/details/121788802

版权

ref: SANER 2019, Chenkai Guo, Systematic Comprehension for Developer Reply in Mobile System Forum

- Pre-processing
  1. Stop words
    1. Chinese language uses stop words to connect different components within a sentence, such as “着” (akind of interjection), “和” (indicating “and”), etc. Such stop words emerge at a high-frequency, but are often meaningless for our further analysis. Therefore, we eliminate them in the early stage.
  2. Non-Chinese Words
    1. Some reviews involve in non- Chinese words, e.g., English. We translate majority of them into Chinese using Python libraries langid and google translator, and discard some that are too long to express content accurately as general Chinese reviews do. Apart from that, we also do some necessary translation from traditional Chinese to simplified one.
  3. Word Segmentation
    1. Word segmentation is a necessary step in handling Chinese language processing, since words within a Chinese sentence are hard to identify. In our work, a common-used toolkit jieba is exploited to implement such word segmentation.
- Pre-training
  1. word embedding [19]
    1. such that the input data can be translated from a high dimensional sparse space into a low dimensional dense numeric representation.
    2. Word embedding has been proven to be an effective batch of representation methods to avoid dimension explosion as well as preserve original word semantic and sequence information, which benefits a lot for sentence comprehension, similarity evaluation and sentiment analysis.
  2. Word2Vec[26]
    1. a popular unsupervised neural network method to implement the word embedding.
    2. In details, there are two typical types of algorithms for the Word2Vec [25] implementation.
      1. One is to predict current word based onthe context, called CBOW;
      2. the other is to predict surrounding words of the targeted word, called continuous Skip-Gram.
      3. Generally, Skip-Gram outperforms CBOW in semantic related tasks.
- LDA
  1. latent Dirichlet allocation
  2. a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model and belongs to the machine learning field and in a wider sense to the artificial intelligence field.
- topic model
  1. a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.
  2. Topic models are also referred to as probabilistic topic models, which refers to statistical algorithms for discovering the latent semantic structures of an extensive text body. In the age of information, the amount of the written material we encounter each day is simply beyond our processing capacity. Topic models can help to organize and offer insights for us to understand large collections of unstructured text bodies. Originally developed as a text-mining tool, topic models have been used to detect instructive structures in data such as genetic information, images, and networks.
- Sentiment Score
  1. popular sentiment analyzer snowNLP
    1. （ref: GitHub - isnowfy/snownlp: Python library for processing Chinese text)
    2. The snowNLP accepts two datasets (a positive sample set and a negative one) as training corpus.

Harriet99

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
一些NLP的概念

ref: SANER 2019, Chenkai Guo,Systematic Comprehension for Developer Reply inMobile System Forum Pre-processing Stop words Chinese language uses stop words to connect different components within a sentence, such as “着” (akind of interjection), “..
复制链接

扫一扫