basic concept of NLP

最新推荐文章于 2024-11-17 16:40:50 发布

weixin_42297320

最新推荐文章于 2024-11-17 16:40:50 发布

阅读量138

点赞数

分类专栏： NLP 文章标签：自然语言处理情感分析

本文链接：https://blog.csdn.net/weixin_42297320/article/details/118284205

版权

NLP 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Basic Concepts of NLP

1 Pre-processing
2 Bag of words
3 Topic modeling
- 3.1 Algorithms
- - 3.1.1 The process:
  - 3.1.2 The characteristics of LDA:
4 Sentiment analysis
5 References

1 Pre-processing

在这里插入图片描述

1.1 Tokenization（词语切分):

Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc, at the same time throwing away certain characters, such as punctuation.
在这里插入图片描述

1.2 Normalization:

Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, converting numbers to their word equivalents, and so on.

1.2.1 Stemming(词干提取)：

the process of slicing the end or the beginning of words with the intention of removing affixes(词缀)

Affixes that are attached at the beginning of the word are called prefixes (e.g. “astro” in the word “astrobiology”) and the ones attached at the end of the word are called suffixes (e.g. “ful” in the word “helpful”).

在这里插入图片描述

1.2.2 Lemmatization(词形还原):

reducing a word to its base form and grouping together different forms of the same word.

Lemmatization resolves words to their dictionary form (known as lemma) for which it requires detailed dictionaries in which the algorithm can look into and link words to their corresponding lemmas(词元).

在这里插入图片描述
Lemmatization also solve problems like disambiguation(歧义消除). By providing a part-of-speech parameter to a word ( whether it is a noun, a verb, and so on) it’s possible to define a role for that word in the sentence and remove disambiguation.

1.2.3 Lemmatization v.s. stemming:

Lemmatization demands more computational power than setting up or adapting a stemming algorithm since it requires more knowledge about the language structure.

1.3 stop words(连接词)removal

getting rid of common language articles, pronouns and prepositions such as “and”, “the” or “to” in English.
There is no universal list of stop words. Pre-defined stop words list based on text or analysis task.

2 Bag of words

word frequencies or occurrences in a piece of text

2.1 For example:

Words are flowing out like endless rain into a paper cup, They slither
while they pass, they slip away across the universe.

The occurrence matrix is:
在这里插入图片描述
Term Document Matrix is tracking the term frequency for each term by each document.

2.2 The drawback:

• the absence of semantic meaning and context
• stop words (like “the” or “a”) add noise to the analysis

2.3 Solution:

Term Frequency — Inverse Document Frequency(TFIDF)
The tf–idf value increases proportionally to the number of times a word appears in the document and is offset(抵消) by the number of documents in the corpus(语料库) that contain the word, which helps to adjust for the fact that some words appear more frequently in genera.
在这里插入图片描述

3 Topic modeling

clusters texts to discover latent topics based on their contents, processing individual words and assigning them values based on their distribution. It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.

3.1 Algorithms

Latent Dirichlet Allocation(LDA) for Topic Modeling. LDA converts this Document-Term Matrix into two lower dimensional matrices – M1 and M2. M1 is a document-topics matrix and M2 is a topic – terms matrix with dimensions (N, K) and (K, M) respectively,

3.1.1 The process:

在这里插入图片描述

3.1.2 The characteristics of LDA:

Unlike k-means cluster, LDA assigns each document to a mixture of topics, which means that each document can be described by one or more topics (e.g. Document 1 is described by 70% of topic A, 20% of topic B and 10% of topic C) and reflect more realistic results.
在这里插入图片描述

4 Sentiment analysis

Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1]