basic concept of NLP

1 Pre-processing

在这里插入图片描述

1.1 Tokenization(词语切分):

Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc, at the same time throwing away certain characters, such as punctuation.
在这里插入图片描述

1.2 Normalization:

Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, converting numbers to their word equivalents, and so on.

1.2.1 Stemming(词干提取):

the process of slicing the end or the beginning of words with the intention of removing affixes(词缀)

Affixes that are attached at the beginning of the word are called prefixes (e.g. “astro” in the word “astrobiology”) and the ones attached at the end of the word are called suffixes (e.g. “ful” in the word “helpful”).

在这里插入图片描述

1.2.2 Lemmatization(词形还原):

reducing a word to its base form and grouping together different forms of the same word.

Lemmatization resolves words to their dictionary form (known as lemma) for which it requires detailed dictionaries in which the algorithm can look into and link words to their corresponding lemmas(词元).

在这里插入图片描述
Lemmatization also solve problems like disambiguation(歧义消除). By providing a part-of-speech parameter to a word ( whether it is a noun, a verb, and so on) it’s possible to define a role for that word in the sentence and remove disambiguation.

1.2.3 Lemmatization v.s. stemming:

Lemmatization demands more computational power than setting up or adapting a stemming algorithm since it requires more knowledge about the language structure.

1.3 stop words(连接词)removal

getting rid of common language articles, pronouns and prepositions such as “and”, “the” or “to” in English.
There is no universal list of stop words. Pre-defined stop words list based on text or analysis task.

2 Bag of words

word frequencies or occurrences in a piece of text

2.1 For example:

Words are flowing out like endless rain into a paper cup, They slither
while they pass, they slip away across the universe.

The occurrence matrix is:
在这里插入图片描述
Term Document Matrix is tracking the term frequency for each term by each document.
在这里插入图片描述

2.2 The drawback:

• the absence of semantic meaning and context
• stop words (like “the” or “a”) add noise to the analysis

2.3 Solution:

Term Frequency — Inverse Document Frequency(TFIDF)
The tf–idf value increases proportionally to the number of times a word appears in the document and is offset(抵消) by the number of documents in the corpus(语料库) that contain the word, which helps to adjust for the fact that some words appear more frequently in genera.
在这里插入图片描述

3 Topic modeling

clusters texts to discover latent topics based on their contents, processing individual words and assigning them values based on their distribution. It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.

3.1 Algorithms

Latent Dirichlet Allocation(LDA) for Topic Modeling. LDA converts this Document-Term Matrix into two lower dimensional matrices – M1 and M2. M1 is a document-topics matrix and M2 is a topic – terms matrix with dimensions (N, K) and (K, M) respectively,

3.1.1 The process:

在这里插入图片描述

3.1.2 The characteristics of LDA:

Unlike k-means cluster, LDA assigns each document to a mixture of topics, which means that each document can be described by one or more topics (e.g. Document 1 is described by 70% of topic A, 20% of topic B and 10% of topic C) and reflect more realistic results.
在这里插入图片描述

4 Sentiment analysis

Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1]

5 References

Your Guide to Natural Language Processing (NLP)
Getting started with NLP: Tokenization, Document-Term Matrix, TF-IDF
Data Preparation, Data Preprocessing, NLP, Text Analytics, Text Mining, Tokenization

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值