Basic Concepts of NLP
1 Pre-processing
1.1 Tokenization(词语切分):
Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc, at the same time throwing away certain characters, such as punctuation.
1.2 Normalization:
Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, converting numbers to their word equivalents, and so on.
1.2.1 Stemming(词干提取):
the process of slicing the end or the beginning of words with the intention of removing affixes(词缀)
Affixes that are attached at the beginning of the word are called prefixes (e.g. “astro” in the word “astrobiology”) and the ones attached at the end of the word are called suffixes (e.g. “ful” in the word “helpful”).
1.2.2 Lemmatization(词形还原):
reducing a word to its base form and grouping together different forms of the same word.
Lemmatization resolves words to their dictionary form (known as lemma) for which it requires detailed dictionaries in which the algorithm can look into and link words to their corresponding lemmas(词元).
Lemmatization also solve problems like disambiguation(歧义消除). By providing a part-of-speech parameter to a word ( whether it is a noun, a verb, and so on) it’s possible to define a role for that word in the sentence and remove disambiguation.
1.2.3 Lemmatization v.s. stemming:
Lemmatization demands more computational power than setting up or adapting a stemming algorithm since it requires more knowledge about the language structure.
1.3 stop words(连接词)removal
getting rid of common language articles, pronouns and prepositions such as “and”, “the” or “to” in English.
There is no universal list of stop words. Pre-defined stop words list based on text or analysis task.
2 Bag of words
word frequencies or occurrences in a piece of text
2.1 For example:
Words are flowing out like endless rain into a paper cup, They slither
while they pass, they slip away across the universe.
The occurrence matrix is:
Term Document Matrix is tracking the term frequency for each term by each document.
2.2 The drawback:
• the absence of semantic meaning and context
• stop words (like “the” or “a”) add noise to the analysis
2.3 Solution:
Term Frequency — Inverse Document Frequency(TFIDF)
The tf–idf value increases proportionally to the number of times a word appears in the document and is offset(抵消) by the number of documents in the corpus(语料库) that contain the word, which helps to adjust for the fact that some words appear more frequently in genera.
3 Topic modeling
clusters texts to discover latent topics based on their contents, processing individual words and assigning them values based on their distribution. It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.
3.1 Algorithms
Latent Dirichlet Allocation(LDA) for Topic Modeling. LDA converts this Document-Term Matrix into two lower dimensional matrices – M1 and M2. M1 is a document-topics matrix and M2 is a topic – terms matrix with dimensions (N, K) and (K, M) respectively,
3.1.1 The process:
3.1.2 The characteristics of LDA:
Unlike k-means cluster, LDA assigns each document to a mixture of topics, which means that each document can be described by one or more topics (e.g. Document 1 is described by 70% of topic A, 20% of topic B and 10% of topic C) and reflect more realistic results.
4 Sentiment analysis
Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1]
5 References
Your Guide to Natural Language Processing (NLP)
Getting started with NLP: Tokenization, Document-Term Matrix, TF-IDF
Data Preparation, Data Preprocessing, NLP, Text Analytics, Text Mining, Tokenization