Basics of text mining
Process of text mining:
- Text Pre-processing
- Text Transformation
- Feature Selection
- Data Mining
- Evaluate
- Applications
- Text representation
- Set of Words
- Bag of Words
- Vector Space Model
- Topic Models
- Word Embedding
Text mining tasks
• Classification –Document categorization –Sentiment analysis
• Clustering Analysis –Text clustering
• Natural Language Processing Tasks
Applications of text mining
- Sentiment Analysis
- Financial Market Prediction
- Recommendation
Challenges in text mining
–Data is not well-organized
–ambiguities
–annotated training examples -expensive to acquire
Text preprocessing
- Tokenization:
Break a stream of text into meaningful units - Normalization:
Convert all text to the same case (upper or lower) –Remove numbers –Remove punctuation
- Stemming/Lemmatization
- Inflected or derived words =>The root form
- Plurals, adverbs, inflected word forms – ladies => lady, referring => refer, forgotten => forget.
- Solutions (for English) – Porter Stemmer:Patterns of vowel-consonant sequence Krovetz Stemmer:Morphological rules
- Risk: – Lose precise meaning of the word – ground => grind
- Normalization - Stopwords
- remove useless words
- risk: break the original meaning and structure of text
Text representation
- Set of Words & Bag of Words
- Vector Space Model –Term Frequency – Inverse Document Frequency
- Topic Models –Latent Dirichlet Allocation –……
- Word Embedding –Word2vec –……
VECTOR SPACE MODEL
Represent texts by vectors
- Each dimension corresponds to a meaningful unit
- Orthogonal: –Linearly independent basis vectors –No ambiguity
- Element of each vector is the weight (importance) of the unit
- Two basic heuristics to assign weights:
- TF (Term Frequency) = Within-doc-frequency.
- IDF (Inverse Document Frequency)
- Two basic heuristics to assign weights:
TF (Term Frequence)
Idea: a term is more important if it occurs more frequently in a document
- raw TF: t f ( t , d ) = c ( t , d ) tf(t,d)=c(t,d) tf(t,d)=c(t,d), frequence count of term t in doc d --not accurate,can be affected by the document length
- Normalize by the number by the number of words in this document
t f ( t , d ) = c ( t , d ) ∑ t c ( t , d ) tf(t,d)=\frac{c(t,d)}{\sum_tc(t,d)} tf(t,d)=∑tc(t,d)c(t,d) - Normalize by teh most frequent word in this document
t f ( t , d ) = α + ( 1 − α ) c ( t , d ) max t c ( t , d ) , i f c ( t , d ) > 0 tf(t,d)=\alpha+(1-\alpha)\frac{c(t,d)}{\max_tc(t,d)}, if \ c(t,d)>0 tf(t,d)=α+(1−α)maxtc(t,d)c(t,d),if c(t,d)>0
IDF (Inverse Document Frequency)
Idea:a term is more discriminative if it occurs only in fewer documents
I
D
F
(
t
)
=
1
+
l
o
g
(
N
d
f
(
t
)
)
IDF(t)=1+log(\frac{N}{df(t)})
IDF(t)=1+log(df(t)N)
TF-IDF
w
(
t
,
d
)
=
T
F
(
t
,
d
)
×
I
D
F
(
t
)
w(t,d)=TF(t,d)\times IDF(t)
w(t,d)=TF(t,d)×IDF(t)
向量空间:生成稀疏高维矩阵
后面两个模型主要生成低维稠密模型