Leture7 Text Mining 文本挖掘笔记

Basics of text mining

Process of text mining:

  1. Text Pre-processing
  2. Text Transformation
  3. Feature Selection
  4. Data Mining
  5. Evaluate
  6. Applications
  • Text representation
    • Set of Words
    • Bag of Words
    • Vector Space Model
    • Topic Models
    • Word Embedding
Text mining tasks

• Classification –Document categorization –Sentiment analysis
• Clustering Analysis –Text clustering
• Natural Language Processing Tasks

Applications of text mining
  • Sentiment Analysis
  • Financial Market Prediction
  • Recommendation
Challenges in text mining

–Data is not well-organized
–ambiguities
–annotated training examples -expensive to acquire

Text preprocessing

在这里插入图片描述

  1. Tokenization:
    Break a stream of text into meaningful units
  2. Normalization:
    Convert all text to the same case (upper or lower) –Remove numbers –Remove punctuation
  • Stemming/Lemmatization
    • Inflected or derived words =>The root form
    • Plurals, adverbs, inflected word forms – ladies => lady, referring => refer, forgotten => forget.
    • Solutions (for English) – Porter Stemmer:Patterns of vowel-consonant sequence Krovetz Stemmer:Morphological rules
    • Risk: – Lose precise meaning of the word – ground => grind
  • Normalization - Stopwords
    • remove useless words
    • risk: break the original meaning and structure of text

Text representation

  • Set of Words & Bag of Words
  • Vector Space Model –Term Frequency – Inverse Document Frequency
  • Topic Models –Latent Dirichlet Allocation –……
  • Word Embedding –Word2vec –……
VECTOR SPACE MODEL

Represent texts by vectors

  • Each dimension corresponds to a meaningful unit
    • Orthogonal: –Linearly independent basis vectors –No ambiguity
  • Element of each vector is the weight (importance) of the unit
    • Two basic heuristics to assign weights:
      • TF (Term Frequency) = Within-doc-frequency.
      • IDF (Inverse Document Frequency)
        在这里插入图片描述

TF (Term Frequence)
Idea: a term is more important if it occurs more frequently in a document

  • raw TF: t f ( t , d ) = c ( t , d ) tf(t,d)=c(t,d) tf(t,d)=c(t,d), frequence count of term t in doc d --not accurate,can be affected by the document length
  • Normalize by the number by the number of words in this document
    t f ( t , d ) = c ( t , d ) ∑ t c ( t , d ) tf(t,d)=\frac{c(t,d)}{\sum_tc(t,d)} tf(t,d)=tc(t,d)c(t,d)
  • Normalize by teh most frequent word in this document
    t f ( t , d ) = α + ( 1 − α ) c ( t , d ) max ⁡ t c ( t , d ) , i f   c ( t , d ) > 0 tf(t,d)=\alpha+(1-\alpha)\frac{c(t,d)}{\max_tc(t,d)}, if \ c(t,d)>0 tf(t,d)=α+(1α)maxtc(t,d)c(t,d),if c(t,d)>0

IDF (Inverse Document Frequency)
Idea:a term is more discriminative if it occurs only in fewer documents
I D F ( t ) = 1 + l o g ( N d f ( t ) ) IDF(t)=1+log(\frac{N}{df(t)}) IDF(t)=1+log(df(t)N)
TF-IDF
w ( t , d ) = T F ( t , d ) × I D F ( t ) w(t,d)=TF(t,d)\times IDF(t) w(t,d)=TF(t,d)×IDF(t)

向量空间:生成稀疏高维矩阵
后面两个模型主要生成低维稠密模型

TOPIC MODELS
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值