Leture7 Text Mining 文本挖掘笔记

最新推荐文章于 2024-07-11 21:12:02 发布

Jesszw

最新推荐文章于 2024-07-11 21:12:02 发布

阅读量323

点赞数

文章标签：机器学习

本文链接：https://blog.csdn.net/Jesszw/article/details/106366922

版权

Basics of text mining

Process of text mining:

Text Pre-processing
Text Transformation
Feature Selection
Data Mining
Evaluate
Applications

Text representation
- Set of Words
- Bag of Words
- Vector Space Model
- Topic Models
- Word Embedding

Text mining tasks

• Classification –Document categorization –Sentiment analysis
• Clustering Analysis –Text clustering
• Natural Language Processing Tasks

Applications of text mining

Sentiment Analysis
Financial Market Prediction
Recommendation

Challenges in text mining

–Data is not well-organized
–ambiguities
–annotated training examples -expensive to acquire

Text preprocessing

在这里插入图片描述

Tokenization:
Break a stream of text into meaningful units
Normalization:
Convert all text to the same case (upper or lower) –Remove numbers –Remove punctuation

Stemming/Lemmatization
- Inflected or derived words =>The root form
- Plurals, adverbs, inflected word forms – ladies => lady, referring => refer, forgotten => forget.
- Solutions (for English) – Porter Stemmer:Patterns of vowel-consonant sequence Krovetz Stemmer:Morphological rules
- Risk: – Lose precise meaning of the word – ground => grind
Normalization - Stopwords
- remove useless words
- risk: break the original meaning and structure of text

Text representation

Set of Words & Bag of Words
Vector Space Model –Term Frequency – Inverse Document Frequency
Topic Models –Latent Dirichlet Allocation –……
Word Embedding –Word2vec –……

VECTOR SPACE MODEL

Represent texts by vectors

Each dimension corresponds to a meaningful unit
- Orthogonal: –Linearly independent basis vectors –No ambiguity
Element of each vector is the weight (importance) of the unit
- Two basic heuristics to assign weights:
  - TF (Term Frequency) = Within-doc-frequency.
  - IDF (Inverse Document Frequency)

TF (Term Frequence)
Idea: a term is more important if it occurs more frequently in a document

raw TF: $t f (t, d) = c (t, d)$ , frequence count of term t in doc d --not accurate,can be affected by the document length
Normalize by the number by the number of words in this document
$tf(t,d)=\frac{c(t,d)}{\sum_tc(t,d)}$
Normalize by teh most frequent word in this document
$tf(t,d)=\alpha+(1-\alpha)\frac{c(t,d)}{\max_tc(t,d)}, if \ c(t,d)>0$

IDF (Inverse Document Frequency)
Idea:a term is more discriminative if it occurs only in fewer documents
$IDF(t)=1+log(\frac{N}{df(t)})$
TF-IDF
$w(t,d)=TF(t,d)\times IDF(t)$

向量空间：生成稀疏高维矩阵
后面两个模型主要生成低维稠密模型

TOPIC MODELS

Jesszw

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
Leture7 Text Mining 文本挖掘笔记

Basics of text miningProcess of text mining:Text Pre-processingText TransformationFeature SelectionData MiningEvaluateApplicationsText representationSet of WordsBag of WordsVector Space ModelTopic ModelsWord EmbeddingText mining tasks
复制链接

扫一扫