[机器学习]文本主题相关
1 TF-IDF
常用于挖掘文本关键词:
- TF(词频) = 词在本文的出现次数/文章的总词数
- IDF(逆文档频率) = log(语料库的文档总数/包含该词的文档数+1)
- 对于某一篇文章,计算出文档的每个词的TF-IDF值,然后按降序排列,取排在最前面的几个词,即为本文关键词
import jieba
import numpy as np
texts=[
'...',
'...',
'...',
'...'
]
# sklearn
# 对于中文文档,需要提前使用jieba进行分词
x_train = [" ".join(jieba.cut(text)) for text in texts[:3]]
x_test = [" ".join(jieba.cut(text)) for text in texts[3:]]
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = CountVectorizer()
tf_idf_transformer = TfidfTransformer()
x_train_tf_idf = tf_idf_transformer.fit_transform(vectorizer.fit_transform(x_train))
x_test_tf_idf = tf_idf_transformer.transform(vectorizer.transform(x_test))
x_train_weight = x_train_tf_idf.toarray()
x_test_weight = x_test_tf_idf.toarray()
words = vectorizer.get_feature_names()
for w in x_train_weight:
loc = np.argsort(-w)
for i in range(5):
print('{}: {} {}'.format(str(i + 1), words[loc[i]], w[loc[i]]))
print('\n')
# gensim
x_train = [jieba.lcut(text) for text in texts[:3]]
x_test = [jieba.lcut(text) for text in texts[3:]]
from gensim import corpora
from gensim import models
# 建立词表
dic = corpora.Dictionary(x_train)
# 建立id2count
x_train_bow = [dic.doc2bow(sentence) for sentence in x_train]
tfidf = models.TfidfModel(x_train_bow)
tfidf_vec = []
for sentence in x_test:
word_bow = dic.doc2bow(sentence.lower())
word_tfidf = tfidf[word_bow]
tfidf_vec.append(word_tfidf)
# 输出 词语id与词语tfidf值
print(tfidf_vec)
# jieba
import jieba.analyse
# idf使用jieba默认的,也可以自行指定
keywords = jieba.analyse.extract_tags(texts[0], topK=5)
对于大规模文本,还可使用spark实现tf-idf:Spark MLlib TF-IDF – Example
2 LDA
常用于本文主题分析。
词汇的分布来表达主题,主题的分布来表达文章。
LDA生成文档的过程如下:
- 从狄利克雷分布 α \alpha α中取样生成文档 i i