文档预处理以及向量化中的要点:
- 基于 TF-IDF 算法的关键词抽取
jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())#allowPOS=(),不指定词性。
jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=('n','v,'vn'))#allowPOS指定词性。
sentence :待提取的文本
topK :返回几个 TF/IDF 权重最大的关键词,默认值为 20。
withWeight :是否一并返回关键词权重值,默认值为 False
allowPOS :仅包括指定词性的词,默认值为空,即不筛选。
- 输出文档字符串长度
print(len(docs))
print(docs[0][:500])
#输出
'''
1740
1
CONNECTIVITY VERSUS ENTROPY
Yaser S. Abu-Mostafa
California Institute of Technology
Pasadena, CA 91125
ABSTRACT
How does the connectivity of a neural network (number of synapses per
neuron) relate to the complexity of the problems it can handle (measured by
the entropy)? Switching theory would suggest no relation at all, since all Boolean
functions can be implemented using a circuit with very low connectivity (e.g.,
using two-input NAND gates). However, for a network that learns a pr
'''
- 删除出现少于20个文档的单词或在50%以上文档中出现的单词:
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)
dictionary.filter_extremes(no_below=20, no_above=0.5)
- 将文档转换为向量形式。只计算每个单词的频率,包括两个单词构成的词组。再看看处理后的词典大小和文档数。
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))
#输出
'''
Number of unique tokens: 8644
Number of documents: 1740
'''
训练LDA中的要点:
- 参数
-
num_topics:主题个数(K)。
-
chunksize:控制在训练算法中一次处理多少个文档。增加chunksize值会加速训练,只要保证内存不溢出。Chunksize会影响model的质量,但差别并不是很大。设置chunksize= 2000,比文档的数量还要多,因此可以一次性处理所有数据。 passes:控制我们在整个语料库上训练模型的频次,即“epochs”。如果设置passes=20,将会看到这行20次,确保在最后一次之前文档已经收敛。
-
iterations:控制每个文档的循环次数。从本质上讲,它控制着我们对每个文档重复执行特定循环的频率。重要的是将“passes”和“iteration”的数量设置得足够高。
-
参数alpha和eta,设置他们='auto’就行了,尽管这两个参数也不好设置,但是gensim可以在模型中自动学习这两个参数,你只需要在设置时设成‘auto’就行。
# Train LDA model.
from gensim.models import LdaModel
# Set training parameters.
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None #不去计算模型复杂度,因为很费时
# Make a index to word dictionary.
temp = dictionary[0] # This is only to "load" the dictionary.
id2word = dictionary.id2token
model = LdaModel(
corpus=corpus,
id2word=id2word,
chunksize=chunksize,
alpha='auto',
eta='auto',
iterations=iterations,
num_topics=num_topics,
passes=passes,
eval_every=eval_every
)
- 计算主题一致性
top_topics = model.top_topics(corpus) #, num_words=20)
# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)
from pprint import pprint
pprint(top_topics)
#输出:
'''
Average topic coherence: -1.1379.
[([(0.0081748655, 'bound'),
(0.007108706, 'let'),
(0.006066193, 'theorem'),
(0.005790631, 'optimal'),
(0.0051151128, 'approximation'),
(0.004763562, 'convergence'),
(0.0043320647, 'class'),
(0.00422147, 'generalization'),
(0.0037292794, 'proof'),
(0.0036608914, 'threshold'),
(0.0034258896, 'sample'
...
'''
- 需要调试的参数
- filter_extremes方法中的no_above和no_below参数
- 增加trigrams或更高阶的n-gram。