python训练自己中文语料库_关于python：LDA模型每次在同一个语料库上训练时都会生成不同的主题...

最新推荐文章于 2022-07-17 22:58:14 发布

weixin_39889337

最新推荐文章于 2022-07-17 22:58:14 发布

阅读量575

点赞数 1

文章标签： python训练自己中文语料库

我正在使用python gensim从一个只有231个句子的小型语料库中训练潜在Dirichlet分配(LDA)模型。但是，每次我重复该过程时，都会产生不同的主题。

为什么相同的LDA参数和语料库每次都会生成不同的主题？

我如何稳定话题的产生？

我正在使用这个语料库(http://pastebin.com/WptkKVF0)和这个停用词列表(http://pastebin.com/LL7dqLcj)，这是我的代码：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45from gensim import corpora, models, similarities

from gensim.models import hdpmodel, ldamodel

from itertools import izip

from collections import defaultdict

import codecs, os, glob, math

stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] !="#" and i !=""]

def generateTopics(corpus, dictionary):

# Build LDA model using the above corpus

lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)

corpus_lda = lda[corpus]

# Group topics with similar words together.

tops = set(lda.show_topics(50))

top_clusters = []

for l in tops:

top = []

for t in l.split(" +"):

top.append((t.split("*")[0], t.split("*")[1]))

top_clusters.append(top)

# Generate word only topics

top_wordonly = []

for i in top_clusters:

top_wordonly.append(":".join([j[1] for j in i]))

return lda, corpus_lda, top_clusters, top_wordonly

#######################################################################

# Read textfile, build dictionary and bag-of-words corpus

documents = []

for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):

lemma = line.split("\t")[3]

documents.append(lemma)

texts = [[word for word in document.lower().split() if word not in stopwords]

for document in documents]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)

for i in topic_wordonly:

print i

Why does the same LDA parameters and corpus generate different topics everytime?

因为LDA在训练和推理步骤中都使用了随机性。

And how do i stabilize the topic generation?

通过每次使用numpy.random.seed训练模型或进行推理时，将numpy.random种子重置为相同的值：

1

2

3

4SOME_FIXED_SEED = 42

# before training/inference:

np.random.seed(SOME_FIXED_SEED)

(这很丑陋，这使得Gensim结果难以重现；请考虑提交补丁。我已经打开了一个问题。)

如果训练数据足够，则结果应收敛在有限的循环中。是不是

我可以知道如何将numpy.random设置为numpy.random.seed吗？您能给我看一个如何用numpy.random.seed调用ldamodel的例子吗？

@ 2er0您没有将np.random设置为np.random.seed，而是使用np.random.seed设置了种子。

@larsmans，所以不是np.random.seed(x)吗？您可以举一个例子来说明如何使用np.random.seed设置种子吗？

@ 2er0：事实就是这样。您对我的答案所做的编辑被其他用户拒绝，但我或多或少地对其进行了重新构造。

感谢您的澄清，=)。顺便说一句，固定的随机种子实际上在主题推理方面提高了我的系统性能。由于文档数量相对较少，因此使用20-50 passes和10 topics训练50-200个文档并使用random.seed(10)训练模型

@ 2er0：真是运气:)有时在随机训练算法中会发生这种情况；尝试许多随机种子直到您找到正确的模型(在验证集上进行测试)是一个非常标准的技巧。

我正在使用LDA查找文档的主题范围。我的解决方案是生成许多模型，并通过它们运行文档，然后取结果的平均值。如果设置随机种子，则将仅获得该模型的一个版本，尽管该模型是可复制的，但您如何知道其正确的版本呢？

在LdaModel()方法的初始化中设置random_state参数。

1

2

3

4

5

6lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,

id2word=id2word,

num_topics=num_topics,

random_state=1,

passes=num_passes,

alpha='auto')

即使有大约50,000条评论，我也遇到了同样的问题。但是您可以通过增加LDA运行的迭代次数来获得更加一致的主题。最初将其设置为50，当我将其提高到300时，通常会得到相同的结果，可能是因为它更接近收敛。

具体来说，您只需添加以下选项：

1ldamodel.LdaModel(corpus, ..., iterations = ):

问题在于，对于具有相同迭代次数的不同运行生成的结果是否具有随机性。

weixin_39889337

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。