看了一天的LDA:
LDA是无监督学习的一种,其中用到的理论包括多项式分布、狄利克雷分布。文档中主题的分布,主题中词汇的分布,文档中词汇的分布三部分。
使用的还是anaconda中的包。
刚开始用python,有很多不懂的地方,碰到的一个问题加深了python的理解:在python工程的模块中导入包的问题。我是在windows下使用python 的,使用的命令行。我在包外部python import的时候没问题,而当进入了包内,再进行python import时候就不可以了:同学提了一个观点说是可能windows对于环境变量的读取有一个优先级选择,如果当前目录下有你要导入的文件,不管该文件是否是需要用的文件,那么windows的策略就是直接导入该文件,如果当前目录没有该文件,那么就按照环境变量进行读取。这个确实可以解决我import lda时候出现的错误问题,姑且这么理解。
使用的数据还是路透社的新闻数据,边操作他的数据,边进行源码阅读,碰到的python语法问题单独查资料。
>>> import numpy as np
>>> import lda
>>> import lda.datasets
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> X.sum()
84010
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X) # model.fit_transform(X) is also available
>>> topic_word = model.topic_word_ # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
... topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
... print('Topic {}: {}'.format(i, ' '.join(topic_words)))
<span style="color:#cc0000;">对于前20个类别,每个单词属于每个类别的概率:</span>
Topic 0: british churchill sale million major letters west britain
Topic 1: church government political country state people party against
Topic 2: elvis king fans presley life concert young death
Topic 3: yeltsin russian russia president kremlin moscow michael operation
Topic 4: pope vatican paul john surgery hospital pontiff rome
Topic 5: family funeral police miami versace cunanan city service
Topic 6: simpson former years court president wife south church
Topic 7: order mother successor election nuns church nirmala head
Topic 8: charles prince diana royal king queen parker bowles
Topic 9: film french france against bardot paris poster animal
Topic 10: germany german war nazi letter christian book jews
Topic 11: east peace prize award timor quebec belo leader
Topic 12: n't life show told very love television father
Topic 13: years year time last church world people say
Topic 14: mother teresa heart calcutta charity nun hospital missionaries
Topic 15: city salonika capital buddhist cultural vietnam byzantine show
Topic 16: music tour opera singer israel people film israeli
Topic 17: church catholic bernardin cardinal bishop wright death cancer
Topic 18: harriman clinton u.s ambassador paris president churchill france
Topic 19: city museum art exhibition century million churches set
>>> doc_topic = model.doc_topic_
>>> for i in range(10):
... print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
<span style="color:#cc0000;">将每个titles根据上面提到的20个类别,进行文章分类:</span>
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 8)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 13)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 14)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 8)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 14)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 14)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 14)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 14)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 14)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 8)