LDA的使用记录--TREC，测试

最新推荐文章于 2024-07-24 00:35:01 发布

赵小越

最新推荐文章于 2024-07-24 00:35:01 发布

阅读量1.4k

点赞数

分类专栏：自然语言处理相关

本文链接：https://blog.csdn.net/angela2016/article/details/78220399

版权

自然语言处理相关专栏收录该内容

30 篇文章 3 订阅

订阅专栏

我觉得做东西最苦恼的是没有思路的瞎忙，做LDA感觉一直限于一种知识点太多，吃不透，苦恼，不想去看的恶性循环中，每当沮丧的时候，内心的程序猿鼓励师就会出现，跟我说，加油，挺住，你可以赢·~今天跟一个做过LDA的师兄聊天，他对我这种小白表示同情以及理解，还跟我说了一个新的名词，TREC 2011,顺着这个滕，我来细细研究一下这个问题。下面是我的学习记录。
TREC是 The Text Retrival Conference的简写，是由美国国家标准技术协会NIST和美国防卫部门联合举办的会议，旨在为信息检索提供研究支持，提供了大量的测试集。
好吧，需要发邮件，来获得数据。那我发邮件获得数据吧，需要研究的嘛·~
呜呼，又陷入瓶颈了，，我还是鼓励鼓励自己的ba ~~
想起来我毕设做的一些数据，，是不是也可以用的，又把之前的毕设的自己手工建的一个小的语料库来做了，，呜呜，，就是工程量比较大哦，好伤心哦~
越测试对于LSI的好感度越高，真的，没有对比就没有伤害呀。
自己构建的测试集：

White House issues cryptic warning against Syrian "potential preparations" for another chemical attack. 
White House warns Syria against chemical attack "preparations" 
White House identifies "potential preparations" for another chemical weapons attack by Syrian President Assad
US identifies #Syrian government potential preparations for chemical attack #SyriaCrisis 
This serves that in eyes of US, Syed Salahuddin is not merely terrorist active against India,but globally-Jonah Bla
US declares Salahuddin global terrorist
Alps avalanche kills two French children and Ukrainian
Avalanche swamps schoolchildren, skier in French Alps; 3 killed: Scores of rescuers
Three dead in Alps avalanche
Horrible Scenes in Turkey as ISIS Attacks Istanbul
Deadly #Istanbul suicide attack blamed on IS #Istanbul
Horrible Scenes in Turkey as ISIS Attacks Istanbul
Car bomb at Turkish police station kills 5
New post: "Car bomb at Turkish police station kills 5"
Turkey car bomb kills five, injures 39: Kurdish rebels detonated a car bomb at a police station in southeastern
Five dead in Kurdish rebels' car bomb blast at Turkey police station
The US sailors' release
Iran Frees US Sailors Swiftly As Diplomacy Smoothes Waters
Iran Releases Footage of US Sailor Apologizing After Capture
Iran humiliated US with the capture of our 10 sailors
14 dead in Japan ski tour bus crash
At least 14 killed, 27 injured in Japan ski tour bus crash
japan Ski bus runs off mountain road in Japan killing 14 people, injuring dozens
At least 20 killed in terrorist attack on hotel in capital of Burkina Faso
20 reported killed in Burkina Faso hotel attackhorrible

主题一：化学袭击：1，2，3，4（4条）
主题二：恐怖主义：5，6（2条）
主题三：飞机失事：7，8，9（3条）
主题四：ISIS的恐怖事件：10，11，12（3条）
主题五：汽车爆炸事件：13，14，15，16（4条）
主题六：水手被捕事件：17，18，19，20（4条）
主题七：日本的车祸事件：21，22，23（3条）
主题八：酒店被袭击：24，25（2条）
对这些内容进行简单的测试：
LSI简直surprise呀，，我这周又有工作量了，超棒超棒哒·~
测试的结果：

[(0, u'0.375*"car" + 0.375*"bomb" + 0.319*"station" + 0.319*"police"'), 
 (1, u'-0.300*"chemical" + -0.300*"preparations" + -0.272*"attack" + -0.260*"house"'),
 (2, u'-0.387*"istanbul" + -0.359*"attacks" + -0.359*"scenes" + -0.359*"horrible"'), 
 (3, u'0.389*"japan" + 0.328*"bus" + 0.328*"ski" + 0.328*"14"'), 
 (4, u'-0.370*"sailors" + -0.334*"us" + -0.289*"iran" + -0.249*"capture"'), 
 (5, u'0.315*"20" + 0.315*"hotel" + 0.315*"faso" + 0.315*"burkina"'), 
 (6, u'0.452*"avalanche" + 0.452*"alps" + 0.309*"three" + 0.296*"french"'), 
 (7, u'0.413*"salahuddin" + 0.332*"global" + 0.332*"declares" + 0.330*"terrorist"')]

识别出的主题：主题一–1，主题二–7，主题三–6，主题四–2，主题五–0，主题六–4，主题七–3，主题八–5

topic:1,-0.677959212292
topic:1,-0.560466990808
topic:1,-0.707421664953
topic:1,-0.594192785244
topic:7,0.732717598253
topic:7,0.700108203255
topic:6,0.70080414284
topic:6,0.638341221701
topic:6,0.680105867685
topic:2,-0.847407394753
topic:2,-0.298636165417
topic:2,-0.847407394753
topic:0,0.799988863413
topic:0,0.739377482341
topic:0,0.733363965516
topic:0,0.691989332757
topic:4,-0.489855273868
topic:4,-0.477669263803
topic:4,-0.435191619025
topic:4,-0.606745557565
topic:3,0.831472885137
topic:3,0.82497123045
topic:3,0.571327530028
topic:5,0.690100226246
topic:5,0.699451684981

牢记一句话：垃圾的输入一定得到垃圾的输出，没有奇迹。garbage in,garbage out!
同样对gensim中的LDA部分进行测试，在这里，我需要明确怎么样才能够使得LDA的参数达到最好，我需要明确一些知识点。
首先，LDA实现中的参数的物理含义；其次，如何来调节参数；最后，怎么跟理论联系起来的。这是一个长期的过程，，需要慢慢的实践积累，找到大方向，就可以朝这方面一点点的努力了。
gensim中的LDA方法，里面有很多的参数，在学习参数之前，还是要弥补很多的概念，之前也学过，但都没有学习应用，需要重新来进行学习。
特征向量，特征值的实际的物理含义。
Ax = b,其中A矩阵对于x向量而言就行对其做一种变化，可能是伸缩变换，也可能是旋转变化。
Ax=ax,此时A矩阵仍然是一种变化，但对于特征向量x而言，是一种同方向的伸缩变化，新的向量与特征向量之间没有发生夹角，a为常数。其实考虑一种变化A，只需要找到其特征向量，在某种程度就可以表示这种变化。
对于LSI(Latent Senmantic Indexing)而言，又可以叫为LSA(Latent Senmantic Analysis)，是在1990年提出来的一种新的索引和检索的方法，同样利用了传统的向量空间模型，不同的是，将词与文档映射到潜在的语义空间来进行处理，从而去除了一些原始的向量空间的一些噪声，提高信息检索的精确度。
其中主要的技术是奇异值分解（SVD），就是原来的文本为X（m*n），对其进行降维，分为U(m*s),D(s*s),V’(s*n)

[(0, u'0.031*"government" + 0.031*"syriacrisis" + 0.029*"releases" + 0.029*"apologizing"'),  
(1, u'0.033*"release" + 0.026*"another" + 0.023*"sailors" + 0.023*"potential"'),     水手被捕事件
(2, u'0.030*"avalanche" + 0.030*"alps" + 0.029*"three" + 0.028*"sailors"'),         飞机失事
(3, u'0.041*"salahuddin" + 0.036*"terrorist" + 0.035*"declares" + 0.035*"global"')，  恐怖主义
(4, u'0.043*"car" + 0.043*"bomb" + 0.038*"5" + 0.038*"turkish"'),           汽车爆炸
(5, u'0.038*"tour" + 0.038*"crash" + 0.035*"least" + 0.033*"14"')，         
(6, u'0.039*"syria" + 0.039*"warns" + 0.028*"white" + 0.028*"house"'),        化学袭击
(7, u'0.046*"turkey" + 0.045*"horrible" + 0.045*"attacks" + 0.045*"isis"')]     isis恐怖事件

对于其的文档主题分类来说，因为我还不知道参数具体怎么调，等我琢磨透，我再重新来改这一段。

在LDA库中的测试：
利用TF-IDF来作为特征训练LDA：

TOPIC0:terrorist salahuddin global declares       恐怖主义
TOPIC1:iran capture 10 apologizing             水手被捕
TOPIC2:avalanche alps killed dead               飞机失事
TOPIC3:chemical preparations white attack          化学袭击
TOPIC4:isis scenes horrible attacks                     isis恐怖事件
TOPIC5:car bomb police station                      骑车爆炸事件
TOPIC6:japan 14 ski bus                                   日本车祸
TOPIC7:sailors release istanbul blamed               水手被捕

在其中确实酒店被袭击的主题。

doc:0,topic:3
doc:1,topic:3
doc:2,topic:3
doc:3,topic:3
doc:4,topic:0
doc:5,topic:0
doc:6,topic:2
doc:7,topic:0
doc:8,topic:2
doc:9,topic:4
doc:10,topic:7
doc:11,topic:4
doc:12,topic:5
doc:13,topic:5
doc:14,topic:5
doc:15,topic:5
doc:16,topic:7
doc:17,topic:7
doc:18,topic:1
doc:19,topic:1
doc:20,topic:6
doc:21,topic:6
doc:22,topic:6
doc:23,topic:2
doc:24,topic:2

没有识别出主题，并且，有一些文档的主题划分错误，查准率为：76%
在直接词频作为特征训练LDA的时候：

TOPIC0：istanbul and children is    
TOPIC1：preparations chemical attack syrian   化学袭击
TOPIC2：car bomb at police   汽车爆炸
TOPIC3:in killed burkina 20        
TOPIC4:japan in 14 bus       日本车祸
TOPIC5:salahuddin us terrorist india   恐怖主义
TOPIC6:as istanbul turkey in    isis恐怖
TOPIC7:us iran sailors alps     飞机失事

对于直接词频来作为特征直接训练LDA拟合的时候，因为没有对词频来进行处理，所以，有很多无意义的高频词出现在了主题的单词中，造成主题比较混乱。而且对于水手被捕事件主题也没有检测出来

doc:0,topic:1
doc:1,topic:1
doc:2,topic:1
doc:3,topic:1
doc:4,topic:5
doc:5,topic:5
doc:6,topic:0
doc:7,topic:7
doc:8,topic:7
doc:9,topic:6
doc:10,topic:0
doc:11,topic:6
doc:12,topic:2
doc:13,topic:2
doc:14,topic:2
doc:15,topic:2
doc:16,topic:7
doc:17,topic:7
doc:18,topic:3
doc:19,topic:7
doc:20,topic:4
doc:21,topic:4
doc:22,topic:4
doc:23,topic:3
doc:24,topic:3

没有识别出主题，并且，有一些文档的主题划分错误，查准率为：72%，而且主题的代表词比较混乱。
***************分界线*****************
主题0：白宫发布反对叙利亚的化学攻击（500）
主题1：Syed Salahuddin被认为是全球通缉犯（500）
主题2：对于民主党而言，是最艰难的时刻，最高法院的审判（350）
LDA识别的主题（此时np.random.seed为10，可以达到最好的生成效果）

[(0, u'0.041*"preparations" + 0.037*"attack" + 0.036*"warns" + 0.036*"white" + 0.035*"house" + 0.035*"chemical" + 0.032*"syria" + 0.027*"potential"'), 
(1, u'0.023*"syed" + 0.023*"terrorist" + 0.022*"global" + 0.022*"designates" + 0.022*"salahuddin" + 0.021*"hizbul" + 0.020*"mujahideen" + 0.015*"leading"'), 
(2, u'0.035*"pick" + 0.034*"dead" + 0.034*"scotus" + 0.034*"supreme" + 0.034*"russia" + 0.034*"travel" + 0.034*"ban" + 0.033*"coming"')]

500条，500条，350条的测试数据，然后得到的查准率为：76.59%

the number of topic0 is:386
the number of topic1 is:58
the number of topic2 is:56
the number of topic0 is:54
the number of topic1 is:298
the number of topic2 is:148
the number of topic0 is:0
the number of topic1 is:0
the number of topic2 is:350

*****************新的测试******************
遵循增量测试的原则，哈哈，走火入魔了
主题0：白宫发布反对叙利亚的化学攻击（500+450+500）
White House warns Syria’s Assad against chemical attack
主题1：Syed Salahuddin被认为是全球通缉犯（500）
RT @timesofindia: US designates Syed Salahuddin ‘global terrorist’, sets tone for Trump-Modi meet
主题2：对于民主党而言，是最艰难的时刻，最高法院的审判（350）
RT @Hublife: Rough week for Democrats -Russia story dead -Another SCOTUS pick coming -Supreme Court Travel ban -Georgia Congres…
主题3：24个移民被营救（50）
24 migrants rescued in Niger, dozens more feared dead
主题4：民主在美国有很大的权利（400）
RT @davidsirota: Democrats have a supermajority in America’s largest state - and just used that power to kill a Medicare-for-all bill
主题5：墨西哥的游客被杀（150）
Kidnapped journalist found dead in Mexico, sixth of 2017
到底是怎么调节参数的呀，没细致的看源码，呜呜，有点伤心
alpha是一个对称的Dirichlet分布的参数，值越大意味着越平滑（更正规化）。看了论文来说，最主要的是设置一个合理的初始种子。利用gibbis采样的是不是会更好一点，需要来进行测试。