Gensim官方教程翻译（七）——分布式潜在语义分析案例（Distributed Latent Semantic Analysis）

最新推荐文章于 2023-11-12 09:54:46 发布

小小小北漂

最新推荐文章于 2023-11-12 09:54:46 发布

阅读量609

点赞数

分类专栏：机器学习有关

机器学习有关专栏收录该内容

11 篇文章 0 订阅

订阅专栏

仅供个人学习之用，如有纰漏，敬请指正。原文地址

阅读《分布式计算》教程来了解gensim中的分布式计算。

设置一个集群

我们将会通过一个案例展示如何运行分布式潜在语义分析。让我们假设我们有5个计算机，所有的电脑都在一个网段（网络广播可达）。为了开始，首先安装gensim并在每台电脑上设置Pyro（）：

$ sudo easy_install gensim[distributed]
$ export PYRO_SERIALIZERS_ACCEPTED=pickle
$ export PYRO_SERIALIZER=pickle
 
 1
2
3

接下来，在某一个计算机上运行Pyro的名称服务（无论是哪一台）：

$ python -m Pyro4.naming -n 0.0.0.0 &
 
 1

假设我们的集群中的电脑都是有带有内存负载的双核电脑，我们可以在其中4台上运行2个工作者脚本，共创建8个逻辑工作节点：

$ python -m gensim.models.lsi_worker &
 
 1

这将会运行gensim的lsi_worker.py脚本（在4台电脑上运行2次）。这让gensim知道它可以在这四台电脑上每台并行运行两个工作，以便计算可以更快，当然也会消耗双倍的内存。
再下一步，选择一台计算机将其作为一个作业调度程序，负责工作者同步，并在其上运行LSA调度器。
在我们的例子中，我们将会使用第5台电脑来作为调度器，并在那里运行：

$ python -m gensim.models.lsi_dispatcher &
 
 1

一般来说，调度器可以运行在与其中一个工作者节点上，或者也可以是另一个不同的电脑（在相同的广播域）。调度器大多数时间不会占用太多CPU，但是请选择一个有足够内存的计算机。
就是这样！集群已经被建立起来了，并且可以用来接受工作了。后来需要移除工作者节点时，只要结束其lsi_worker进程即可（不会影响正在运行的计算，节点的添加和删除都是动态的）。如果结束了lsi_dispatcher，在你重启它之前将不能运行计算（虽然已经存在的工作者进程能重新启用）。

运行LSA

让我们测试一下我们的设置，运行一个分布式LSA计算。在五台计算机中的任意一台打开一个Python shell（再说一遍，可以是任意一台在同一广播域的计算机，我们的选择都是偶然的）并尝试运行：

>>> from gensim import corpora, models, utils
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

>>> corpus = corpora.MmCorpus('/tmp/deerwester.mm') # 载入一个在教程中用到的9个文档的语料库
>>> id2word = corpora.Dictionary.load('/tmp/deerwester.dict')

>>> lsi = models.LsiModel(corpus, id2word=id2word, num_topics=200, chunksize=1, distributed=True) # 运行分布式LSA
 
 1
2
3
4
5
6
7
8

这里使用到了《语料库与向量空间》教程中创建的语料库及属性记号映射。如果你查找Python会话的日志，你应该会找到类似这样的一行：

2010-08-09 23:44:25,746 : INFO : using distributed version with 8 workers
 
 1

这意味着一切都在顺利进行。你也可以检查来自工作者和调度器进程的日志文件——这对于防止问题特别有用。检查一下LSA的结果，打印前两个潜在主题：

topic #0(3.341): 0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"time" + 0.265*"response"
topic #1(2.542): 0.623*"graph" + 0.490*"trees" + 0.451*"minors" + 0.274*"survey" + -0.167*"system"
 
 1
2

成功了！但是这种规模的语料库对于我们强大的集群来说没有什么挑战性……实际上，我们故意降低了单个文档一次工作的大小（chunksize参数），否则所有的文档将会被一个工作者一次处理完。

所以，让我们在百万文档上试验一下LSA：

>>> # 不断重复corpus将语料填充至1M
>>> corpus1m = utils.RepeatCorpus(corpus, 1000000)
>>> # 运行分布式LSA
>>> lsi1m = models.LsiModel(corpus1m, id2word=id2word, num_topics=200, chunksize=10000, distributed=True)

>>> lsi1m.print_topics(num_topics=2, num_words=5)
topic #0(1113.628): 0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"time" + 0.265*"response"
topic #1(847.233): 0.623*"graph" + 0.490*"trees" + 0.451*"minors" + 0.274*"survey" + -0.167*"system"
 
 1
2
3
4
5
6
7
8

其日志文件应该类似于：

2010-08-10 02:46:35,087 : INFO : using distributed version with 8 workers
2010-08-10 02:46:35,087 : INFO : updating SVD with new documents
2010-08-10 02:46:35,202 : INFO : dispatched documents up to #10000
2010-08-10 02:46:35,296 : INFO : dispatched documents up to #20000
…
2010-08-10 02:46:46,524 : INFO : dispatched documents up to #990000
2010-08-10 02:46:46,694 : INFO : dispatched documents up to #1000000
2010-08-10 02:46:46,694 : INFO : reached the end of input; now waiting for all remaining jobs to finish
2010-08-10 02:46:47,195 : INFO : all jobs finished, downloading final projection
2010-08-10 02:46:47,200 : INFO : decomposition complete

因为我们的“一百万语料库”词汇量太小、结构太一般，LSA的计算啊仅仅消耗了12秒。-_-!为了真实的压力测试，让我们在英文维基百科语料库上做一个LSA。

维基百科上的分布式LSA

首先，像之前的《英文维基百科的实验》一样下载并准备维基百科语料库，然后加载语料库迭代器：

>>> import logging, gensim, bz2
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

>>> # 加载id->word mapping (the dictionary)
>>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
>>> # load corpus iterator
>>> mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
>>> # mm = gensim.corpora.MmCorpus(bz2.BZ2File('wiki_en_tfidf.mm.bz2')) # use this if you compressed the TFIDF output

>>> print(mm)
MmCorpus(3199665 documents, 100000 features, 495547400 non-zero entries)
 
 1
2
3
4
5
6
7
8
9
10
11

现在我们已经准备好在英文维基百科上运行分布式LSA了：

>>> # 使用集群提取400个LSI主题
>>> lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400, chunksize=20000, distributed=True)

>>> # 打印前10个主题的贡献最高的单词（消极或积极）
>>> lsi.print_topics(10)
2010-11-03 16:08:27,602 : INFO : topic #0(200.990): -0.475*"delete" + -0.383*"deletion" + -0.275*"debate" + -0.223*"comments" + -0.220*"edits" + -0.213*"modify" + -0.208*"appropriate" + -0.194*"subsequent" + -0.155*"wp" + -0.117*"notability"
2010-11-03 16:08:27,626 : INFO : topic #1(143.129): -0.320*"diff" + -0.305*"link" + -0.199*"image" + -0.171*"www" + -0.162*"user" + 0.149*"delete" + -0.147*"undo" + -0.144*"contribs" + -0.122*"album" + 0.113*"deletion"
2010-11-03 16:08:27,651 : INFO : topic #2(135.665): -0.437*"diff" + -0.400*"link" + -0.202*"undo" + -0.192*"user" + -0.182*"www" + -0.176*"contribs" + 0.168*"image" + -0.109*"added" + 0.106*"album" + 0.097*"copyright"
2010-11-03 16:08:27,677 : INFO : topic #3(125.027): -0.354*"image" + 0.239*"age" + 0.218*"median" + -0.213*"copyright" + 0.204*"population" + -0.195*"fair" + 0.195*"income" + 0.167*"census" + 0.165*"km" + 0.162*"households"
2010-11-03 16:08:27,701 : INFO : topic #4(116.927): -0.307*"image" + 0.195*"players" + -0.184*"median" + -0.184*"copyright" + -0.181*"age" + -0.167*"fair" + -0.162*"income" + -0.151*"population" + -0.136*"households" + -0.134*"census"
2010-11-03 16:08:27,728 : INFO : topic #5(100.326): 0.501*"players" + 0.318*"football" + 0.284*"league" + 0.193*"footballers" + 0.142*"image" + 0.133*"season" + 0.119*"cup" + 0.113*"club" + 0.110*"baseball" + 0.103*"f"
2010-11-03 16:08:27,754 : INFO : topic #6(92.298): -0.411*"album" + -0.275*"albums" + -0.217*"band" + -0.214*"song" + -0.184*"chart" + -0.163*"songs" + -0.160*"singles" + -0.149*"vocals" + -0.139*"guitar" + -0.129*"track"
2010-11-03 16:08:27,780 : INFO : topic #7(83.811): -0.248*"wikipedia" + -0.182*"keep" + 0.180*"delete" + -0.167*"articles" + -0.152*"your" + -0.150*"my" + 0.144*"film" + -0.130*"we" + -0.123*"think" + -0.120*"user"
2010-11-03 16:08:27,807 : INFO : topic #8(78.981): 0.588*"film" + 0.460*"films" + -0.130*"album" + -0.127*"station" + 0.121*"television" + 0.115*"poster" + 0.112*"directed" + 0.110*"actors" + -0.096*"railway" + 0.086*"movie"
2010-11-03 16:08:27,834 : INFO : topic #9(78.620): 0.502*"kategori" + 0.282*"categoria" + 0.248*"kategorija" + 0.234*"kategorie" + 0.172*"категория" + 0.165*"categoría" + 0.161*"kategoria" + 0.148*"categorie" + 0.126*"kategória" + 0.121*"catégorie"
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

在串行模式，使用单程算法（one-pass algorithm）创建维基百科的LSI模型，在我的笔记本上消耗了5.25小时（OS X, C2D 2.53GHz, 4GB RAM with libVec）。使用了有4个工作者（Linux, dual-core Xeons of 2Ghz, 4GB RAM with ATLAS）的分布式模式，消耗时间下降至1小时41分。你可以在我的研究论文中阅读到更多的关于内部设定和实验的内容。