Dynamic Topic Models的Python实现

最新推荐文章于 2021-12-10 18:54:25 发布

RICH LIANG

最新推荐文章于 2021-12-10 18:54:25 发布

阅读量6.3k

点赞数 7

分类专栏： Python 主题模型文章标签：主题模型 DTM Python gensim

本文链接：https://blog.csdn.net/Lyric1/article/details/96331805

版权

Python 同时被 2 个专栏收录

4 篇文章 2 订阅

订阅专栏

主题模型

1 篇文章 0 订阅

订阅专栏

Dynamic Topic Models的Python实现

Dynamic Topic Models(DTM)简介
Dynamic Topic Models的实现
- 数据与预处理
- Python实现

Dynamic Topic Models(DTM)简介

Dynamic Topic Models来源于Blei于2006发表在第23届机器学习国际会议上的论文 Dynamic Topic Models，与先前的Latent Dirichlate Allocation(LDA)模型有所不同，DTM引入了时间因素，从而刻画语料库主题随时间的动态演化。
在LDA模型中，给定语料库中的所有文档，并无时间先后的差别，与词袋模型（bag-of-words）中的词无先后之分类似，在建模的过程中认为整个语料库中的K个主题是固定的。在DTM模型中，文档有了时间属性，具有先后之分。DTM认为，在不同时期，主题是随时间动态演化的。举个例子，比如语料库中存在音乐这个主题，从上世纪80年代的音乐所反映的内容与现在所反映的内容肯定是存在差别的。

DTM的概率图模型如下：
在这里插入图片描述
DTM的生成过程如下：

在DTM模型中，是语料库中的K个主题在不同的slice中是不断演化的，生成过程的第一步和第二步中，可以看出，t阶段是语料库中doc-topic分布以及topic-word分布是从t-1阶段演（evoled）化过来。由于在其他主题模型（如LDA）中被广泛使用的狄利克雷分布不再适合词出的序列建模，此处论文中采用的是高斯分布。
在这里插入图片描述
想要详细了解DTM的同学可以参读一下Blei的论文《Dynamic Topic Models》。DTM提供了原始代码 dtm，但是该代码的实现难度较大，感兴趣的同学可以下载源代码琢磨一下。下面是本文的重点，通过Python调用NLP神器Gensim中的模块，实现DTM。

Dynamic Topic Models的实现

数据与预处理

数据集为GitHub上所提供的1324个文档，分为三个月。
在这里插入图片描述
首先，需要对文档进行合共，将三个月的行为文档放入一个txt文件中，以每一行表示一个文档。这里我首先是通过java对文档进行了合并操作，并去除标点，代码如下：

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;

public class newsToDoc {
	public static void main(String[] args) {
		File file = new File("newsData\\sample");
		File newsOut = new File("newsData\\newOut.txt");
		File[] files = file.listFiles();
		String line;	
		try {
			BufferedWriter bfw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(newsOut),"UTF-8"));	
			BufferedReader bfr = null;
			for(int i = 0;i<files.length;i++) {
				bfr = new BufferedReader(new InputStreamReader(new FileInputStream(files[i]),"UTF-8"));
				while((line = bfr.readLine())!=null) {
					line = line.replaceAll("[`~!@#$%^&*()+=|{}':;',\\\\[\\\\].<>/?~！\"\"？@#￥%……&;*（）——+|{}《》【】‘；：’。，、|-]",""); //去除标点符号
					bfw.append(line);
				}
				bfw.newLine();
				bfw.flush();
			}
			bfw.flush();
			bfw.close();
			bfr.close();
		} catch (IOException e) {
			e.printStackTrace();
		} 	
	}
}

读者也可以自己动手，使用其他方式尝试将文档进行合并以及去除标点的操作。经过处理后，原本分三个月的共1324篇新闻已经合并到myCorpus.txt文档中。
在这里插入图片描述

Python实现

对文档进行预处理后，通过Python调用Gensim实现DTM.
首先导入相关模块：

import logging
from gensim import corpora
from six import iteritems
from gensim.models import ldaseqmodel
from gensim.corpora import Dictionary, bleicorpus

接下面，我们需要将myCorpus.txt这个文档转化成DTM模型所需要的语料库，并构造词典

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) #输出日志信息，便于了解程序运行情况
stoplist = set('a able  about  above  according i accordingly  "i  across  actually  after  afterwards  again  against  ain’t -  all  allow  allows  almost  alone  along  already  also  although  always  am  among  amongst  an  and  another  any  anybody  anyhow  anyone  anything  anyway  anyways  anywhere  apart  appear  appreciate  appropriate  are  aren’t  around  as  a’s  aside  ask  asking  associated  at  available  away  awfully  be  became  because  become  becomes  becoming  been  before  beforehand  behind  being  believe  below  beside  besides  best  better  between  beyond  both  brief  but  by  came  can  cannot  cant  can’t  cause  causes  certain  certainly  changes  clearly  c’mon  co  com  come  comes  concerning  consequently  consider  considering  contain  containing  contains  corresponding  could  couldn’t  course  c’s  currently  definitely  described  despite  did  didn’t  different  do  does  doesn’t  doing  done  don’t  down  downwards  during  each  edu  eg  eight  either  else  elsewhere  enough  entirely  especially  et  etc  even  ever  every  everybody  everyone  everything  everywhere  ex  exactly  example  except  far  few  fifth  first  five  followed  following  follows  for  former  formerly  forth  four  from  further  furthermore  get  gets  getting  given  gives  go  goes  going  gone  got  gotten  greetings  had  hadn’t  happens  hardly  has  hasn’t  have  haven’t  having  he  hello  help  hence  her  here  hereafter  hereby  herein  here’s  hereupon  hers  herself  he’s  hi  him  himself  his  hither  hopefully  how  howbeit  however  i’d  ie  if  ignored  i’ll  i’m  immediate  in  inasmuch  inc  indeed  indicate  indicated  indicates  inner  insofar  instead  into  inward  is  isn’t  it  it’d  it’ll  its  it’s  itself  i’ve  just  keep  keeps  kept  know  known  knows  last  lately  later  latter  latterly  least  less  lest  let  let’s  like  liked  likely  little  look  looking  looks  ltd  mainly  many  may  maybe  me  mean  meanwhile  merely  might  more  moreover  most  mostly  much  must  my  myself  name  namely  nd  near  nearly  necessary  need  needs  neither  never  nevertheless  new  next  nine  no  nobody  non  none  noone  nor  normally  not  nothing  novel  now  nowhere  obviously  of  off  often  oh  ok  okay  old  on  once  one  ones  only  onto  or  other  others  otherwise  ought  our  ours  ourselves  out  outside  over  overall  own  particular  particularly  per  perhaps  placed  please  plus  possible  presumably  probably  provides  que  quite  qv  rather  rd  re  really  reasonably  regarding  regardless  regards  relatively  respectively  right  said  same  saw  say  saying  says  second  secondly  see  seeing  seem  seemed  seeming  seems  seen  self  selves  sensible  sent  serious  seriously  seven  several  shall  she  should  shouldn’t  since  six  so  some  somebody  somehow  someone  something  sometime  sometimes  somewhat  somewhere  soon  sorry  specified  specify  specifying  still  sub  such  sup  sure  take  taken  tell  tends  th  than  thank  thanks  thanx  that  thats  that’s  the  their  theirs  them  themselves  then  thence  there  thereafter  thereby  therefore  therein  theres  there’s  thereupon  these  they  they’d  they’ll  they’re  they’ve  think  third  this  thorough  thoroughly  those  though  three  through  throughout  thru  thus  to  together  too  took  toward  towards  tried  tries  truly  try  trying  t’s  twice  two  un  under  unfortunately  unless  unlikely  until  unto  up  upon  us  use  used  useful  uses  using  usually  value  various  very  via  viz  vs  want  wants  was  wasn’t  way  we  we’d  welcome  well  we’ll  went  were  we’re  weren’t  we’ve  what  whatever  what’s  when  whence  whenever  where  whereafter  whereas  whereby  wherein  where’s  whereupon  wherever  whether  which  while  whither  who  whoever  whole  whom  who’s  whose  why  will  willing  wish  with  within  without  wonder  won’t  would  wouldn’t  yes  yet  you  you’d  you’ll  your  you’re  yours  yourself  yourselves  you’ve  zero  zt  ZT  zz  ZZ'.split())
#构造词典，并去除停用词以及文档中只出现过一次的词
dictionary = corpora.Dictionary(line.lower().split() for line in open('datasets/myCorpus.txt'))  
stop_ids = [
     dictionary.token2id[stopword]
     for stopword in stoplist
     if stopword in dictionary.token2id
 ]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids)  # 去除只出现过一次的词
dictionary.compactify()       # 删除去除单词后的空格
dictionary.save('datasets/news_dictionary')  # 保存词典
#将文档加载成构造语料库
class MyCorpus(object):
    def __iter__(self):
        for line in open('datasets/myCorpus.txt'):
            yield dictionary.doc2bow(line.lower().split())
corpus_memory_friendly = MyCorpus()  
corpus = [vector for vector in corpus_memory_friendly]  # 将读取的文档转换成语料库
corpora.BleiCorpus.serialize('datasets/news_corpus', corpus)  # 存储为Blei lda-c格式的语料库

通过上面的工作，我们已经将文档转换成了DTM模型所需要的词典以及语料库，下面把语料库、词典加载到模型中

try:
    dictionary = Dictionary.load('datasets/news_dictionary')
except FileNotFoundError as e:
    raise ValueError("SKIP: Please download the Corpus/news_dictionary dataset.")
corpus = bleicorpus.BleiCorpus('datasets/news_corpus')
time_slice = [438, 430, 456]   #设置这个语料库的间隔，此处分为三个时期，第一个时期内有438条新闻，第二为430条，第三个为456条。
num_topics = 5  #设置主题数，此处为5个主题
ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics) #将语料库、词典、参数加载入模型中进行训练
corpusTopic = ldaseq.print_topics(time=0)  # 输出指定时期主题分布，此处第一个时期主题分布
print(corpusTopic)
topicEvolution = ldaseq.print_topic_times(topic=0) # 查询指定主题在不同时期的演变，此处为第一个主题的
print(topicEvolution)
doc = ldaseq.doc_topics(0) # 查询指定文档的主题分布,此处为第一篇文档的主题分布
print (doc)