Gensim库的使用——Doc2Vec模型(一)介绍与使用

Doc2Vec模型

使用Lee corpus来介绍Gensim中Doc2vec模型的使用

Doc2vec模型是用来将每一篇文档转换成向量的模型,注意,是将整篇文档转换为向量!

段落向量模型

Le and Mikolov 在2014年介绍了Doc2Vec 算法,这个算法虽然仅仅是使用了Word2Vec的向量进行了平均化操作,但是效果却很好。

gensim库的Doc2vec模型实现了这个算法

这有两种实现

1、Paragraph Vector-Distributed Memory(PV-DM)

2、Paragraph Vector-Distributed Bag of Words(PV-DBOW)

两种实现的不同:

PV-DM类似于Word2vec模型的CBOW,文档的向量表示是通过在基于词向量的上下文均值和整个文档的向量来预测中心词的任务上,训练神经网络。

PV-DBOW模型类似于Word2vec模型的 SG,文档的向量是通过训练一个神经网络,从全部的文档向量(doc-vector)中预测目标词的任务上进行训练。

准备训练和测试数据

通过使用在gensim中自带的LeeBackground Corpus语料库来训练模型,这个语料库包含314篇选自澳大利亚广播公司的新闻邮件服务的文档,提供标题故事的文本电子邮件,覆盖很广泛的主题。

模型的测试则是使用LeeCorpus,这个语料库只包含50篇文档。

import os
import gensim
lee_train_file = './lee_background.cor'
lee_test_file = './lee.cor'

 

定义读取与处理文本的函数

接下来需要定义一个函数,用来

1:打开训练/测试文件(使用latin编码)

2、一行一行的读取文件

3、对每一行进行预处理(将文本标记化成单个的单词,移除标点符号,转小写等等)

注意:读取的整个文件就是一个语料库,这个文件里面的每一行就是一篇文档

import smart_open
lee_train_file = './lee_background.cor'
lee_test_file = './lee.cor'
def read_corpus(file_name,tokens_only=False):
    with smart_open.open(file_name,encoding='utf-8') as f:
        for i,line in enumerate(f):
            tokens = gensim.utils.simple_preprocess(line)
            if tokens:
                yield tokens
            else:
                gensim.models.doc2vec.TaggedDocument(tokens,[i])
                
train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file))

 

可以打印一下训练语料库和测试语料库

print(train_corpus[:2])
print(test_corpus[:2])
[['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', 'caused', 'the', 'fire', 'to', 'burn', 'in', 'finger', 'formation', 'have', 'now', 'eased', 'and', 'about', 'fire', 'units', 'in', 'and', 'around', 'hill', 'top', 'are', 'optimistic', 'of', 'defending', 'all', 'properties', 'as', 'more', 'than', 'blazes', 'burn', 'on', 'new', 'year', 'eve', 'in', 'new', 'south', 'wales', 'fire', 'crews', 'have', 'been', 'called', 'to', 'new', 'fire', 'at', 'gunning', 'south', 'of', 'goulburn', 'while', 'few', 'details', 'are', 'available', 'at', 'this', 'stage', 'fire', 'authorities', 'says', 'it', 'has', 'closed', 'the', 'hume', 'highway', 'in', 'both', 'directions', 'meanwhile', 'new', 'fire', 'in', 'sydney', 'west', 'is', 'no', 'longer', 'threatening', 'properties', 'in', 'the', 'cranebrook', 'area', 'rain', 'has', 'fallen', 'in', 'some', 'parts', 'of', 'the', 'illawarra', 'sydney', 'the', 'hunter', 'valley', 'and', 'the', 'north', 'coast', 'but', 'the', 'bureau', 'of', 'meteorology', 'claire', 'richards', 'says', 'the', 'rain', 'has', 'done', 'little', 'to', 'ease', 'any', 'of', 'the', 'hundred', 'fires', 'still', 'burning', 'across', 'the', 'state', 'the', 'falls', 'have', 'been', 'quite', 'isolated', 'in', 'those', 'areas', 'and', 'generally', 'the', 'falls', 'have', 'been', 'less', 'than', 'about', 'five', 'millimetres', 'she', 'said', 'in', 'some', 'places', 'really', 'not', 'significant', 'at', 'all', 'less', 'than', 'millimetre', 'so', 'there', 'hasn', 'been', 'much', 'relief', 'as', 'far', 'as', 'rain', 'is', 'concerned', 'in', 'fact', 'they', 've', 'probably', 'hampered', 'the', 'efforts', 'of', 'the', 'firefighters', 'more', 'because', 'of', 'the', 'wind', 'gusts', 'that', 'are', 'associated', 'with', 'those', 'thunderstorms'], ['indian', 'security', 'forces', 'have', 'shot', 'dead', 'eight', 'suspected', 'militants', 'in', 'night', 'long', 'encounter', 'in', 'southern', 'kashmir', 'the', 'shootout', 'took', 'place', 'at', 'dora', 'village', 'some', 'kilometers', 'south', 'of', 'the', 'kashmiri', 'summer', 'capital', 'srinagar', 'the', 'deaths', 'came', 'as', 'pakistani', 'police', 'arrested', 'more', 'than', 'two', 'dozen', 'militants', 'from', 'extremist', 'groups', 'accused', 'of', 'staging', 'an', 'attack', 'on', 'india', 'parliament', 'india', 'has', 'accused', 'pakistan', 'based', 'lashkar', 'taiba', 'and', 'jaish', 'mohammad', 'of', 'carrying', 'out', 'the', 'attack', 'on', 'december', 'at', 'the', 'behest', 'of', 'pakistani', 'military', 'intelligence', 'military', 'tensions', 'have', 'soared', 'since', 'the', 'raid', 'with', 'both', 'sides', 'massing', 'troops', 'along', 'their', 'border', 'and', 'trading', 'tit', 'for', 'tat', 'diplomatic', 'sanctions', 'yesterday', 'pakistan', 'announced', 'it', 'had', 'arrested', 'lashkar', 'taiba', 'chief', 'hafiz', 'mohammed', 'saeed', 'police', 'in', 'karachi', 'say', 'it', 'is', 'likely', 'more', 'raids', 'will', 'be', 'launched', 'against', 'the', 'two', 'groups', 'as', 'well', 'as', 'other', 'militant', 'organisations', 'accused', 'of', 'targetting', 'india', 'military', 'tensions', 'between', 'india', 'and', 'pakistan', 'have', 'escalated', 'to', 'level', 'not', 'seen', 'since', 'their', 'war']]
[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to', 'june', 'chief', 'executive', 'paul', 'batchelor', 'said', 'the', 'result', 'was', 'solid', 'in', 'what', 'he', 'described', 'as', 'the', 'worst', 'conditions', 'for', 'stock', 'markets', 'in', 'years', 'amp', 'half', 'year', 'profit', 'sank', 'per', 'cent', 'to', 'million', 'or', 'share', 'as', 'australia', 'largest', 'investor', 'and', 'fund', 'manager', 'failed', 'to', 'hit', 'projected', 'per', 'cent', 'earnings', 'growth', 'targets', 'and', 'was', 'battered', 'by', 'falling', 'returns', 'on', 'share', 'markets']]

 

训练模型

接下来将实例化Doc2vec模型,设置向量的大小为50维,并且迭代训练40次。设置最小的单词数目为2,来过滤掉那些出现了很少次的单词。

迭代次数越多,花费的时间就越多,最终会在一个收益减少的点返回。

由于这个数据集很小(只有300篇文档)而且文档也比较短(每篇只有几百个单词)增加训练通道可以帮助训练这种小数据集

model = gensim.models.doc2vec.Doc2Vec(vector_size=50,min_count = 2,epochs= 40)

建立单词表

model.build_vocab(train_corpus)

最后,字典是一个列表(可以通过model.wv.index_to_key来访问)保存了唯一的从个训练语料库中抽取的单词。

每一个单词的额外的属性可以使用model.wv.get_vecattr()方法。

举个例子来说,想要看一下单词penalty在语料库里面出现了多少次:

#目前出现错误

print(f"word 'penalty' appeared {model.wv.get_vecattr('penalty','count')} times in the training corpus")

猜测可能是版本更新,不提供这种方式了

 

接下来在语料库上对模型进行训练,分别要指明总的样例数目(300)和训练轮次(epochs)

print(model.corpus_count)
print(model.epochs)
300
40

接下来进行训练

model.train(train_corpus,total_examples=model.corpus_count,epochs=model.epochs)

现在,我们可以通过把需要得到向量的一段文本送入训练好的模型,通过使用model.infer_vector()函数,来获取对应文本片段的向量

document=['only', 'you', 'can', 'prevent', 'forest', 'fires']
vector = model.infer_vector(document)
print(vector)
[ 0.01459495 -0.12987736 -0.00955497  0.28156355  0.08839367 -0.01502135
  0.10272299 -0.23317005  0.09937829 -0.08008339 -0.24118152 -0.02030749
 -0.03038951 -0.13531376 -0.02082395  0.01122938 -0.01497335 -0.02043401
 -0.06758485 -0.25644922  0.00335516  0.16214134 -0.04422463  0.00353207
 -0.09650277  0.14667782 -0.0168256  -0.26699612 -0.01597765 -0.30470508
  0.14583154 -0.06073249 -0.09766442  0.00282319 -0.0465058  -0.15031087
  0.11314052 -0.11855821 -0.01786134  0.2063379  -0.08481903  0.13462573
 -0.05411615  0.06104388  0.1578157  -0.00915051 -0.1533627   0.12898561
  0.13229387  0.07068578]

注意:在使用infer_vector时,不要送入字符串,要送入经过token化的列表。

 

评估模型

为了评估我们的新模型,我们将首先为训练语料库中的每个文档推断新的向量,将推断的向量与训练语料库进行比较,然后基于自相似性返回文档的排名。

基本上,我们假装训练语料库是一批没见过的新的数据,然后观察它们与训练语料库的比较情况。

ranks = []
sencond_ranks=[]
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector],topn=10)
    rank = [docid for docid,sim in sims].index(doc_id)
    ranks.append(rank)
    sencond_ranks.append(sims[1])
print(ranks)
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

 

import collections
counter = collections.Counter(ranks)
print(counter)
Counter({0: 291, 1: 9})

 可以看到,对每一篇文档找最相似的文档,291篇都是在最相似排行第一的找到,只有9篇在最相似排行第二的找到。

基本上超过了95%的正确率!

接下来还有一个例子

查看一下第1篇文档的最相似的文档,第二相似的文档,中间相似的文档,和最不相似的

doc_id = 0
inferred_vector = model.infer_vector(train_corpus[doc_id].words)
sims = model.docvecs.most_similar([inferred_vector],topn=10)
print('Document({}):《{}》\n'.format(doc_id,' '.join(train_corpus[doc_id].words)))
for label,index in [('MOST',0),('SECOND-MOST',1),('MEDIAM',len(sims)//2),('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))
Document(0):《hundreds of people have been forced to vacate their homes in the southern highlands of new south wales as strong winds today pushed huge bushfire towards the town of hill top new blaze near goulburn south west of sydney has forced the closure of the hume highway at about pm aedt marked deterioration in the weather as storm cell moved east across the blue mountains forced authorities to make decision to evacuate people from homes in outlying streets at hill top in the new south wales southern highlands an estimated residents have left their homes for nearby mittagong the new south wales rural fire service says the weather conditions which caused the fire to burn in finger formation have now eased and about fire units in and around hill top are optimistic of defending all properties as more than blazes burn on new year eve in new south wales fire crews have been called to new fire at gunning south of goulburn while few details are available at this stage fire authorities says it has closed the hume highway in both directions meanwhile new fire in sydney west is no longer threatening properties in the cranebrook area rain has fallen in some parts of the illawarra sydney the hunter valley and the north coast but the bureau of meteorology claire richards says the rain has done little to ease any of the hundred fires still burning across the state the falls have been quite isolated in those areas and generally the falls have been less than about five millimetres she said in some places really not significant at all less than millimetre so there hasn been much relief as far as rain is concerned in fact they ve probably hampered the efforts of the firefighters more because of the wind gusts that are associated with those thunderstorms》

MOST (0, 0.9582430124282837): «hundreds of people have been forced to vacate their homes in the southern highlands of new south wales as strong winds today pushed huge bushfire towards the town of hill top new blaze near goulburn south west of sydney has forced the closure of the hume highway at about pm aedt marked deterioration in the weather as storm cell moved east across the blue mountains forced authorities to make decision to evacuate people from homes in outlying streets at hill top in the new south wales southern highlands an estimated residents have left their homes for nearby mittagong the new south wales rural fire service says the weather conditions which caused the fire to burn in finger formation have now eased and about fire units in and around hill top are optimistic of defending all properties as more than blazes burn on new year eve in new south wales fire crews have been called to new fire at gunning south of goulburn while few details are available at this stage fire authorities says it has closed the hume highway in both directions meanwhile new fire in sydney west is no longer threatening properties in the cranebrook area rain has fallen in some parts of the illawarra sydney the hunter valley and the north coast but the bureau of meteorology claire richards says the rain has done little to ease any of the hundred fires still burning across the state the falls have been quite isolated in those areas and generally the falls have been less than about five millimetres she said in some places really not significant at all less than millimetre so there hasn been much relief as far as rain is concerned in fact they ve probably hampered the efforts of the firefighters more because of the wind gusts that are associated with those thunderstorms»

SECOND-MOST (48, 0.8999849557876587): «thousands of firefighters remain on the ground across new south wales this morning as they assess the extent of fires burning around sydney and on the state south coast firefighters are battling fire band stretching from around campbelltown south west of sydney to the royal national park hundreds of people have been evacuated from small villages to the south and south west of sydney authorities estimate more than properties have been destroyed in the greater sydney area fourteen homes have been destroyed in the hawkesbury area north of sydney and properties have been ruined at jervis bay john winter from the new south wales rural fire service says firefighters main concern is the fire band from campbelltown through to the coast that is going to be very difficult area today we do expect that the royal national park is likely to be impacted by fire later in the morning he said certainly in terms of population risk and threat to property that band is going to be our area of greatest concern in the act it appears the worst of the fire danger may have passed though strong winds are expected to keep firefighters busy today the fires have burned more than hectares over the past two days yesterday winds of up to kilometres an hour fanned blazes in dozen areas including queanbeyan connor mount wanniassa red hill and black mountain strong winds are again predicted for today but fire authorities are confident they have the resources to contain any further blazes total fire ban is in force in the act today and tomorrow emergency services minister ted quinlan has paid tribute to the efforts of firefighters there has just been whole body of people that have been magnificent in sacrificing their christmas for the benefit of the community he said»

MEDIAM (255, 0.8344497680664062): «the new south wales state emergency service ses says it has now received calls for help in the wake of monday fierce storms natural disaster areas have been declared throughout sydney and surrounding areas and parts of the state north west in sydney more than homes mainly in the northern suburbs remain without power ses spokeswoman laura goodin says several hundred volunteers will be back in the field this morning we ve had about calls for help of which we ve completed about two thirds we ve had about volunteers in the field being helped out by the royal fire service and the new south wales fire brigades and we re expecting to have most jobs completed by about friday ms goodin said the extensive storm damage has prompted warning about people falsely claiming to work for the ses the warning from fair trading minister john aquilina follows reports from the suburb of hornsby that people claiming to work for the ses are asking for payment from the storm victims mr aquilina has reminded householders that the ses is volunteer organisation and does not charge for its work or employ sub contractors he has suggested residents contact the police if they are approached by such people the government is also warning householders against dealing with unlicensed tradespeople»

LEAST (105, 0.7385852932929993): «fresh palls of smoke are billowing from the woomera detention centre in south australia far north trouble at the centre has entered day three with plume of smoke metres high into the air and up to metres across the compound this morning thirteen buildings were either destroyed or damaged by fire on monday night overnight fires and rioting appeared to have abated just after midnight local time three fire crews one ambulance and several police have attended the scene water cannon and three tear gas canisters were used to subdue detainees who throughout the night were thought to be chanting visa it is not known whether anyone has been injured or arrested overnight the acting immigration minister daryl williams says the government is not losing control of woomera he has told channel nine vandalism is not going to get visas for the detainees the detainees who have been provided with very good facilities and who to our knowledge have absolutely no complaint about the facilities there are engaging in this campaign of damaging and destroying buildings in order to put pressure on the australian authorities to grant them visa he said there is plea for so called high risk detainees to be separated from the rest of the population ath the woomera detention centre in the wake of continued disturbances there south australian labor mp lyn breuer whose electorate covers woomera says higher risk detainees must be separated from women and children at the centre think that will probably have to be the ultimate solution we will have to send high risk detainees to other areas she said we can keep them in an environment where there are young children there it all very nasty situation and have particular concerns for the people that are guarding them as well because one of them is going to get hurt very badly very soon»

 

  • 6
    点赞
  • 40
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
要实现垃圾邮件分类算法,可以先对邮件文本进行预处理,包括分词、去除停用词等操作,然后使用gensim中的Word2Vec模型将每个词表示为一个向量。具体实现过程如下: 1. 安装gensim:可以使用pip install gensim命令进行安装。 2. 加载数据:将垃圾邮件和正常邮件的文本数据加载到一个列表中。 3. 分词:使用jieba进行分词,将每个邮件文本分成一个个词语。 4. 去除停用词:去除停用词,并将分词结果转化为列表形式。 5. 训练Word2Vec模型使用gensim中的Word2Vec模型进行训练,将每个词表示为一个向量。 6. 将邮件文本表示为向量:使用训练好的Word2Vec模型将每个邮件文本表示为词向量的和或平均。 7. 进行分类:使用机器学习算法(如SVM、决策树等)进行分类。 代码示例: ``` import jieba import gensim # 加载数据 spam_data = [...] ham_data = [...] data = spam_data + ham_data # 分词 data = [list(jieba.cut(text)) for text in data] # 去除停用词 stop_words = [...] data = [[word for word in text if word not in stop_words] for text in data] # 训练Word2Vec模型 model = gensim.models.Word2Vec(data, size=100, window=5, min_count=5) # 将邮件文本表示为向量 vec_data = [] for text in data: vec = [model.wv[word] for word in text if word in model.wv] if vec: vec_data.append(sum(vec) / len(vec)) # 进行分类 X_train, X_test, y_train, y_test = train_test_split(vec_data, labels, test_size=0.2, random_state=42) clf = svm.SVC() clf.fit(X_train, y_train) y_pred = clf.predict(X_test) ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值