Doc2Vec Model (二)

67 篇文章 1 订阅
4 篇文章 0 订阅

Tutorial of Doc2Vec Model

Doc2Vec 是一个将每一个 D o c u m e n t Document Document 表示为向量的模型。 本教程将介绍该模型,并演示如何对其进行培训和评估。

The basic idea is: act as if a document has another floating word-like vector, which contributes to all training predictions, and is updated like other word-vectors, but we will call it a doc-vector.

There are two implementations:

  • Paragraph Vector - Distributed Memory (PV-DM). PV-DM is analogous to Word2Vec CBOW. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document’s doc-vector.
  • Paragraph Vector - Distributed Bag of Words (PV-DBOW). PV-DBOW is analogous to Word2Vec SG. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a target word just from the full document’s doc-vector. (It is also common to combine this with skip-gram testing, using both the doc-vector and nearby word-vectors to predict a single target word, but only one at a time.)

准备训练和测试数据

在教程中,我们使用 g e n s i m gensim gensim 中包含的语料库 L e e B a c k g r o u n d C o r p u s Lee Background Corpus LeeBackgroundCorpus 训练模型。 这个语料库包含314个 d o c u m e n t document document, 这些 d o c u m e n t document document 来自澳大利亚广播公司的新闻邮件服务,它提供头条新闻的文本电子邮件,并涵盖许多广泛的主题。
我们用包含50个 d o c u m e n t document document L e e C o r p u s Lee Corpus LeeCorpus 的语料库测试我们的模型。

import os
import gensim

# 设置训练和测试数据的文件名
test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')
lee_train_file = os.path.join(test_data_dir, 'lee_background.cor')
lee_test_file = os.path.join(test_data_dir, 'lee.cor')

定义一个函数读取和预处理文本

我们定义一个函数,

  • 打开训练和测试文件(使用 l a t i n latin latin 编码)
  • 一行一行地读取文件
  • 预处理每一行 (将文本标记为单个单词,删除标点,设置为小写等)
    我们读取的文件就是语料库,文件中的每一行就是 d o c u m e n t document document.

为了训练模型,我们需要将一个标签/编号与训练语料库的每个 d o c u m e n t document document 相关联。在我们的例子中,标签只是从零开始的行号。

import smart_open

def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            tokens = gensim.utils.simple_preprocess(line)
            # gensim.utils.simple_preprocess: 将一个document转换为一个小写token的列表, 忽略太短或者太长的token
            if tokens_only:
                yield tokens
            else:
                # 对于训练数据, 需要添加tag
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

查看训练语料库

print(train_corpus[:1])
>> [TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', 'caused', 'the', 'fire', 'to', 'burn', 'in', 'finger', 'formation', 'have', 'now', 'eased', 'and', 'about', 'fire', 'units', 'in', 'and', 'around', 'hill', 'top', 'are', 'optimistic', 'of', 'defending', 'all', 'properties', 'as', 'more', 'than', 'blazes', 'burn', 'on', 'new', 'year', 'eve', 'in', 'new', 'south', 'wales', 'fire', 'crews', 'have', 'been', 'called', 'to', 'new', 'fire', 'at', 'gunning', 'south', 'of', 'goulburn', 'while', 'few', 'details', 'are', 'available', 'at', 'this', 'stage', 'fire', 'authorities', 'says', 'it', 'has', 'closed', 'the', 'hume', 'highway', 'in', 'both', 'directions', 'meanwhile', 'new', 'fire', 'in', 'sydney', 'west', 'is', 'no', 'longer', 'threatening', 'properties', 'in', 'the', 'cranebrook', 'area', 'rain', 'has', 'fallen', 'in', 'some', 'parts', 'of', 'the', 'illawarra', 'sydney', 'the', 'hunter', 'valley', 'and', 'the', 'north', 'coast', 'but', 'the', 'bureau', 'of', 'meteorology', 'claire', 'richards', 'says', 'the', 'rain', 'has', 'done', 'little', 'to', 'ease', 'any', 'of', 'the', 'hundred', 'fires', 'still', 'burning', 'across', 'the', 'state', 'the', 'falls', 'have', 'been', 'quite', 'isolated', 'in', 'those', 'areas', 'and', 'generally', 'the', 'falls', 'have', 'been', 'less', 'than', 'about', 'five', 'millimetres', 'she', 'said', 'in', 'some', 'places', 'really', 'not', 'significant', 'at', 'all', 'less', 'than', 'millimetre', 'so', 'there', 'hasn', 'been', 'much', 'relief', 'as', 'far', 'as', 'rain', 'is', 'concerned', 'in', 'fact', 'they', 've', 'probably', 'hampered', 'the', 'efforts', 'of', 'the', 'firefighters', 'more', 'because', 'of', 'the', 'wind', 'gusts', 'that', 'are', 'associated', 'with', 'those', 'thunderstorms'], tags=[1])

查看测试语料库

print(test_corpus[:1])

>>[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist']]

注意: 测试语料库不包含 tags

训练模型

现在,我们将实例化一个 D o c 2 V e c Doc2Vec Doc2Vec 模型,向量大小为50个维度,并在训练语料库上迭代40次。我们将最小单词次数设为2,以便丢弃很少出现的单词。(如果没有各种具有代表性的例子,保留这些不常见的单词往往会使模型变得更糟!)在已发布的 P a r a g r a p h V e c t o r p a p e r Paragraph Vector paper ParagraphVectorpaper 的结果中,使用10千到数百万个文档的典型迭代计数为10-20个。更多的迭代需要更多的时间,最终达到收益递减的程度。

然而,这是一个非常小的数据集(300个 d o c u m e n t document document), d o c u m e n t document document 是简短的(几百个单词)。增加训练次数有时可以帮助处理此类小数据集。

model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)

创建一个 v o c a b u l a r y vocabulary vocabulary

model.build_vocab(train_corpus)

基本上, v o c a b u l a r y vocabulary vocabulary 是一个 d i c t i o n a r y dictionary dictionary (字典)(通过 m o d e l . w v . v o c a b model.wv.vocab model.wv.vocab)从训练语料库中提取的所有独特的单词以及数量(e.g., m o d e l . w v . v o c a b [ ′ p e n a l t y ′ ] . c o u n t model.wv.vocab['penalty'].count model.wv.vocab[penalty].count 计算单词 p e n a l t y penalty penalty 的数量)。

print(model.wv.vocab['penalty'].count)
>> 4

print(model.wv.vocab['the'].count)
>> 4135

# 遍历构建的vocabulary
for word in model.wv.vocab:
    print(word, model.wv.vocab[word].count)

接下来,在语料库上训练模型。如果正在使用 B L A S BLAS BLAS库,则此过程不应超过3秒。如果 B L A S BLAS BLAS 库没有被使用,应该不会超过2分钟,所以如果您重视您的时间,请使用 B L A S BLAS BLAS

model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

现在,我们可以使用训练好的模型去得到向量,通过将一个单词列表传递给函数 m o d e l . i n f e r _ v e c t o r model.infer\_vector model.infer_vector。然后,可以通过余弦相似性将该向量与其他向量进行比较。

vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
print(vector)

>> [ 0.07382482 -0.16637728  0.08065781  0.10710249 -0.08022405  0.10134723
 -0.12217197 -0.09071762  0.01531634  0.06908066  0.07702638  0.1595575
 -0.37061992 -0.08847439  0.29466426 -0.22069848 -0.18415774  0.02725518
  0.05061761 -0.0402746   0.11257     0.06543995  0.05444453  0.13411525
  0.16597901 -0.12336837  0.12333424 -0.09284582  0.21192113  0.2717053
  0.0357861  -0.06665415  0.08196177  0.0742147   0.20340991 -0.17522307
  0.044316   -0.03971378 -0.00918944  0.12423325  0.04040724 -0.29882511
 -0.07876693 -0.24262914 -0.17023526 -0.15497623 -0.17637901  0.24168456
  0.12749192 -0.03374054]

注意: i n f e r _ v e c t o r ( ) infer\_vector() infer_vector()不接受字符串,而是一个 s t r i n g   t o k e n s string\ tokens string tokens 列表,该列表已经按照与原始训练 d o c u m e n t document document w o r d s words words 属性相同的方式进行处理。

另外,由于底层的训练/推理算法是一个迭代逼近问题,利用内部随机化,同一文本的重复推理将返回稍微不同的向量。
例如,在原教程中该测试单词列表的向量为:

>> [-0.0014455  -0.03838259  0.03199863  0.01624313  0.04130909  0.20024535
 -0.09749083  0.00597675 -0.0498415  -0.04540551  0.01723257 -0.20151177
  0.08523481 -0.08950453  0.00380471  0.10519169 -0.11385646 -0.12259311
  0.05124485  0.13983724  0.12103602 -0.2321382  -0.07852937 -0.24980102
  0.08878644 -0.1038101   0.22263823 -0.21950239 -0.31584352  0.11648487
  0.18644053 -0.08014616 -0.11723718 -0.22560167 -0.04025911  0.05705469
  0.20113727  0.12674493  0.07401953 -0.01472244  0.13031979 -0.19944443
  0.16314563 -0.05472009  0.01138415  0.09830751 -0.11751664  0.00259685
  0.11373404  0.03917272]

评估模型

为了评估我们的新模型,我们首先为训练语料库的每个 d o c u m e n t document document 推断新的向量,然后将推断出的向量与训练语料库进行比较,然后根据自相似性返回 d o c u m e n t document document 的排名。基本上,我们假装训练语料库是一些新的未知的数据,然后观察它们与训练模型的比较。我们的期望是,我们可能过度拟合了我们的模型(即,所有排名都将小于2),因此我们应该能够非常容易地找到类似的 d o c u m e n t document document。此外,我们将跟踪第二个级别,以便比较不太相似的 d o c u m e n t document document

ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    # model.docvecs 该对象包含从训练数据中学习的paragraph vector。对于训练期间每个唯一标记的document,都有一个这样的向量。
    # model.docvecs.most_similar() 从训练集中找到前N个最相似的docvec。
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

计算每个 d o c u m e n t document document 相对于训练语料库的排名
注意:由于 r a n d o m   s e e d i n g random\ seeding random seeding 和训练语料库偏小,结果在不同的运行中有所不同

import collections

counter = collections.Counter(ranks)
print(counter)

>> Counter({0: 292, 1: 8})

基本上,超过95%的 d o c u m e n t document document 与其本身最相似,约5%错误地与另一个 d o c u m e n t document document 最相似。
根据 t r a i n i n g − v e c t o r training-vector trainingvector 检查 i n f e r r e d − v e c t o r inferred-vector inferredvector 是一种 s a n i t y   c h e c k sanity\ check sanity check,以确定模型是否以有效的一致方式运行,尽管不是真正的“准确度”值。
这是伟大的,并不完全令人惊讶。我们可以看一个例子:

print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words))) # 输出当前的document的tag以及内容
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))
    # 最相似的,第二相似的, 中间相似的,最不相似的

>> Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (299, 0.9439473152160645): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»

SECOND-MOST (104, 0.8274812698364258): «australian cricket captain steve waugh has supported fast bowler brett lee after criticism of his intimidatory bowling to the south african tailenders in the first test in adelaide earlier this month lee was fined for giving new zealand tailender shane bond an unsportsmanlike send off during the third test in perth waugh says tailenders should not be protected from short pitched bowling these days you re earning big money you ve got responsibility to learn how to bat he said mean there no times like years ago when it was not professional and sort of bowlers code these days you re professional our batsmen work very hard at their batting and expect other tailenders to do likewise meanwhile waugh says his side will need to guard against complacency after convincingly winning the first test by runs waugh says despite the dominance of his side in the first test south africa can never be taken lightly it only one test match out of three or six whichever way you want to look at it so there lot of work to go he said but it nice to win the first battle definitely it gives us lot of confidence going into melbourne you know the big crowd there we love playing in front of the boxing day crowd so that will be to our advantage as well south africa begins four day match against new south wales in sydney on thursday in the lead up to the boxing day test veteran fast bowler allan donald will play in the warm up match and is likely to take his place in the team for the second test south african captain shaun pollock expects much better performance from his side in the melbourne test we still believe that we didn play to our full potential so if we can improve on our aspects the output we put out on the field will be lot better and we still believe we have side that is good enough to beat australia on our day he said»

MEDIAN (170, 0.2509412467479706): «the united states federal reserve has cut key interest rate by quarter point to year low of per cent and left the door open to further easing to help bring the us economy out of recession it was the th cut this year to the federal funds target rate and the fourth since the september suicide attacks in new york and washington the key rate which determines overnight borrowing costs between banks is at its lowest level since july policy makers also cut the discount rate at which commercial banks can borrow from the federal reserve by the same quarter point margin to per cent economic activity remains soft with underlying inflation likely to edge lower from relatively modest levels the federal open market committee said in written statement the us economy officially slid into recession in march ending an unprecedented year expansion period the terrorist shockwave has escalated the task of rebuilding growth experts said»

LEAST (261, -0.1336527168750763): «afghan opposition leaders meeting in germany have reached an agreement after seven days of talks on the structure of an interim post taliban government for afghanistan the agreement calls for the immediate assembly of temporary group of multi national peacekeepers in kabul and possibly other areas the four afghan factions have approved plan for member ruling council composed of chairman five deputy chairmen and other members the council would govern afghanistan for six months at which time traditional afghan assembly called loya jirga would be convened to decide on more permanent structure the agreement calls for elections within two years»

请注意,最相似的 d o c u m e n t document document(通常是同一文本)的相似性分数接近1.0。然而,排名第二的 d o c u m e n t document document 的相似性分数应该显著降低(假设 d o c u m e n t document document 实际上是不同的),当我们检查文本本身时,原因变得明显。

我们可以重复运行下一段代码,以查看其他目标 d o c u m e n t document document

# 从语料库中随机选取一个document,并使用模型得到它的向量
import random
doc_id = random.randint(0, len(train_corpus) - 1)

# 比较,并输出其的第二相似的document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

>> Train Document (153): «at least two helicopters have landed near tora bora mountain in eastern afghanistan in what could be the start of raid against al qaeda fighters an afp journalist said the helicopters landed around pm local time am aedt few hours after al qaeda fighters rejected deadline set by afghan militia leaders for them to surrender or face death us warplanes have been bombing the network of caves and tunnels for eight days as part of the hunt for al qaeda leader osama bin laden several witnesses have spoken in recent days of seeing members of us or british special forces near the frontline between the local afghan militia and the followers of bin laden they could not be seen but could be clearly heard as they came into land and strong lights were seen in the same district us bombers and other warplanes staged series of attacks on the al qaeda positions in the white mountains after bin laden fighters failed to surrender all four crew members of us bomber that has crashed in the indian ocean near diego garcia have been rescued us military officials said pentagon spokesman navy captain timothy taylor said initial reports said that all four were aboard the destroyer uss russell which was rushed to the scene after the crash the bomber which usually carries crew of four and is armed with bombs and cruise missiles was engaged in the air war over afghanistan pentagon officials said they had heard about the crash just after am aedt and were unable to say whether the plane was headed to diego garcia or flying from the indian ocean island it is thought the australian arrested in afghanistan for fighting alongside the taliban is from adelaide northern suburbs but the salisbury park family of year old david hicks is remaining silent the president of adelaide islamic society walli hanifi says mr hicks approached him in having just returned from kosovo where he had developed an interest in islam he says mr hicks wanted to know more about the faith but left after few weeks late yesterday afternoon mr hicks salisbury park family told media the australian federal police had told them not to comment local residents confirmed member of the family called mr hicks had travelled to kosovo in recent years and has not been seen for around three years but most including karen white agree they cannot imagine mr hicks fighting for terrorist regime not unless he changed now but when he left here no he wasn he just normal teenage adult boy she said but man known as nick told channel ten he is sure the man detained in afghanistan is his friend david he says in david told him about training in the kosovo liberation army he gone through six weeks basic training how he been in the trenches you know killed few people you know confirmed kills and had few of his mates killed as well the man said»

Similar Document (50, 0.7697946429252625): «afghan security forces have arrested wounded arab al qaeda fighter but seven others with weapons and explosives remain barricaded in hospital in the southern city of kandahar spokesman for provincial governor gul agha akbar jan says the man was arrested when he left his ward one arab believed to be yemeni was taken into custody when he came out of his ward for bandage mr jan said he says the other seven are carrying weapons including pistols grenades and explosives we are trying to persuade them not to detonate their explosives and to surrender their weapons mr jan said we are concerned about the safety of other patients the arabs wounded in earlier us bombing of kandahar airport were admitted to mirwais hospital before the departure of the taliban militia earlier this month before fleeing the taliban had handed over some weapons including grenades and explosives so the arabs could protect themselves they have been threatening to blow up their hospital room if any attempt is made to arrest them»

测试模型

使用与上面相同的方法,我们将得到随机选择的测试 d o c u m e n t document document 的向量,并通过眼睛将 d o c u m e n t document document 与我们的模型进行比较。

# 从测试语料库中随机选取一个document,并使用模型得到它的向量
doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

# Compare and print the most/median/least similar documents from the train corpus
# 与训练语料库比较,并输出最相似,中间相似,最不相似的document
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

>>Test Document (9): «british air raid in southern iraq left eight civilians dead and nine wounded the iraqi military said sunday the military told the official iraqi news agency that the warplanes bombed areas in basra province miles south of baghdad the central command in florida said coalition aircraft used precision guided weapons to strike two air defense radar systems near basra in response to recent iraqi hostile acts against coalition aircraft monitoring the southern no fly zone»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (141, 0.816508948802948): «united states air strikes on al qaeda fighters have intensified following the collapse of surrender talks with the northern alliance the battle for tora bora appears to be heading towards bloody climax northern alliance commanders have now abandoned all attempts to secure peaceful surrender of al qaeda militants trapped in the mountainous area of tora bora truckloads of armed men have been seen heading toward the area suggesting full scale ground attack is imminent us aircraft have been bombarding the militants position since first light effectively blocking any possible retreat around pakistani troops have fanned across the border in bid to prevent any al qaeda fighters escaping»

MEDIAN (48, 0.2500033676624298): «thousands of firefighters remain on the ground across new south wales this morning as they assess the extent of fires burning around sydney and on the state south coast firefighters are battling fire band stretching from around campbelltown south west of sydney to the royal national park hundreds of people have been evacuated from small villages to the south and south west of sydney authorities estimate more than properties have been destroyed in the greater sydney area fourteen homes have been destroyed in the hawkesbury area north of sydney and properties have been ruined at jervis bay john winter from the new south wales rural fire service says firefighters main concern is the fire band from campbelltown through to the coast that is going to be very difficult area today we do expect that the royal national park is likely to be impacted by fire later in the morning he said certainly in terms of population risk and threat to property that band is going to be our area of greatest concern in the act it appears the worst of the fire danger may have passed though strong winds are expected to keep firefighters busy today the fires have burned more than hectares over the past two days yesterday winds of up to kilometres an hour fanned blazes in dozen areas including queanbeyan connor mount wanniassa red hill and black mountain strong winds are again predicted for today but fire authorities are confident they have the resources to contain any further blazes total fire ban is in force in the act today and tomorrow emergency services minister ted quinlan has paid tribute to the efforts of firefighters there has just been whole body of people that have been magnificent in sacrificing their christmas for the benefit of the community he said»

LEAST (17, -0.3036884665489197): «spain has begun its hopman cup campaign in perth with victory over argentina arantxa sanchez vicario and tommy robredoboth won their singles matches and then teamed to win the mixed doubles sanchez vicario says she is hoping to win her second hopman cup title after winning the tournament with her brother emilio in it would be very nice to start the year off and as say it always tough but it very good start for me and looking forward with tommy to see if we can be the champions again she said today the united states will play france meanwhile world number one lleyton hewitt says he will not be putting pressure on himself to win next month australian tennis open in melbourne hewitt yesterday teamed with fellow australian alicia molik to beat switzerland in their opening tie at the hopman cup in perth hewitt says his first objective will be to reach the second week of the grand slam event think if play my best tennis and give per cent no matter who play think in with good chance of getting through to the second week and if that happens then most times in grand slam it sort of anyone tournament from there he said»
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Python Doc2Vec is an algorithm for generating vector representations of documents. It is an extension of the Word2Vec algorithm, which generates vector representations of words. Doc2Vec is used for tasks such as text classification, document similarity, and clustering. The basic idea behind Doc2Vec is to train a neural network to predict the probability distribution of words in a document. The network takes both the document and a context word as input, and predicts the probability of each word in the vocabulary being the next word in the document. The output of the network is a vector representation of the document. Doc2Vec can be implemented using the Gensim library in Python. The Gensim implementation of Doc2Vec has two modes: Distributed Memory (DM) and Distributed Bag of Words (DBOW). In DM mode, the algorithm tries to predict the next word in the document using both the context words and the document vector. In DBOW mode, the algorithm only uses the document vector to predict the next word. To use Doc2Vec with Gensim, you need to first create a corpus of documents. Each document should be represented as a list of words. You can then create a Doc2Vec model and train it on the corpus. Once the model is trained, you can use it to generate vector representations of new documents. Here's an example of training a Doc2Vec model using Gensim: ``` from gensim.models.doc2vec import Doc2Vec, TaggedDocument from nltk.tokenize import word_tokenize # create a corpus of documents doc1 = TaggedDocument(words=word_tokenize("This is the first document."), tags=["doc1"]) doc2 = TaggedDocument(words=word_tokenize("This is the second document."), tags=["doc2"]) doc3 = TaggedDocument(words=word_tokenize("This is the third document."), tags=["doc3"]) corpus = [doc1, doc2, doc3] # create a Doc2Vec model and train it on the corpus model = Doc2Vec(corpus, vector_size=100, window=5, min_count=1, workers=4, epochs=50) # generate vector representations of new documents new_doc = word_tokenize("This is a new document.") vector = model.infer_vector(new_doc) ``` In this example, we create a corpus of three documents and train a Doc2Vec model with a vector size of 100, a window size of 5, a minimum word count of 1, and 50 epochs. We then generate a vector representation of a new document using the `infer_vector` method.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值