【专利练习2】Doc2vec实践

最新推荐文章于 2024-08-08 20:39:10 发布

窗边的小七酱

最新推荐文章于 2024-08-08 20:39:10 发布

阅读量539

点赞数

文章标签：深度学习自然语言处理

本文链接：https://blog.csdn.net/weixin_40064136/article/details/104123845

版权

Doc2vec训练及实践

上篇文章已经训练了Word2Vec，这里将每篇专利训练为一个向量（即 Doc2Vec），这样就可以对专利进行分类、聚类等下游任务了。

1 Doc2Vec的原理

转载自：https://blog.csdn.net/fendouaini/article/details/80327250
进行了一定简化和加工

Doc2Vec类似word2vec，这里不进行详细阐述，大致描述一下训练方式：
1）一种是PV-DM（Distributed Memory Model of paragraph vectors）类似于word2vec中的CBOW模型，如图一：
在这里插入图片描述 2）另一种是PV-DBOW（Distributed Bag of Words of paragraph vector类似于Word2vec中的skip-gram模型，如图二：
训练：
Doc2vec相对于word2vec不同之处在于，在输入层，增添了一个新的句子向量Paragraph vector，Paragraph vector可以被看作是另一个词向量，它扮演了一个记忆。

词袋模型	Doc2vec
每次训练只会截取句子中一小部分词训练，而忽略了除了本次训练词以外该句子中的其他词	每次训练也是滑动截取句子中一小部分词来训练，Paragraph Vector在同一句话会有多次训练，每次训练中输入都包含Paragraph vector。
忽略了文本的词序问题。仅仅训练出来每个词的向量表达，句子只是每个词的向量累加在一起表达	Paragraph vector可以被看作是句子的主旨

预测：
1、在预测新的句子的时候，将该Paragraph vector随机初始化。
2、放入模型中，根据随机梯度下降，不断迭代，求得最终稳定下来的句子向量。

预测过程中，模型里的词向量、投影层、输出层的softmax weights参数是不会变的，不断迭代中只更新Paragraph vector，只需很少的时间就能计算出带预测的Paragraph vector。

2 项目实践

我们将专利的标题、摘要和首项权利要求作为一篇专利的文本

1、数据预处理

df_files['all_sentence'] = df_files['标题'] + df_files['摘要'] + df_files['首项权利要求']
def cutWords(sentence):
    word_list = jieba.cut(sentence)
    res = ' '.join(word_list)
    res = res.split(' ')
    tempX = ''
    for i in res:
        if i not in stopwordset:
            tempX+=i
            tempX+=' '
    return tempX.strip()
import numpy as np
def DropNan(sentence):
    if sentence is np.nan:
        return ''
    return sentence

df_files['all_sentence'] = df_files['all_sentence'].apply(DropNan)
df_files['all_sentence'] = df_files['all_sentence'].apply(lambda x: x.replace('\n', ' '))
df_files['all_sentence'] = df_files['all_sentence'].apply(cutWords)

在这里插入图片描述
gensim里Doc2vec模型需要的输入为固定格式，输入样本为：[句子，句子序号],这里需要用gensim中Doc2vec里的TaggedDocument来包装输入的句子。

import gensim
Tagged_Document = gensim.models.doc2vec.TaggedDocument
def X_train(sentence):
    x_train = []
    for i,text in enumerate(sentence):
        word_list = text.split(' ')
        document = Tagged_Document(word_list,tags = [i])
        x_train.append(document)
    return x_train
result = X_train(sentence)

2 模型训练

这是Doc2Vec的官方文档，里面揭示了参数和具体含义
https://radimrehurek.com/gensim/models/doc2vec.html

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# 模型训练
model = Doc2Vec(result, dm=1, size=300, window=8, min_count=5, workers=4,epochs = 20)
# 保存模型
model.save('models/ko_d2v.model')

3 下游任务

1 输出每篇文章的向量表示
用于相似性判断、分类、聚类等任务

model = Doc2Vec.load('model/ko_d2v.model')
// 与标签‘0’最相似的
print(model.docvecs.most_similar(0))
// 进行相关性比较
print(model.docvecs.similarity(0,1))
// 输出标签为‘10’句子的向量
print(model.docvecs[10])

2 推断一个句向量(未出现在语料中)

// 也可以推断一个句向量(未出现在语料中)
words = u"喵 喵 喵"
print(model.infer_vector(words.split()))

参考博客：
https://blog.csdn.net/fendouaini/article/details/80327250
https://blog.csdn.net/u010417185/article/details/80654558

窗边的小七酱

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【专利练习2】Doc2vec实践

Doc2vec训练及实践上篇文章已经训练了Word2Vec，这里将每篇专利训练为一个向量（即 Doc2Vec），这样就可以对专利进行分类、聚类等下游任务了。1 Doc2Vec的原理转载自：https://blog.csdn.net/fendouaini/article/details/80327250进行了一定简化和加工Doc2Vec类似word2vec，这里不进行详细阐述，大致描述一下...
复制链接

扫一扫