比较文档相似度c语言,Doc2Vec,Word2Vec文本相似度初体验。

最新推荐文章于 2022-04-01 17:49:26 发布

weixin_39605997

最新推荐文章于 2022-04-01 17:49:26 发布

阅读量279

点赞数

文章标签：比较文档相似度c语言

Doc2Vec,Word2Vec文本相似度初体验。

参考资料：

https://radimrehurek.com/gensim/models/word2vec.html

接上篇：

import jieba

all_list = jieba.cut(xl['工作内容'][0:6],cut_all=True)

print(all_list)

every_one = xl['工作内容'].apply(lambda x:jieba.cut(x))

import traceback

def filtered_punctuations(token_list):

try:

punctuations = [' ', '\n', '\t', ',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%','：',

'/','\xa0','。','；','、']

token_list_without_punctuations = [word for word in token_list

if word not in punctuations]

#print "[INFO]: filtered_punctuations is finished!"

return token_list_without_punctuations

except Exception as e:

print (traceback.print_exc())

from gensim.models import Doc2Vec,Word2Vec

import gensim

def list_crea(everyone):

list_word = []

for k in everyone:

fenci= filtered_punctuations(k)

list_word.append(fenci)

return list_word

aa_word = list_crea(every_one)

print(type(aa_word))

#aa_word 是个嵌套的list [[1,2,3], [4,5,6], [7,8,9]]

model = Word2Vec(aa_word, min_count=1) # 训练模型，参考英文官网，在上面

say_vector = model['java'] # get vector for word

model.similarity('计算', '计算机')

转载于:https://blog.51cto.com/13000661/2121671

Doc2Vec,Word2Vec文本相似度初体验。相关教程

jieba分词以及word2vec词语相似度

jieba分词以及word2vec词语相似度去除标点符号，下一步开始文本相似度计算：参考文章： http://www.jb51.net/article/139690.htm from gensim.models import Word2Vec model = Word2Vec(sentences, sg=1, size=100, window=5, min_count=5, negative=

文本处理三剑客之一----------awk

文本处理三剑客之一----------awk 简单的说awk是一门类似于shell的编程语言，是一种强大的文本处理工具，它的设计思想来源于 SNOBOL4 、sed 、Marc Rochkind设计的有效性语言、语言工具 yacc 和 lex ，当然还从 C 语言中获取了一些优秀的思想。它有着属于自

文本处理工具sed

文本处理工具sed 小编来了，今天呢小编将给大家介绍一下文本处理工具sed的用法，sed的功能可以说是非常强大，强大的都写成一本书了，对sed文本处理工具非常感兴趣的童鞋，可以买一本sed的书进行深刻研究，小编在这里只给大家介绍一下sed的基本用法，跟着我一

使用Windows 7放大镜使文本和图像更易于阅读

使用Windows 7放大镜使文本和图像更易于阅读 Do you have impaired vision or find it difficult to read small print on your computer screen? Today, we’ll take a closer look at how to magnify that hard to read content with the Magnifier in Wind

linux 编辑gedit_如何使用gedit在Linux上以图形方式编辑文本文件

linux 编辑gedit_如何使用gedit在Linux上以图形方式编辑文本文件 linux 编辑gedit Linux users normally edit configuration files with terminal-based tools like nano and vim . If you want to edit a file graphically—even a system file—the gedit t

文本太长Transformer用不了怎么办

文本太长，Transformer用不了怎么办长文档预训练模型基于Transformer的模型已经引领NLP领域，然而基于Transformer的方法随着输入文本长度的增加，计算量剧增，并且Transformer能处理的句子长度受限，已有的方法大多使用截断的方式，这会导致信息损失，因此

vi使用入门_使用Vi编辑文本文件的入门指南

vi使用入门_使用Vi编辑文本文件的入门指南 vi使用入门 Vi is a powerful text editor included with most Linux systems, even embedded ones. Sometimes you’ll have to edit a text file on a system that doesn’t include a friendlier text editor, so

python：pytesseract文本识别

python：pytesseract文本识别文章目录一、安装tesseract-ocr 1、下载软件 2、设置环境变量二、安装pytesseract模块三、基本使用一、安装tesseract-ocr 下载地址：https://digi.bib.uni-mannheim.de/tesseract/ 双击进行安装，安装在任意文件夹，记住路径