word2vec

最新推荐文章于 2022-05-15 11:01:32 发布

涤生（bluez）

最新推荐文章于 2022-05-15 11:01:32 发布

阅读量1.5k

点赞数 1

分类专栏：数据科学入门到精通文章标签：数据科学

本文链接：https://blog.csdn.net/weixin_40903057/article/details/95367381

版权

数据科学入门到精通专栏收录该内容

83 篇文章 1 订阅

订阅专栏

1、语料的预处理

novel=open('C:\\Users\\CDAer\\Desktop\\西游记.txt',mode='r',encoding='gb18030')
content=novel.read()

import jieba

cutword=jieba.lcut(content,cut_all=False,HMM=True)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\CDAer\AppData\Local\Temp\jieba.cache
Loading model cost 1.233 seconds.
Prefix dict has been built succesfully.

cutword[:5]

['第一回', '\n', '灵根育孕', '源流', '出']

#处理\n
cutwords=[]
for word in cutword:
    if word!='\n':
        cutwords.append(word)

pd.read_csv?

#停留词的处理
import pandas as pd
stopwords=pd.read_csv(r'C:\Users\CDAer\Desktop\data\stopwords.txt',sep='\t',
                     quoting=3,names=['stopword'])

stopwords.head(3)

	stopword
0	!
1	"
2	#

stopwords_list=stopwords['stopword'].values.tolist()

# clean_words=[]
# for word in cutwords:
#     if word in stopwords_list:
#         continue
#     else:
#         clean_words.append(word)

#打开文件
novel_seg=open('C:\\Users\\CDAer\\Desktop\\data\\西游记_seg.txt','w',encoding='utf-8')

#去除停留词，并保存
words=''
for word in cutwords:
    if word in stopwords_list:
        continue
    else:
        words=words+' '+word
print(words,file=novel_seg)

novel.close()
novel_seg.close()

2、word2vec词向量

from gensim.models import word2vec

#加入语料训练模型
word2vec.Word2Vec?

#将处理好的数据读取进来
word2vec.LineSentence?
word2vec.Text8Corpus?

#加载训练语料
text=word2vec.Text8Corpus('C:\\Users\\CDAer\\Desktop\\data\\西游记_seg.txt')

#训练skip-gram模型
model=word2vec.Word2Vec(text,size=200,min_count=10,sg=1,window=5)

#得到每一个词的特征向量
model['大圣']

D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  





array([-0.1533685 , -0.08057445,  0.0176796 , -0.10280587, -0.12772673,
       -0.03973517, -0.02112414, -0.18869366,  0.14949718,  0.09318699,
        0.25933027,  0.257369  , -0.15355076,  0.10930856, -0.18674324,
       -0.05865503, -0.15077035,  0.01238145,  0.01634889, -0.0246722 ,
        0.05444198, -0.08243156,  0.03422857, -0.04964158,  0.01489304,
       -0.10621809,  0.03272914,  0.02671562,  0.07172433, -0.03783002,
        0.05618566, -0.02103374, -0.01358261,  0.1813264 ,  0.08818752,
       -0.14293073, -0.13122404,  0.05792145,  0.08821993, -0.14294365,
       -0.21792363, -0.08709765,  0.1138996 , -0.05943882,  0.09594493,
       -0.17882566, -0.08118752, -0.1526768 ,  0.0168893 , -0.07024644,
        0.12232823, -0.2549806 ,  0.194912  ,  0.00911774,  0.2672487 ,
        0.08056606,  0.00614514,  0.10046203, -0.03566631,  0.11721888,
        0.01127368,  0.00401328, -0.1458041 ,  0.30821475,  0.09249471,
        0.12005535, -0.14878592,  0.09928823,  0.01618707,  0.04846606,
        0.05638336,  0.02283391, -0.05701989, -0.05773084,  0.1594044 ,
       -0.06287961,  0.17134733,  0.05117307, -0.18263257, -0.083422  ,
       -0.03465077, -0.04503918,  0.10726914, -0.09143738,  0.08167376,
        0.06061165,  0.06074144,  0.19309185,  0.07915749, -0.11904246,
        0.14842874,  0.09935912,  0.10409604, -0.15229185,  0.00162017,
        0.05878293,  0.08219051,  0.21717533, -0.02163329, -0.10784964,
        0.10610149, -0.04759702, -0.01785776, -0.07483187,  0.1626905 ,
        0.33602083, -0.16017778, -0.05621226,  0.11591039, -0.12509346,
       -0.12255644, -0.06394254,  0.16320434,  0.13513343, -0.00990414,
        0.03073754, -0.00041756, -0.00945251,  0.11489146, -0.04028097,
       -0.15269004,  0.11972128, -0.10128278, -0.02790532,  0.09995367,
       -0.15508983,  0.12511507,  0.08891103, -0.10152003,  0.00764787,
       -0.10944611, -0.05375282,  0.06812549,  0.05812013, -0.03937917,
        0.22765504,  0.00372892,  0.06387893,  0.09559281,  0.02007664,
        0.24314202,  0.18200296, -0.08239567, -0.02860177, -0.00207823,
       -0.1734508 ,  0.03046796,  0.05045646, -0.00278576, -0.23061818,
        0.00690519,  0.00626373,  0.04695902, -0.08392623, -0.01150203,
       -0.04777888, -0.25073016,  0.01703038, -0.02620552,  0.07595298,
        0.07270079, -0.07288843, -0.00517831,  0.03834118,  0.02118174,
       -0.16969006, -0.00073896,  0.16086763, -0.06513947,  0.00683115,
       -0.01509206,  0.02197671, -0.03536109, -0.01040418,  0.06069956,
       -0.14009444, -0.09763421,  0.08548868,  0.00808922, -0.10547569,
        0.15066835,  0.10981354, -0.08242812,  0.10089713, -0.12059139,
       -0.1292853 , -0.03677294, -0.06604069,  0.09970321,  0.03814212,
        0.04470449,  0.21596256, -0.0100713 ,  0.12398451, -0.10095138,
       -0.13962515, -0.02092131, -0.10098378,  0.2194601 ,  0.03088953],
      dtype=float32)

#计算词与词之间的相似度
model.similarity('行者','八戒')

D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).
  
D:\ProgramData\Anaconda3\lib\site-packages\gensim\matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int32 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):





0.8457024

#计算某个词的相关词的列表
model.most_similar('八戒',topn=10)

D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
  
D:\ProgramData\Anaconda3\lib\site-packages\gensim\matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int32 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):





[('沙和尚', 0.9313569664955139),
 ('扯住', 0.9298043847084045),
 ('师兄', 0.9273734092712402),
 ('呆子', 0.9261776804924011),
 ('哥哥', 0.9243035316467285),
 ('挑着', 0.9239184856414795),
 ('挑', 0.9199938178062439),
 ('老猪', 0.9184861183166504),
 ('沙僧', 0.9174877405166626),
 ('哥', 0.9134185910224915)]

#模型的保存与调用
model.save('C:\\Users\\CDAer\\Desktop\\data\\xyj.model')

model_1=word2vec.Word2Vec.load('C:\\Users\\CDAer\\Desktop\\data\\xyj.model')

model_1.train?

model_1.train([['大师兄','猴哥']],total_examples=1,epochs=1)

(0, 2)

涤生（bluez）

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
word2vec

1、语料的预处理novel=open('C:\\Users\\CDAer\\Desktop\\西游记.txt',mode='r',encoding='gb18030')content=novel.read()import jiebacutword=jieba.lcut(content,cut_all=False,HMM=True)Building prefix dict from ...
复制链接

扫一扫