Wrod2vec算法实战_3分钟热情学NLP第5篇

最新推荐文章于 2023-10-20 09:11:12 发布

13线

最新推荐文章于 2023-10-20 09:11:12 发布

阅读量223

点赞数

分类专栏： NLP 3分钟热情学NLP Python 文章标签：算法机器学习自然语言处理 python

本文链接：https://blog.csdn.net/licx1988/article/details/112854151

版权

Python 同时被 3 个专栏收录

13 篇文章 0 订阅

订阅专栏

NLP

12 篇文章 0 订阅

订阅专栏

3分钟热情学NLP

11 篇文章 0 订阅

订阅专栏

参考文章：https://blog.csdn.net/qq_30189255/article/details/103049569

1、语料

本文采用的语料：#将语料text8，保存在sentence中；text8有100mb大小；text8的下载地址：http://mattmahoney.net/dc/text8.zip
text8语料，已经按照空格进行分词，去掉了标点符号，无需进行预处理

2、模型训练

采用python的gensim包实现word2vec

输入：

from gensim.models import word2vec

#Gensim是一款开源的第三方Python工具包，用于从原始的非结构化的文本中，无监督地学习到文本隐层的主题向量表达。支持TF-IDF，LSA，LDA，和word2vec多种主题模型算法，

#将语料text8，保存在sentence中；text8有100mb大小；text8的下载地址：http://mattmahoney.net/dc/text8.zip
sentences = word2vec.Text8Corpus('text8')

#生成词向量空间模型
model = word2vec.Word2Vec(sentences, sg=1, size=100, window=5,min_count=5, negative=3, sample=0.001, hs=1, workers=4)

#命名模型，并保存模型；保存后，后续如果还需要用到该模型，就可以直接训练好的模型
model.save('text8_word2vec_model')

3、模型加载

#进行模型加载
model = word2vec.Word2Vec.load('text8_word2vec_model')

4、计算2个词的相似度，使用similarity()

print('---计算2个词的相似度---')
word1 = 'man'
word2 = 'woman'
result1 = model.similarity(word1, word2)
print(word1 + "和" + word2 + "的相似度为：",result1)

输出：

---计算2个词的相似度---
man和woman的相似度为： 0.6944872

5、计算某个词的关联词表，使用most_similar()

输入：

print('\n---计算1个词的关联列表---')
word = 'cat'
result2 = model.most_similar(word, topn=10)#计算得出10个最相关的词

print( "和" + word +"相关的10个词为：")
for item in result2:
    print(item[0], item[1])

输出：

---计算1个词的关联列表---
和cat相关的10个词为：
prionailurus 0.7491977214813232
cats 0.7341662049293518
dog 0.7332097887992859
dogs 0.7025191783905029
kitten 0.6987137794494629
rat 0.6867721676826477
eared 0.6866066455841064
felis 0.6811522245407104
pug 0.678561806678772
tortoiseshell 0.6764862537384033

借助翻译软件，看下与cat相关的词
prionailurus，豹猫属（学名 Prionailurus）是猫科的一属，其体型与家猫大致相仿；kitten，小猫；eared，有耳的；

换1个词试试，看看 beijing 的相关词

---计算1个词的关联列表---
和beijing相关的10个词为：
guangzhou 0.7843025326728821
shanghai 0.7154852151870728
peking 0.6975410580635071
taipei 0.6882435083389282
hangzhou 0.6816953420639038
wuhan 0.6814815998077393
kaohsiung 0.6703094244003296
ribao 0.664854109287262
guangdong 0.6647670269012451
hong 0.6628706455230713

从上到下依次为：广州、上海、北京、台北、杭州、武汉、高雄、日报、广东、hong
最后1个词，hong 个人推测是hongkong

6、寻找与词对应关系，同样使用most_similar()

输入

print('\n---寻找词之间的对应关系---')

print('"boy" is to "father" as "girl" to ?')
result3 = model.most_similar(['girl', 'father'], ['boy'], topn=2)#计算得出2个对应的词

for item in result3:
    print(item[0], item[1])

输出

---寻找词之间的对应关系---
"boy" is to "father" as "girl" to ?
mother 0.7658053040504456
wife 0.7323337197303772

more_examples = ["she her he", "small smaller bad", "going went being"]
for example in more_examples:
    a, b, x = example.split()
    predicted = model.most_similar([x, b], [a])[0][0]
    print("'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted))

'she' is to 'her' as 'he' is to 'his'
'small' is to 'smaller' as 'bad' is to 'worse'
'going' is to 'went' as 'being' is to 'was'

7、在一堆词中找出异类，使用doesnt_math()

输入

print('\n---在一堆词中，找茬---')

words = "apple cat banana peach"#从水果词汇中，找出动物词汇
result4 = model.doesnt_match(words.split())
print("在 " + words + " 中，与众不同的词为：",result4)

输出

---在一堆词中，找茬---
在 apple cat banana peach 中，与众不同的词为： cat

8、查看词向量

输入

word = 'boy'
print(word, "\n", model[word])

输出

boy 
 [-0.05925538  0.11277281  0.11228959  0.00941157 -0.29323277  0.3983824
  0.10022594 -0.27772436 -0.0637489   0.21361585 -0.1111148  -0.07992619
  0.19348109 -0.3863782  -0.39820215 -0.5309777   0.3023594   0.09559165
  0.26342046  0.07928758  0.181699    0.69354516  0.06837065 -0.18296044
  0.02820505 -0.2478618   0.02427425  0.05263022  0.4571287  -0.11103037
  0.00101246 -0.27764824 -0.24569483  0.44549158 -0.21713312  0.5335748
  0.14214468  0.11317527  0.19602373  0.2653484  -0.32859662 -0.38938046
  0.25495887 -0.45625678  0.14457951  0.32262853  0.15038528  0.32194614
 -0.08338999 -0.01091572  0.20316067 -0.74805576 -0.08273557 -0.59173554
 -0.12938951 -0.2492775   0.16524307  0.14128453 -0.42496806  0.2531642
  0.01175205  0.24926914 -0.20511891 -0.32925373  0.64965665 -0.2722091
  0.7198772  -0.45331827  0.02247382 -0.44499233  0.46038678  0.099677
 -0.03841541  0.22986875  0.24340023 -0.2364937  -0.22875474 -0.08419312
  0.47897708 -0.2800826   0.36107522 -0.41507873  0.13201733 -0.61776733
  0.08101977 -0.14693528  0.15443248  0.08642672  0.21798083 -0.30605313
  0.09893245 -0.15973178  0.07892659  0.31995687 -0.07135762  0.46047646
 -0.53847355 -0.00333725 -0.03253252  0.20049895]

boy，用100维向量的展示。