word2vec 细节解析1

最新推荐文章于 2020-05-25 22:27:19 发布

weixin_30679823

最新推荐文章于 2020-05-25 22:27:19 发布

阅读量197

点赞数

原文链接：http://www.cnblogs.com/fpzs/p/10333877.html

版权

count.extend(collections.Counter(list1).most_common(2))
表示：使用collections.Counter统计list1列表重单词的频数，
然后使用most_common方法取top2频数的单词。然后加入到count中。
# -*- coding:utf-8 -*-
import collections

words = ['physics','physics', 'chemistry', 'the','the','the','the','a','b','c']

#统计单词列表重单词的频数
tt=collections.Counter(words)
print(type(tt))#<class 'collections.Counter'>

#打印出单词,和单词出现的次数
print (tt)#Counter({'the': 4, 'physics': 2, 'a': 1, 'c': 1, 'b': 1, 'chemistry': 1})

print (tt['the'])#打印出单词‘the’出现的次数
#4

#在#打印出单词,和单词出现的次数  中  选取出现次数最多的2个
t=collections.Counter(words).most_common(2)
print (t)#[('the', 4), ('physics', 2)]


count = [['UNK', -1]]  # 此时，len(count)=1,表示只有一组数据

#在count的基础上，把list1单词表出现次数的最多的2个，添加到count后面
count.extend(collections.Counter(words).most_common(2))

print count
#[['UNK', -1], ('the', 4), ('physics', 2)]


dictionary=dict()#创建一个字典
#将全部单词转为编号（以频数排序的编号），top50000之外的单词，认为UnKown,编号为0,并统计这类词汇的数量

for word,_ in count:
    dictionary[word]=len(dictionary)

print dictionary
#{'the': 1, 'UNK': 0, 'physics': 2}

data=list()
unk_count=0
for word in words:#遍历单词列表，
    #对于其中每一个单词，先判断是否出现在dictionary中，
    if word in dictionary:
        #如果出现，则转为其编号
        index=dictionary[word]
    else:#如果不是，则转为编号0
        index=0
        unk_count+=1
    data.append(index)

print data
#编码后：[2, 2, 0, 1, 1, 1, 1, 0, 0, 0]

count[0][1]=unk_count

print count
#[['UNK', 4], ('the', 4), ('physics', 2)]

转载于:https://www.cnblogs.com/fpzs/p/10333877.html

weixin_30679823

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
word2vec 细节解析1

count.extend(collections.Counter(list1).most_common(2))表示：使用collections.Counter统计list1列表重单词的频数，然后使用most_common方法取top2频数的单词。然后加入到count中。# -*- coding:utf-8 -*-import collectionswords = ['physics','physi...
复制链接

扫一扫