word2vec代码详解(2)-创建字典、生成训练样本

最新推荐文章于 2024-02-22 21:33:18 发布

cy冲鸭

最新推荐文章于 2024-02-22 21:33:18 发布

阅读量1k

点赞数

分类专栏：自然语言处理

本文链接：https://blog.csdn.net/weixin_41841797/article/details/84193687

版权

本文详细介绍了word2vec模型中创建字典的过程，选取频数最高的50000个单词，其余归为'UNK'。接着讲解了如何生成训练样本，采用Skip-Gram模式，设定batch_size、num_skips和skip_window参数，通过滑动窗口从目标单词反推其上下文。

摘要由CSDN通过智能技术生成

第二步，创建字典，取频数最高的50000个单词，按照频数从高到低存到字典中，编号为1到50000，其它单词认定其为Unknow，编号为0

# Step 2: Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000

def build_dataset(words, n_words):
  """Process raw inputs into a dataset."""
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(n_words - 1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words:
        index = dictionary.get(word, 0) #获取单词的编号
        if index == 0:  # dictionary['UNK']
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reversed_dictionary

# Filling 4 global variables:
# data

最低0.47元/天解锁文章

cy冲鸭

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
word2vec代码详解(2)-创建字典、生成训练样本

第二步，创建字典，取频数最高的50000个单词，按照频数从高到低存到字典中，编号为1到50000，其它单词认定其为Unknow，编号为0# Step 2: Build the dictionary and replace rare words with UNK token.vocabulary_size = 50000def build_dataset(words, n_words):...
复制链接

扫一扫

专栏目录