【语法更正】gensim4.0以后获取词向量语法更正

sucka136

已于 2022-03-28 16:03:14 修改

阅读量3.8k

点赞数 5

文章标签： python 自然语言处理 word2vec

于 2022-03-28 15:12:30 首次发布

本文链接：https://blog.csdn.net/sucka136/article/details/123795551

版权

最近用gensim的word2vec的模型训练词向量时，以前的别人的文章的代码：


ngram_model_counter = Counter()            
for key in ngram_model.wv.vocab.keys():         #获取key值
    if key not in stoplist:
        if len(key.split("_")) > N:
            ngram_model_counter[key] += ngram_model.wv.vocab[key].count #计数

运行到wv.vocab.keys()这行就会报错：

AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0

仔细去找了word2vec的源代码

要获取关键词的列表：wv.index_to_key

要获取关键词与其索引值的字典：wv.key_to_index

还有一个问题：词频的获取，版本更新之后输出的词向量已经去重，没办法在key_to_index里面去计数，再去研究库源代码，发现了计数值/频率在新版本只用来对词向量排序（按出现频率最大降序排列），储存在self.expandos['count']这个np数组中，那词向量的词频就直接白嫖就好了。

图1 self.expandos数组

代码修正如下：

ngram_model = Word2Vec(ngram[abstrct], vector_size=100)
ngram_model_counter = Counter()
count = ngram_model.wv.expandos['count']            #计数值存储在self.expandos['count']这个numpy数组中
count =np.sort(count)[::-1]                         #取出计数值列表（降序排序）
keylist = ngram_model.wv.index_to_key
for i in range(len(keylist)):
    key = keylist[i]
    if key not in stoplist:
        if len(key.split("_")) > N:
            ngram_model_counter[key] += count[i]

sucka136

关注

5
点赞
踩
17

收藏

觉得还不错? 一键收藏
0
评论
【语法更正】gensim4.0以后获取词向量语法更正

最近用gensm的word2vec的模型训练词向量时，以前的别人的文章的代码：ngram_model_counter = Counter() for key in ngram_model.wv.vocab.keys(): #获取key值 if key not in stoplist: if len(key.split("_")) > N: ngram_model_counter[key] += ngram_
复制链接

扫一扫