NLP 利器 Gensim 中 word2vec 模型的内存需求,和模型评估方式

Gensim 中 word2vec 模型的内存需求,和模型评估方式

本文为系列文章之一,前面的几篇请点击链接:
NLP 利器 gensim 库基本特性介绍和安装方式
NLP 利器 Gensim 库的使用之 Word2Vec 模型案例演示
NLP 利器 Gensim 来训练自己的 word2vec 词向量模型
NLP 利器 Gensim 来训练 word2vec 词向量模型的参数设置

一、内存需求

word2vec 模型的参数是以 Numpy array 的形式存储。

shape 是:(词表长度,词向量维度)

  • 词表长度由 min_count 控制。
  • 词向量维度由 size 控制。

所以参数个数是 len(vocab) * size

每个参数都是单精度浮点数,即 32 位,在内存中占 4 个字节 bytes。

而这样的矩阵会有 3 个同时存储在内存 RAM 中。

所以假设我们词表长度为 100,000,词向量维度 200,那我们所需的内存大小为:

100,000 * 200 * 4 * 3 = 229MB 左右

当然需要额外的一些内存存储词表内容,但是这个基本可以忽略。

二、模型评估

Word2Vec 模型的训练,是一个非监督学习过程,其实没有客观的标准去衡量精确度。

评估需要依赖于最终的应用。

Google 开放了一个 20,000 个样本的测试集合(句法和语义),来测试 “A 之于于 B 就好比 C 之于 D” 这样的任务。

例如一个比较类型的句法类比:

bad : worse ; good : ?

数据集中有 9 种句法对比,包括名词的复数,相反意义的名词等。

语义问题包括了 5 种语义类比,比如:

首都城市(Paris : France ; Tokyo : ?)

家庭成员(brother : sister ; dad : ?)

Gensim 支持同样的评估集合,同时格式也一样。

model.wv.accuracy('./datasets/questions-words.txt')

测试结果:

[{'section': 'capital-common-countries',
  'correct': [],
  'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
   ('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'),
   ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),
   ('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'),
   ('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'),
   ('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN')]},
 {'section': 'capital-world',
  'correct': [],
  'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
   ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE')]},
 {'section': 'currency', 'correct': [], 'incorrect': []},
 {'section': 'city-in-state', 'correct': [], 'incorrect': []},
 {'section': 'family',
  'correct': [],
  'incorrect': [('HE', 'SHE', 'HIS', 'HER'),
   ('HE', 'SHE', 'MAN', 'WOMAN'),
   ('HIS', 'HER', 'MAN', 'WOMAN'),
   ('HIS', 'HER', 'HE', 'SHE'),
   ('MAN', 'WOMAN', 'HE', 'SHE'),
   ('MAN', 'WOMAN', 'HIS', 'HER')]},
 {'section': 'gram1-adjective-to-adverb', 'correct': [], 'incorrect': []},
 {'section': 'gram2-opposite', 'correct': [], 'incorrect': []},
 {'section': 'gram3-comparative',
  'correct': [],
  'incorrect': [('GOOD', 'BETTER', 'GREAT', 'GREATER'),
   ('GOOD', 'BETTER', 'LONG', 'LONGER'),
   ('GOOD', 'BETTER', 'LOW', 'LOWER'),
   ('GOOD', 'BETTER', 'SMALL', 'SMALLER'),
   ('GREAT', 'GREATER', 'LONG', 'LONGER'),
   ('GREAT', 'GREATER', 'LOW', 'LOWER'),
   ('GREAT', 'GREATER', 'SMALL', 'SMALLER'),
   ('GREAT', 'GREATER', 'GOOD', 'BETTER'),
   ('LONG', 'LONGER', 'LOW', 'LOWER'),
   ('LONG', 'LONGER', 'SMALL', 'SMALLER'),
   ('LONG', 'LONGER', 'GOOD', 'BETTER'),
   ('LONG', 'LONGER', 'GREAT', 'GREATER'),
   ('LOW', 'LOWER', 'SMALL', 'SMALLER'),
   ('LOW', 'LOWER', 'GOOD', 'BETTER'),
   ('LOW', 'LOWER', 'GREAT', 'GREATER'),
   ('LOW', 'LOWER', 'LONG', 'LONGER'),
   ('SMALL', 'SMALLER', 'GOOD', 'BETTER'),
   ('SMALL', 'SMALLER', 'GREAT', 'GREATER'),
   ('SMALL', 'SMALLER', 'LONG', 'LONGER'),
   ('SMALL', 'SMALLER', 'LOW', 'LOWER')]},
 {'section': 'gram4-superlative',
  'correct': [],
  'incorrect': [('BIG', 'BIGGEST', 'GOOD', 'BEST'),
   ('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),
   ('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),
   ('GOOD', 'BEST', 'GREAT', 'GREATEST'),
   ('GOOD', 'BEST', 'LARGE', 'LARGEST'),
   ('GOOD', 'BEST', 'BIG', 'BIGGEST'),
   ('GREAT', 'GREATEST', 'LARGE', 'LARGEST'),
   ('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),
   ('GREAT', 'GREATEST', 'GOOD', 'BEST'),
   ('LARGE', 'LARGEST', 'BIG', 'BIGGEST'),
   ('LARGE', 'LARGEST', 'GOOD', 'BEST'),
   ('LARGE', 'LARGEST', 'GREAT', 'GREATEST')]},
 {'section': 'gram5-present-participle',
  'correct': [],
  'incorrect': [('GO', 'GOING', 'LOOK', 'LOOKING'),
   ('GO', 'GOING', 'PLAY', 'PLAYING'),
   ('GO', 'GOING', 'RUN', 'RUNNING'),
   ('GO', 'GOING', 'SAY', 'SAYING'),
   ('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),
   ('LOOK', 'LOOKING', 'RUN', 'RUNNING'),
   ('LOOK', 'LOOKING', 'SAY', 'SAYING'),
   ('LOOK', 'LOOKING', 'GO', 'GOING'),
   ('PLAY', 'PLAYING', 'RUN', 'RUNNING'),
   ('PLAY', 'PLAYING', 'SAY', 'SAYING'),
   ('PLAY', 'PLAYING', 'GO', 'GOING'),
   ('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),
   ('RUN', 'RUNNING', 'SAY', 'SAYING'),
   ('RUN', 'RUNNING', 'GO', 'GOING'),
   ('RUN', 'RUNNING', 'LOOK', 'LOOKING'),
   ('RUN', 'RUNNING', 'PLAY', 'PLAYING'),
   ('SAY', 'SAYING', 'GO', 'GOING'),
   ('SAY', 'SAYING', 'LOOK', 'LOOKING'),
   ('SAY', 'SAYING', 'PLAY', 'PLAYING'),
   ('SAY', 'SAYING', 'RUN', 'RUNNING')]},
 {'section': 'gram6-nationality-adjective',
  'correct': [('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN')],
  'incorrect': [('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),
   ('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),
   ('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),
   ('AUSTRALIA', 'AUSTRALIAN', 'JAPAN', 'JAPANESE'),
   ('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),
   ('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),
   ('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),
   ('FRANCE', 'FRENCH', 'JAPAN', 'JAPANESE'),
   ('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),
   ('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),
   ('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),
   ('INDIA', 'INDIAN', 'JAPAN', 'JAPANESE'),
   ('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),
   ('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),
   ('ISRAEL', 'ISRAELI', 'JAPAN', 'JAPANESE'),
   ('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),
   ('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),
   ('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),
   ('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'),
   ('JAPAN', 'JAPANESE', 'SWITZERLAND', 'SWISS'),
   ('JAPAN', 'JAPANESE', 'AUSTRALIA', 'AUSTRALIAN'),
   ('JAPAN', 'JAPANESE', 'FRANCE', 'FRENCH'),
   ('JAPAN', 'JAPANESE', 'INDIA', 'INDIAN'),
   ('JAPAN', 'JAPANESE', 'ISRAEL', 'ISRAELI'),
   ('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),
   ('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),
   ('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'),
   ('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'),
   ('SWITZERLAND', 'SWISS', 'JAPAN', 'JAPANESE')]},
 {'section': 'gram7-past-tense',
  'correct': [],
  'incorrect': [('GOING', 'WENT', 'PAYING', 'PAID'),
   ('GOING', 'WENT', 'PLAYING', 'PLAYED'),
   ('GOING', 'WENT', 'SAYING', 'SAID'),
   ('GOING', 'WENT', 'TAKING', 'TOOK'),
   ('PAYING', 'PAID', 'PLAYING', 'PLAYED'),
   ('PAYING', 'PAID', 'SAYING', 'SAID'),
   ('PAYING', 'PAID', 'TAKING', 'TOOK'),
   ('PAYING', 'PAID', 'GOING', 'WENT'),
   ('PLAYING', 'PLAYED', 'SAYING', 'SAID'),
   ('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),
   ('PLAYING', 'PLAYED', 'GOING', 'WENT'),
   ('PLAYING', 'PLAYED', 'PAYING', 'PAID'),
   ('SAYING', 'SAID', 'TAKING', 'TOOK'),
   ('SAYING', 'SAID', 'GOING', 'WENT'),
   ('SAYING', 'SAID', 'PAYING', 'PAID'),
   ('SAYING', 'SAID', 'PLAYING', 'PLAYED'),
   ('TAKING', 'TOOK', 'GOING', 'WENT'),
   ('TAKING', 'TOOK', 'PAYING', 'PAID'),
   ('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),
   ('TAKING', 'TOOK', 'SAYING', 'SAID')]},
 {'section': 'gram8-plural',
  'correct': [],
  'incorrect': [('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),
   ('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),
   ('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),
   ('BUILDING', 'BUILDINGS', 'ROAD', 'ROADS'),
   ('BUILDING', 'BUILDINGS', 'WOMAN', 'WOMEN'),
   ('CAR', 'CARS', 'CHILD', 'CHILDREN'),
   ('CAR', 'CARS', 'MAN', 'MEN'),
   ('CAR', 'CARS', 'ROAD', 'ROADS'),
   ('CAR', 'CARS', 'WOMAN', 'WOMEN'),
   ('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),
   ('CHILD', 'CHILDREN', 'MAN', 'MEN'),
   ('CHILD', 'CHILDREN', 'ROAD', 'ROADS'),
   ('CHILD', 'CHILDREN', 'WOMAN', 'WOMEN'),
   ('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),
   ('CHILD', 'CHILDREN', 'CAR', 'CARS'),
   ('MAN', 'MEN', 'ROAD', 'ROADS'),
   ('MAN', 'MEN', 'WOMAN', 'WOMEN'),
   ('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),
   ('MAN', 'MEN', 'CAR', 'CARS'),
   ('MAN', 'MEN', 'CHILD', 'CHILDREN'),
   ('ROAD', 'ROADS', 'WOMAN', 'WOMEN'),
   ('ROAD', 'ROADS', 'BUILDING', 'BUILDINGS'),
   ('ROAD', 'ROADS', 'CAR', 'CARS'),
   ('ROAD', 'ROADS', 'CHILD', 'CHILDREN'),
   ('ROAD', 'ROADS', 'MAN', 'MEN'),
   ('WOMAN', 'WOMEN', 'BUILDING', 'BUILDINGS'),
   ('WOMAN', 'WOMEN', 'CAR', 'CARS'),
   ('WOMAN', 'WOMEN', 'CHILD', 'CHILDREN'),
   ('WOMAN', 'WOMEN', 'MAN', 'MEN'),
   ('WOMAN', 'WOMEN', 'ROAD', 'ROADS')]},
 {'section': 'gram9-plural-verbs', 'correct': [], 'incorrect': []},
 {'section': 'total',
  'correct': [('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN')],
  'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
   ('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'),
   ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),
   ('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'),
   ('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'),
   ('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN'),
   ('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
   ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),
   ('HE', 'SHE', 'HIS', 'HER'),
   ('HE', 'SHE', 'MAN', 'WOMAN'),
   ('HIS', 'HER', 'MAN', 'WOMAN'),
   ('HIS', 'HER', 'HE', 'SHE'),
   ('MAN', 'WOMAN', 'HE', 'SHE'),
   ('MAN', 'WOMAN', 'HIS', 'HER'),
   ('GOOD', 'BETTER', 'GREAT', 'GREATER'),
   ('GOOD', 'BETTER', 'LONG', 'LONGER'),
   ('GOOD', 'BETTER', 'LOW', 'LOWER'),
   ('GOOD', 'BETTER', 'SMALL', 'SMALLER'),
   ('GREAT', 'GREATER', 'LONG', 'LONGER'),
   ('GREAT', 'GREATER', 'LOW', 'LOWER'),
   ('GREAT', 'GREATER', 'SMALL', 'SMALLER'),
   ('GREAT', 'GREATER', 'GOOD', 'BETTER'),
   ('LONG', 'LONGER', 'LOW', 'LOWER'),
   ('LONG', 'LONGER', 'SMALL', 'SMALLER'),
   ('LONG', 'LONGER', 'GOOD', 'BETTER'),
   ('LONG', 'LONGER', 'GREAT', 'GREATER'),
   ('LOW', 'LOWER', 'SMALL', 'SMALLER'),
   ('LOW', 'LOWER', 'GOOD', 'BETTER'),
   ('LOW', 'LOWER', 'GREAT', 'GREATER'),
   ('LOW', 'LOWER', 'LONG', 'LONGER'),
   ('SMALL', 'SMALLER', 'GOOD', 'BETTER'),
   ('SMALL', 'SMALLER', 'GREAT', 'GREATER'),
   ('SMALL', 'SMALLER', 'LONG', 'LONGER'),
   ('SMALL', 'SMALLER', 'LOW', 'LOWER'),
   ('BIG', 'BIGGEST', 'GOOD', 'BEST'),
   ('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),
   ('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),
   ('GOOD', 'BEST', 'GREAT', 'GREATEST'),
   ('GOOD', 'BEST', 'LARGE', 'LARGEST'),
   ('GOOD', 'BEST', 'BIG', 'BIGGEST'),
   ('GREAT', 'GREATEST', 'LARGE', 'LARGEST'),
   ('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),
   ('GREAT', 'GREATEST', 'GOOD', 'BEST'),
   ('LARGE', 'LARGEST', 'BIG', 'BIGGEST'),
   ('LARGE', 'LARGEST', 'GOOD', 'BEST'),
   ('LARGE', 'LARGEST', 'GREAT', 'GREATEST'),
   ('GO', 'GOING', 'LOOK', 'LOOKING'),
   ('GO', 'GOING', 'PLAY', 'PLAYING'),
   ('GO', 'GOING', 'RUN', 'RUNNING'),
   ('GO', 'GOING', 'SAY', 'SAYING'),
   ('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),
   ('LOOK', 'LOOKING', 'RUN', 'RUNNING'),
   ('LOOK', 'LOOKING', 'SAY', 'SAYING'),
   ('LOOK', 'LOOKING', 'GO', 'GOING'),
   ('PLAY', 'PLAYING', 'RUN', 'RUNNING'),
   ('PLAY', 'PLAYING', 'SAY', 'SAYING'),
   ('PLAY', 'PLAYING', 'GO', 'GOING'),
   ('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),
   ('RUN', 'RUNNING', 'SAY', 'SAYING'),
   ('RUN', 'RUNNING', 'GO', 'GOING'),
   ('RUN', 'RUNNING', 'LOOK', 'LOOKING'),
   ('RUN', 'RUNNING', 'PLAY', 'PLAYING'),
   ('SAY', 'SAYING', 'GO', 'GOING'),
   ('SAY', 'SAYING', 'LOOK', 'LOOKING'),
   ('SAY', 'SAYING', 'PLAY', 'PLAYING'),
   ('SAY', 'SAYING', 'RUN', 'RUNNING'),
   ('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),
   ('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),
   ('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),
   ('AUSTRALIA', 'AUSTRALIAN', 'JAPAN', 'JAPANESE'),
   ('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),
   ('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),
   ('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),
   ('FRANCE', 'FRENCH', 'JAPAN', 'JAPANESE'),
   ('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),
   ('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),
   ('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),
   ('INDIA', 'INDIAN', 'JAPAN', 'JAPANESE'),
   ('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),
   ('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),
   ('ISRAEL', 'ISRAELI', 'JAPAN', 'JAPANESE'),
   ('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),
   ('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),
   ('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),
   ('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'),
   ('JAPAN', 'JAPANESE', 'SWITZERLAND', 'SWISS'),
   ('JAPAN', 'JAPANESE', 'AUSTRALIA', 'AUSTRALIAN'),
   ('JAPAN', 'JAPANESE', 'FRANCE', 'FRENCH'),
   ('JAPAN', 'JAPANESE', 'INDIA', 'INDIAN'),
   ('JAPAN', 'JAPANESE', 'ISRAEL', 'ISRAELI'),
   ('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),
   ('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),
   ('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'),
   ('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'),
   ('SWITZERLAND', 'SWISS', 'JAPAN', 'JAPANESE'),
   ('GOING', 'WENT', 'PAYING', 'PAID'),
   ('GOING', 'WENT', 'PLAYING', 'PLAYED'),
   ('GOING', 'WENT', 'SAYING', 'SAID'),
   ('GOING', 'WENT', 'TAKING', 'TOOK'),
   ('PAYING', 'PAID', 'PLAYING', 'PLAYED'),
   ('PAYING', 'PAID', 'SAYING', 'SAID'),
   ('PAYING', 'PAID', 'TAKING', 'TOOK'),
   ('PAYING', 'PAID', 'GOING', 'WENT'),
   ('PLAYING', 'PLAYED', 'SAYING', 'SAID'),
   ('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),
   ('PLAYING', 'PLAYED', 'GOING', 'WENT'),
   ('PLAYING', 'PLAYED', 'PAYING', 'PAID'),
   ('SAYING', 'SAID', 'TAKING', 'TOOK'),
   ('SAYING', 'SAID', 'GOING', 'WENT'),
   ('SAYING', 'SAID', 'PAYING', 'PAID'),
   ('SAYING', 'SAID', 'PLAYING', 'PLAYED'),
   ('TAKING', 'TOOK', 'GOING', 'WENT'),
   ('TAKING', 'TOOK', 'PAYING', 'PAID'),
   ('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),
   ('TAKING', 'TOOK', 'SAYING', 'SAID'),
   ('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),
   ('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),
   ('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),
   ('BUILDING', 'BUILDINGS', 'ROAD', 'ROADS'),
   ('BUILDING', 'BUILDINGS', 'WOMAN', 'WOMEN'),
   ('CAR', 'CARS', 'CHILD', 'CHILDREN'),
   ('CAR', 'CARS', 'MAN', 'MEN'),
   ('CAR', 'CARS', 'ROAD', 'ROADS'),
   ('CAR', 'CARS', 'WOMAN', 'WOMEN'),
   ('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),
   ('CHILD', 'CHILDREN', 'MAN', 'MEN'),
   ('CHILD', 'CHILDREN', 'ROAD', 'ROADS'),
   ('CHILD', 'CHILDREN', 'WOMAN', 'WOMEN'),
   ('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),
   ('CHILD', 'CHILDREN', 'CAR', 'CARS'),
   ('MAN', 'MEN', 'ROAD', 'ROADS'),
   ('MAN', 'MEN', 'WOMAN', 'WOMEN'),
   ('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),
   ('MAN', 'MEN', 'CAR', 'CARS'),
   ('MAN', 'MEN', 'CHILD', 'CHILDREN'),
   ('ROAD', 'ROADS', 'WOMAN', 'WOMEN'),
   ('ROAD', 'ROADS', 'BUILDING', 'BUILDINGS'),
   ('ROAD', 'ROADS', 'CAR', 'CARS'),
   ('ROAD', 'ROADS', 'CHILD', 'CHILDREN'),
   ('ROAD', 'ROADS', 'MAN', 'MEN'),
   ('WOMAN', 'WOMEN', 'BUILDING', 'BUILDINGS'),
   ('WOMAN', 'WOMEN', 'CAR', 'CARS'),
   ('WOMAN', 'WOMEN', 'CHILD', 'CHILDREN'),
   ('WOMAN', 'WOMEN', 'MAN', 'MEN'),
   ('WOMAN', 'WOMEN', 'ROAD', 'ROADS')]}]

可以看到测试的结果并不理想,应该是因为我们前面使用的训练语料比较小的原因。

这种精确度的衡量方式有个可选参数 restrict_vocab,用于限制哪些测试样本会被考虑到。

在 2016 年的版本中,Gensim 增加了一个更好的方式来评估语义相似度。

默认使用的是学术数据集:WS-353。但是个人也可以基于这个数据集创造一个专注于特别领域的数据集。

这个数据集包含词语对,及人工标注的相似度评估,用于衡量这两个词的相关性,或同时出现的概率。

例如 coast(海岸) 和 shore(岸)非常相似,这两个词经常出现在同一段文字中。

同时,clothes(衣服) 和 closet(衣橱) 的相似度就要低一些,虽然这两个词是有关系的,但是无法互换。

model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))

测试结果:

((0.1952515342533469, 0.13490728041580877),
 SpearmanrResult(correlation=0.19127414318530173, pvalue=0.14319638687965558),
 83.0028328611898)

返回值:

  • pearson (tuple of (float, float)) – Pearson correlation coefficient with 2-tailed p-value.
    • 皮尔森相关系数(2 个双尾 p 值)
  • spearman (tuple of (float, float)) – Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself, with 2-tailed p-value.
    • 斯皮尔曼等级相关系数,针对数据集的相关性和模型产生的相关性,2 个 双尾 p 值。
  • oov_ratio (float) – The ratio of pairs with unknown words.
    • 配对中有未知单词的比例。

所以上面的结果显示,我们测试的成绩并不好呀,应该是训练语料较小的原因吧!

!!! 注意:

  • 在 Google 测试集和 WS-353 上取得好成绩并不意味着在应用中也会表现很好~
  • 反之亦然~
  • 最好直接在所需的任务中进行测试!比如我们要做一个分类任务,那直接看分类的效果就好了!
  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值