GENSIM官方教程（4.0.0beta最新版）-FastText Model

最新推荐文章于 2024-07-23 11:19:52 发布

Ace Cheney

最新推荐文章于 2024-07-23 11:19:52 发布

阅读量920

点赞数 1

分类专栏： NLP 翻译文章标签： NLP fastText gensim

原文链接：https://radimrehurek.com/gensim_3.8.3/auto_examples/tutorials/run_fasttext.html

版权

NLP 同时被 2 个专栏收录

8 篇文章 4 订阅

订阅专栏

翻译

5 篇文章 1 订阅

订阅专栏

本文介绍了如何使用gensim的FastText模型进行词嵌入训练，包括模型参数、保存与加载、相似性查询等。FastText通过子词信息处理多义词问题，尤其在小规模数据集上表现出色。示例展示了如何用LeeCorpus训练模型，并进行词向量查找和相似性分析。

摘要由CSDN通过智能技术生成

译文目录

GENSIM官方文档（4.0.0beta最新版）-面向新手的核心教程

GENSIM官方教程（4.0.0beta最新版）-Word2Vec词向量模型

GENSIM官方教程（4.0.0beta最新版）-FastText Model

GENSIM官方教程（4.0.0beta最新版）-LDA模型

GENSIM官方教程（4.0.0beta最新版）-LDA模型评价与可视化

博主全天在线，欢迎评论或私信讨论NLP相关问题

这篇文章主要介绍使用gensim的fastText库去训练词嵌入模型，保存和载入模型，以及和Word2vec类似的相似性查询工作。

什么时候使用fastText

Word2Vec这类传统的词嵌入模型，只是为每个单词单独训练一个词嵌入表示方式。但是有很多语种，比如德语和土耳其语，一个单词有一堆大相径庭的意思，这种情况就很难用词向量来训练一个好的词嵌入模型。

fasttext尝试把每个单词看成子词模型（subwords）的集合，为了简化和保持语义独立性，subwords是单词字符的n元词袋，一个词向量被简单的表示成其所有子词n元字符的和。

从这可以看出fastText在语法问题上的效果要远好于传统的Word2Vec,特别是语料集很小的时候。在语义问题上，Word2Vec效果稍微好一点，但是随着语料集的增加这个差距也看不出来了。

fasttext甚至可以训练那些不在词典中的单词（out-of-vocabulary,OOV
），只要它字符的n元分词在训练集中出现过

模型训练

下面我们用Lee Corpus来训练一个fastText模型

from pprint import pprint as print
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model = FastText()

# build the vocabulary
model.build_vocab(corpus_file=corpus_file)

# train the model
model.train(
    corpus_file=corpus_file, epochs=model.epochs,
    total_examples=model.corpus_count, total_words=model.corpus_total_words,
)

print(model)

#[OUT]:<gensim.models.fasttext.FastText object at 0x7f844a11f040>

参数解释

classgensim.models.fasttext.FastText(sentences=None, corpus_file=None, sg=0, hs=0, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, word_ngrams=1, sample=0.001, seed=1, workers=3, min_alpha=0.0001, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, epochs=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=10000, callbacks=(), max_final_vocab=None)

gensim的fasttext支持word2vec中所包含的全部参数，和一些自己独有的参数，上面给出的参数是默认值。

sentences：传入语料集，格式为a list of lists of tokens, 如[[‘I’,‘am’,‘good’],[‘you’,‘are’,‘great’]]
corpus_filre：传入语料集，格式为训练好的格式是ListSentence的语料集的路径。这个参数和上买呢参数传一个就行。
min_count：清洗语料集的低频词，这个参数设置的就是一个单词最低出现频次的阈值。
sg：选取的算法框架，可取值0或者1，0代表CBOW，1代表skip-gram。
hs：选取的优化策略，可取值0或者1，0代表negative sampling，1代表hierarchical softmax。
window：上下文选取单词的范围。
min_n：n元分字符的最小字符个数
max_n：n元分字符的最大字符个数，如果将max_n设置为0或者比min_n还小，fasttext就简化成word2vec了。

为了节约内存，gensim使用哈希函数（FNV_1a variant）把n元分字符映射到整数1-K.

注意，使用gensim原生fasttext构建模型时，你的模型是可以随着新语料集的加入而继续训练的

模型的保存与载入

使用save和load函数实现，代码如下：

# Save a model trained via Gensim's fastText implementation to temp.
import tempfile
import os
with tempfile.NamedTemporaryFile(prefix='saved_model_gensim-', delete=False) as tmp:
    model.save(tmp.name, separately=[])

# Load back the same model.
loaded_model = FastText.load(tmp.name)
print(loaded_model)

os.unlink(tmp.name)  # demonstration complete, don't need the temp file anymore

也可以载入word2vec的模型，可以看成所有n元分字符的信息都没有的fasttext模型。

词向量查找

fasttext中所有的单词信息都储存在model.wv中，如果不需要继续训练模型，可以导出wv变量（比如用pickle）并舍弃model以节省内存。

wv = model.wv
print(wv)

#
# FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.
#
print('night' in wv.key_to_index)

#[OUT]:
#	<gensim.models.fasttext.FastTextKeyedVectors object at 0x20ce0d828>
#	True

查询一个单词的向量表示：

print(model['night'])
'''
[OUT]:
array([ 0.09290078,  0.00179044, -0.5732425 ,  0.47277036,  0.59876233,
       -0.31260246, -0.18675974, -0.03937651,  0.42742983,  0.3419642 ,
       -0.6347907 , -0.01129783, -0.6731092 ,  0.40949872,  0.27855358,
       -0.0675667 , -0.19392972,  0.17853093,  0.24443033, -0.37596267,
       -0.23575999,  0.27301458, -0.36870447,  0.02350322, -0.8377813 ,
        0.7330566 ,  0.11465224,  0.17489424,  0.4105659 ,  0.00782498,
       -0.6537432 ,  0.23468146,  0.0849599 , -0.4827836 ,  0.46601945,
        0.10883024, -0.16093193, -0.0672544 ,  0.4203116 ,  0.21155815,
       -0.00366337, -0.0748013 ,  0.3834724 , -0.06503348,  0.12586932,
        0.1853084 , -0.1237317 ,  0.20932904, -0.01647663, -0.3908304 ,
       -0.5708807 , -0.5556746 ,  0.06411647,  0.0105149 ,  0.3988393 ,
       -0.8015626 , -0.1093765 , -0.18021879,  0.01527423, -0.03230731,
        0.21715961, -0.12600328, -0.48359045, -0.10510948, -0.5346136 ,
        0.34130558,  0.00175925,  0.15395461,  0.03269634,  0.4691867 ,
       -0.5634196 , -0.51715475, -0.01452069, -0.11632308, -0.33402348,
        0.03678156,  0.2714943 ,  0.11561721, -0.13655168,  0.18497233,
        0.44912726,  0.05588026, -0.16958544,  0.4569073 , -0.38961336,
       -0.25632814,  0.11925202,  0.29190361,  0.3145572 ,  0.28840527,
       -0.1761603 ,  0.11538666, -0.03718378, -0.19138913, -0.2689859 ,
        0.55656165,  0.28513685,  0.44856617,  0.5552184 ,  0.46507034],
      dtype=float32)
'''

相似性查询

print(wv.similarity('night','nights'))
'''
[OUT]:0.9999927
'''

语法上相似的单词在fasttext模型中通常具有很高的相似性因为它们的n元分字符在很大程度上都是相似的。所以fasttext在语法任务上表现的比word2vec好。

print(wv.most_similar("nights"))
'''
[OUT]:
[('Arafat', 0.9982752203941345),
 ('study', 0.9982697367668152),
 ('"That', 0.9982694983482361),
 ('boat', 0.9982693791389465),
 ('Arafat,', 0.9982683062553406),
 ('Endeavour', 0.9982543587684631),
 ('often', 0.9982521533966064),
 ("Arafat's", 0.9982460737228394),
 ('details', 0.9982452392578125),
 ('north.', 0.9982450008392334)]
'''

因为数据集很mini，效果差也正常。

print(wv.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant']))
'''
[OUT]:
0.99995166
'''

这个n_similarity可以比较两个分词列表的相似度。

print(model.doesnt_match("breakfast cereal dinner lunch".split()))
'''
[OUT]:cereal
'''

找到最另类的词：cereal

print(model.most_similar(positive=['baghdad', 'england'], negative=['london']))
'''
[OUT]:
[('1', 0.2434064894914627),
 ('40', 0.23903147876262665),
 ('2', 0.2356666624546051),
 ('20', 0.2340335100889206),
 ('26', 0.23390895128250122),
 ('blaze', 0.23327460885047913),
 ('UN', 0.2332388311624527),
 ('keep', 0.23248346149921417),
 ('As', 0.2321406602859497),
 ('...', 0.23206500709056854)]
'''

找到和列表中最相似的词。

比较两句话的Word Movers distance：

distance = model.wmdistance(sentence_obama, sentence_president)
print(distance)
'''
[OUT]:
1.3929935492649077

'''

在这里插入图片描述

Ace Cheney

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录