本文为KeyedVectors的官方文档简单翻译,欢迎批评指正。
词向量的训练和使用可以分开,gensim中用KeyedVectors实现实体(单词、文档、图片都可以)和向量之间的映射。实体都用string id表示。
gensim中word2vec的生成方式如下:
>>> from gensim.test.utils import common_texts #common_texts是分词后的文档集 >>> from gensim.models import Word2Vec >>> >>> model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4) >>> word_vectors = model.wv
如需载入已有词向量可用如下方式:
>>> from gensim.test.utils import datapath >>> >>> wv_from_text = KeyedVectors.load_word2vec_format(datapath('word2vec_pre_kv_c'), binary=False) # C text format >>> wv_from_bin = KeyedVectors.load_word2vec_format(datapath("euclidean_vectors.bin"), binary=True) # C binary format
有了它就可以算两个单词之间的相似度了
>>> import gensim.downloader as api >>> >>> word_vectors = api.load("glove-wiki-gigaword-100") # load pre-trained word-vectors from gensim-data >>> >>> result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man']) >>> print("{}: {:.4f}".format(*result[0])) queen: 0.7699 >>> >>> result = word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man']) >>> print("{}: {:.4f}".format(*result[0])) queen: 0.8965 >>> >>> print(word_vectors.doesnt_match("breakfast cereal dinner lunch".split())) cereal >>> >>> similarity = word_vectors.similarity('woman', 'man') >>> similarity > 0.8 True >>> >>> result = word_vectors.similar_by_word("cat") >>> print("{}: {:.4f}".format(*result[0])) dog: 0.8798 >>> >>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split() >>> sentence_president = 'The president greets the press in Chicago'.lower().split() >>> >>> similarity = word_vectors.wmdistance(sentence_obama, sentence_president) >>> print("{:.4f}".format(similarity)) 3.4893 >>> >>> distance = word_vectors.distance("media", "media") >>> print("{:.1f}".format(distance)) 0.0 >>> >>> sim = word_vectors.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant']) >>> print("{:.4f}".format(sim)) 0.7067 >>> >>> vector = word_vectors['computer'] # numpy vector of a word >>> vector.shape (100,) >>> >>> vector = word_vectors.wv.word_vec('office', use_norm=True) >>> vector.shape (100,)