既然您说过可以使用spacy作为NLP库,那么让我们考虑一个简单的基准测试。我们将使用brownnews语料库将其分成两半来创建一些任意的词对。在from nltk.corpus import brown
brown_corpus = list(brown.words(categories='news'))
brown_df = pd.DataFrame({
'word_1':brown_corpus[:len(brown_corpus)//2],
'word_2': brown_corpus[len(brown_corpus)//2:]
})
len(brown_df)
50277
两个标记/文档的余弦相似性可以用^{}方法计算。在
^{pr2}$
最后,将这两种方法应用于数据帧:nltk_similarity = %timeit -o brown_df.apply(nltk_max_similarity, axis=1)
1 loop, best of 3: 59 s per loop
spacy_similarity = %timeit -o brown_df.apply(spacy_max_similarity, axis=1)
1 loop, best of 3: 8.88 s per loop
请注意,NLTK和spacy在度量相似性时使用了不同的技术。spacy使用经过word2vec算法预训练的词向量。从docs:Using word vectors and semantic similarities
[...]
The default English model installs vectors for one million vocabulary
entries, using the 300-dimensional vectors trained on the Common Crawl
corpus using the GloVe algorithm. The GloVe common crawl vectors have
become a de facto standard for practical NLP.