繁星点点滴滴
这样做的常用方法是将文档转换为tf-idf向量,然后计算它们之间的余弦相似度。任何有关信息检索(IR)的教科书都涵盖了这一点。尤其是 信息检索简介,免费在线提供。Tf-idf(和类似的文本转换)在Python包Gensim和scikit-learn中实现。在后一种方案中,计算余弦相似度就像from sklearn.feature_extraction.text import TfidfVectorizerdocuments = [open(f) for f in text_files]tfidf = TfidfVectorizer().fit_transform(documents)# no need to normalize, since Vectorizer will return normalized tf-idfpairwise_similarity = tfidf * tfidf.T或者,如果文件是简单的字符串,>>> vect = TfidfVectorizer(min_df=1)>>> tfidf = vect.fit_transform(["I'd like an apple",...
"An apple a day keeps the doctor away",... "Never compare an apple to an orange",... "I prefer scikit-learn to Orange"])>>> (tfidf * tfidf.T).A
array([[ 1. , 0.25082859, 0.39482963, 0. ],
[ 0.25082859, 1. , 0.22057609, 0. ],
[ 0.39482963, 0.22057609, 1. , 0.26264139],
[ 0. , 0. , 0.26264139, 1. ]])虽然Gensim可能有更多选择来完成这类任务。另见这个问题。[免责声明:我参与了scikit-learn tf-idf实现。]