接上一篇<sklearn文本特征预处理1: WordPunctTokenizer, CountVectorizer, TF-IDF>
五. Similarity特征
# 余弦相似度
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

六. 聚类特征
from sklearn.cluster import KMeans
km = KMeans(n_clusters = 2)
km.fit_transform(similarity_df)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat(

最低0.47元/天 解锁文章

1885

被折叠的 条评论
为什么被折叠?



