word2vec python实现,Python:基于word2vec聚类相似词

This might be the naive question which I am about to ask. I have a tokenized corpus on which I have trained Gensim's Word2vec model. The code is as below

site = Article("http://www.datasciencecentral.com/profiles/blogs/blockchain-and-artificial-intelligence-1")

site.download()

site.parse()

def clean(doc):

stop_free = " ".join([i for i in word_tokenize(doc.lower()) if i not in stop])

punc_free = ''.join(ch for ch in stop_free if ch not in exclude)

normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())

snowed = " ".join(snowball.stem(word) for word in normalized.split())

return snowed

b = clean(site.text)

model = gensim.models.Word2Vec([b],min_count=1,size=32)

print(model) ### Prints: Word2Vec(vocab=643, size=32, alpha=0.025) ####

To cluster similar words, I am using PCA to visualize the clusters of similar words. But the problem is that it is forming only big cluster as seen in the image.

PCA & scatter plot Code:

vocab = list(model.wv.vocab)

X = model[vocab]

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X)

df = pd.concat([pd.DataFrame(X_pca),

pd.Series(vocab)],

axis=1)

df.columns = ['x','y','word']

fig = plt.figure()

ax = fig.add_subplot(1,1,1)

ax.scatter(df['x'],df['y'])

plt.show()

So, I have three questions here:

1) Is just one article enough to have the clear segregation of the clusters?

2) If I have a model trained with huge corpus and I want to predict the similar words in the new article and visualize them (i.e. words in the article I'm predicting) in the form of clusters, is there a way to do that?

I highly appreciate your suggestions. Thank you.

解决方案No, not really. For reference, common word2vec models which are trained on wikipedia (in english) consists around 3 billion words.

You can use KNN (or something similar). Gensim has the most_similar function to get the closest words. Using a dimensional reduction (like PCA or tsne) you can get yourself a nice cluster. (Not sure if gensim has tsne module, but sklearn has, so you can use it)

btw you're referring to some image, but it's not available.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值