python中整型对应的英文_如何使用python-pandas和gensim将数据框中的单词映射到整数ID?...

Given such a data frame, including the item and corresponding review texts:

item_id review_text

B2JLCNJF16 i was attracted to this...

B0009VEM4U great snippers...

I want to map the top 5000 most frequent word in review_text, so the resulting data frame should be like:

item_id review_text

B2JLCNJF16 1 2 3 4 5...

B0009VEM4U 6... #as the word "snippers" is out of the top 5000 most frequent word

Or, a bag-of-word vector is highly preferred:

item_id review_text

B2JLCNJF16 [1,1,1,1,1....]

B0009VEM4U [0,0,0,0,0,1....]

How can I do that? Thanks a lot!

EDIT:

I have tried @ayhan 's answer. Now I have successfully changed the review text to a doc2bow form:

item_id review_text

B2JLCNJF16 [(123,2),(130,3),(159,1)...]

B0009VEM4U [(3,2),(110,2),(121,5)...]

It denotes the word of ID 123 has occurred 2 times in that document. Now I'd like to transfer it to a vector like:

[0,0,0,.....,2,0,0,0,....,3,0,0,0,......1...]

#123rd 130th 159th

Do you how to do that? Thank you in advance!

解决方案

First, to get a list of words in every row:

df["review_text"] = df["review_text"].map(lambda x: x.split(' '))

Now you can pass df["review_text"] to gensim's Dictionary:

from gensim import corpora

dictionary = corpora.Dictionary(df["review_text"])

For the 5000 most frequent words, use filter_extremes method:

dictionary.filter_extremes(no_below=1, no_above=1, keep_n=5000)

doc2bow method will get you the bag of words representation (word_id, frequency):

df["bow"] = df["review_text"].map(dictionary.doc2bow)

0 [(1, 2), (3, 1), (5, 1), (11, 1), (12, 3), (18...

1 [(0, 3), (24, 1), (28, 1), (30, 1), (56, 1), (...

2 [(8, 1), (15, 1), (18, 2), (29, 1), (36, 2), (...

3 [(69, 1), (94, 1), (115, 1), (123, 1), (128, 1...

4 [(2, 1), (18, 4), (26, 1), (32, 1), (55, 1), (...

5 [(6, 1), (18, 1), (30, 1), (61, 1), (71, 1), (...

6 [(0, 5), (13, 1), (18, 6), (31, 1), (42, 1), (...

7 [(0, 10), (5, 1), (18, 1), (35, 1), (43, 1), (...

8 [(0, 24), (1, 4), (4, 2), (7, 1), (10, 1), (14...

9 [(0, 7), (18, 3), (30, 1), (32, 1), (34, 1), (...

10 [(0, 5), (9, 1), (18, 3), (19, 1), (21, 1), (2...

After getting the bag of words representation, you can concat the series in each row (probably not very efficient):

df2 = pd.concat([pd.DataFrame(s).set_index(0) for s in df["bow"]], axis=1).fillna(0).T.set_index(df.index)

0 1 2 3 4 5 6 7 8 9 ... 728 729 730 731 732 733 734 735 736 737

0 0 2 0 1 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

1 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 1 1 0 0 0

3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

4 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 1 0

5 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 1 0 0 0 0 0 0

6 5 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0

7 10 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

8 24 4 0 0 2 0 0 1 0 0 ... 1 1 2 0 1 3 1 0 1 0

9 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1

10 5 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值