【自然语言处理】Gensim学习笔记(一)

关于这个库的介绍这里不多说了,不建议看官方文档,官方文档写的比较乱,看的时候抓不住重点,网上关于这个库的博客也很多,但大多都比较水,不是互相抄袭就是很多api一带而过,对于小白来说学起来真的很苦恼~
1.doc2bow:词袋模型
这个api我是研究了一上午才明白的,在源码中这样解释到:

Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples

意思就是将文档转为“词袋”模型,“词袋”形如元组(token_id, token_count)。假如我们将doc2bow最终返回的结果表达成一个树形结构的话,可以表示成如下形式(该形式只适用于文档结构,即有多个句子构成的一篇文档,如果只有一个句子,那么返回的结构只有下面树形结构的后两个):
在这里插入图片描述

那么词袋元组中的token_idtoken_count究竟是什么呢?很遗憾,网上居然没有多少博客提到或者讲明白~我们先来看一个例子:

from gensim import corpora


texts = [['human', 'interface', 'computer'], \
         ['survey', 'user', 'computer', 'system', 'response', 'time'], \
         ['eps', 'user', 'interface', 'system'], \
         ['system', 'human', 'system', 'eps'], \
         ['user', 'response', 'time'], \
         ['trees'], \
         ['graph', 'trees'], \
         ['graph', 'minors', 'trees'], \
         ['graph', 'minors', 'survey']]

dct = corpora.Dictionary(texts)
print(dct.token2id)

c = [dct.doc2bow(text) for text in texts]
print(c)

corpora.Dictionary()就好比将texts转为一种集合结构,去掉了重复词,并且每个词都有编号:

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

编号有个特点就是同一个句子中不同单词是无序的,那么dct.token2id输出的字典中的值即为词包中的token_id.

[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]

输出的c的结果如上所示,第二个token_count就是单词在其所在句子中出现的次数,单词出现次数多于一次的均统计为一个词袋,例如(5,2)不会统计成(5,1),(5,1)
上面的texts是字符串的形式,但是通常情况下多为文本形式,将文本形式的文档转为词袋如下:

from gensim import corpora

with open("text.txt", encoding='utf-8') as f:
    dataset = [line.strip().split('\n') for line in f.readlines()]
    voc_list = [line[0].split(' ') for line in dataset]

dic = corpora.Dictionary(voc_list)
print(dic.token2id)
bow = [dic.doc2bow(text) for text in voc_list]
print(bow)
{'Pepperleigh': 0, 'Virginia': 1, 'Zena': 2, 'by': 3, 'creepers.': 4, 'half': 5, 'hidden': 6, 'house,': 7, "judge's": 8, 'novels': 9, 'of': 10, 'on': 11, 'piazza': 12, 'reading': 13, 'sit': 14, 'the': 15, 'to': 16, 'used': 17, 'At': 18, 'a': 19, 'and': 20, 'book': 21, 'did': 22, 'eyes': 23, 'fall': 24, 'her': 25, 'in': 26, 'it': 27, 'lap': 28, 'look': 29, 'such': 30, 'that': 31, 'there': 32, 'times': 33, 'unstilled': 34, 'upon': 35, 'violet': 36, 'was': 37, 'would': 38, 'yearning': 39, 'another': 40, 'apple': 41, 'beside': 42, 'bite': 43, 'disappear': 44, 'entirely': 45, 'even': 46, 'it.': 47, 'lay': 48, 'not': 49, 'out': 50, 'picked': 51, 'she': 52, 'took': 53, 'up': 54, 'when': 55, 'When': 56, 'With': 57, 'all': 58, 'beautiful': 59, 'clasped': 60, 'day-dreams': 61, 'dreaming': 62, 'faraway': 63, 'girlhood.': 64, 'hands': 65, 'saw': 66, 'you': 67, 'armoured': 68, 'embattled': 69, 'eyes,': 70, 'from': 71, 'knight': 72, 'meant': 73, 'plumed': 74, 'rescuing': 75, 'Algerian': 76, 'Danube.': 77, 'an': 78, 'away': 79, 'being': 80, 'borne': 81, 'castle': 82, 'corsair': 83, 'keep': 84, 'other': 85, 'France': 86, 'Mediterranean': 87, 'arms': 88, 'blue': 89, 'farewell': 90, 'over': 91, 'reaching': 92, 'say': 93, 'towards': 94, 'waters': 95}
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 3), (16, 1), (17, 1)], [(10, 1), (15, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 2), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1)], [(10, 1), (15, 1), (20, 1), (25, 1), (31, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1)], [(10, 1), (14, 1), (15, 1), (31, 1), (32, 1), (38, 1), (52, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1)], [(15, 1), (19, 1), (20, 1), (25, 2), (26, 1), (27, 1), (29, 1), (31, 2), (37, 2), (52, 1), (62, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1)], [(3, 1), (10, 1), (15, 1), (18, 1), (19, 1), (33, 1), (37, 1), (42, 1), (52, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1)], [(10, 1), (15, 2), (16, 2), (20, 1), (25, 1), (37, 1), (47, 1), (50, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1)]]

原文本:
在这里插入图片描述

  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值