【自然语言处理】Gensim学习笔记（一）

最新推荐文章于 2023-05-08 16:23:00 发布

Legolas~

最新推荐文章于 2023-05-08 16:23:00 发布

阅读量301

点赞数 2

分类专栏： NLP自然语言处理文章标签：自然语言处理 nlp gensim doc2bow

本文链接：https://blog.csdn.net/qq_38883271/article/details/107976197

版权

NLP自然语言处理专栏收录该内容

7 篇文章 2 订阅

订阅专栏

关于这个库的介绍这里不多说了，不建议看官方文档，官方文档写的比较乱，看的时候抓不住重点，网上关于这个库的博客也很多，但大多都比较水，不是互相抄袭就是很多api一带而过，对于小白来说学起来真的很苦恼~
1.doc2bow：词袋模型
这个api我是研究了一上午才明白的，在源码中这样解释到：

Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples

意思就是将文档转为“词袋”模型，“词袋”形如元组(token_id, token_count)。假如我们将doc2bow最终返回的结果表达成一个树形结构的话，可以表示成如下形式（该形式只适用于文档结构，即有多个句子构成的一篇文档，如果只有一个句子，那么返回的结构只有下面树形结构的后两个）：
在这里插入图片描述

那么词袋元组中的token_id和token_count究竟是什么呢？很遗憾，网上居然没有多少博客提到或者讲明白~我们先来看一个例子：

from gensim import corpora


texts = [['human', 'interface', 'computer'], \
         ['survey', 'user', 'computer', 'system', 'response', 'time'], \
         ['eps', 'user', 'interface', 'system'], \
         ['system', 'human', 'system', 'eps'], \
         ['user', 'response', 'time'], \
         ['trees'], \
         ['graph', 'trees'], \
         ['graph', 'minors', 'trees'], \
         ['graph', 'minors', 'survey']]

dct = corpora.Dictionary(texts)
print(dct.token2id)

c = [dct.doc2bow(text) for text in texts]
print(c)

corpora.Dictionary()就好比将texts转为一种集合结构，去掉了重复词，并且每个词都有编号：

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

编号有个特点就是同一个句子中不同单词是无序的，那么dct.token2id输出的字典中的值即为词包中的token_id.

[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]

输出的c的结果如上所示，第二个token_count就是单词在其所在句子中出现的次数，单词出现次数多于一次的均统计为一个词袋，例如(5,2)不会统计成(5,1),(5,1)。
上面的texts是字符串的形式，但是通常情况下多为文本形式，将文本形式的文档转为词袋如下：

from gensim import corpora

with open("text.txt", encoding='utf-8') as f:
    dataset = [line.strip().split('\n') for line in f.readlines()]
    voc_list = [line[0].split(' ') for line in dataset]

dic = corpora.Dictionary(voc_list)
print(dic.token2id)
bow = [dic.doc2bow(text) for text in voc_list]
print(bow)

{'Pepperleigh': 0, 'Virginia': 1, 'Zena': 2, 'by': 3, 'creepers.': 4, 'half': 5, 'hidden': 6, 'house,': 7, "judge's": 8, 'novels': 9, 'of': 10, 'on': 11, 'piazza': 12, 'reading': 13, 'sit': 14, 'the': 15, 'to': 16, 'used': 17, 'At': 18, 'a': 19, 'and': 20, 'book': 21, 'did': 22, 'eyes': 23, 'fall': 24, 'her': 25, 'in': 26, 'it': 27, 'lap': 28, 'look': 29, 'such': 30, 'that': 31, 'there': 32, 'times': 33, 'unstilled': 34, 'upon': 35, 'violet': 36, 'was': 37, 'would': 38, 'yearning': 39, 'another': 40, 'apple': 41, 'beside': 42, 'bite': 43, 'disappear': 44, 'entirely': 45, 'even': 46, 'it.': 47, 'lay': 48, 'not': 49, 'out': 50, 'picked': 51, 'she': 52, 'took': 53, 'up': 54, 'when': 55, 'When': 56, 'With': 57, 'all': 58, 'beautiful': 59, 'clasped': 60, 'day-dreams': 61, 'dreaming': 62, 'faraway': 63, 'girlhood.': 64, 'hands': 65, 'saw': 66, 'you': 67, 'armoured': 68, 'embattled': 69, 'eyes,': 70, 'from': 71, 'knight': 72, 'meant': 73, 'plumed': 74, 'rescuing': 75, 'Algerian': 76, 'Danube.': 77, 'an': 78, 'away': 79, 'being': 80, 'borne': 81, 'castle': 82, 'corsair': 83, 'keep': 84, 'other': 85, 'France': 86, 'Mediterranean': 87, 'arms': 88, 'blue': 89, 'farewell': 90, 'over': 91, 'reaching': 92, 'say': 93, 'towards': 94, 'waters': 95}
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 3), (16, 1), (17, 1)], [(10, 1), (15, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 2), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1)], [(10, 1), (15, 1), (20, 1), (25, 1), (31, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1)], [(10, 1), (14, 1), (15, 1), (31, 1), (32, 1), (38, 1), (52, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1)], [(15, 1), (19, 1), (20, 1), (25, 2), (26, 1), (27, 1), (29, 1), (31, 2), (37, 2), (52, 1), (62, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1)], [(3, 1), (10, 1), (15, 1), (18, 1), (19, 1), (33, 1), (37, 1), (42, 1), (52, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1)], [(10, 1), (15, 2), (16, 2), (20, 1), (25, 1), (37, 1), (47, 1), (50, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1)]]

原文本：
在这里插入图片描述

Legolas~

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
【自然语言处理】Gensim学习笔记（一）

关于这个库的介绍这里不多说了，不建议看官方文档，官方文档写的比较乱，看的时候抓不住重点，网上关于这个库的博客也很多，但大多都挺水的，不是互相抄袭就是很多api一带而过，对于小白来说学起来真的很苦恼，其实有的时候静下心来、别浮躁、心平气和地学点东西真的挺好的哈~1.doc2bow：词袋模型这个api我是研究了一上午才明白的，在源码中这样解释到：Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token
复制链接

扫一扫

专栏目录