gensim官方文档实践笔记

最新推荐文章于 2024-04-04 09:41:35 发布

lagoon_lala

最新推荐文章于 2024-04-04 09:41:35 发布

阅读量1.2k

点赞数 4

分类专栏：人工智能文章标签： Gensim

原文链接：https://radimrehurek.com/gensim/auto_examples/index.html#documentation

版权

人工智能专栏收录该内容

91 篇文章 52 订阅

订阅专栏

中文文档比机翻还颠三倒四, 所以还是自己记录一下, 顺便加一些注解.

方法1: 特征向量(vector of features)

教程中打印均使用pprint:

import pprint

data = ("test", [1, 2, 3,'test', 4, 5], "This is a string!",

{'age':23, 'gender':'F'})

print(data)

pprint.pprint(data)

print输出全在一行, pprint输出按照数据结构排列更清晰.

核心概念

文档、语料库、向量和模型Documents, Corpora, Vectors and Models

文档document

在Gensim中，文档document是文本序列(text sequence)类型的对象, 即str.

document = "Human machine interface for lab abc computer applications"

语料库Corpus

语料库(corpus, 复数corpora)是Document对象的集合.

语料库作用

1. training corpus输入训练模型

Gensim搞无监督模型, 不需要标记数据, 直接用训练语料库寻找共同主题初始化模型参数, 得到主题模型topic model.

2. 组织新document

通过从document提取主题, 可以建立相似性查询(Similarity Queries)的索引. 索引可以用来查询语义的相似性, 聚类.

语料库示例(含9个Document):

text_corpus = [

"Human machine interface for lab abc computer applications",

"A survey of user opinion of computer system response time",

"The EPS user interface management system",

"System and human system engineering testing of EPS",

"Relation of user perceived response time to error measurement",

"The generation of random binary unordered trees",

"The intersection graph of paths in trees",

"Graph minors IV Widths of trees and well quasi ordering",

"Graph minors A survey",

]

对于超大corpus无法全加载到内存, 使用Corpus Streaming(一次一个Document).

语料库预处理

simple_preprocess()函数

如:

删除常用单词如‘the’, 和仅出现一次的单词.

分词(Tokenization), documents分解为words(空格作为分隔符).

转换成小写lower-casing

例子:

# 建立常用词(frequent words)集合

stoplist = set('for a of the and to in'.split(' '))

# 转为小写, 空格分词, 去掉停用词(stopwords)

texts = [[word for word in document.lower().split() if word not in stoplist]

for document in text_corpus]

# 计算词频(word frequencies)

from collections import defaultdict

frequency = defaultdict(int)#字典的value初始化为 0

for text in texts:

for token in text:#token代表分词结果: 单词和标点

frequency[token] += 1

# 当出现次数>1则保存下来

processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]

pprint.pprint(processed_corpus)

[['human', 'interface', 'computer'],

['survey', 'user', 'computer', 'system', 'response', 'time'],

['eps', 'user', 'interface', 'system'],

['system', 'human', 'system', 'eps'],

['user', 'response', 'time'],

['trees'],

['graph', 'trees'],

['graph', 'minors', 'trees'],

['graph', 'minors', 'survey']]

其中texts和processed_corpus的获得用的是列表推导式, 参考:

https://blog.csdn.net/weixin_43790276/article/details/90247423

以下两者等价:

# for循环

list_a = list()

for a in range(5):

list_a.append(a)

print(list_a)

# 列表推导式

list_a = [a for a in range(5)]

print(list_a)

列表推导式的几种常用类型:

# for前的表达式有运算: 遍历后面的可迭代对象,然后按照for前的表达式进行运算,生成最终的列表

list_c = [7 * c for c in "python"]

print(list_c)

['ppppppp', 'yyyyyyy', 'ttttttt', 'hhhhhhh', 'ooooooo', 'nnnnnnn']

# 带if条件: for遍历每次迭代后紧跟着进行条件判断

list_d = [d for d in range(6) if d % 2 != 0]

print(list_d)

[1, 3, 5]

# 嵌套列表推导式: 参考for前的表达式有运算的情况, 先计算for后的得到g=[3,6,9,...],在根据g进行前面表达式操作[[0,1,2],...]

list_g = [[x for x in range(g - 3, g)] for g in range(22) if g % 3 == 0 and g != 0]

print(list_g)

[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11], [12, 13, 14], [15, 16, 17], [18, 19, 20]]

defaultdict(function_factory)构建的是一个类似dictionary

其中frequency定义时, 使用了defaultdict(int)补充缺失值, 这个函数与dict.setdefault()等价, 但是更快.

词汇编号

建立字典, 给刚刚获取的语料库中每个元素token一个序号.

from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)

print(dictionary)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)

其中包含12个不同token, 查看序号对应token:

pprint.pprint(dictionary.token2id)

{'computer': 0,

'eps': 8,

'graph': 10,

'human': 1,

'interface': 2,

'minors': 11,

'response': 3,

'survey': 4,

'system': 5,

'time': 6,

'trees': 9,

'user': 7}

向量Vector

每个document表示为向量.

方法1: 特征向量(vector of features)

通过数值(Gensim规定为浮点数)存储对应维度信息.

实践中, 向量中含有很多0, 为稀疏向量(sparse/bag-of-words vector). 为节省空间, 只保存非零值. 如:

(0, 2, 5)-> (2, 2.0), (3, 5.0) | 0.0

方法2: doc2bow

词袋模型(bag-of-words model): 向量对应该document包含字典中每个词的计数(frequency counts), 向量长度=字典中元素个数

词袋特点: 忽略语序

doc2bow方法用字典将新的document分词后用词袋(12维向量的单词计数的稀疏向量)表示.

new_doc = "Human computer interaction"

new_vec = dictionary.doc2bow(new_doc.lower().split())

print(new_vec)

[(0, 1), (1, 1)]

其中元组含义:(词ID，标记计数)

语料库中无interaction, 故向量中也不包含这个词.

document中未出现的词为隐式零节省空间.

原始语料库转换为向量列表:

bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]

pprint.pprint(bow_corpus)

[[(0, 1), (1, 1), (2, 1)],

[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],

[(2, 1), (5, 1), (7, 1), (8, 1)],

[(1, 1), (5, 2), (8, 1)],

[(3, 1), (6, 1), (7, 1)],

[(9, 1)],

[(9, 1), (10, 1)],

[(9, 1), (10, 1), (11, 1)],

[(4, 1), (10, 1), (11, 1)]]

输入的语料库很大时无法全读入内存, Gensim可用任意迭代器(iterator), 一次返回一个document向量.

模型Model

向量转换的算法, 将document的一种向量表示转成另一种向量表示.

转换的方式在读取语料库训练时得到.

tf-idf 模型训练

原向量为词袋表示, 转换后的计数根据词稀有度加权.

用向量化后的语料库训练tf-idf 模型, 并转换一个document.

from gensim import models

# 用向量化后的语料库训练模型

tfidf = models.TfidfModel(bow_corpus)

# 转换document

words = "system minors".lower().split()#预处理作为Corpus

print(tfidf[dictionary.doc2bow(words)])#向量化转为词袋再输入tfidf

[(5, 0.5898341626740045), (11, 0.8075244024440723)]

返回的元组含义(token ID, tf-idf 权重)

其中“system”在训练语料库出现次数多, 权重低.

tf-idf 模型训练好后, 可以转换整个语料库, 建立索引, 相似性查询( similarity queries).

建立索引

from gensim import similarities

index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)

其中tfidf为训练出的模型, bow_corpus为已经向量化的语料库. tfidf[bow_corpus]得到的是TransformedCorpus对象.

查询相似性

输入一个新document, 可以查询它和语料库中其他document相似性.

query_document = 'system engineering'.split()

query_bow = dictionary.doc2bow(query_document)

sims = index[tfidf[query_bow]]

print(list(enumerate(sims)))

[(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]

其中查询相似度的方法, 和用模型转换的方法一样, 都是放在下标输入文档/语料库.

输出元组含义: (document序号, 相似度百分比)

相似度排序

for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):

print(document_number, score)

3 0.7184812

2 0.41707572

1 0.32448703

0 0.0

4 0.0

5 0.0

6 0.0

7 0.0

8 0.0

其中sorted函数:

sorted(iterable, key=None, reverse=False)

iterable -- 可迭代对象。

key -- 指定可迭代对象中的一个元素来进行比较排序。

reverse -- True 降序， False 升序（默认）

而key中的lambda函数与展开f函数等价:

lambda x: x[1]

def f(x):

return x[1]

总结

用tfidf训练一个模型, 对语料库建立相似度索引后, 就可以用它计算输入document的相似度.

语料库流.

自己定义一个类实现__iter__方法, 一个一个document返回, 注意用的yield返回. 这样在for循环中可以迭代遍历这个对象的返回值.

如果不定义类, for循环接line in open也可以.

这次的语料库预处理没在建字典之前进行, 计算了词频dfs删掉停用词k-v对后, compactify重新编排id.

语料库和向量空间

将文本转换为向量空间表示, 语料流(corpus streaming)和磁盘持久化(persistence to disk).

字符串转向量

创建9个document的语料库.

documents = [

"Human machine interface for lab abc computer applications",

"A survey of user opinion of computer system response time",

"The EPS user interface management system",

"System and human system engineering testing of EPS",

"Relation of user perceived response time to error measurement",

"The generation of random binary unordered trees",

"The intersection graph of paths in trees",

"Graph minors IV Widths of trees and well quasi ordering",

"Graph minors A survey",

]

语料库预处理(分词, 删除停用词)

from pprint import pprint # pretty-printer

from collections import defaultdict

# remove common words and tokenize

stoplist = set('for a of the and to in'.split())

texts = [

[word for word in document.lower().split() if word not in stoplist]

for document in documents

]

# remove words that appear only once

frequency = defaultdict(int)

for text in texts:

for token in text:

frequency[token] += 1

texts = [

[token for token in text if frequency[token] > 1]

for text in texts

]

pprint(texts)

[['human', 'interface', 'computer'],

['survey', 'user', 'computer', 'system', 'response', 'time'],

['eps', 'user', 'interface', 'system'],

['system', 'human', 'system', 'eps'],

['user', 'response', 'time'],

['trees'],

['graph', 'trees'],

['graph', 'minors', 'trees'],

['graph', 'minors', 'survey']]

用词袋(记录单词出现次数)将document字符串化为向量.

建立字典(ID与单词之间映射), 保存字典

from gensim import corpora

dictionary = corpora.Dictionary(texts)# 建立字典: ID与单词之间映射

dictionary.save('/tmp/deerwester.dict') # 保存字典

print(dictionary)# 字典

print(dictionary.token2id)# 查看单词与其 id 之间的映射

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

原语料库, 新document分词后, 用该字典转换为词袋向量

new_doc = "Human computer interaction"

new_vec = dictionary.doc2bow(new_doc.lower().split())

print(new_vec) # "interaction"未在字典内

corpus = [dictionary.doc2bow(text) for text in texts]

corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus) # 保存语料库词袋

print(corpus)

[(0, 1), (1, 1)]

[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]

doc2bow()计算单词出现次数, 单词用ID表示

语料库流Corpus Streaming

语料库可以作为列表保存在内存, 也可以保存在文件中.

Gensim用语料库流读入时, 要求一次一个document.

定义迭代器, 打开文件, 读取一个document, 预处理后返回.

因为可以自定义__iter__函数, 文档来源和输入格式都随意. 来源可以是disk, network, database, dataframes等, 格式可以解析XML, 不用一行一document也行. 只要保证解析出每个document的token列表, 即可用字典转换为稀疏向量.

from smart_open import open # 显式打开 remote files

class MyCorpus:

def __iter__(self):

for line in open('https://raw.githubusercontent.com/RaRe-Technologies/gensim/develop/docs/notebooks/datasets/mycorpus.txt'):

# 每行放一个document, 空格分词

yield dictionary.doc2bow(line.lower().split())

原链接' https://radimrehurek.com/gensim/mycorpus.txt '无法打开, 在github找了个一样的文件换了.

创建语料库对象

不可直接打印对象, 但可遍历其中document向量

corpus_memory_friendly = MyCorpus() # 这次不是同时加载corpus中所有document

# print(corpus_memory_friendly) # 对象不能直接打印

for vector in corpus_memory_friendly: # load one vector into memory at a time

print(vector)

[(0, 1), (1, 1), (2, 1)]

[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]

[(2, 1), (5, 1), (7, 1), (8, 1)]

[(1, 1), (5, 2), (8, 1)]

[(3, 1), (6, 1), (7, 1)]

[(9, 1)]

[(9, 1), (10, 1)]

[(9, 1), (10, 1), (11, 1)]

[(4, 1), (10, 1), (11, 1)]

语料库流构建字典

# 预处理获得token构建字典

dictionary = corpora.Dictionary(line.lower().split() for line in open('https://raw.githubusercontent.com/RaRe-Technologies/gensim/develop/docs/notebooks/datasets/mycorpus.txt'))

删除停用词

#搜集停用词id

stop_ids = [

dictionary.token2id[stopword]

for stopword in stoplist

if stopword in dictionary.token2id

]

#dictionary.token2id为token:id键值对

stop_ids

[33, 8, 3, 18, 9, 26, 19]

#搜集只出现一次的词id

once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]

dictionary.filter_tokens(stop_ids + once_ids) # 删除这两种id的token

dictionary.compactify() # 重新编排ID, 空位则前挪

print(dictionary)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)

其中的dictionary.dfs代表词频，{单词id: 出现次数}

dict.items()以列表返回(键, 值)元组数组, 即[(单词id, 出现次数), …]

词袋表示完成了, 只需对该向量进行转换(如tf-idf ), 就能计算相似性.

语料库保存

corpus persistency

向量空间语料库Vector Space corpus（即向量序列sequence）可以保存成很多格式, 其中Market Matrix格式(.mm)比较重要.

Gensim存取方式: 流式语料库接口, 一次一个文档.

创建示例corpus保存为Market Matrix格式

corpus = [[(1, 0.5)], []] # 语料库包含两个document, 其中一个空document, 哎, 就是玩儿

corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)# 保存为Market Matrix格式

corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)

corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)

corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)

从 Matrix Market 文件迭代加载语料库

corpus = corpora.MmCorpus('/tmp/corpus.mm')

Corpus不能直接打印, 是流(streams)对象.

查看内容方式: 1. 整个加载进内存 2. 流接口, 一次一个document

# 整个加载进内存

print(list(corpus)) # list() 转换任何序列sequence->list

[[(1, 0.5)], []]

# streaming interface

for doc in corpus:

print(doc)

[(1, 0.5)]

[]

与NumPy/SciPy相互转换

与numpy稠密矩阵转换

# 与numpy转换

import gensim

import numpy as np

numpy_matrix = np.random.randint(10, size=[5, 2]) # 定义numpy矩阵

corpus = gensim.matutils.Dense2Corpus(numpy_matrix)# numpy->corpus

print(numpy_matrix)

for doc in corpus:

print(doc)

# numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=number_of_corpus_features)

[[5 0]

[2 1]

[3 7]

[7 9]

[0 9]]

[(0, 5.0), (1, 2.0), (2, 3.0), (3, 7.0)]

[(1, 1.0), (2, 7.0), (3, 9.0), (4, 9.0)]

这里的corpus分别记录了, 第一个列表的元组=(矩阵第一列的元素序号,值), 第一个列表的元组=(矩阵第二列的元素序号,值)

与scipy稀疏矩阵相互转换

# corpus与scipy稀疏矩阵相互转换

import scipy.sparse

scipy_sparse_matrix = scipy.sparse.random(5, 2, 0.5) # 建立sparse矩阵

corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)

scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)

print(scipy_sparse_matrix)

for doc in corpus:

print(doc)

print(scipy_csc_matrix)

(2, 1) 0.38879739650662737

(1, 0) 0.16282125152146287

(4, 1) 0.3451535436418379

(3, 0) 0.2699858725364166

(2, 0) 0.6473813531322135

[(1, 0.16282125152146287), (2, 0.6473813531322135), (3, 0.2699858725364166)]

[(2, 0.38879739650662737), (4, 0.3451535436418379)]

(1, 0) 0.16282125152146287

(2, 0) 0.6473813531322135

(3, 0) 0.2699858725364166

(2, 1) 0.38879739650662737

(4, 1) 0.3451535436418379

注意此处官方教程可能不严谨, scipy.sparse.random要定义density参数才能生成含元素的矩阵.

主题和转换

转换document的向量表示, 转换关系通过训练一个模型获得.

目的:

1. 找到语料库中隐藏的结构关系, 用和语义更相关的方式描述document.

2. 向量表示更精炼, 提升效果和效率. (消耗更少资源, 忽略边际数据, 降噪)

创建语料库

和上节一样

from collections import defaultdict

from gensim import corpora

documents = [

"Human machine interface for lab abc computer applications",

"A survey of user opinion of computer system response time",

"The EPS user interface management system",

"System and human system engineering testing of EPS",

"Relation of user perceived response time to error measurement",

"The generation of random binary unordered trees",

"The intersection graph of paths in trees",

"Graph minors IV Widths of trees and well quasi ordering",

"Graph minors A survey",

]

# 删除停用词, 分词

stoplist = set('for a of the and to in'.split())

texts = [

[word for word in document.lower().split() if word not in stoplist]

for document in documents

]

# 去除只出现一次词

frequency = defaultdict(int)

for text in texts:

for token in text:

frequency[token] += 1

texts = [

[token for token in text if frequency[token] > 1]

for text in texts

]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

创建转换

转换transformation为模型对象, 常用训练语料库初始化.

from gensim import models

tfidf = models.TfidfModel(corpus) # 初始化模型

不同的转换初始化参数不同:

TfIdf需要训练语料库+计算document中所有feature的频率. Latent Semantic Analysis和Latent Dirichlet Allocation需要的更复杂.

训练转换模型与向量转换需要使用相同feature id:

变换在两个特定的特征向量空间进行. 字符串预处理，特征 id，TfIdf 向量的输入词袋向量都要相同. 否则会导致垃圾输出和运行时异常.

转换向量

训练好后, tfidf作为只读对象. 将旧表示向量（词袋int类型计数）转换为新表示（TfIdf 的浮点数权重）.

词袋转tfidf

#新建词袋向量

doc_bow = [(0, 1), (1, 1)]

print(tfidf[doc_bow]) # 用模型转换向量

[(0, 0.7071067811865476), (1, 0.7071067811865476)]

# 转换整个语料库

corpus_tfidf = tfidf[corpus]#输入tfidf的语料库需要保证向量空间相同

for doc in corpus_tfidf:

print(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]

[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]

[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]

[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]

[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]

[(9, 1.0)]

[(9, 0.7071067811865475), (10, 0.7071067811865475)]

[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]

[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]

其中corpus_tfidf = tfidf[corpus]一句需要注意, 如果corpus来自文件流对象, 得到的也只是流, 如果需要多次使用, 最好保存成文件.

转换后可以再接另一个转换潜语义索引(Latent Semantic Indexing):

lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) #初始化LSI转换模型

corpus_lsi = lsi_model[corpus_tfidf] # 对原始语料库包裹第二层转换: bow->tfidf->fold-in-lsi

其中num_topics=2代表设置转换后维度为2.

查看转换后维度:

lsi_model.print_topics(2)

[(0,

'0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"response" + 0.060*"time" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'),

(1,

'-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

根据 LSI，“trees”、“graph”和“minors”为相关词（且与第一个主题更贴近），第二个主题与剩下其他词相关.

获取document向量(两个转换在此时才执行), 并打印其对应原始文本.

for doc, as_text in zip(corpus_lsi, documents):

print(doc, as_text)

[(0, 0.0660078339609013), (1, -0.5200703306361858)] Human machine interface for lab abc computer applications

[(0, 0.19667592859142163), (1, -0.760956316770006)] A survey of user opinion of computer system response time

[(0, 0.08992639972446137), (1, -0.724186062675252)] The EPS user interface management system

[(0, 0.07585847652177911), (1, -0.6320551586003438)] System and human system engineering testing of EPS

[(0, 0.10150299184979847), (1, -0.5737308483002961)] Relation of user perceived response time to error measurement

[(0, 0.7032108939378321), (1, 0.16115180214025498)] The generation of random binary unordered trees

[(0, 0.8774787673119842), (1, 0.16758906864659026)] The intersection graph of paths in trees

[(0, 0.9098624686818586), (1, 0.14086553628718607)] Graph minors IV Widths of trees and well quasi ordering

[(0, 0.6165825350569281), (1, -0.05392907566389665)] Graph minors A survey

前五个document与第二个主题的相关性更强，而后四个文档与第一个主题的相关性更强.

模型保存

各种模型都可用save()和load()

import os

import tempfile

with tempfile.NamedTemporaryFile(prefix='model-', suffix='.lsi', delete=False) as tmp:

lsi_model.save(tmp.name)

loaded_lsi_model = models.LsiModel.load(tmp.name)

os.unlink(tmp.name)

得到模型后可以对输入document进行相似度查询, 排序.

总结

先训练转换模型, 转换时model[corpus]还没有真正执行, 遍历获取时才执行.

转换的结果还可以接其他转换模型(LSI等).

这些转换模型是在线训练的, 可以用add_documents增加语料库.

转换方法

几种流行的向量空间转换模型.

Tf-Idf

Term Frequency * Inverse Document Frequency.

训练初始化时输入训练语料库词袋

转换时, 输出实值向量与输入向量维度相同. 训练语料库中频数少的特征则增加权重.

model = models.TfidfModel(corpus, normalize=True)

normalize=True代表向量归一化为单位长度。

LSI/LSA

潜在语义索引，将document从词袋或（最好是）TfIdf 加权空间转换为较低维度的潜在空间. 将 200-500 的目标维度x效果最好.

model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)

LSI训练特点: 可在线训练(继续训练), 底层模型增量更新. 输入文档流可以是无限的. 但计算转换时作为只读模型.

model.add_documents(another_tfidf_corpus) # 此时训练为 tfidf_corpus + another_tfidf_corpus

lsi_vec = model[tfidf_vec] # 再训练不会影响转换模型的使用

LDA

Latent Dirichlet Allocation潜在狄利克雷分配，是LSA（也称为多项式 PCA）的概率扩展. LDA 的主题可以解释为单词的概率分布. 文档被解释为主题的混合（同LSA）

model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)

HDP

分层狄利克雷过程Hierarchical Dirichlet Process，非参数贝叶斯方法. 不一定好使, 谨慎使用.

model = models.HdpModel(corpus, id2word=dictionary)

Random Projections随机投影，旨在减少向量空间维数. 通过加入随机性来近似文档之间的 TfIdf 距离。推荐的目标维度在100~10000.

model = models.RpModel(tfidf_corpus, num_topics=500)

相似性查询

Similarity Queries

创建语料库

和上一节一样

from collections import defaultdict

from gensim import corpora

documents = [

"Human machine interface for lab abc computer applications",

"A survey of user opinion of computer system response time",

"The EPS user interface management system",

"System and human system engineering testing of EPS",

"Relation of user perceived response time to error measurement",

"The generation of random binary unordered trees",

"The intersection graph of paths in trees",

"Graph minors IV Widths of trees and well quasi ordering",

"Graph minors A survey",

]

# 删除停用词, 分词

stoplist = set('for a of the and to in'.split())

texts = [

[word for word in document.lower().split() if word not in stoplist]

for document in documents

]

# 去除只出现一次词

frequency = defaultdict(int)

for text in texts:

for token in text:

frequency[token] += 1

texts = [

[token for token in text if frequency[token] > 1]

for text in texts

]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

相似性接口

创建语料库, 并对向量空间进行转换后, 可以计算document之间的相似性.

用语料库初始化模型, 以定义二维 LSI 空间:

from gensim import models

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

LSI是一种变换transformation.
LSI 优点: 可以识别单词和主题间关系. 此处设空间为二维( num_topics = 2 ), 即有两个主题.

将document转换为向量, 转换到相同空间(LSI)

doc = "Human computer interaction"

vec_bow = dictionary.doc2bow(doc.lower().split())#doc转为词袋向量

vec_lsi = lsi[vec_bow] # 转为LSI空间

print(vec_lsi)

[(0, 0.46182100453271546), (1, -0.07002766527899967)]

相似度的度量有: 1. cosine similarity 2. different similarity measures

语料库索引化, 待被查询doc比较相似度.

from gensim import similarities

index = similarities.MatrixSimilarity(lsi[corpus])

其中, 类MatrixSimilarity用在内存运行. 如果corpus过大, 需要用类Similarity(在磁盘运行).

保存/加载相似度索引

index.save('/tmp/deerwester.index')

index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

similarities.Similarity类也是一样用, 而且还可以向索引添加document.

执行查询

对向量化的doc语义相关性进行查询.

sims = index[vec_lsi] # 对vec-lsi与corpus执行相似度查询

print(list(enumerate(sims)))# 输出二元组(document序号, 相似度)

[(0, 0.998093), (1, 0.93748635), (2, 0.9984453), (3, 0.98658866), (4, 0.90755945), (5, -0.12416792), (6, -0.1063926), (7, -0.09879464), (8, 0.05004177)]

相似性使用余弦度量(cos值域[-1,1]), 越大越相似.

相似性降序排序:

sims = sorted(enumerate(sims), key=lambda item: -item[1])

for doc_position, doc_score in sims:

print(doc_score, documents[doc_position])

0.9984453 The EPS user interface management system

0.998093 Human machine interface for lab abc computer applications

0.98658866 System and human system engineering testing of EPS

0.93748635 A survey of user opinion of computer system response time

0.90755945 Relation of user perceived response time to error measurement

0.05004177 Graph minors A survey

-0.09879464 Graph minors IV Widths of trees and well quasi ordering

-0.1063926 The intersection graph of paths in trees

-0.12416792 The generation of random binary unordered trees

其中, doc2(The EPS user interface management system), doc4(Relation of user perceived response time to error measurement)均与查询doc(即Human computer interaction)没有共同词汇, 但得到高相似度. 因为他们属于共同主题.

总结

用bow向量语料库训练LSI转换模型;

用训练好的转换模型, 分别转换查询doc的bow向量和被查询语料库的bow向量;

建立被查询语料库的相似度矩阵(索引化);

将查询doc的lsi空间向量输入这个索引查询相似度.

Word2Vec 模型

word2vec 用大量未注释的纯文本，学习单词间关系。输出每个向量对应一个词，其具有的线性关系, 可进行向量加减: vec(“king”) - vec(“man”) + vec(“woman”) =~ vec(“queen”).

1. Word2Vec是对词袋bag-of-words的改进

2. 使用预训练模型的Word2Vec demo

3. 自己训练模型

4. 加载和保存模型

5. 训练参数

6. 内存要求

7. 通过降维进行Word2Vec embedding可视化

词袋

词袋模型将document转换为整数向量(词典中单词总个数).

词袋模型缺点

1. 丢失词序信息

解决方案: bag of n-grams长度为 n 的词短语, 捕获局部词序. 存在数据稀疏性和高维数的问题.

2. 不学习单词内涵

向量距离无法体现含义差异. 解决: Word2Vec

Word2Vec

通过浅神经网络, 将单词嵌入到维度更低的向量空间. 所得词向量根据上下文具有相似的含义.

Word2Vec 类的两个版本:

1. Skip-grams (SG)

在文本数据中使用滑动窗口得到一对词(word1, word2).

通过训练一个合并任务的神经网络, 得到临近词的概率分布(哪些词离输入词最近). 该神经网络给定输入word, 且有单隐层.

词的one-hot编码通过投影变换(projection layer)至隐藏层, 投影权重即为word embedding.

若设隐藏层神经元个数为300, 则word embedding有300维.

2. 连续词袋 (CBOW)

与SG相似, 单隐层神经网络. 不同之处在于CBOW输入多个上下文词(取平均值)预测中心词, SG输入中心词预测上下文词.

同样投影权重为word embedding.

Word2Vec Demo

1. 下载一个预先训练的模型(2GB)并使用它

因为此次作业不能用预训练, 所以略过

2. 训练自己的模型

训练数据使用Lee Evaluation Corpus(装了Gensim就会有, 或者在此处下载https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor)

实现迭代器，逐行读取语料库.

from gensim.test.utils import datapath

from gensim import utils

class MyCorpus:

"""迭代器yield返回句子(lists of str)."""

def __iter__(self):

corpus_path = datapath('lee_background.cor')

for line in open(corpus_path):

# assume there's one document per line, tokens separated by whitespace

yield utils.simple_preprocess(line)

自定义的预处理可以在MyCorpus迭代器内部完成. word2vec只需要一次一个句子.

语料库上训练一个模型

import gensim.models

sentences = MyCorpus()

model = gensim.models.Word2Vec(sentences=sentences)

模型的主要部分是model.wv，其中wv代表词向量

使用模型:

vec_king = model.wv['king']

查看词汇表(序号, 单词)

for index, word in enumerate(model.wv.index_to_key):

if index == 10:

break

print(f"word #{index}/{len(model.wv.index_to_key)} is {word}")

word #0/1750 is the

word #1/1750 is to

word #2/1750 is of

word #3/1750 is in

word #4/1750 is and

word #5/1750 is he

word #6/1750 is is

word #7/1750 is for

word #8/1750 is on

word #9/1750 is said

(此处教程将model.wv略写为wv)

存储/加载模型

import tempfile

with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:

temporary_filepath = tmp.name

# 存储

model.save(temporary_filepath)

# 加载

new_model = gensim.models.Word2Vec.load(temporary_filepath)

训练超参数

影响训练速度和质量的参数.

min_count默认值=5

修剪内部字典, 只出现几次的单词, 没有足够的数据得到训练.

model = gensim.models.Word2Vec(sentences, min_count=10)

vector_size默认值=100

Word2Vec 将单词映射到的空间的维数N

更大的尺寸值需要更多的训练数据，但可以产生更准确的模型

model = gensim.models.Word2Vec(sentences, vector_size=200)

workers默认值=3

训练并行化, 加快训练速度(没安装Cython则只能用1个)

model = gensim.models.Word2Vec(sentences, workers=4)

消耗空间

word2vec模型参数存储在矩阵中, 类型为NumPy数组.

数组大小由输入词汇量*映射word embedding向量大小决定.

其中参数min_count控制输入词汇量, vector_size控制向量大小.

评估

无监督的任务，没有好的方法可以客观地评估结果.

可以使用句法语义(syntactic and semantic)测试集.

1.使用评估集

model.wv.evaluate_word_analogies(datapath('questions-words.txt'))

2.评估一对单词的语义相似性, 衡量两个词的相关性或共现(默认使用学术数据集 WS-353)

model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))

在线训练

#加载模型

model = gensim.models.Word2Vec.load(temporary_filepath)

more_sentences = [

['Advanced', 'users', 'can', 'load', 'a', 'model',

'and', 'continue', 'training', 'it', 'with', 'more', 'sentences'],

]

# 遍历语料库建立词典

model.build_vocab(more_sentences, update=True)

#继续训练

model.train(more_sentences, total_examples=model.corpus_count, epochs=model.epochs)

# 清除临时文件

import os

os.remove(temporary_filepath)

total_words参数控制学习率衰减.

训练损失计算

参数compute_loss控制是否计算. 计算出的损失存储在模型属性中running_training_loss.

# 实例化/训练模型

model_with_loss = gensim.models.Word2Vec(

sentences,

min_count=1,

compute_loss=True,

hs=0,

sg=1,

seed=42,

)

# 获得训练损失

training_loss = model_with_loss.get_latest_training_loss()

print(training_loss)

Word Embeddings可视化

使用 tSNE 将单词的维数减少到 2 维

语义：像 cat、dog、cow 等词有靠近的倾向

句法：像 run、running 或 cut、cutting 这样的词靠得很近

Word2Vec API

word2vec 算法包括 skip-gram 和 CBOW 模型，使用hierarchical softmax 或负采样.

Gensim训练词向量(embedding)的方法: Word2Vec, Doc2Vec, FastText.

示例

初始化模型, 保存

from gensim.test.utils import common_texts

from gensim.models import Word2Vec

model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)

model.save("word2vec.model")

其中, 模型训练是流式的, sentences可迭代从硬盘/网络获取. 可迭代流: BrownCorpus, Text8Corpus, LineSentence.

加载模型,再次训练

model = Word2Vec.load("word2vec.model")

model.train([["hello", "world"]], total_examples=1, epochs=1)

(0, 2)

还需要继续训练时, 用model.wv保存训练得到的词向量.

不需要继续训练时, 可将模型切换到KeyedVectors实例, 得到更小的矩阵, 只保留向量和对应key(不保留模型状态).

vector = model.wv['computer'] # 输入一个词可获得一个numpy向量

sims = model.wv.most_similar('computer', topn=10) # 得到相似词

wv保存和加载

from gensim.models import KeyedVectors

# 保存单词及其对应embeddings.

word_vectors = model.wv

word_vectors.save("word2vec.wordvectors")

# 加载回词向量, mmap='r'代表内存共享 memory-mapping = read-only

wv = KeyedVectors.load("word2vec.wordvectors", mmap='r')

vector = wv['computer'] # Get numpy vector of a word

切换到KeyedVectors实例

word_vectors = model.wv

del model

多词n-grams embedding

当检测多词短语时, 可使用gensim.models.phrases

from gensim.models import Phrases

# 训练 bigram 探测器

bigram_transformer = Phrases(common_texts)

# 对语料库使用训练好的探测器 , 所得结果用来训练Word2vec 模型

model = Word2Vec(bigram_transformer[common_texts], min_count=1)

lagoon_lala

关注

4
点赞
踩
6

收藏

觉得还不错? 一键收藏
7
评论
gensim官方文档实践笔记

文档document语料库Corpus语料库作用语料库预处理词汇编号向量Vector方法1: 特征向量(vector of features)方法2: doc2bow模型Model语料库和向量空间字符串转向量语料库流Corpus Streaming语料库保存与NumPy/SciPy相互转换主题和转换创建转换转换向量转换方法相似性查询创建语料库相似性接口执行查询Word2Vec 模型词袋
复制链接

扫一扫