批量处理文件构造语料库并训练Doc2Vec模型（基于gensim的实现）

最新推荐文章于 2021-12-08 22:29:09 发布

`AllureLove

最新推荐文章于 2021-12-08 22:29:09 发布

阅读量415

点赞数

分类专栏：自然语言处理文章标签： nlp 自然语言处理

本文链接：https://blog.csdn.net/weixin_36488653/article/details/118343237

版权

自然语言处理专栏收录该内容

16 篇文章 0 订阅

订阅专栏

Word2Vec模型的两种预测原理在NLP文本预处理的文章中有提到，该模型可以基于语料库现有的词语进行词语向量的预测。
因此针对句子、段落和文档而言，也有类似的模型Doc2Vec，过程是先基于对应的语料库对模型进行训练，然后利用训练好的模型来预测文档向量表示，具体流程如下：

"""
数据集文档存储结构：
-data--类别1--sample1
	       --sample2
	       ...
	 --类别2--sample1
	       --sample2
	       ...
	 --类别2--sample1
	       --sample2
	       ...
"""
import gensim
from gensim.models.doc2vec import Doc2Vec

parent_path = "dada文件夹路径"
for folder in os.listdir(parent_path):
	label = folder
	labels = []
	sentences = []
	source_path = "/".join([self.source_local_path, label])
	if not os.path.exists(source_path):
        continue
    for file in os.listdir(source_path):
        with open(source_file, "r") as f:
            strs_list = []
            for line in f:
                line = line.replace("\t", "").replace("\n", "")
                # line = re.sub("\d", "", line)
                line = re.findall(pattern, line)
                # line = re.sub("[0-9]", "", line)
                strs_list.extend(line)
        if strs_list:
            sentences.append(strs_list)
            if label == "类别1":
                labels.append(1)
            elif ......

corups = []
# 将样本处理成要求的形式。由于gensim里Doc2vec模型需要的输入为固定格式，输入样本为：[句子，句子序号]
# 这里需要用gensim中Doc2vec里的TaggedDocument来包装输入的句子
for i, sen in enumerate(sentences, 0):
    sen = sen.split()
    TaggedDocument = gensim.models.doc2vec.TaggedDocument
    document = TaggedDocument(sen, tags=[i])
    corups.append(document)

# 利用Doc2Vec算法基于现有的语料库进行训练
vector_size = 200
w2v_model = Doc2Vec(corups, min_count=500, window=5, vector_size=vector_size)
w2v_model.train(corups, total_examples=w2v_model.corpus_count, epochs=10) 

# 利用Doc2Vec算法进行训练好的模型进行预测
# 把要进行预测的样本处理成上述corups相同的形式，这里不赘述
# 如data就是要进行预测的同形式的样本
# 方式一
data_vec = np.concatenate([np.array(w2v_model.docvecs[e.tags[0]].reshape(1, vector_size)) for e in data]) 

# 同样是处理成相同的形式
# 方式二
valX_vec = []
for e in valX:
    vec = np.array(w2v_model.infer_vector(e[0], alpha=0.025, steps=500)).reshape(1, 200)
    valX_vec.append(vec)
valX_vec = np.concatenate(valX_vec)

具体很多细节可以查看官网：
Doc2Vec

`AllureLove

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
批量处理文件构造语料库并训练Doc2Vec模型（基于gensim的实现）

文档预处理"""数据集文档存储结构：-data--类别1--sample1 --sample2 ... --类别2--sample1 --sample2 ... --类别2--sample1 --sample2 ..."""parent_path = "dada文件夹路径"for folder in os.listdir(parent_path): label = folder lab.
复制链接

扫一扫