如何将中文文档语料训练成词向量

最新推荐文章于 2023-04-01 21:42:16 发布

中科小白

最新推荐文章于 2023-04-01 21:42:16 发布

阅读量2.3k

点赞数 1

分类专栏：工具 NLP python 文章标签：自然语言处理词向量 word2vec

本文链接：https://blog.csdn.net/qq_42278138/article/details/111522370

版权

准备需要训练的原始语料

在这里，我们采用docx文档作为原始语料：

如图所示，这些文档是我用到的语料。

将语料转换为txt文件格式

用到了docx这个包，可能需要先安装一下：pip install docx

代码如下：

import os
import docx


def docx_to_txt():
    # 打开文件
    files = os.listdir('./corpus')    # 此处为你存放语料的路径
    with open('corpus.txt', 'w+', encoding='utf-8') as f:
        for file_name in files:
            if file_name.endswith('docx'):
                print(file_name)
                file = docx.opendocx("./corpus/"+file_name)
                # 读取文本内容
                text = docx.getdocumenttext(file)
                # 写入文件
                for t in text:
                    f.write(t)


docx_to_txt()

最低0.47元/天解锁文章

中科小白

关注

1
点赞
踩
20

收藏

觉得还不错? 一键收藏
7
评论
如何将中文文档语料训练成词向量

准备需要训练的原始语料在这里，我们采用docx文档作为原始语料：如图所示，这些文档是我用到的语料。将语料转换为txt文件格式代码如下：import osimport docxdef docx_to_txt(): # 打开文件 files = os.listdir('./corpus') with open('corpus.txt', 'w+', encoding='utf-8') as f: for file_name in file
复制链接

扫一扫