python学习-102-文本数据的预处理-分词

最新推荐文章于 2024-07-10 17:20:46 发布

流花飞羽8

最新推荐文章于 2024-07-10 17:20:46 发布

阅读量2.9k

点赞数 4

分类专栏： python学习文章标签： python

本文链接：https://blog.csdn.net/u013521274/article/details/84994835

版权

python学习专栏收录该内容

21 篇文章 2 订阅

订阅专栏

前言：

对于自然语言处理来讲在一些情况下需要建立自己的语料库，并将其训练为模型，本片文章是将已经整理好的数据进行分词和去除杂乱字符的操作。通过结巴分词工具进行分词，加载自定义的停用词表(停用词表内容=中科院+自定义)

不喜勿喷^-^

数据保存在TXT文件中如下：

分词完成：

代码：

# coding:utf8
import utils as util
import jieba

# 1读入文件分词之后存入文件
def readCutRemovewrite(readfile_path, writefile_path):
    inputs = open(readfile_path, 'r', encoding='utf-8')
    outputs = open(writefile_path, 'w', encoding='utf8')
    for line in inputs:
        line_seg = seg_sentence(line)  # 这里的返回值是字符串
        outputs.write(line_seg + '\n')
    outputs.close()
    inputs.close()

# 2句子分词并去停用词
def seg_sentence(sentence):
    # 2创建停用词list
    stopWords = [line.strip() for line in open('data/stopWord.txt', 'r', encoding='utf-8').readlines()]
    sentence_seged = jieba.cut(sentence.strip())
    outstr = ''
    for word in sentence_seged:
        if word not in stopWords:
            if word != '\t':
                outstr += word
                outstr += " "
    return outstr

if __name__ == '__main__':

    readfile_path =r'F:\data\test1.txt'
    #工具类方法 读入 分词 写入
    readCutRemovewrite(readfile_path,writefile_path)
    print('数据预处理完成')