NLP基础--中文分词、去停用词小Demo

最新推荐文章于 2024-05-27 17:04:28 发布

你搁这儿写bug呢？

最新推荐文章于 2024-05-27 17:04:28 发布

阅读量2.8k

点赞数 2

分类专栏： NLP 文章标签： NLP python

原文链接：https://www.cnblogs.com/pinard/p/6744056.html

版权

NLP 专栏收录该内容

7 篇文章 1 订阅

订阅专栏

1. 使用jieba对中文进行分词、去停用词

ChnSentiCorp_htl_all数据集下载自：https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/ChnSentiCorp_htl_all/intro.ipynb
这个数据集有7000 多条酒店评论数据，5000 多条正向评论，2000 多条负向评论。数据大概长下面的样子：第一列是lable,取0或1。0表示负面评价，1表示正面评价。第二列是评论内容。在本文这个小Demo中，随机复制了各12条正面评价和负面评价作为数据。
在这里插入图片描述
开始进行实践：
直接读取，并进行分词

# -*- coding: utf-8 -*-
import jieba

with open('data-preprocess.txt') as f:
    document = f.read()
    document_cut = jieba.cut(document)  # 默认为精确模式
    res = ' '.join(document_cut)

    print res
f.close()

结果：
在这里插入图片描述
可以看出从第二行评论开始，每行开头被加入了一个空格，这是因为上面的代码将所有评论作为条数据进行了处理，所以接下来需要将每条评论作为一条数据。所以添加代码将文档中的每一个评论分为一行。
将数据分行读取，并进行分词

import jieba
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

# 将文档中的每一个评论分为一行
file_line = []
count = 0  # 统计行数
with open('data-preprocess.txt') as f:
    for index, line in enumerate(f):
        file_line.append(line)
        count += 1
print("一共有%d行" % count)
# for i in range(len(file_line)):
#     print str(file_line[i]).decode('utf-8')

res = []
for i in range(len(file_line)):
    sentence_seged = jieba.cut(file_line[i].strip())
    res.append(' '.join(sentence_seged))
print '分词完成：'
for i in range(len(res)):
    print str(res[i]).decode('utf-8')

结果：
现在就将每一条评论作为一条数据进行处理。
在这里插入图片描述
另外，jieba支持自定义词典，使用方式有两种：

# 在过程中动态添加用户字典
jieba.suggest_freq('川沙公路', True)  # 这条代码需要添加到jieba.cut之前任意位置
# 也可以自己先形成一个文档例如mydict.txt
# 用法： jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径
# 词典格式：一个词占一行；每一行分三部分：词语、词频（可省略）、词性（可省略），用空格隔开，顺序不可颠倒。
# file_name 若为路径或二进制方式打开的文件，则文件必须为 UTF-8 编码。
# 使用 add_word(word, freq=None, tag=None) 和 del_word(word) 可在程序中动态修改词典。

在这里插入图片描述
去除停用词：

停用词文件下载地址：https://github.com/goto456/stopwords，可以根据自己的需求下载。

# 加载停用词列表
f_stop = open('cn_stopwords.txt')  # 自己的中文停用词表,可以根据实际情况再额外添加停用词
stopwords = [line.strip() for line in f_stop]
f_stop.close()
# print str(sw[34]).decode('utf-8')

word_list_seg = []
for i in range(len(res)):
    outstr = ''
    for word in res[i].split():
        if word not in stopwords:
            if word != '/t':
                outstr += word
                outstr += " "
    word_list_seg.append(outstr)
print '_______________________'
print '去除停用词完成：'
print len(word_list_seg)

for i in range(len(word_list_seg)):
    print word_list_seg[i]

结果：
在这里插入图片描述
整体代码：

# -*- coding: utf-8 -*-
import jieba
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

# 将文档中的每一个评论分为一行
file_line = []
count = 0  # 统计行数
with open('data-preprocess.txt') as f:
    for index, line in enumerate(f):
        file_line.append(line)
        count += 1
print("一共有%d行" % count)
# print str(file_line[0]).decode('utf-8')

# 在过程中动态添加用户字典
jieba.suggest_freq('川沙公路', True)
# 也可以自己先形成一个文档例如mydict.txt
# 用法： jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径
# 词典格式：一个词占一行；每一行分三部分：词语、词频（可省略）、词性（可省略），用空格隔开，顺序不可颠倒。
# file_name 若为路径或二进制方式打开的文件，则文件必须为 UTF-8 编码。
# 使用 add_word(word, freq=None, tag=None) 和 del_word(word) 可在程序中动态修改词典。

# 使用jieba开始分词
# file_userDict = 'dict.txt'  # 自定义的词典 目前还没有
# jieba.load_userdict(file_userDict)
res = []
for i in range(len(file_line)):
    sentence_seged = jieba.cut(file_line[i].strip())
    res.append(' '.join(sentence_seged))
print '分词完成：'
for i in range(len(res)):
    print str(res[i]).decode('utf-8')

# 加载停用词列表
f_stop = open('cn_stopwords.txt')  # 自己的中文停用词表
sw = [line.strip() for line in f_stop]
f_stop.close()
# print str(sw[34]).decode('utf-8')

word_list_seg = []
for i in range(len(res)):
    stopwords = sw
    outstr = ''
    for word in res[i].split():
        # print 'word:', word
        if word not in stopwords:
            if word != '/t':
                outstr += word
                outstr += " "
    print 'outstr:', outstr
    word_list_seg.append(outstr)
print '_______________________'
print '去除停用词完成：'
print len(word_list_seg)

for i in range(len(word_list_seg)):
    print word_list_seg[i]

参考

https://www.cnblogs.com/pinard/p/6744056.html
https://blog.csdn.net/qq_42491242/article/details/105006651
https://github.com/fxsjy/jieba

你搁这儿写bug呢？

关注

2
点赞
踩
22

收藏

觉得还不错? 一键收藏
0
评论
NLP基础--中文分词、去停用词小Demo

1. 使用jieba对中文进行分词、去停用词ChnSentiCorp_htl_all数据集下载自：https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/ChnSentiCorp_htl_all/intro.ipynb这个数据集有7000 多条酒店评论数据，5000 多条正向评论，2000 多条负向评论。数据大概长下面的样子：第一列是lable,取0或1。0表示负面评价，1表示正面评价。第二列是评论内容。在本文这个小Demo
复制链接

扫一扫