Python 中文文本分词（包含标点的移除）

最新推荐文章于 2024-06-11 11:51:15 发布

汀桦坞

最新推荐文章于 2024-06-11 11:51:15 发布

阅读量1.9w

点赞数 14

分类专栏：机器学习

本文链接：https://blog.csdn.net/wiborgite/article/details/79886947

版权

机器学习专栏收录该内容

57 篇文章 7 订阅

订阅专栏

背景信息

本文为构建中文词向量的前期准备，主要实现中文文本的分词工作，并且在分词过程中移除了标点符号、英文字符、数字等干扰项，从而可以得到较为纯净的分词后的中文语料。

详细代码

import jieba
import jieba.analyse
import jieba.posseg as pseg
import codecs,sys
from string import punctuation
if sys.getdefaultencoding() != 'utf-8':
    reload(sys)
    sys.setdefaultencoding('utf-8')
# 定义要删除的标点等字符
add_punc='，。、【 】 “”：；（）《》‘’{}？！⑦()、%^>℃：.”“^-——=&#@￥'
all_punc=punctuation+add_punc

#def cut_words(sentence):
    #print sentence
#    return " ".join(jieba.cut(sentence)).encode('utf-8')
# 指定要分词的文本
f=codecs.open('/book/04.word2vec/simple-examples/data/zhuxianshort.txt','r',encoding="utf8")
#指定分词结果的保存文本
target = codecs.open("/book/04.word2vec/simple-examples/data/zxout.txt", 'w',encoding="utf8")
print ('open files')
line_num=1
line = f.readline()
while line:
    print('---- processing ', line_num, ' article----------------')
    # 第一次分词，用于移除标点等符号
	#line=re.sub(r'[A-Za-z0-9]|/d+','',line)   #用于移除英文和数字
	line_seg = " ".join(jieba.cut(line))
    # 移除标点等需要删除的符号
    testline=line_seg.split(' ')
    te2=[]
    for i in testline:
        te2.append(i)
        if i in all_punc:
            te2.remove(i)
    # 返回的te2是个list，转换为string后少了空格，因此需要再次分词
	# 第二次在仅汉字的基础上再次进行分词
    line_seg2 = " ".join(jieba.cut(''.join(te2)))
    target.writelines(line_seg2)
    line_num = line_num + 1
    line = f.readline()
f.close()
target.close()
exit()

说明：上述代码中指定的中文文本为小说诛仙，如下读取了分词前和分词后各10行文本，可以看出标点符号已被删除，但分词效果主要取决于jieba分词器：

汀桦坞

关注

14
点赞
踩
73

收藏

觉得还不错? 一键收藏
打赏
12
评论
Python 中文文本分词（包含标点的移除）

背景信息本文为构建中文词向量的前期准备，主要实现中文文本的分词工作，并且在分词过程中移除了标点符号、英文字符、数字等干扰项，从而可以得到较为纯净的分词后的中文语料。详细代码import jiebaimport jieba.analyseimport jieba.posseg as psegimport codecs,sysfrom string import punctuationif ...
复制链接

扫一扫