Daywhale_day5

最新推荐文章于 2022-04-20 10:08:13 发布

越来越胖的GuanRunwei

最新推荐文章于 2022-04-20 10:08:13 发布

阅读量172

点赞数

分类专栏： Datawhale 机器学习 NLP

本文链接：https://blog.csdn.net/qq_38890412/article/details/107666424

版权

机器学习同时被 3 个专栏收录

34 篇文章 5 订阅 ¥9.90 ¥99.00

订阅专栏

超级会员免费看

NLP

16 篇文章 15 订阅

订阅专栏

Datawhale

9 篇文章 0 订阅

订阅专栏

分词：

import jieba
import jieba.analyse
import jieba.posseg as pseg
import codecs, sys


def cut_words(sentence):
    # print sentence
    return " ".join(jieba.cut(sentence)).encode('utf-8')


f = codecs.open('wiki.zh.jian.text', 'r', encoding="utf8")
target = codecs.open("zh.jian.wiki.seg-1.3g.txt", 'w', encoding="utf8")
print('open files')
line_num = 1
line = f.readline()
while line:
    print('---- processing ', line_num, ' article----------------')
    line_seg = " ".join(jieba.cut(line))
    target.writelines(line_seg)
    line_num = line_num + 1
    line = f.readline()
f.close()
target.close()
exit()
w

了解本专栏