Task2

最新推荐文章于 2023-09-27 22:40:05 发布

just__we

最新推荐文章于 2023-09-27 22:40:05 发布

阅读量102

点赞数 1

分类专栏： nlp task2

本文链接：https://blog.csdn.net/weixin_43346864/article/details/90212519

版权

nlp task2 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

中英文字符串处理（删除不相关的字符、去停用词）；分词（结巴分词）；词、字符频率统计

分词

jieba.cut():三个参数，需要分词的字符串，cut_all为是否为全模式；HMM为是否使用HMM模型
jieba.cut_cut_for_search:需要分词的字符串；是否使用HMM模型，编码可为Unicode，utf-8，GBK。GBK易被误解为utf-8
上述返回的都为一个可迭代的generator可用for 循环来获得分词后得到的每一个词语(unicode)，或者用
jieba.lcut 以及 jieba.lcut_for_search 直接返回 list
jieba.Tokenizer()新建自定义分词器，可用于不同的词典。

去除停用词

import jieba

#jieba.load_userdict('userdict.txt')
#创建停用词list
def stopwordlist(filepath):
	stopwords = [line.strip() for line in open(filepath,'r',encoding='utf-8) .readlines()]
	return stopwords
#对句子进行分词
def seg_setence(sentence):
	sentence_seg = jieba.cut(sentence.strip())
	stopwords = stopwordslist('./test/stopwords.txt')
	outstr = ''
	for word in sentence_seg:
		if word not in stopwords:
			if word != '\t':
				outstr += word
				outstr += " "
	return outstr

inputs = open('./test/input.txt','r',encoding = utf-8)
outputs = open('./test/output.txt','w')

for line in inputs:
	line_seg = seg_sentence(line)#返回的是字符串
	outputs.write(line_seg + '\n')
outputs.close()
inputs.close()

但是在网上搜集的停用词可能会有重复的，用下面的代码进行去重

def stop_reduction(infilepath,outfilepath):
	infile = open(infilepath,'r',encoding = 'utf-8')
	outfile = open(outfile,'w')
	stopwordslist = []
	for str in infile.read().split('\n'):
		if str not in stopwordslist:
			stopwordslist.append(str)
			outfile.write(str+'\n')