中英文字符串处理(删除不相关的字符、去停用词);分词(结巴分词);词、字符频率统计
分词
jieba.cut():三个参数,需要分词的字符串,cut_all为是否为全模式;HMM为是否使用HMM模型
jieba.cut_cut_for_search:需要分词的字符串;是否使用HMM模型,编码可为Unicode,utf-8,GBK。GBK易被误解为utf-8
上述返回的都为一个可迭代的generator可用for 循环来获得分词后得到的每一个词语(unicode),或者用
jieba.lcut 以及 jieba.lcut_for_search 直接返回 list
jieba.Tokenizer()新建自定义分词器,可用于不同的词典。
去除停用词
import jieba
#jieba.load_userdict('userdict.txt')
#创建停用词list
def stopwordlist(filepath):
stopwords = [line.strip() for line in open(filepath,'r',encoding='utf-8) .readlines()]
return stopwords
#对句子进行分词
def seg_setence(sentence):
sentence_seg = jieba.cut(sentence.strip())
stopwords = stopwordslist('./test/stopwords.txt')
outstr = ''
for word in sentence_seg:
if word not in stopwords:
if word != '\t':
outstr += word
outstr += " "
return outstr
inputs = open('./test/input.txt','r',encoding = utf-8)
outputs = open('./test/output.txt','w')
for line in inputs:
line_seg = seg_sentence(line)#返回的是字符串
outputs.write(line_seg + '\n')
outputs.close()
inputs.close()
但是在网上搜集的停用词可能会有重复的,用下面的代码进行去重
def stop_reduction(infilepath,outfilepath):
infile = open(infilepath,'r',encoding = 'utf-8')
outfile = open(outfile,'w')
stopwordslist = []
for str in infile.read().split('\n'):
if str not in stopwordslist:
stopwordslist.append(str)
outfile.write(str+'\n')