python笔记:jieba分词
运用jieba对三国演义进行分词,统计出现人物次数排行前十的人物:
首先加载文件
txt = open('三国演义.txt','r',encoding='utf-8').read()
如果文件报错,出现UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xc8 in position 0异常,这里解决办法有两种,一种是改变encoding ,将utf-8改为gb18030 ,或者将文件另存为,书将编码改为UTF-8
words = jieba.lcut(txt) #jieba库的精确模式
counts = {}
for word in words:
if len(word) == 1:
continue
else:
counts[word]= counts.get(word,0)+1
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
创建一个空的字典counts,索引word中的单词,从而进行对出现词语的统