【2018-7-10】
- UnicodeDecodeError
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte
三国演义人物出场统计(引自嵩天《Python语言程序设计基础》),其中threekindoms.txt中是《三国演义》全文。
#CalThreeKingdomsV1.py import jieba txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue else: counts[word] = counts.get(word,0) + 1 items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(15): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count))
F5执行程序时,提示错误:
Traceback (most recent call last): File "C:\Users\Vicki\Desktop\MOOC\python\WEEK6\CalThreeKingdomsV1.py", line 4, in <module> txt = open("threekingdoms.txt", "r", encoding='utf-8').read() File "E:\Python36\lib\codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte
【错误原因】:编码问题, threekindoms.txt中粘贴《三国演义》全文时,默认保存的是编码是"ANSI",而这里应该用utf-编码。
【解决方法】:将threekindoms.txt另存为编码格式为“UTF-8”的txt文件。点击F5运行。
【输出结果】:
== RESTART: C:\Users\Vicki\Desktop\MOOC\python\WEEK6\CalThreeKingdomsV1.py == Building prefix dict from the default dictionary ... Loading model from cache C:\Users\Vicki\AppData\Local\Temp\jieba.cache Loading model cost 1.498 seconds. Prefix dict has been built succesfully. 曹操 953 孔明 836 将军 772 却说 656 玄德 585 关公 510 丞相 491 二人 469 不可 440 荆州 425 玄德曰 390 孔明曰 390 不能 384 如此 378 张飞 358 >>>
- SyntaxError: Non-UTF-8 code starting with...
SyntaxError: Non-UTF-8 code starting with '\xbd' in file
三国演义人物出场统计 优化版(引自嵩天《Python语言程序设计基础》),其中threekindoms.txt中是《三国演义》全文。#CalThreeKingdomsV2.py import jieba excludes = {"将军","却说","荆州","二人","不可","不能","如此"} txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue elif word == "诸葛亮" or word == "孔明曰": rword = "孔明" elif word == "关公" or word == "云长": rword = "关羽" elif word == "玄德" or word == "玄德曰": rword = "刘备" elif word == "孟德" or word == "丞相": rword = "曹操" else: rword = word counts[rword] = counts.get(rword,0) + 1 for word in excludes: del counts[word] items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(10): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count))
F5执行程序时,提示错误:
File "CalThreeKingdomsV2.py", line 3
SyntaxError: Non-UTF-8 code starting with '\xbd' in file CalThreeKingdomsV2.py on line 3, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
【错误原因】:在.py文件第3行有中文字符,运行时出现错误。【解决办法】:在.py文件开头,输入:# coding=gbk
【输出结果】:
== RESTART: C:\Users\Vicki\Desktop\MOOC\python\WEEK6\CalThreeKingdomsV2.py == Building prefix dict from the default dictionary ... Loading model from cache C:\Users\Vicki\AppData\Local\Temp\jieba.cache Loading model cost 1.011 seconds. Prefix dict has been built succesfully. 曹操 1451 孔明 1383 刘备 1252 关羽 784 张飞 358 商议 344 如何 338 主公 331 军士 317 吕布 300 >>>