题目:读取文本文档中每个单词出现的次数
第一次 尝试编写代码:
import re
with open("test.txt",'r') as text:
words = text.read().split()
for word in words:
if words.count(word)>1:
print('{},{}times'.format(word,words.count(word)))
执行结果,出现编码问题:
UnicodeDecodeError Traceback (most recent call last) <ipython-input-19-d8c8e28c5803> in <module> 2 3 with open("test.txt",'r') as text: ----> 4 words = text.read().split() 5 for word in words: 6 if words.count(word)>1: UnicodeDecodeError: 'gbk' codec can't decode byte 0x9d in position 220: illegal multibyte sequence
第二次尝试以‘rb’模式读取文本内容 ,‘rb'是以二进制格式打开文件。
import re
with open("test.txt",'rb') as text:
words = text.read().split()
for word in words:
if words.count(word)>1:
print('{},{}times'.format(word,words.count(word)))
执行结果如下,可以执行成功,但是输出结果中单词前面都带有一个b。b代表二进制模式,并不是单词输出错误。要使单词输出正确,需转换编码方式。
b'a',4times b'virtual',6times b'in',6times b'the',6times
第三次, 添加输出时转换编码的语句。
import re
with open("test.txt",'rb') as text:
words = text.read().split()
for word in words:
if words.count(word)>1:
print('{}-{}times'.format(str(word,"cp936"),words.count(word)))
执行结果:
a-4times virtual-6times in-6times the-6times