data.txt
是中文文档,代码是这个:
with open('data.txt', 'rt') as f:
corpus_chars = f.read()
print(corpus_chars[0:49])
报错如下:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 4: illegal multibyte sequence
解决办法:with open(fname, encoding='utf-8') as data_file
,即以encoding='utf-8'
方式读文件。
所以原代码改为:
with open('data.txt', 'rt', encoding='utf-8') as f:
corpus_chars = f.read()
print(corpus_chars[0:49])