1. 问题
被爬虫的网页是"UTF-8"格式的编码,但是我在保存内容时
from urllib.request import urlopen
def get_url():
url = 'https://www.hao123.com/'
resp = urlopen(url)
with open('baidu.html', mode='w') as file:
content = resp.read()
# print(f)
# file.write(f)
file.write(content.decode("UTF-8"))
print('file is done!!')
if __name__ == '__main__':
get_url()
出现了下面的错误
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 252532: illegal multibyte sequence
2. 问题及解决方案
原因是windows默认打开文件的时候采用的是‘gbk'编码,这里我们修改其编码的方式为’UTF-8‘即可
with open('baidu.html', mode='w', encoding="utf-8") as file:
在打开的这行函数加了encoding="utf-8"