今天提取某网站的网页中的中文,发现出现乱码,将解决的方法记录一下
1、开始时候代码如下,
for html in html_list:
requests_html = requests.get(html2, headers=headers)
requests_html.encoding = 'utf-8'
bs_html = BeautifulSoup(requests_html.text, "lxml")
for link in bs_html.find_all("a",{"class":{"ulink"}}):
print(bs_html.find_all("a",{"class":{"ulink"}}))
这个地方
requests_html.encoding = 'utf-8'
一般提取网页中文时候都不会出现乱码,但是恰巧今天碰到了,运行出现下面
中文全部是乱码,改成gbk也不行,不指定编码也不行
我用火狐浏览器打开网站地址,发现是gb2312编码
for html in html_list:
requests_html = requests.get(html2, headers=headers)
requests_html.encoding = 'gb2312'
bs_html = BeautifulSoup(requests_html.text, "lxml")
for link in bs_html.find_all("a",{"class":{"ulink"}}):
print(bs_html.find_all("a",{"class":{"ulink"}}))
修改后中文显示正常