由于爬取的网页编码格式是“gb2312”格式的,所以第一反应就是也用这个格式编码和解码
import re
from lxml import etree
import html
with open('test.html','r',encoding='gbk') as f:
c = f.read()
s = re.sub(r'\n',' ',c)
tree = etree.HTML(c)
rows = tree.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li")
for row in rows:
boards = {}
s1 = etree.tostring(row).decode('gbk')
s1 = html.unescape(s1)
print(s1)
break
由于 “gbk” 包括 “gb2312”所以使用了 “gbk”,其实结果都一样
翻看了好多博客发现:
爬取的所有网页无论何种编码格式,都转化为 utf-8 格式进行存储
具体什么原因现在我也没清楚,留着后续补充吧
但是关于 gbk 或者 gb2312 格式的网页牵扯到存储时,转换成 utf-8 格式是没错的
import re
from lxml import etree
import html
with open('test.html','r',encoding='utf-8') as f:
c = f.read()
s = re.sub(r'\n',' ',c)
tree = etree.HTML(c)
rows = tree.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li")
for row in rows:
boards = {}
s1 = etree.tostring(row).decode('utf-8')
s1 = html.unescape(s1)
print(s1)
break
正常显示