用xpath做爬虫,初步etree.tostring整理网页框架,显示乱码,代码如下
# -*- coding:UTF-8 -*-
import requests
from lxml import etree
url ='http://www.j342c.net/base.php?wer'
raw_html = requests.get(url)
ahtml = etree.HTML(raw_html.content)
aresult = etree.tostring(ahtml)
网页由gb18030编码,用tostring输出汉字乱码
将最后一句改为
aresult = etree.tostring(ahtml,encoding='utf-8',pretty_print=True,method='html')
解决!
输出用如下代码打印log
html_text = raw_html.content.decode('gb18030','utf-8')