通过lxml库的etree.HTML来处理一段网页源代码,从而生成一个可以被xpath解析的对象,出现下面的情况
response = etree.HTML(response.text)
File "lxml.etree.pyx", line 2953, in lxml.etree.HTML (src\lxml\lxml.etree.c:66734)
File "parser.pxi", line 1780, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:101591)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
根据报错信息推测,可能是因为不支持编码声明的Unicode字符串。Google发现这个问题在2012年就已经有人提交给作者了,但是一直没有被修复。地址在->https://gist.github.com/karlcow/3258330
不过下面的人也给出了解决办法:
response = bytes(bytearray(response.text, encoding='utf-8'))
response = etree.HTML(response)
首先将源代码转化成比特数组,然后再将比特数组转化成一个比特对象。这样就可以绕过这个bug。就可以用xpath提取数据了
参考来源:https://www.cnblogs.com/xieqiankun/p/lxmloldbug.html