拿到HTML网页
html = etree.HTML(content)
编写规则
html = etree.HTML(content)
divs = html.xpath('//div[@class="rank"]//span[@class="span"]')
print(type(divs))
print(divs)
divs返回一个列表,无法直接打印出数据:
<class 'list'>
[<Element span at 0x16d2edb2848>]
etree.HTML():构造了一个XPath解析对象并对HTML文本进行自动修正。
etree.tostring():输出修正后的结果,类型是bytes
html = etree.HTML(content)
divs = html.xpath('//div[@class="rank"]//span[@class="span"]')
d = etree.tostring(divs,encoding='utf-8').encode('utf-8')
print(d)
报错:TypeError: Type ‘list’ cannot be serialized.
Traceback (most recent call last):
File "E:/pycharm2019/Test/test.py", line 14, in <module>
d = etree.tostring(divs)
File "src/lxml/etree.pyx", line 3443, in lxml.etree.tostring
TypeError: Type 'list' cannot be serialized.
翻了很多都没有找到同样问题的解决,于是突然想起规则末尾加 /text()
html = etree.HTML(content)#HTML网页
divs = html.xpath('//div[@class="rank"]//span[@class="span"]/text()')#XPATH提取数据
print(divs)#输出数据
直接得到目标数据(根本不需要那句etree.tostring…被视频教程误导了)