XPath详解
选取节点
namenode 选取namenode底下所有子节点
/ 选取根元素下所有的bookstore节点
// 从全局节点中找到所有的bookstore节点
@ //book[@price]选择所有拥有price属性的book节点
谓词
/bookstore/book[1] 选取bookstore下的第一个子元素
/bookstore/book[last()] 选取bookstore下的倒数第二个book元素
bookstore/book[position() < 3] 选取bookstore下前面两个子元素
//book[@price=10] 选取所有属性price等于10的book元素
lxml解析html代码和文件
解析html字符串
from lxml import etree
text=""" """
html=etree.HTML(text)
print(etree.tostring(html,encoding='utf-8').decode("utf-8"))
解析html文件
from lxml import etree
html=etree.parse(“ .html")
print(etree.tostring(html,encoding='utf-8').decode("utf-8"))
碰到不规范的html
parser=etree.HTMLParser(encoding='utf-8')
html=etree.parser(" .html",parser=parser)
print(etree.tostring(html,encoding='utf-8').decode("utf-8"))
lxml和xpath结合使用
from lxml import etree
parser=etree.HTMLParser(encoding='utf-8')
html=etree.parser(" .html",parser=parser)
获取所有tr标签
trs=html.xpath("//tr")
for tr in trs:
print(etree.tostring(tr,encoding='utf-8').decode("utf-8"))
获取第二个tr标签
trs=html.xpath("//tr[2]")[0]
for tr in trs:
print(etree.tostring(tr,encoding='utf-8').decode("utf-8"))
获取所有a标签的href属性
as=html.xpath("//a/@href")