1. XPath简介
是一门在XML文档中查找内容的语言
XML文档:存储和传输数据的。同类型的还有Json数据
1.1 JSON和XML区别
JSON数据对机器友好型语言
XML对人类友好型语言
1.2 XML文档中的节点
XML文档中有很多节点。例如:根节点(文档节点)、元素节点等
XML文档是树形结构
RIGHT Example:
xml_str = """
<supermarket>:根节点(文档节点)
<name>永辉超市</name>:元素节点
<address>中国</address>
<address name="one">四川成都</address>
<address name="two">肖家河大厦</address>
<goodsList>
<goods name="泡面" price="3.5" count="20"></goods>
<goods name="矿泉水" price="2" count="50"></goods>
<goods name="面包" price="5" count="15"></goods>
</goodsList>
<worker_list>
<cashier name="张三" pay="4000"></cashier>
<shoppingGuide name="李四" pay="3500"></shoppingGuide>
</worker_list>
<goods price="50" count="15">
<name>烟</name>
</goods>
</supermarket>
"""
2. XPath语法
etree.HTML(字符串类型HTML页面源代码):将HTML页面源码转换成_Element类型
etree.XML(字符串类型XML源码):将XML源码转换成_Element类型
RIGHT Example:
from lxml import etree
# 页面源代码转换类型
root = etree.HTML(xml_str)
print(root, type(root))
2.1 XPath路径选择器
(1).表示当前节点
RIGHT Example:
goods_list = root.xpath('/supermarket/goodsList/goods')
print(goods_list)
for i in goods_list:
print(i.xpath('./@name'))
"""
['泡面']
['矿泉水']
['面包']
"""
(2)…表示当前节点的父节点
RIGHT Example:
# //crashier/..:获取当前文档中任意位置的crashier节点的父节点
print(root.xpath('//cashier/..')) # [<Element worker_list at 0x180e7d12340>]
(3)/表示当前文档的根节点
RIGHT Example:
# supermarket:获取文档中名字叫做supermarket的子节点
print(root.xpath('supermarket')) # []
# /supermarket:获取文档根节点supermarket
print(root.xpath('/supermarket')) # [<Element supermarket at 0x1f7892b2d80>]
(4)//表示当前节点的任意位置的节点
(5)text() 获取节点中的内容
注意:返回结果是个列表
RIGHT Example:
print(root.xpath('//name/text()')) # ['永辉超市', '烟']
(6)@ 获取节点中的属性值
RIGHT Example:
print(root.xpath('/supermarket/goodsList/goods/@name'))
# ['泡面', '矿泉水', '面包']
(7)谓语用法1:给xpath路径选择器添加精确条件(第n个节点)
RIGHT Example:
print(root.xpath('/supermarket/goodsList/goods[1]/@name')) # ['泡面']
(8)谓语用法2:倒数第n个节点
RIGHT Example:
print(root.xpath('/supermarket/goodsList/goods[last()-1]/@name')) # ['矿泉水']
(9)谓语用法3://goods[@name]:获取当前节点中带有name属性的goods
RIGHT Example:
print(root.xpath('//goods[@name]'))
# [<Element goods at 0x222fc00bf00>, <Element goods at 0x222fc00bfc0>, <Element goods at 0x222fc00bf80>]
(10)谓语用法4://goods[@name=“矿泉水”]:获取当前节点中所有name属性是矿泉水的goods
RIGHT Example:
print(root.xpath('//goods[@name="矿泉水"]'))
# [<Element goods at 0x222fc00bfc0>]
APPLICATION 尝试获取goodsList中goods信息:
# 尝试获取goodsList中三个元素
# 获取根节点supermarket下goodsList子节点的goods子节点
print(root.xpath('/supermarket/goodsList/goods'))
# 尝试获取goods属性
print(root.xpath('/supermarket/goodsList/goods/@name'))
2.2 案例
APPLICATION 尝试获取li标签中的信息:
import requests
from lxml import etree
def requests_get(href, proxy=None):
Headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36'
}
# timeout:超过指定时间拿不到响应结果停止请求
resp = requests.get(url=href, headers=Headers, proxies=proxy, timeout=3)
if resp.status_code == 200:
return resp
else:
print(resp.status_code)
if __name__ == '__main__':
URL = 'https://movie.douban.com/top250?start=100&filter='
response = requests_get(URL)
root = etree.HTML(response.text)
# 获取存放电影信息的每一个li标签
li_list = root.xpath('/html/body/div[@id="wrapper"]/div[@id="content"]//div[@class="hd"]')
for i in li_list:
# 电影链接
href = i.xpath('a/@href')[0]
# 电影名
name = i.xpath('a/span/text()')
name = "".join(name)
print(name)