本次示例使用python的lxml 对xpath进行演示
安装lxml
pip install lxml
xpath常规用法
示例html
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
查找xxx下的所有xx元素
from lxml import etree # 在pycharm中遇到红线提示找不到etree的初始化方法,没关系不影响(File → settings → project → project interpreter重新加载一下即可)
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
all_li = selector.xpath('//div/ul/li') # //代表从节结点开始查找,这里查找ul下为li的所有元素
for i in all_li:
print(i)
执行结果:
<Element li at 0x1a7955a2808> # 0x1a7955a2808是内存地址,这是一组元素,如要显示具体可以这样(如:/a/text() # 查看a标签的文本(往下看也有演示))
<Element li at 0x1a7955a27c8>
<Element li at 0x1a7955a28c8>
<Element li at 0x1a7955a2908>
<Element li at 0x1a7955a2948>
<Element li at 0x1a7955a29c8>
查找xxx下的第一个xx元素
from lxml import etree # 在pycharm中遇到红线提示找不到etree的初始化方法,没关系不影响(File → settings → project → project interpreter重新加载一下即可)
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
all_li = selector.xpath('//div/ul/li[1]') # 查找第一个li,注意在xpath中第一个下标不是0,而是1
print(all_li)
执行结果:
[<Element li at 0x1d0e2612608>]
注意:
如果网页中存在多个相同元素,不使用下标进行查找,系统只会默认查找第一个,若第一个元素不符会直接抛出异常。
查找xx元素对应的文本信息
from lxml import etree # 在pycharm中遇到红线提示找不到etree的初始化方法,没关系不影响(File → settings → project → project interpreter重新加载一下即可)
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
# all_li = selector.xpath('//div/ul/li[1]/a/text()')[0] # 这样写直接输出a下面的第一个文本
all_li = selector.xpath('//div/ul/li[1]/a/text()') # 使用text()提取a标签下的文本信息
print(all_li) # 也可以使用下标直接取出结果如:all_li[0]输出结果 first item
执行结果:
['first item']
小知识
如果在使用的html页面中只要元素是唯一的,也可以不从根目录开始查找,简单示例几种:
all_li = selector.xpath('//ul/li[1]/a/text()')[0] #省去div一样可以
all_li = selector.xpath('//*[@class="item-inactive"]/a/text()')[0] # 直接使用class查找第三个li的文本
all_li = selector.xpath('//a[@href="link2.html"]/text()')[0] # 直接使用href查找第二个li的文本
获取xxx下元素的属性
获取单个属性
from lxml import etree # 在pycharm中遇到红线提示找不到etree的初始化方法,没关系不影响(File → settings → project → project interpreter重新加载一下即可)
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
all_li = selector.xpath('//li[3]/a/@href')[0] # 获取href的属性
print(all_li)
执行结果:
link3.html
获取xxx对应的全部属性
from lxml import etree # 在pycharm中遇到红线提示找不到etree的初始化方法,没关系不影响(File → settings → project → project interpreter重新加载一下即可)
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
all_li = selector.xpath('//li/@class') # 获取名称为class的全部属性
print(all_li)
执行结果:
['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0', 'else-1']
xpath高级用法
查找出xxx元素以xx开头的属性
还是这段html来做演示:
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
使用starts-with()
示例代码:
from lxml import etree # 在pycharm中遇到红线提示找不到etree的初始化方法,没关系不影响(File → settings → project → project interpreter重新加载一下即可)
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
all_li = selector.xpath("//li[starts-with(@class, 'item-')]") # 获取href的属性
all_a = []
for i in all_li:
all_a.append(i.xpath('a/text()')[0]) # 继续对找到的li元素使用xpath查找其里面的内容
print(all_a)
执行结果:
['first item', 'second item', 'third item', 'fourth item', 'fifth item']
也可以这样写:
from lxml import etree # 在pycharm中遇到红线提示找不到etree的初始化方法,没关系不影响(File → settings → project → project interpreter重新加载一下即可)
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
all_li = selector.xpath("//li[starts-with(@class, 'item-')]/a/text()") # 获取href的属性
print(all_li)
执行结果:
['first item', 'second item', 'third item', 'fourth item', 'fifth item']
查找所有文本
使用string()
示例代码:
from lxml import etree # 在pycharm中遇到红线提示找不到etree的初始化方法,没关系不影响(File → settings → project → project interpreter重新加载一下即可)
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
all_li = selector.xpath("string(//ul)") # 获取ul下的所有文本
print(all_li)
执行结果:
first item
second item
third item
fourth item
fifth item
something else
this is ul item
小小实例
获取豆瓣首页的豆瓣读书文本及链接,在首页取出一张图片存入本地。
import requests
from lxml import etree # 在pycharm中遇到红线提示找不到etree的初始化方法,没关系不影响(File → settings → project → project interpreter重新加载一下即可)
r = requests.get('https://www.douban.com/')
r.encoding = 'utf-8'
html = etree.HTML(r.text)
text = html.xpath('//*[@id="anony-nav"]/div[1]/ul/li[1]/a/@href')[0]
h1 = html.xpath('//*[@id="anony-nav"]/div[1]/ul/li[1]/a/text()')[0]
logs = html.xpath('//*[@id="anony-sns"]/div/div[3]/div/div[1]/ul/li[3]/div/a/img/@src')[0]
print(text)
print(h1)
print(logs)
log = requests.get(logs)
with open('d:/a.gif', 'wb') as file: # wb 二进制形式写入
file.write(log.content) # 保存图片
执行结果:
https://book.douban.com
豆瓣读书
https://img3.doubanio.com/f/shire/a1fdee122b95748d81cee426d717c05b5174fe96/pics/blank.gif