Chapter3.3 xpath内容整理

最新推荐文章于 2024-08-03 22:12:22 发布

lee's work

最新推荐文章于 2024-08-03 22:12:22 发布

阅读量69

点赞数

分类专栏： scrapy学习文章标签： xml python

本文链接：https://blog.csdn.net/qq_27608761/article/details/121010712

版权

scrapy学习专栏收录该内容

10 篇文章 1 订阅

订阅专栏

文章目录

3.3 XPath
- 3.3.1　基础语法
- 3.3.2　常用函数

3.3 XPath

XPath即XML路径语言（XML Path Language），它是一种用来确定xml文档中某部分位置的语言。
xml文档（html属于xml）是由一系列节点构成的树，例如：

<html>
	<body>
		<div >
			<p>Hello world<p>
			<a href="/home">Click here</a>
		</div>
	</body>
</html>`

xml文档的节点有多种类型，其中最常用的有以下几种：

●　根节点　整个文档树的根。
●　元素节点　html、body、div、p、a。
●　属性节点　href。
●　文本节点　Hello world、Click here。

节点间的关系有以下几种：
●　父子　body是html的子节点，p和a是div的子节点。反过来，
div是p和a的父节点。
●　兄弟　p和a为兄弟节点。
●　祖先／后裔　body、div、p、a都是html的后裔节点；反过来
html是body、div、p、a的祖先节点。

3.3.1　基础语法

表3-1列出了XPath常用的基本语法。

表3-1　XPath常用的基本语法

表达式	描述
/	选中文档的根（root）
.	选中当前节点
…	选中当前节点的父节点
ELEMENT	选中子节点中所有ELEMENT
//ELEMENT	选中后代节点中所有ELEMENT元素节点
*	选中所有元素子节点
text()	选中所有文本子节点
@ATTR	选中名为ATTR的属性节点
@*	选中所有属性节点
谓语	谓语用来查找某个特定的节点或者包含某个特定值的节点

接下来，我们通过一些例子展示XPath的使用。
首先创建一个用于演示的html文档，并用其构造一个 HtmlResponse对象：

>>>from scrapy.selector import Selector
>>>from scrapy.http import HtmlResponse
>>>body = '''
    <html>
     <head>
         <base href='http://example.com/' />
         <title>Example website</title>
     </head>
     <body>
         <a href='image0.html'>Name: Image 0 <br/><img src='image0.jpg'></a>
         <div id='images'>
             <a href='image1.html'>Name: Image 1 <br/><img src='image1.jpg'></a>
             <a href='image2.html'>Name: Image 2 <br/><img src='image2.jpg'></a>
             <a href='image3.html'>Name: Image 3 <br/><img src='image3.jpg'></a>
             <a href='image4.html'>Name: Image 4 <br/><img src='image4.jpg'></a>
             <a href='image5.html'>Name: Image 5 <br/><img src='image5.jpg'></a>
         </div>
     </body>
     </html>
'''
>>>response = HtmlResponse(url='http://www.example.com', body=body, encoding='utf-8')
>>>response
<200 http://www.example.com>

●　/ ：描述一个从根开始的绝对路径。

>>> response.xpath('/html')
[<Selector xpath='/html' data='<html>\n     <head>\n         <base hre...'>]
>>>response.xpath('/html/head')
[<Selector xpath='/html/head' data='<head>\n         <base href="http://ex...'>]

●　E1/E2：选中E1子节点中的所有E2。

# 选中div子节点中的所有a
>>> response.xpath('/html/body/div/a')
[<Selector xpath='/html/body/div/a' data='<a href="image1.html">Name: Image 1 <...'>,
 <Selector xpath='/html/body/div/a' data='<a href="image2.html">Name: Image 2 <...'>,
 <Selector xpath='/html/body/div/a' data='<a href="image3.html">Name: Image 3 <...'>,
 <Selector xpath='/html/body/div/a' data='<a href="image4.html">Name: Image 4 <...'>,
 <Selector xpath='/html/body/div/a' data='<a href="image5.html">Name: Image 5 <...'>]

●　//E：选中文档中的所有E，无论在什么位置。

# 选中文档中的所有a
>>> response.xpath('//a')
[<Selector xpath='//a' data='<a href="image0.html">Name: Image 0 <...'>,#在div外的a标签
 <Selector xpath='//a' data='<a href="image1.html">Name: Image 1 <...'>,
 <Selector xpath='//a' data='<a href="image2.html">Name: Image 2 <...'>,
 <Selector xpath='//a' data='<a href="image3.html">Name: Image 3 <...'>,
 <Selector xpath='//a' data='<a href="image4.html">Name: Image 4 <...'>,
 <Selector xpath='//a' data='<a href="image5.html">Name: Image 5 <...'>]

●　E1//E2：选中E1后代节点中的所有E2，无论在后代中的什么位置。

# 选中body后代中的所有img
>>> response.xpath('/html/body//img')
[<Selector xpath='/html/body//img' data='<img src="image0.jpg">'>,
 <Selector xpath='/html/body//img' data='<img src="image1.jpg">'>,
 <Selector xpath='/html/body//img' data='<img src="image2.jpg">'>,
 <Selector xpath='/html/body//img' data='<img src="image3.jpg">'>,
 <Selector xpath='/html/body//img' data='<img src="image4.jpg">'>,
 <Selector xpath='/html/body//img' data='<img src="image5.jpg">'>]

●　E/text()：选中E的文本子节点。

# 选中所有a的文本
>>> sel = response.xpath('//a/text()')
>>> sel
[<Selector xpath='//a/text()' data='Name: Image 0 '>,
 <Selector xpath='//a/text()' data='Name: Image 1 '>,
 <Selector xpath='//a/text()' data='Name: Image 2 '>,
 <Selector xpath='//a/text()' data='Name: Image 3 '>,
 <Selector xpath='//a/text()' data='Name: Image 4 '>,
 <Selector xpath='//a/text()' data='Name: Image 5 '>]
 
 >>> sel.extract()
 ['Name: Image 0 ',
 'Name: Image 1 ',
 'Name: Image 2 ',
 'Name: Image 3 ',
 'Name: Image 4 ',
 'Name: Image 5 ']

●　E/*：选中E的所有元素子节点。

# 选中html的所有元素*子节点*
>>> response.xpath('/html/*')
[<Selector xpath='/html/*' data='<head>\n         <base href="http://ex...'>,
 <Selector xpath='/html/*' data='<body>\n         <a href="image0.html"...'>]

# 选中div的所有*后代元素节点*
>>> response.xpath('/html/body/div//*')
>[<Selector xpath='/html/body/div//*' data='<a href="image1.html">Name: Image 1 <...'>,
 <Selector xpath='/html/body/div//*' data='<br>'>,
 <Selector xpath='/html/body/div//*' data='<img src="image1.jpg">'>,
 <Selector xpath='/html/body/div//*' data='<a href="image2.html">Name: Image 2 <...'>,
 <Selector xpath='/html/body/div//*' data='<br>'>,
 <Selector xpath='/html/body/div//*' data='<img src="image2.jpg">'>,
 <Selector xpath='/html/body/div//*' data='<a href="image3.html">Name: Image 3 <...'>,
 <Selector xpath='/html/body/div//*' data='<br>'>,
 <Selector xpath='/html/body/div//*' data='<img src="image3.jpg">'>,
 <Selector xpath='/html/body/div//*' data='<a href="image4.html">Name: Image 4 <...'>,
 <Selector xpath='/html/body/div//*' data='<br>'>,
 <Selector xpath='/html/body/div//*' data='<img src="image4.jpg">'>,
 <Selector xpath='/html/body/div//*' data='<a href="image5.html">Name: Image 5 <...'>,
 <Selector xpath='/html/body/div//*' data='<br>'>,
 <Selector xpath='/html/body/div//*' data='<img src="image5.jpg">'>]

●　*/E：选中孙节点中的所有E。

# 选中div孙节点中的所有img
>>> response.xpath('//div/*/img')
[<Selector xpath='//div/*/img' data='<img src="image1.jpg">'>,
 <Selector xpath='//div/*/img' data='<img src="image2.jpg">'>,
 <Selector xpath='//div/*/img' data='<img src="image3.jpg">'>,
 <Selector xpath='//div/*/img' data='<img src="image4.jpg">'>,
 <Selector xpath='//div/*/img' data='<img src="image5.jpg">'>]

●　E/@ATTR：选中E的ATTR属性。

# 选中所有img的src 属性
>>> response.xpath('//img/@src')
[<Selector xpath='//img/@src' data='image0.jpg'>,
 <Selector xpath='//img/@src' data='image1.jpg'>,
 <Selector xpath='//img/@src' data='image2.jpg'>,
 <Selector xpath='//img/@src' data='image3.jpg'>,
 <Selector xpath='//img/@src' data='image4.jpg'>,
 <Selector xpath='//img/@src' data='image5.jpg'>]

●　//@ATTR：选中文档中所有ATTR属性。

# 选中所有的href 属性
>>> response.xpath('//@href')
[<Selector xpath='//@href' data='http://example.com/'>,
 <Selector xpath='//@href' data='image0.html'>,
 <Selector xpath='//@href' data='image1.html'>,
 <Selector xpath='//@href' data='image2.html'>,
 <Selector xpath='//@href' data='image3.html'>,
 <Selector xpath='//@href' data='image4.html'>,
 <Selector xpath='//@href' data='image5.html'>]

●　E/@*：选中E的所有属性。

# 获取第一个a 下img的所有属性（这里只有src 一个属性）
>>> response.xpath('//a[1]/img/@*')
[<Selector xpath='//a[1]/img/@*' data='image0'>,
 <Selector xpath='//a[1]/img/@*' data='image1'>]

●　.：选中当前节点，用来描述相对路径。

# 获取第1个a的选择器对象
>>> sel = response.xpath('//a')[0]
>>> sel
<Selector xpath='//a' data='<a href="image0.html">Name: Image 0 <...'>

# 假设我们想选中当前这个a 后代中的所有img，下面的做法是错误的，
# 会找到文档中所有的img
# 因为//img是绝对路径，会从文档的根开始搜索，而不是从当前的a 开始
>>> sel.xpath('//img')
>[<Selector xpath='//img' data='<img src="image0.jpg">'>,
 <Selector xpath='//img' data='<img src="image1.jpg">'>,
 <Selector xpath='//img' data='<img src="image2.jpg">'>,
 <Selector xpath='//img' data='<img src="image3.jpg">'>,
 <Selector xpath='//img' data='<img src="image4.jpg">'>,
 <Selector xpath='//img' data='<img src="image5.jpg">'>]
# 需要使用.//img 来描述当前节点后代中的所有img
>>> sel.xpath('.//img')
[<Selector xpath='.//img' data='<img src="image0.jpg">'>]

●　…：选中当前节点的父节点，用来描述相对路径。

# 选中所有img的父节点
>>> response.xpath('//img/..')
[<Selector xpath='//img/..' data='<a href="image0.html">Name: Image 0 <...'>,
 <Selector xpath='//img/..' data='<a href="image1.html">Name: Image 1 <...'>,
 <Selector xpath='//img/..' data='<a href="image2.html">Name: Image 2 <...'>,
 <Selector xpath='//img/..' data='<a href="image3.html">Name: Image 3 <...'>,
 <Selector xpath='//img/..' data='<a href="image4.html">Name: Image 4 <...'>,
 <Selector xpath='//img/..' data='<a href="image5.html">Name: Image 5 <...'>]

●　node[谓语]：谓语用来查找某个特定的节点或者包含某个特定值的节点。

# 选中所有a 中的第3 个
>>> response.xpath('//a[3]')
[<Selector xpath='//a[3]' data='<a href="image3.html">Name: Image 3 <...'>]

# 使用last函数，选中最后1 个
>>> response.xpath('//a[last()]')#div外img0既是第一个又是最后一个
>[<Selector xpath='//a[last()]' data='<a href="image0.html">Name: Image 0 <...'>,
 <Selector xpath='//a[last()]' data='<a href="image5.html">Name: Image 5 <...'>]

# 使用position函数，选中前3 个
response.xpath('//a[position()<=3]')
[<Selector xpath='//a[position()<=3]' data='<a href="image0.html">Name: Image 0 <...'>,
 <Selector xpath='//a[position()<=3]' data='<a href="image1.html">Name: Image 1 <...'>,
 <Selector xpath='//a[position()<=3]' data='<a href="image2.html">Name: Image 2 <...'>,
 <Selector xpath='//a[position()<=3]' data='<a href="image3.html">Name: Image 3 <...'>]

# 选中所有含有id属性的div
>>> response.xpath('//div[@id]')
[<Selector xpath='//div[@id]' data='<div id="images">\n             <a hre...'>]

# 选中所有含有id属性且值为"images"的div
>>> response.xpath('//div[@id="images"]')
>[<Selector xpath='//div[@id="images"]' data='<div id="images">\n             <a hre...'>]

3.3.2　常用函数

XPath还提供许多函数，如数字、字符串、时间、日期、统计等。
在上面的例子中，我们已经使用了函数position()、last()。由于篇幅有限，下面仅介绍两个十分常用的字符串函数。
●　string(arg)：返回参数的字符串值。

>>> from scrapy.selector import Selector
>>> text = '<a href="#">Click here to go to the <strong>Next Page</strong></a>'
>>> sel = Selector(text=text)
>>> sel
<Selector xpath=None data='<html><body><a href="#">Click here to...'>

# 以下做法和sel.xpath('/html/body/a/strong/text()')得到相同结果
>>> sel.xpath('string(/html/body/a/strong)').extract()
>['Next Page']

# 如果想得到a 中的整个字符串'Click here to go to the Next Page'，
# 使用text()就不行了，因为Click here to go to the和Next Page 在不同元素下
# 以下做法将得到两个子串
>>>sels.xpath('/html/body/a//text()').extract()
['Click here to go to the ', 'Next Page']

# 这种情况下可以使用string()函数
>>> sel.xpath('string(/html/body/a)').extract()
['Click here to go to the Next Page']

●　contains(str1, str2)：判断str1中是否包含str2，返回布尔
值。

>>> text = '''
... <div>
... <p class="small info">hello world</p>
... <p class="normal info">hello scrapy</p>
... </div>
... '''
>>> sel = Selector(text=text)
>>> sel.xpath('//p[contains(@class, "small")]') # 选择class 属性中包含"small"
[<Selector xpath='//p[contains(@class, "small")]' data='<p class="small info">hello world</p>'>]
>>> sel.xpath('//p[contains(@class,"info")]') # 选择class 属性中包含"info"
[<Selector xpath='//p[contains(@class,"info")]' data='<p class="small info">hello world</p>'>,
 <Selector xpath='//p[contains(@class,"info")]' data='<p class="normal info">hello scrapy</p>'>]

关于XPath的使用先介绍到这里，更多详细内容可以参看XPath文
档：https://www.w3.org/TR/xpath/。

本文参照《精通Scrapy网络爬虫+（刘硕著）》PDF，并自己跑相关代码，代码内容稍作修改，来对xpath的使用方法进行笔记及方法解读，仅做参考和笔记复习使用

lee's work

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Chapter3.3 xpath内容整理

文章目录3.3 XPath3.3.1　基础语法3.3 XPath XPath即XML路径语言（XML Path Language），它是一种用来确定xml文档中某部分位置的语言。 xml文档（html属于xml）是由一系列节点构成的树，例如：<html> <body> <div > <p>Hello world<p> <a href="/home">Click here</a> </
复制链接

扫一扫