xpath 高级用法

FOAF-lambda

已于 2024-05-13 09:58:58 修改

阅读量1.5k

点赞数

文章标签： html python

于 2022-09-17 11:37:58 首次发布

本文链接：https://blog.csdn.net/lwdfzr/article/details/126903260

版权

scrapy中报Selector没有xx属性
if hasattr(contact_urls,'extract'):
contact_urls = contact_urls.extract()

https://blog.csdn.net/weixin_43639743/article/details/106585710
contains不区分大小写结合用法
//*/text()[contains(translate(.,‘ABCDEFGHIJKLMNOPQRSTUVWXYZ’,‘abcdefghijklmnopqrstuvwxyz’), ‘test’)]
备注：
translate(.,‘ABCDEFGHIJKLMNOPQRSTUVWXYZ’,‘abcdefghijklmnopqrstuvwxyz’)
是将被匹配的字符串中的大写字母全转为小写

starts-with 顾名思义，匹配一个属性开始位置的关键字
contains匹配一个属性值中包含的字符串
text() 匹配的是显示文本信息，
//input[starts-with(@name,'name1')] 查找name属性中开始位置包含'name1'关键字的页面元素
//input[contains(@name,'na')] 查找name属性中包含na关键字的页面元素
< a href= "http://www.baidu.com" >百度搜索</ a >xpath写法为 //a[text()='百度搜索'] 或者 //a[contains(text(),"百度搜索")]

ancestor 选取当前节点的所有先辈（父、祖父等）。
ancestor-or-self 选取当前节点的所有先辈（父、祖父等）以及当前节点本身。
attribute 选取当前节点的所有属性。
child 选取当前节点的所有子元素。
descendant 选取当前节点的所有后代元素（子、孙等）。
descendant-or-self 选取当前节点的所有后代元素（子、孙等）以及当前节点本身。
following 选取文档中当前节点的结束标签之后的所有节点。
namespace 选取当前节点的所有命名空间节点。
parent 选取当前节点的父节点。
preceding 选取文档中当前节点的开始标签之前的所有节点。
preceding-sibling 选取当前节点之前的所有同级节点。
self 选取当前节点。

#排除script标签
response.xpath('normalize-space(string(/html/body//*[not(self::script)]))')[0]
如果还想排除其他标签：
response.xpath('normalize-space(string(/html/body//*[not(self::script or self::img or self::div)]))')[0]

//div[@itemprop="content"]/descendant-or-self::*[not(self::script)]/text()[normalize-space()]

排除class包含"expand-child"的tr标签
//table[@id="sites_tbl"]/tbody/tr[not(contains(@class,"expand-child"))]

排除strong标签获取text文本

<li>

<strong>Overall Dimensions (LxWxH):</strong>

4-5/16"x3-1/8"x2-3/8" (11x8x6cm)

</li>

li.xpath('./descendant-or-self::*[not(self::strong)]/text()')

https://stackoverflow.com/questions/38199646/xpath-flatten-text-excluding-certain-nodes XPath拼合文本，不包括某些节点
https://www.w3school.com.cn/xpath/xpath_functions.asp xpath函数
https://www.guojingyi.cn/760.html xpath排除script标签
https://cloud.tencent.com/developer/ask/168300

html.xpath("//*[name(.)!='style']")
html.xpath("//*[not(name()='style')]")

#获取所有标签下的text排除style和script标签，以及空白内容
//div/descendant-or-self::*[not(self::style) and not(self::script)]/text()[normalize-space()]

<a href="http://yahoo.com"><img src="http://yahoo.com/logo.png"></a>
查找不包含img标签的所有a标签
//a[not(img)]
<ul>
<li><a href="#">some link</a></li>
<li><a href="#">some link 2</a></li>
<li><a href="#">link i want to find</a></li>
</ul>
忽略“第一个”元素，则：
//li[position()>1]
//a[preceding::a]找到除第一个元素外的所有a元素

https://www.jb51.net/article/24884.htm

//div[@class='content']//text()[not(ancestor-or-self::div[@class='content']//h3)]

//p[contains(text(),"重复采购率")]/../following-sibling::div[2]/p[1]/text()
p标签父节点的兄弟节点

| ：选取若干个路径
//div[contains(text(),"Please enter Captcha")] |//*[@class="tableFloatingHeaderOriginal"]

test="""
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Storm</title>
</head>
<body>
<h1 id="h1" name="hname" class="cname">这是一个h1标签</h1>
<h1 id="h2" name="hname" class="cname2">这是一2个h1标签</h1>
<form class="logo">
文本域1：<input type="text" name="first_name">
<br>
文本域2：<input type="text" name="last_name">
</form>
<div>
<div itemprop="content">
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
<br>
Donec fringilla est eu euismod varius.
</div>
<div itemprop="content">
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>Donec fringilla est eu euismod varius.</p>
<p class="quote">
<span>Quote</span>
<a href="#">Exclude me</a>
<ul>
<li>Exclude me</li>
<li>Exclude me</li>
</ul>
</p>
<blockquote>Cras facilisis suscipit euismod.</blockquote>
</div>
</div>

</body>
</html>

response=HTML(test)
tt=response.xpath('/html/body/*[not(name()="form")]//text()')
tt1=response.xpath('/html/body/*[name(.)!="form"]//text()')
print(tt)
print(tt1)

test1="""
<div class="a">
some text i want to see
<div class="b">
other text i want to see
<div>
<div class="c">
some text i DON'T WANT to see
</div>
some more text i wish to see..
</div>"""
response2=HTML(test1)
#将选择没有div父节点的@class="c"所有文本节点：
tt2=response2.xpath('//div[@class="a"]//text()[not(parent::div[@class="c"])]')
print('tt2:',tt2)
# 如果你想排除仅空格的文本节点，那么这个XPath，
tt3=response2.xpath('//div[@class="a"]//text()[not(parent::div[@class="c"]) and normalize-space()]')
print('tt3:',tt3)

<a href="/lt/apie/" class="button primary is-large" style="border-radius:99px;">
<span>Contact Us</span>
</a>

//*[contains(text(),"Contact Us")]/parent::a/@href

test2="""
<td class="postbody">
hi this is a response
<div class="bbc-block">
<blockquote>
blah blah blah here's a quote
<br>
</blockquote>
</div>
<br>
what I quoted and now I'm responding to
</td>
<td class="postbody">
<div class="bbc-block">
<blockquote>
and now I'm responding to what I quoted
<br>
</blockquote>
</div>
<br>
wow what a great response
</td>
"""

response3=HTML(test2)
text3=response3.xpath('//td[@class="postbody"][not(@class="bbc-block")]/text()')
text31=response3.xpath('//td[@class="postbody"]/text()')

print('text3::',text3)
print('text31::',text31)

test3="""
<div class="content">
<div class="Titel">Title</div> #difference
<p>content</p>
<p>content</p>
<ul>
<li>List</li>
<li>List</li>
</ul>
</div>
"""
response4=HTML(test3)
text4=response4.xpath("//div[@class='content']//text()[not(ancestor-or-self::div[@class='content']//h3)]")
print('text4::',text4)
text41=response4.xpath("//div[@class='content']/child::*//text()")
print('text41::',text41)

test4="""
<book category="CLASSICS">
<title lang="it">Purgatorio</title>
<author>Dante Alighieri</author>
<year>1308</year>
<price>30.00</price>
</book>

<book category="CLASSICS">
<title lang="it">Inferno</title>
<author>Dante Alighieri</author>
<year>1308</year>
<price>30.00</price>
</book>

<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>

<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>

<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>

<div class="_4-u2 _4-u8">
<div class=".*?">
<div class=".*?">
<div class=".*?">
<div class=".*?">
<div class=".*?">
<div class=".*?">
<div class=".*?">(.*?)</div>
<div class=".*?">总赞数</div><div class=".*?"><div class=".*?"><div class=".*?"><div class=".*?"><div class=".*?"></div></div></div></div></div></div><div class=".*?"><div class=".*?">(.*?)</div><div class=".*?">粉丝总数</div>',

"""
response5=HTML(test4)
text5=response5.xpath("//book[title/@lang='it']//text()")
print('text5:',text5)

https://baijiahao.baidu.com/s?id=1645607565824624330&wfr=spider&for=pc
1，nodename：选取此节点的所有子节点
2，/ ：从根节点选取
3，// ：从匹配的节点选取文档中的所有节点，不考虑位置
4，. :选取当前节点
5，.. :选取当前节点的父节点
6，@ :选取元素
7. * :匹配任何元素节点
8，@* :匹配任何属性节点
9，node()：匹配任何类型的节点
10，| ：选取若干个路径

xpath谓语的用法：
谓语用来查找某个特定的节点或者包含某个指定的值的节点。
谓语被嵌在方括号中。

/div[1]：选取根节点的子节点第一个div元素

/div[last()]：选取根节点的子节点最后一个div元素

/div[last()-1]：选取根节点的子节点倒数第二个div元素

/div[position()<3]：选取根节点的子节点最前面的两个div元素

//div[@id]：选取所有div包含id属性的元素

//div[@id='kw']：选取所有div包含id属性等于kw的元素

xpath 轴

在 XPath 中，有七种类型的节点：元素、属性、文本、命名空间、处理指令、注释以及文档节点（或称为根节点）。

轴是相对于当前节点的节点集

ancestor：选取当前节点的所有先辈（包括父，祖父，祖祖父等）

ancestor-or-self：选取当前节点的所有先辈以及当前节点本身

attribute：选取当前节点的所有属性

child：选取当前节点的所有子元素

descendant：选取当前节点的所有后代元素（包括子，孙等）

descendant-or-self：选取当前节点的所有后代元素及当前节点本身

following：选择文本中当前节点结束标签后的所有节点

namespace：选取当前节点的所有命名空间节点

parent：选取当前节点的父节点

preceding：选取文档中当前节点的开始标签之前的所有节点

preceding-sibling：选取当前节点之前的所有同级节点

self：选取当前节点

用法：

child::div：所有属于当前节点的子元素的div节点

attribute::id：选取当前节点的id属性

child::*：选取当前节点的所有子元素

attribute::*：选取当前节点的所有属性

child::text()：选取当前节点的所有文本子节点

child::node()：选取当前节点的所有子节点

descendant::div：选取当前节点的所有div的后代元素

ancestor::div：选取当前节点的所有div 的先辈元素

ancestor-or-self::div：选取当前节点的所有div的先辈元素以及当前节点（如果此节点为div节点的话）

child::*/child::div：选取当前节点的所有div孙节点

<div>
<a id="1" href="www.baidu.com">我是第1个a标签</a>
<p>我是p标签</p>
<a id="2" href="www.baidu.com">我是第2个a标签</a>
<a id="3" href="www.baidu.com">我是第3个a标签</a>
<a id="4" href="www.baidu.com">我是第4个a标签</a>
<p>我是p标签</p>
<a id="5" href="www.baidu.com">我是第5个a标签</a>
</div>

获取第三个a标签的下一个a标签："//a[@id='3']/following-sibling::a[1]"

获取第三个a标签的上一个a标签："//a[@id='3']/preceding-sibling::a[1]"