帮助文档
https://www.w3.org/TR/xpath/
基础语法
表达式 | 描述 |
---|---|
/ | 选中文档的根 |
.(点) | 选中当前节点 |
…(点点) | 选中当前节点的父节点 |
ELEMENT | 选中子节点中所有ELEMENT元素节点 |
//ELEMENT | 选中后代节点中所有ELEMENT元素节点 |
* | 选中所有元素子节点 |
text() | 选中所有文本子节点 |
@ATTR | 选中名为ATTR的属性节点 |
@* | 选中所有属性节点 |
创建html文档
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
body='''
<html>
<head>
<base href='http://example.com/'>
<title>Example website</site>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name:Image 1 <br/><img src='image1.jpg' /></a>
<a href='image2.html'>Name:Image 2 <br/><img src='image2.jpg' /></a>
<a href='image3.html'>Name:Image 3 <br/><img src='image3.jpg' /></a>
<a href='image4.html'>Name:Image 4 <br/><img src='image4.jpg' /></a>
<a href='image5.html'>Name:Image 5 <br/><img src='image5.jpg' /></a>
</div>
</body>
</html>
'''
response = HtmlResponse(url='http://www.example.com',body=body,encoding='utf8')
获取根路径
print(response.xpath('/html'))
[<Selector xpath='/html' data='<html>\n\t<head>\n\t\t<base href="http://e...'>]
print(response.xpath('/html/head'))
[<Selector xpath='/html/head' data='<head>\n\t\t<base href="http://example.c...'>]
获取div下的所有a标签
print(response.xpath('/html/body/div/a'))
[<Selector xpath='/html/body/div/a' data='<a href="image1.html">Name:Image 1 <b...'>,
<Selector xpath='/html/body/div/a' data='<a href="image2.html">Name:Image 2 <b...'>,
<Selector xpath='/html/body/div/a' data='<a href="image3.html">Name:Image 3 <b...'>,
<Selector xpath='/html/body/div/a' data='<a href="image4.html">Name:Image 4 <b...'>,
<Selector xpath='/html/body/div/a' data='<a href="image5.html">Name:Image 5 <b...'>]
选中文档中的所有a标签
print(response.xpath('//a'))
[<Selector xpath='//a' data='<a href="image1.html">Name:Image 1 <b...'>,
<Selector xpath='//a' data='<a href="image2.html">Name:Image 2 <b...'>,
<Selector xpath='//a' data='<a href="image3.html">Name:Image 3 <b...'>,
<Selector xpath='//a' data='<a href="image4.html">Name:Image 4 <b...'>,
<Selector xpath='//a' data='<a href="image5.html">Name:Image 5 <b...'>]
选中body后代中所有的img
print(response.xpath('/html/body//img'))
[<Selector xpath='/html/body//img' data='<img src="image1.jpg">'>,
<Selector xpath='/html/body//img' data='<img src="image2.jpg">'>,
<Selector xpath='/html/body//img' data='<img src="image3.jpg">'>,
<Selector xpath='/html/body//img' data='<img src="image4.jpg">'>,
<Selector xpath='/html/body//img' data='<img src="image5.jpg">'>]
获取所有a标签的文本
print(response.xpath('//a/text()').extract())
['Name:Image 1 ', 'Name:Image 2 ', 'Name:Image 3 ', 'Name:Image 4 ', 'Name:Image 5 ']
获取html的所有元素子节点
print(response.xpath('/html/*'))
[<Selector xpath='/html/*' data='<head>\n\t\t<base href="http://example.c...'>,
<Selector xpath='/html/*' data='<body>\n\t\t<div id="images">\n\t\t\t<a href...'>]
获取div的所有后代元素节点
print(response.xpath('/html/body/div//*'))
[<Selector xpath='/html/body/div//*' data='<a href="image1.html">Name:Image 1 <b...'>,
<Selector xpath='/html/body/div//*' data='<br>'>,
<Selector xpath='/html/body/div//*' data='<img src="image1.jpg">'>,
<Selector xpath='/html/body/div//*' data='<a href="image2.html">Name:Image 2 <b...'>,
<Selector xpath='/html/body/div//*' data='<br>'>,
<Selector xpath='/html/body/div//*' data='<img src="image2.jpg">'>,
<Selector xpath='/html/body/div//*' data='<a href="image3.html">Name:Image 3 <b...'>,
<Selector xpath='/html/body/div//*' data='<br>'>,
<Selector xpath='/html/body/div//*' data='<img src="image3.jpg">'>,
<Selector xpath='/html/body/div//*' data='<a href="image4.html">Name:Image 4 <b...'>,
<Selector xpath='/html/body/div//*' data='<br>'>,
<Selector xpath='/html/body/div//*' data='<img src="image4.jpg">'>,
<Selector xpath='/html/body/div//*' data='<a href="image5.html">Name:Image 5 <b...'>,
<Selector xpath='/html/body/div//*' data='<br>'>,
<Selector xpath='/html/body/div//*' data='<img src="image5.jpg">'>]
获取div孙节点中所有的img
print(response.xpath('//div/*/img'))
[<Selector xpath='//div/*/img' data='<img src="image1.jpg">'>,
<Selector xpath='//div/*/img' data='<img src="image2.jpg">'>,
<Selector xpath='//div/*/img' data='<img src="image3.jpg">'>,
<Selector xpath='//div/*/img' data='<img src="image4.jpg">'>,
<Selector xpath='//div/*/img' data='<img src="image5.jpg">'>]
获取所有img的src属性
print(response.xpath('//img/@src'))
[<Selector xpath='//img/@src' data='image1.jpg'>,
<Selector xpath='//img/@src' data='image2.jpg'>,
<Selector xpath='//img/@src' data='image3.jpg'>,
<Selector xpath='//img/@src' data='image4.jpg'>,
<Selector xpath='//img/@src' data='image5.jpg'>]
获取所有的href属性
print(response.xpath('//@href'))
[<Selector xpath='//@href' data='http://example.com/'>, <Selector xpath='//@href' data='image1.html'>,
<Selector xpath='//@href' data='image2.html'>,
<Selector xpath='//@href' data='image3.html'>,
<Selector xpath='//@href' data='image4.html'>,
<Selector xpath='//@href' data='image5.html'>]
获取第一个a标签下的img的所有属性
print(response.xpath('//a[1]/img/@*'))
[<Selector xpath='//a[1]/img/@*' data='image1.jpg'>]
获取第一个a标签的选择器对象
print(response.xpath('//a')[0].xpath('.//img'))
[<Selector xpath='.//img' data='<img src="image1.jpg">'>]
print(response.xpath('//a[1]').xpath('.//img'))
[<Selector xpath='.//img' data='<img src="image1.jpg">'>]
获取所有img在父节点
print(response.xpath('//img/..'))
[<Selector xpath='//img/..' data='<a href="image1.html">Name:Image 1 <b...'>,
<Selector xpath='//img/..' data='<a href="image2.html">Name:Image 2 <b...'>,
<Selector xpath='//img/..' data='<a href="image3.html">Name:Image 3 <b...'>,
<Selector xpath='//img/..' data='<a href="image4.html">Name:Image 4 <b...'>,
<Selector xpath='//img/..' data='<a href="image5.html">Name:Image 5 <b...'>]
选择所有a中的第三个
print(response.xpath('//a[3]'))
[<Selector xpath='//a[3]' data='<a href="image3.html">Name:Image 3 <b...'>]
使用last函数,选中最后一个
print(response.xpath('//a[last()]'))
[<Selector xpath='//a[last()]' data='<a href="image5.html">Name:Image 5 <b...'>]
使用position函数,选中前三个
print(response.xpath('//a[position()<=3]'))
[<Selector xpath='//a[position()<=3]' data='<a href="image1.html">Name:Image 1 <b...'>,
<Selector xpath='//a[position()<=3]' data='<a href="image2.html">Name:Image 2 <b...'>,
<Selector xpath='//a[position()<=3]' data='<a href="image3.html">Name:Image 3 <b...'>]
选中所有含有id属性的div
print(response.xpath('//div[@id]'))
[<Selector xpath='//div[@id]' data='<div id="images">\n\t\t\t<a href="image1....'>]
选中所有含有id属性并且值为images的div
print(response.xpath('//div[@id="images"]'))
[<Selector xpath='//div[@id="images"]' data='<div id="images">\n\t\t\t<a href="image1....'>]
xpath常用函数
string(arg):返回参数的字符串值
获取strong的值
from scrapy.selector import Selector
text = '<a href="#">Click here to go to the <strong>Next Page</strong></a>'
sel = Selector(text=text)
res = sel.xpath('string(/html/body/a/strong)').extract()
print(res)
res1 = sel.xpath('/html/body/a/strong/text()')
print(res1)
结果:
['Next Page']
[<Selector xpath='/html/body/a/strong/text()' data='Next Page'>]
获取a标签中的两个值
from scrapy.selector import Selector
text = '<a href="#">Click here to go to the <strong>Next Page</strong></a>'
sel = Selector(text=text)
res = sel.xpath('string(/html/body/a//text())').extract()
print(res)
res1 = sel.xpath('string(/html/body/a)').extract()
print(res1)
结果:
['Click here to go to the ']
['Click here to go to the Next Page']
contains(str1,str2):判断str1中是否包含str2,返回布尔值
text = '''
<div>
<p class="small info">hello world</p>
<p> class="normal info">hello scrapy</p>
</div>
'''
sel = Selector(text=text)
print(sel.xpath('//p[contains(@class,"small")]'))
print(sel.xpath('//p[contains(@class,"info")]'))
结果:
[<Selector xpath='//p[contains(@class,"small")]' data='<p class="small info">hello world</p>'>]
[<Selector xpath='//p[contains(@class,"info")]' data='<p class="small info">hello world</p>'>]
child:选取当前节点的所有子元素
parent:选取当前节点的父节点
ancestor:选取当前节点的所有先辈(父、祖父等)
ancestor-or-self:选取当前节点的所有先辈(父、祖父等)以及当前节点本身
descendant:选取当前节点的所有后代元素(子、孙等)
descendant-or-self:选取当前节点的所有后代元素(子、孙等)以及当前节点本身
preceding:选取文档中当前节点的开始标记之前的所有节点
following:选取文档中当前节点的结束标记之后的所有节点
preceding-sibling:选取当前节点之前的所有同级节点
following-sibling:选取当前节点之后的所有同级节点
self:选取当前节点
attribute:选取当前节点的所有属性
namespace:选取当前节点的所有命名空间节点
xpath运算符