帮助文档
https://www.w3.org/TR/xpath/
基础语法
表达式 | 描述 |
---|---|
/ | 选中文档的根 |
.(点) | 选中当前节点 |
…(点点) | 选中当前节点的父节点 |
ELEMENT | 选中子节点中所有ELEMENT元素节点 |
//ELEMENT | 选中后代节点中所有ELEMENT元素节点 |
* | 选中所有元素子节点 |
text() | 选中所有文本子节点 |
@ATTR | 选中名为ATTR的属性节点 |
@* | 选中所有属性节点 |
创建html文档
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
body='''
<html>
<head>
<base href='http://example.com/'>
<title>Example website</site>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name:Image 1 <br/><img src='image1.jpg' /></a>
<a href='image2.html'>Name:Image 2 <br/><img src='image2.jpg' /></a>
<a href='image3.html'>Name:Image 3 <br/><img src='image3.jpg' /></a>
<a href='image4.html'>Name:Image 4 <br/><img src='image4.jpg' /></a>
<a href='image5.html'>Name:Image 5 <br/><img src='image5.jpg' /></a>
</div>
</body>
</html>
'''
response = HtmlResponse(url='http://www.example.com',body=body,encoding='utf8')
获取根路径
print(response.xpath('/html'))
[<Selector xpath='/html' data='<html>\n\t<head>\n\t\t<base href="http://e...'>]
print(response.xpath('/html/head'))
[<Selector xpath='/html/head' data='<head>\n\t\t<base href="http://example.c...'>]
获取div下的所有a标签
print(response.xpath('/html/body/div/a'))
[<Selector xpath='/html/body/div/a' data='<a href="image1.html">Name:Image 1 <b...'>,
<Selector xpath='/html/body/div/a' data='<a href="image2.html">Name:Image 2 <b...'>,
<Selector xpath='/html/body/div/a' data='<a href="image3.html">Name:Image 3 <b...'>,
<Selector xpath