Scrapy_XPath选择器

最新推荐文章于 2024-02-20 06:30:00 发布

苦涩2020

最新推荐文章于 2024-02-20 06:30:00 发布

阅读量279

点赞数

分类专栏： Python 文章标签： XPath Scrapy Python

本文链接：https://blog.csdn.net/UserPython/article/details/83863286

版权

Python 专栏收录该内容

42 篇文章 1 订阅

订阅专栏

文章目录

XPath选择器

XPath即XML路径语言，它是一种用来确定xml文档中某个部分位置的语言

基础语法

下面通过一个HTML文档讲解各个语法

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

body = """
<html>
    <head>
        <base href = "http://www.baidu.com" />
        <title>User23333</title>
    </head>
    <body>
        <h1>Hello World</h1>
        <h1>Hello Scrapy</h1>
        <b>Hello Python</b>
        <ul>
            <li>Python学习手册<b>价格：99.00元</b></li>
            <li>Python核心编程<b>价格：88.00元</b></li>
            <li>Python基础教程<b>价格：80.00元</b></li>
        </ul> 
        <div id = "images">
            <a href = "image1.html">Name: Iamge 1</br><img src = "image1.jpg"/></a>
            <a href = "image2.html">Name: Iamge 2</br><img src = "image2.jpg"/></a>
            <a href = "image3.html">Name: Iamge 3</br><img src = "image3.jpg"/></a>
            <a href = "image4.html">Name: Iamge 4</br><img src = "image4.jpg"/></a>
            <a href = "image5.html">Name: Iamge 5</br><img src = "image5.jpg"/></a>
        </div> 
        <ul class="pager">          
            <li class="current">Page 1 of 50</li>            
            <li class="next"><a href="catalogue/page-2.html" class="">next</a></li>         
        </ul>     
    </body>
</html>
"""

response = HtmlResponse(url="http://www.example.com", body=body, encoding='utf-8')

/:描述一个从根开始的绝对路径

print(response.xpath('/html'))
print(response.xpath('/html/head'))

E1/E2: 选中E1子节点中的所有E2节点

print(response.xpath('/html/body/div/a')) #选中div子节点中的多有a 节点

//E: 选中文档中的所有E，无论在什么位置

print(response.xpath('//a')) #选中文档中的所有a

E1//E2: 选中E1后代节点中的所有E2节点，无论在后代中的什么位置

print(response.xpath('/html/body//img')) # 选中body后代中的所有img

E/text(): 选中E节点的文本

sel = response.xpath('//a/text()') # 选中所有a 的文本
print(sel)
print(sel.extract())

E/*: 选中E的所有元素子节点

print(response.xpath('/html/*')) # 选中html的所有子节点
print(response.xpath('/html/body/div//*')) # 选中div的所有后代元素节点

*/E: 选中孙节点中的所有E节点

print(response.xpath('//div/*/img'))

E/@ATTR : 选中E节点中的ATTR属性值

print(response.xpath('//img/@src')) # 选中img节点的src属性值

//@ATTR : 选中文档中所有ATTR属性

print(response.xpath('//@href')) # 选中所欲的href属性的值

E/@*: 选中E节点的所有属性值

print(response.xpath('//a[1]/img/@*')) # 获取第一个a下img的所有属性（这里只有src一个属性）

. : 选中当前节点，用来描述相对路径

sel2 = response.xpath('//a')[0] # 获取第一个a 的选择器对象
print(sel2)

'''
假设我们想选中当前这个a 后代中的所有img,下面的做法是错误的，会找到文档中所有的img
因为//img是绝对路径，会从文档的根开始搜索，而不是从当前的a 开始
'''
print(sel2.xpath('//img')) #错误

#需要使用.//img来描述当前节点后代中的所有img
print(sel2.xpath('.//img'))

. . : 选中当前节点的父节点，用来描述相对路径

print(response.xpath('//img/..')) # 选中所有img的父节点

node[谓语] : 谓语用来查找某个特定的节点或者包含某个特定值的节点

print(response.xpath('//a[3]')) # 选中多有a 中的第3个
print(response.xpath('//a[last()]'))
print(response.xpath('//a[position()<=3]'))


# 选中所有含有id属性的div节点
print(response.xpath('//div[@id]'))

# 选中所有含有id属性且值诶images的div
print(response.xpath('//div[@id="images"]'))

常用函数

string(arg): 返回参数的字符串值，包括子孙节点

text = '<a href = "#">Click here to go the<strong>Next Page</strong></a>'
selector = Selector(text=text)
# print(selector)

# 以下做法得到相同结果
print(selector.xpath('/html/body/a/strong/text()').extract())
print(selector.xpath('string(/html/body/a/strong)').extract())

# 如果想得到a 中的整个字符串‘Click here to go the Next Page’
# 使用text()就不行了，因为Click here to go the 和 Next Page在不同元素下
# 以下做法得到两个子串
print(selector.xpath('/html/body/a//text()').extract())
# 这种情况下可以使用string()函数
print(selector.xpath('string(/html/body/a)').extract())

contains(str1, str2): 判断str1中是否包含str2，返回布尔值

text2 = '''
    <div>
        <p class="small info">hello world</p>
        <p class="normal info">hello scrapy</p>
    </div>
'''

selector2 = Selector(text = text2)
print(selector2.xpath('//p[contains(@class, "small")]')) # 选择class属性中包含“small”的p元素
print(selector2.xpath('//p[contains(@class, "info")]')) # 选择class属性中包含“info”的p元素

苦涩2020

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Scrapy_XPath选择器

文章目录XPath选择器基础语法常用函数XPath选择器XPath即XML路径语言，它是一种用来确定xml文档中某个部分位置的语言基础语法下面通过一个HTML文档讲解各个语法from scrapy.selector import Selectorfrom scrapy.http import HtmlResponsebody = &quot;&quot;&quot;&amp;lt;html&amp;gt; &amp;lt;h...
复制链接

扫一扫