scrapy中的xpath中的re使用

第一种:

 

例子:这里我使用"http://www.simple-style.com/page/1"这个网站的爬虫

>>>scrapy shell  http://www.simple-style.com/page/1

进入交互环境后,我想找到当前网页的所有src

 1 >>> response.xpath('//@src').extract()
 2 ['http://www.simple-style.com/wp-includes/js/jquery/jquery.js?ver=1.12.4', 'http://www.simple-style.com/wp-includes/js/jquery/jquery-migrate.m
 3 in.js?ver=1.4.1', 'http://www.simple-style.com/wp-content/plugins/to-top/public/js/to-top-public.js?ver=1.0', 'http://www.simple-style.com/wp-
 4 content/uploads/2017/03/simple-logo.gif', '//v.qq.com/iframe/player.html?vid=e0386mjreck&tiny=0&auto=0', 'http://www.simple-style.com/wp-conte
 5 nt/uploads/2017/03/END_OF_LOVE_MICHAL_NAROZNY_001.jpg', 'http://www.simple-style.com/wp-content/uploads/2017/03/ali_bosworth_01.jpg', 'http://
 6 www.simple-style.com/wp-content/uploads/2017/03/xiaoxuan_01.jpg', 'http://www.simple-style.com/wp-content/uploads/2017/03/the_warehouse_hotel_
 7 01.jpg', 'http://www.simple-style.com/wp-content/uploads/2017/02/ahndraya_parlato_01.jpg', 'http://www.simple-style.com/wp-content/uploads/201
 8 6/07/inner_self_04.jpg', 'http://www.simple-style.com/wp-content/uploads/2016/07/Yuanghua-Chen-01.jpg', 'http://www.simple-style.com/wp-conten
 9 t/uploads/2016/07/01-alicephoebelou.jpg', 'http://www.simple-style.com/wp-content/uploads/2016/06/02-Tim_Gao_Photography_Invisible_Theatre_17.
10 jpg', 'http://www.simple-style.com/wp-content/uploads/2016/05/4.png', 'http://www.simple-style.com/wp-content/uploads/2016/05/01-Remona.jpg',
11 'http://www.simple-style.com/wp-content/uploads/2016/05/Nbr-h000-1.jpg', 'http://www.simple-style.com/wp-content/uploads/2016/04/0501.jpg', 'h
12 ttp://www.simple-style.com/wp-content/uploads/2016/04/01.jpg', 'http://www.simple-style.com/wp-content/plugins/smartideo/static/smartideo.js?v
13 er=2.2.5', 'http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/js/skip-link-focus-fix.js?ver=1.0', 'http://www.simple-style.
14 com/wp-content/themes/twentyseventeen/assets/js/navigation.js?ver=1.0', 'http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/
15 js/global.js?ver=1.0', 'http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/js/jquery.scrollTo.js?ver=2.1.2', 'http://www.sim
16 ple-style.com/wp-includes/js/wp-embed.min.js?ver=4.7.3']

得到很多个src后,我想只取到"/2017/03"日上传的jpg的src,则可以使用正则

这里xpath后的对象不用extract(), re后会返回一个字符串列表,否则会报错

1 response.xpath('//@src').re('.*/2017/03/.*\.jpg')
2 ['http://www.simple-style.com/wp-content/uploads/2017/03/END_OF_LOVE_MICHAL_NAROZNY_001.jpg', 'http://www.simple-style.com/wp-content/uploads/
3 2017/03/ali_bosworth_01.jpg', 'http://www.simple-style.com/wp-content/uploads/2017/03/xiaoxuan_01.jpg', 'http://www.simple-style.com/wp-conten
4 t/uploads/2017/03/the_warehouse_hotel_01.jpg']

 

第二种:

 1 from scrapy.selector import Selector
 2 from scrapy.http import HtmlResponse
 3 html = """<!DOCTYPE html>
 4 <html>
 5 <head lang="en">
 6     <meta charset="UTF-8">
 7     <title></title>
 8 </head>
 9 <body>
10     <li class="item-"><a href="link.html">first item</a></li>
11     <li class="item-0"><a href="link1.html">first item</a></li>
12     <li class="item-1"><a href="link2.html">second item</a></li>
13 </body>
14 </html>
15 """
16 response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
17 ret = Selector(response=response).xpath('//li[re:test(@class, "item-\d*")]//@href').extract()
18 print(ret)
19 
20 正则选择器

 

转载于:https://www.cnblogs.com/Garvey/p/6697162.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值