目录
Scrapy选择器用法
官方测试页面:https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
HTML代码:
<html><head>
<base href="http://example.com/">
<title>Example website</title>
<style type="text/css" abt="234"></style></head>
<body>
<div id="images">
<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>
<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>
<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>
<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>
<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>
</div>
</body></html>
命令提示符中:
scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
查找
selector:Scrapy中内置的一个选择器类,用这个类可以做数据的提取
response.selector
输出:<Selector xpath=None data='<html>\n <head>\n <base href="http://e...'>
eg:使用XPath选择器获取title的内容,返回的是一个list
response.selector.xpath('//title/text()')
输出:[<Selector xpath='//title/text()' data='Example website'>]
输出标题:
response.selector.xpath('//title/text()').extract_first()
输出:'Example website'
eg2:使用css选择器
response.selector.css('title::text').extract_first()
输出:'Example website'
迭代查找
eg:查找网站内所有的图片
response.xpath('//div[@id="images"]').css('img')
输出:
[<Selector xpath='descendant-or-self::img' data='<img src="image1_thumb.jpg">'>,
<Selector xpath='descendant-or-self::img' data='<img src="image2_thumb.jpg">'>,
<Selector xpath='descendant-or-self::img' data='<img src="image3_thumb.jpg">'>,
<Selector xpath='descendant-or-self::img' data='<img src="image4_thumb.jpg">'>,
<Selector xpath='descendant-or-self::img' data='<img src="image5_thumb.jpg">'>]
获取链接:
response.xpath('//div[@id="images"]').css('img::attr(src)').extract()
输出:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
extract_frist():获取第一条内容
- 参数:default=""
指定获取不到时返回的内容,避免出现错误
extract():获取文本内容
eg2:获取页面中所有的超链接
XPath选择器:
response.xpath('//a/@href').extract()
CSS选择器:
response.css('a::attr(href)').extract()
输出:['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
eg3:获取属性名称包含images的超链接
第一个参数为属性名,第二个参数为属性的值
XPath选择器:
response.xpath('//a[contains(@href,"image")]/@href').extract()
CSS选择器:
response.css('a[href*=image]::attr(href)').extract()
输出:
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
eg4:选择所有a标签里面的image的src属性
XPath选择器:
response.xpath('//a[contains(@href,"image")]/img/@src').extract()
CSS选择器:
response.css('a[href*=image] img::attr(src)').extract()
输出:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
正则表达式筛选
eg:
response.css('a::text').extract()
['Name: My image 1 ',
'Name: My image 2 ',
'Name: My image 3 ',
'Name: My image 4 ',
'Name: My image 5 ']
只选取冒号之后的内容:
response.css('a::text').re('Name\:(.*)')
[' My image 1 ',
' My image 2 ',
' My image 3 ',
' My image 4 ',
' My image 5 ']
例子:爬取B站排行榜
进入环境
scrapy shell https://www.bilibili.com/ranking
1.获取标题
XPath选择器
response.selector.xpath('//title/text()').extract_first()
CSS选择器
response.selector.css('title::text').extract_first()
'热门视频排行榜 - 哔哩哔哩 (゜-゜)つロ 干杯~-bilibili'
2.查找页面内所有的链接
XPath选择器:
response.xpath('//div[@class="info"]/a/@href').extract()
CSS选择器:
response.css('.title::attr(href)').extract()
3.查找页面内视频的标题
XPath选择器:
response.xpath('//div[@class="info"]/a/text()').extract()
CSS选择器:
response.css('.info a[href]::text').extract()