Scrapy二
1,Scrapy shell
shell == ipython,ipython是一个增强版的python解释器,可以在 setting.py 里面设置
-
进入shell方式
-
直接在项目根目录:
scrapy shell [url]
, 这里的url指的就是你想爬取的网址,可不写 -
命令添加请求头
(Spiderenv) pyvip@VIP:~/code/爬虫/myspider$ scrapy shell -s USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' https://www.jianshu.com
-
setting.py
里添加请求头,根目录下运行shell会默认加载setting.py
文件
-
-
shell 里的 fetch方法
1.直接fetch('url') fetch('http:www.baidu.com') 2.构造一个request请求 req = scrapy.Request('http://www.quanshuwang.com') fetch(req) 3.添加请求头信息
-
shelp() 方法
[s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x7fd20ff30da0> [s] item {} [s] req <GET http://www.quanshuwang.com> [s] request <GET http://www.quanshuwang.com> [s] response <200 http://www.quanshuwang.com> [s] settings <scrapy.settings.Settings object at 0x7fd20e7ff588> [s] spider <DefaultSpider 'default' at 0x7fd2045e2898> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser # 让浏览器打开请求
2,Scrapy 选择器
Scrapy提供基于lxml库的解析机制,他们被称为选选择器
-
xpath 传入路径表达式,返回路劲列表
-
css
-
re
-
extract 序列化selectors对象为字符串
from scrapy import Selector
from scrapy.http import HtmlResponse
# 从文本构造
body = '<html><body><span>good</span></body></html>'
select = Selector(text=body)
selectxpath('//span/text()').extract()
--> ['good']
selectxpath('//span/text()').extract_first()
--> 'good'
# 从响应构造
response = HtmlResponse(url='http://www.example.com', body=body, encoding='utf-8')
select1 = Selector(response=response)
response.selector.xpath('//span/text()').extract()
--> ['good']
response.xpath('//span/text()').extract_first()
--> 'good'
https://doc.scrapy.org/en/latest/_static/selectors-sample1.html
更多的事例:
In [23]: fetch('https://doc.scrapy.org/en/latest/_static/selectors-sample1.html')
In [24]: response.body
Out[24]: b"<html>\n <head>\n <base href='http://example.com/' />\n <title>Example website</title>\n </head>\n <body>\n <div id='images'>\n <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n </div>\n </body>\n</html>\n\n"
In [25]: response.xpath('//title/text()')
Out[25]: [<Selector xpath='//title/text()' data='Example website'>]
In [26]: response.xpath('//title/text()').extract()
Out[26]: ['Example website']
In [27]: response.css('title::text')
Out[27]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]
In [28]: response.css('title::text').extract()
Out[28]: ['Example website']
In [29]: response.xpath('//a[contains(@href, "image")]')
Out[29]:
[<Selector xpath='//a[contains(@href, "image")]' data='<a href="image1.html">Name: My image 1 <'>,
<Selector xpath='//a[contains(@href, "image")]' data='<a href="image2.html">Name: My image 2 <'>,
<Selector xpath='//a[contains(@href, "image")]' data='<a href="image3.html">Name: My image 3 <'>,
<Selector xpath='//a[contains(@href, "image")]' data='<a href="image4.html">Name: My image 4 <'>,
<Selector xpath='//a[contains(@href, "image")]' data='<a href="image5.html">Name: My image 5 <'>]
In [30]: links = response.xpath('//a[contains(@href, "image")]')
In [37]: for index, link in enumerate(links):
...: print(index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
out [37]:
0 ['image1.html'] ['image1_thumb.jpg']
1 ['image2.html'] ['image2_thumb.jpg']
2 ['image3.html'] ['image3_thumb.jpg']
3 ['image4.html'] ['image4_thumb.jpg']
4 ['image5.html'] ['image5_thumb.jpg']
In [38]: response.css('img').xpath('@src').extract()
Out[38]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
# 结合正则表达式
In [39]: response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s(.*)')
Out[39]: ['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image 5 ']
In [40]: response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s(.*)')
Out[40]: 'My image 1 '
3,scrapy.Spider
爬虫基类
-
用来定义以针对某一个网站,某一些网站的特定爬取动作的地方
必备的东西:
name,爬虫的名字
start_urls,起始url
parse,默认解析方法,返回scrapy,Request,scrapy,Item,dict
start_request,通过start_urls构造初始请求
-
观看源码
name
Spider的名字,必须有,唯一
start_urls,起始url
parse,默认解析方法,返回scrapy,Request,scrapy,Item,dict
start_request,通过start_urls构造初始请求
-
观看源码
name
Spider的名字,必须有,唯一start_url
起始url,是一个列表可以写多个 url