爬虫大框架之 - Scrapy(二)

Scrapy二

1,Scrapy shell

shell == ipython,ipython是一个增强版的python解释器,可以在 setting.py 里面设置

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6NvkaLrp-1592032715573)(assets/.png)]

  1. 进入shell方式

    1. 直接在项目根目录:scrapy shell [url], 这里的url指的就是你想爬取的网址,可不写

    2. 命令添加请求头

      (Spiderenv) pyvip@VIP:~/code/爬虫/myspider$ scrapy shell -s USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' https://www.jianshu.com
      
    3. setting.py 里添加请求头,根目录下运行shell会默认加载 setting.py 文件

      [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cNm4nEDj-1592032715578)(assets/.png)]

  2. shell 里的 fetch方法

    1.直接fetch('url')
    	fetch('http:www.baidu.com')
    
    2.构造一个request请求
    	req = scrapy.Request('http://www.quanshuwang.com')
    	fetch(req)
        
    3.添加请求头信息
    	
    
  3. shelp() 方法

    [s] Available Scrapy objects:
    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    [s]   crawler    <scrapy.crawler.Crawler object at 0x7fd20ff30da0>
    [s]   item       {}
    [s]   req        <GET http://www.quanshuwang.com>
    [s]   request    <GET http://www.quanshuwang.com>
    [s]   response   <200 http://www.quanshuwang.com>
    [s]   settings   <scrapy.settings.Settings object at 0x7fd20e7ff588>
    [s]   spider     <DefaultSpider 'default' at 0x7fd2045e2898>
    [s] Useful shortcuts:
    [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
    [s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
    [s]   shelp()           Shell help (print this help)
    [s]   view(response)    View response in a browser	# 让浏览器打开请求
    
2,Scrapy 选择器

Scrapy提供基于lxml库的解析机制,他们被称为选选择器

  1. xpath 传入路径表达式,返回路劲列表

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rifq8lXC-1592032715581)(assets/.png)]

  2. css

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tgAbF8nv-1592032715584)(assets/.png)]

  3. re

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-T1fvSRnJ-1592032715586)(assets/.png)]

  4. extract 序列化selectors对象为字符串

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AHwPZb3D-1592032715588)(assets/.png)]

from scrapy import Selector 
from scrapy.http import HtmlResponse 

# 从文本构造
body = '<html><body><span>good</span></body></html>' 
select = Selector(text=body)
selectxpath('//span/text()').extract()
	-->  ['good']
selectxpath('//span/text()').extract_first() 
	-->  'good'
    
# 从响应构造
response = HtmlResponse(url='http://www.example.com', body=body, encoding='utf-8')
select1 = Selector(response=response)
response.selector.xpath('//span/text()').extract()
	-->  ['good']
response.xpath('//span/text()').extract_first()
	-->  'good'
    
https://doc.scrapy.org/en/latest/_static/selectors-sample1.html

更多的事例:

In [23]: fetch('https://doc.scrapy.org/en/latest/_static/selectors-sample1.html') 

In [24]: response.body                 
Out[24]: b"<html>\n <head>\n  <base href='http://example.com/' />\n  <title>Example website</title>\n </head>\n <body>\n  <div id='images'>\n   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n  </div>\n </body>\n</html>\n\n"

In [25]: response.xpath('//title/text()')    
Out[25]: [<Selector xpath='//title/text()' data='Example website'>]

In [26]: response.xpath('//title/text()').extract()
Out[26]: ['Example website']

In [27]: response.css('title::text')
Out[27]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]

In [28]: response.css('title::text').extract()
Out[28]: ['Example website']

In [29]: response.xpath('//a[contains(@href, "image")]')
Out[29]: 
[<Selector xpath='//a[contains(@href, "image")]' data='<a href="image1.html">Name: My image 1 <'>,
 <Selector xpath='//a[contains(@href, "image")]' data='<a href="image2.html">Name: My image 2 <'>,
 <Selector xpath='//a[contains(@href, "image")]' data='<a href="image3.html">Name: My image 3 <'>,
 <Selector xpath='//a[contains(@href, "image")]' data='<a href="image4.html">Name: My image 4 <'>,
 <Selector xpath='//a[contains(@href, "image")]' data='<a href="image5.html">Name: My image 5 <'>]
 
In [30]: links = response.xpath('//a[contains(@href, "image")]')

In [37]: for index, link in enumerate(links): 
    ...:     print(index, link.xpath('@href').extract(), link.xpath('img/@src').extract()) 
out [37]0 ['image1.html'] ['image1_thumb.jpg']
1 ['image2.html'] ['image2_thumb.jpg']
2 ['image3.html'] ['image3_thumb.jpg']
3 ['image4.html'] ['image4_thumb.jpg']
4 ['image5.html'] ['image5_thumb.jpg']

In [38]: response.css('img').xpath('@src').extract()
Out[38]: 
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

# 结合正则表达式
In [39]: response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s(.*)')
Out[39]: ['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image 5 ']

In [40]: response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s(.*)') 
Out[40]: 'My image 1 '
3,scrapy.Spider

爬虫基类

  1. 用来定义以针对某一个网站,某一些网站的特定爬取动作的地方

    必备的东西:

    ​ name,爬虫的名字

    ​ start_urls,起始url

    ​ parse,默认解析方法,返回scrapy,Request,scrapy,Item,dict

    ​ start_request,通过start_urls构造初始请求

  2. 观看源码

    1. name Spider的名字,必须有,唯一
      start_urls,起始url

    ​ parse,默认解析方法,返回scrapy,Request,scrapy,Item,dict

    ​ start_request,通过start_urls构造初始请求

  3. 观看源码

    1. name Spider的名字,必须有,唯一
    2. start_url 起始url,是一个列表可以写多个 url
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值