Scrapy快速入门

Scrapy

Scrapy项目结构

  • 安装

    pip install scrapy # 或 conda install scrapy
    
  • scrapy startproject PROJECT_NAME  # 创建名为PROJECT_NAME的scrapy工程
    

    scrapy startproject myspider创建名为myspider的工程,文件结构:

    myspider

    myspider

    __init__.py

    __pycache__

    items.py # 需要爬取的内容

    middlewares.py # 自定义中间件

    pipelines.py # 管道,保存数据

    settings.py # 设置文件

    spiders # 存放spider,一个项目可以定义一多个爬虫

    __init__.py

    __pycache__

    scrapy.cfg #项目配置文件

  • 在该工程下创建一个爬虫

    cd <PROJECT_NAME>  # 进入项目目录
    spider genspider <SPIDER_NAME> <DOMAIN>  # SPIDER_NAME爬虫名字,DOMAIN域名
    

    如:

    cd myspider
    spider genspider baidu baidu.com
    

    此时spiders文件夹内会生成baidu.py文件:

    import scrapy
    
    class BaiduSpider(scrapy.Spider):
        name = 'baidu'  # 爬虫的名字SPIDER_NAME
        allowed_domains = ['baidu.com']  # 允许进行解析的域名DOMAIN,其他域名会被过滤掉
        start_urls = ['http://baidu.com/']  # 起始的解析网址
    
        # 对网址具体的解析操作
        def parse(self, response):
            pass
    

Items

  • 可以在items.py中声明需要爬取的字段,如

    import scrapy
    
    # 定义一个个人信息的类
    class PersonItem(scrapy.Item):
        # define the fields for your item here like:
        name = scrapy.Field()
        sex = scrapy.Field()
        age = scrapy.Field()
    

使用

  • 在spider中创建一个该类(PersonItem)的对象即可

    import scrapy
    
    class ExampleSpider(scrapy.Spider):
        name = "example"
        allowed_domains = ['example.com']
        start_urls = ['http://example.com']
        
        def parse(self, response):
            item = PersonItem()
            # extract()与extract_first()联系与区别请自行搜索
            item["name"] = response.xpath("").extract_first()
            item["sex"] = response.xpath("").extract_first()
            item["age"] = response.xpath("").extract_first()
            
            yield item # 会传到pipeline进行处理
    

Pipeline

  • 可以定义多个pipeline的场景
    1. 有多个spider,不同的pipeline处理不同item的内容
    2. 一个spider的内容可能要做不同的操作,比如存入不同的数据库等

使用pipeline

  1. 在pipelines.py中定义pipeline类

    class MyspiderPipeline:
        def process_item(self, item, spider): # 注意方法名固定,item为从spider接收到的数据,spider代表爬虫类本身,可以通过spider.name获取爬虫的名字,spider.settings.get["NAME"]获取settings中定义的字段信息等,其他方法请自行搜索
            # 对数据item进行处理
            return item  # 如果有其他pipeline,则会传给优先级比这个低的其他pipeline继续处理
    
  2. 在配置文件中声明

    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
        'myspider.pipelines.MyspiderPipeline': 300,
        # 'myspider.pipelines.MyspiderPipeline2': 301,
    }
    

    pipeline的权重越小优先级越高

pipeline类的常用方法

  • 此处以将接收到的item信息存入json文件中为例
import json

class JsonWriterPipeline:
    def open_spider(self, spider):  # 仅爬虫开启的时候运行一次
        self.file = open(spider.settings.get["SAVE_FILE"], 'w')
        
    def close_spider(self, spider):  # 仅爬虫结束的时候运行一次
        self.file.close()
        
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item  # 不return时其他权重较低的pipeline获取不到item

CrawlSpider

简介

  • crawlspider是Spider的派生类(一个子类),Spider类的设计原则是只爬取start_url列表中的网页,而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制,从爬取的网页中获取link并继续爬取的工作更适合。

使用

  1. 创建项目

    scrapy startproject <PROJECT_NAME>
    
  2. 创建爬虫

    cd <PROJECT_NAME>
    scrapy genspider -t crawl <SPIDER_NAME> <DOMAIN>  # 爬虫名字,域名
    
  3. 指定start_url

  4. 完善Rules

爬虫文件详解

CrawlSpider类和Spider类的最大不同是CrawlSpider多了一个rules属性,其作用是定义”提取动作“。在rules中可以包含一个或多个Rule对象,在Rule对象中包含了LinkExtractor对象。

# -*- coding: utf-8 -*-
import scrapy
# 导入CrawlSpider相关模块
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

# 表示该爬虫程序是基于CrawlSpider类的
class CrawldemoSpider(CrawlSpider):
    name = 'crawlDemo'    #爬虫文件名称
    start_urls = ['http://www.example.com/']
    
    #连接提取器:会去起始url响应回来的页面中提取指定的url
    link = LinkExtractor(allow=r'')
    #rules元组中存放的是不同的规则解析器(封装好了某种解析规则)。如果多个Rule都满足某一个url,会选择rules中第一个Rule
    rules = (
        #规则解析器:可以将连接提取器提取到的所有连接表示的页面进行指定规则(回调函数)的解析
        Rule(link, callback='parse_item', follow=True),
    )
    # 解析方法
    def parse_item(self, response):
        #print(response.url)
        pass
参数介绍
  • LinkExtractor:连接提取器,用于提取response中符合规则的链接

    LinkExtractor(
    allow=r’Items/’,# 满足括号中“正则表达式”的值会被提取,如果为空,则全部匹配。
    deny=xxx, # 满足正则表达式的则不会被提取。
    restrict_xpaths=xxx, # 满足xpath表达式的值会被提取
    restrict_css=xxx, # 满足css表达式的值会被提取
    deny_domains=xxx, # 不会被提取的链接的domains。  
    )

  • Rule:规则解析器,根据制定规则从连接提取器输出的链接中解析网页内容

    Rule(

    ​ LinkExtractor(allow=r’Items/’), # 指定链接提取器

    ​ callback=‘parse_item’, # 指定规则解析器解析数据的规则(回调函数)

    ​ follow=True # 是否将链接提取器继续作用到链接提取器提取出的链接网页中,默认值为true

    )

  1. CrawlSpider中不能再有以parse为名字的方法,该方法被用来实现基础的url提取等功能

糗事百科示例

  1. 爬虫文件

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from qiubaiBycrawl.items import QiubaibycrawlItem
    import re
    class QiubaitestSpider(CrawlSpider):
        name = 'qiubaiTest'
        #起始url
        start_urls = ['http://www.qiushibaike.com/']
    
        #定义链接提取器,且指定其提取规则
        page_link = LinkExtractor(allow=r'/8hr/page/\d+/')
        
        rules = (
            #定义规则解析器,且指定解析规则通过callback回调函数
            Rule(page_link, callback='parse_item', follow=True),
        )
    
        #自定义规则解析器的解析规则函数
        def parse_item(self, response):
            div_list = response.xpath('//div[@id="content-left"]/div')
            
            for div in div_list:
                #定义item
                item = QiubaibycrawlItem()
                #根据xpath表达式提取糗百中段子的作者
                item['author'] = div.xpath('./div/a[2]/h2/text()').extract_first().strip('\n')
                #根据xpath表达式提取糗百中段子的内容
                item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip('\n')
    
                yield item #将item提交至管道
    
  2. items文件

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class QiubaibycrawlItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        author = scrapy.Field() #作者
        content = scrapy.Field() #内容
    
  3. pipeline文件

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    class QiubaibycrawlPipeline(object):
        
        def __init__(self):
            self.fp = None
            
        def open_spider(self,spider):
            print('开始爬虫')
            self.fp = open('./data.txt','w')
            
        def process_item(self, item, spider):
            #将爬虫文件提交的item写入文件进行持久化存储
            self.fp.write(item['author']+':'+item['content']+'\n')
            return item
        
        def close_spider(self,spider):
            print('结束爬虫')
            self.fp.close()
    

scrapy模拟登录

模拟登录的方式

  • requests:

    1. 直接携带cookies请求页面
    2. 找借口发送post请求存储cookie
  • selenium:

    找到对应的input标签,输入文字点击登录

  • scrapy

    1. 直接携带cookie
    2. 找到发送post请求的url地址,带上信息,发送请求

start_url的处理逻辑

scrapy.Spider部分源码如下:

from scrapy.http import Request

    def start_requests(self):
        cls = self.__class__
        if not self.start_urls and hasattr(self, 'start_url'):
            raise AttributeError(
                "Crawling could not start: 'start_urls' not found "
                "or empty (but found 'start_url' attribute instead, "
                "did you miss an 's'?)")
        if method_is_overridden(cls, Spider, 'make_requests_from_url'):
            warnings.warn(
                "Spider.make_requests_from_url method is deprecated; it "
                "won't be called in future Scrapy releases. Please "
                "override Spider.start_requests method instead (see %s.%s)." % (
                    cls.__module__, cls.__name__
                ),
            )
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
        else:
            for url in self.start_urls:
                yield Request(url, dont_filter=True)
                
    def make_requests_from_url(self, url):
        """ This method is deprecated. """
        warnings.warn(
            "Spider.make_requests_from_url method is deprecated: "
            "it will be removed and not be called by the default "
            "Spider.start_requests method in future Scrapy releases. "
            "Please override Spider.start_requests method instead."
        )
        return Request(url, dont_filter=True)    

我们定义在spider下的start_url=[]都是交给start_requests处理的。必要时可以重写这个方法。

重写start_url实现模拟登录【例】

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/user/profile']  # 示例网址,此处需要替换成实际的信息页地址
    
    # 重写start_requests方法
    def start_requests(self):
        # 这里的cookie拿的是我的b站的cookie,“=”后面的信息我做了修改。请按实际替换成自己的cookie
        cookies = "_uuid=151A8F2AD9460689infoc; buvid3=3D303838-7355-4B505FA155825infoc; sid=12r1g; DedeUserID=7544; DedeUserID__ckMd5=26d8ad82ab; SESSDATA=a13de69%2C21379*61; bili_jct=09d7836772af9; CURRENT_FNVAL=16; rpdid=|(J~R~lmJ|l)Y)k; LIVE_BUVID=AUT8415; Hm_lvt_8a6e55dbd2870f0f5bc9194cddf32a02=1590,1505; bp_video_offset_72290544=41303; bp_t_offset_72290544=413503; CURRENT_QUALITY=64; PVID=1"
        # 格式
        cookies = {i.split('=')[0]:i.split('=')[1] for i in cookies.split(';')}
        yield scrapy.Request(
        	self.start_urls[0],
            callback = self.parse,
            cookies = cookies
        )
        
    def parse(self, response):
        # 具体的页面解析逻辑
        pass

cookie在不同解析函数之间传递

  • cookie在settings.py中默认是开启的,这是cookie在不同解析函数之间传递的前提

    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
  • 在settings中添加COOKIES_DEBUG = TRUE 可以在终端显示cookie的信息。

    [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to:

    <GET http://www.example.com>

    Cookie: _uuid=151A8F2AD9460689infoc; buvid3=3D303838-7355-4B505FA155825infoc; …

发送post请求

  • 使用scrapy.Request时发送的是GET请求,发送POST请求需要使用scrapy.FormRequest,同时使用formdata来携带需要post的数据。

    class ExampleSpider(scrapy.Spider):
        name = 'example'
        allowed_domains = ['example.com']
        start_urls = ['https://example.com/login']
        headers = {
            "Accept": "*/*",
            "Accept-Language": "en-US,en;q=0.8,zh-TW;q=0.6,zh,q=0.4"
        }
        
        def parse(self, response):
            # 表单需要的其他字段获取,可以通过xpath,如 
            authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
            return scrapy.FormRequest(
            	"https://example.com/session", # 根据具体的请求信息而定
                formdata=dict(
                	login="user",
                    password="123456",
                    authenticity_token=authenticity_token
                ),
                callback=self.after_login
            )
        
        def after_login(self, response):
            # 登陆后做的事情
            pass
    

自动登录

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/login']
    
    def parse(self, response):
        yield scrapy.FormRequest.form_response(
        	response, # 自动从响应中找到form表单进行登录
            formdata={"email":"user_name","password":"password"},
            callback=self.after_login
        )
        
    def after_login(self, response):
        # 登陆后做的事情
        pass

Middlewares

使用方法:

编写一个类,然后在settings中开启。

【例】

MyspiderDownloaderMiddleware类(创建爬虫时默认生成的):

class MyspiderDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

settings.py

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'myspider.middlewares.MyspiderDownloaderMiddleware': 543,
}

应用

(一)随机UA
  • 方法一:添加自定义的UA,给request的headers[“User-Agent”]赋值即可
  1. 在settings中声明一个USER_AGENT_LIST

    USER_AGENT_LIST = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0",
        "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.0)",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"]
    
  2. 在middlewares.py中定义RandomUserAgent类

    import random
    
    class RandomUserAgentMiddleware:
        
        def process_request(self, request, spider):
            user_agent = random.choice(spider.settings.get["USER_AGENT_LIST"])
            request.headers['User-Agent'] = user_agent
    
    # 验证随机ua是否生效,在终端输出每次请求的ua
    class CheckUserAgentMiddleware:
        def process_response(self, request, reponse, spider):
            print(request.headers["User-Agent"])
            return response
    
  3. 在settings中声明middleware

    DOWNLOADER_MIDDLEWARES = {
       'myspider.middlewares.RandomUserAgentMiddleware': 543,
       'myspider.middlewares.CheckUserAgentMiddleware': 544,
    }
    
  • 方法二:使用 fake_useragent包。https://fake-useragent.herokuapp.com/browsers/0.1.11

    pip install fake_useragent
    

    使用:

    from fake_useragent import UserAgent
    
    # 随机UA
    class RandomUserAgentMiddleware(object):
        def process_request(self, request, spider):
            request.headers.setdefault(b'User-Agent', UserAgent().random)
    
(二)设置代理
  • 添加代理,需要在request的meta信息中添加proxy字段。代理的形式为:协议+ip+端口。

    class ProxyMiddleware:
        def process_request(self, request, spider):
            request.meta["proxy"] = "http://1.2.3.4:8888"
    
  • 添加检验代理有效性

(待完善)

  • 加密代理

(待完善)

settings文件的认识

  • 此处以创建的myspider项目为例:
# Scrapy settings for myspider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'myspider'

SPIDER_MODULES = ['myspider.spiders']
NEWSPIDER_MODULE = 'myspider.spiders'

# 过滤控制台输出日志级别
LOG_LEVEL = "DEBUG"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'myspider (+http://www.yourdomain.com)'

# 是否遵循robots.txt规则
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# 请求头的设置
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# 启用自定义的SPIDER_MIDDLEWARES
#SPIDER_MIDDLEWARES = {
#    'myspider.middlewares.MyspiderSpiderMiddleware': 543,
#}

# 启用自定义的DOWNLOADER_MIDDLEWARES
#DOWNLOADER_MIDDLEWARES = {
#    'myspider.middlewares.MyspiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# 启用自定义的pipeline
#ITEM_PIPELINES = {
#    'myspider.pipelines.MyspiderPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  • 针对单个spider定制的设置可以在每个spider里声明,如

    class FbdataaccessSpider(scrapy.Spider):
        name = 'FbDataAccess'
        custom_settings = {
            'PROTO_FIELD_NAME': 'Es_FbPost',
        }
    

scrapy shell

  • scrapy shell 是一个交互终端,可以在未启动spider的情况下尝试以及调试代码,也可用来测试xpath表达式

  • 使用方法

    scrapy shell https://www.baidu.com/
    
    # 下面是终端输入命令后的显示结果
    PS D:\personal\wangzg\OSDP\src\MediaDataAccess> scrapy shell https://www.baidu.com/
    2020-07-21 11:53:48 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: MediaDataAccess)
    2020-07-21 11:53:48 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:20:19) [MSC v.1925 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.18362-SP0
    2020-07-21 11:53:48 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
    2020-07-21 11:53:48 [scrapy.crawler] INFO: Overridden settings:
    {'BOT_NAME': 'MediaDataAccess',
     'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',      
     'LOGSTATS_INTERVAL': 0,
     'NEWSPIDER_MODULE': 'MediaDataAccess.spiders',
     'SPIDER_MODULES': ['MediaDataAccess.spiders']}
    2020-07-21 11:53:48 [scrapy.extensions.telnet] INFO: Telnet Password: 02420bd4fbd6de7f
    2020-07-21 11:53:48 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole']
    2020-07-21 11:53:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',   
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',     
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',   
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2020-07-21 11:53:49 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'MediaDataAccess.middlewares.RandomUserAgentMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2020-07-21 11:53:49 [scrapy.middleware] INFO: Enabled item pipelines:
    ['scrapy.pipelines.files.FilesPipeline',
     'scrapy.pipelines.images.ImagesPipeline',
     'MediaDataAccess.pipelines.ProtobufSavePipeline']
    2020-07-21 11:53:49 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
    2020-07-21 11:53:49 [scrapy.core.engine] INFO: Spider opened
    2020-07-21 11:53:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/> (referer: None)
    [s] Available Scrapy objects:
    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    [s]   crawler    <scrapy.crawler.Crawler object at 0x047F72B0>
    [s]   item       {}
    [s]   request    <GET https://www.baidu.com/>
    [s]   response   <200 https://www.baidu.com/>
    [s]   settings   <scrapy.settings.Settings object at 0x047F73D0>
    [s]   spider     <DefaultSpider 'default' at 0x4d61d00>
    [s] Useful shortcuts:
    [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
    [s]   fetch(req)                  Fetch a scrapy.Request and update local objects
    [s]   shelp()           Shell help (print this help)
    [s]   view(response)    View response in a browser
    >>> 
    

    response.url:当前响应的url

    response.request.url:当前响应对应的请求的url

    response.headers:响应头

    response.body:响应体

    response.request.headers:当前响应的请求头

scrapy-redis

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值