scrapy的下载器中间件及配置文件

Downloader Middlewares(下载器中间件)

下载器中间件是引擎和下载器之间通信的中间件。在这个中间件中我们可以设置代理、更换请求头等来达到反反爬虫的目的。要写下载器中间件,可以在下载器中实现两个方法。一个是process_request(self,request,spider),这个方法是在请求发送之前会执行,还有一个是process_response(self,request,response,spider),这个方法是数据下载到引擎之前执行。

process_request(self,request,spider):

这个方法是下载器在发送请求之前会执行的。一般可以在这个里面设置随机代理ip等。

  1. 参数:
    request:发送请求的request对象。spider:发送请求的spider对象。
  2. 返回值:
    返回None:如果返回None,Scrapy将继续处理该request,执行其他中间件中的相应方法,直到合适的下载器处理函数被调用。
    返回Response对象:Scrapy将不会调用任何其他的process_request方法,将直接返回这个response对象。已经激活的中间件的process_response()方法则会在每个response返回时被调用。
    返回Request对象:不再使用之前的request对象去下载数据,而是根据现在返回的request对象返回数据。
    如果这个方法中抛出了异常,则会调用process_exception方法。
process_response(self,request,response,spider):

这个是下载器下载的数据到引擎中间会执行的方法。

  1. 参数:
    request:request对象。response:被处理的response对象。spider:spider对象。
  2. 返回值:
    返回Response对象:会将这个新的response对象传给其他中间件,最终传给爬虫。
    返回Request对象:下载器链被切断,返回的request会重新被下载器调度下载。
    如果抛出一个异常,那么调用request的errback方法,如果没有指定这个方法,那么会抛出一个异常。
随机请求头中间件:

随机更改请求头,可以在下载中间件中实现。在请求发送给服务器之前,随机的选择一个请求头。这样就可以避免总使用一个请求头了。
user-agent列表:http://www.useragentstring.com/pages/useragentstring.php?typ=Browser

class UserAgentDownloadMiddleware(object):
    USER_AGENTS=[
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36",
        "Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16",
        "Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14",
        "Mozilla/5.0 (Windows NT 6.0; rv:2.0) Gecko/20100101 Firefox/4.0 Opera 12.14",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0) Opera 12.14",
        "Opera/12.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.02",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1",
        "Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0",
        "Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0",
    ]

    def process_request(self,request,spider):
        user_agent = random.choice(self.USER_AGENTS)
        request.headers['User-Agent']=user_agent
ip代理池中间件

芝麻代理:http://http.zhimaruanjian.com/
太阳代理:http://http.taiyangruanjian.com/
快代理:http://www.kuaidaili.com/
讯代理:http://www.xdaili.cn/
蚂蚁代理:http://www.mayidaili.com/

  1. 开放池:
class IPProxyDownloadMiddleware(object):
     PROXIES = [
        "ip:端口",
        "ip:端口",
        "ip:端口",
     ]
     def process_request(self,request,spider):
           proxy = random.choice(self.PROXIES)
           print('被选中的代理:%s' % proxy)
           request.meta['proxy'] = "http://" + proxy
  1. 独享代理池:
 class IPProxyDownloadMiddleware(object):
     def process_request(self,request,spider):
         proxy = 'ip:端口'
         user_password = "xxxx:xxxx"
         request.meta['proxy'] = proxy
         # bytes
         b64_user_password = base64.b64encode(user_password.encode('utf-8'))
         request.headers['Proxy-Authorization'] = 'Basic ' + b64_user_password.decode('utf-8')
setting配置信息
  1. BOT_NAME:项目名称。
  2. ROBOTSTXT_OBEY:是否遵守爬虫协议。默认不遵守。
  3. CONCURRENT_ITEMS:代表pipeline同时处理的item数的最大值。默认是100
  4. CONCURRENT_REQUESTS:代表下载器并发请求的最大是,默认是16。
  5. DEFAULT_REQUEST_HEADERS:默认请求头。可以将一些不会经常变化的请求头放在这个里面。
  6. DEPTH_LIMIT:爬取网站最大允许的深度。默认为0,如果为0,则没有限制。
  7. DOWNLOAD_DELAY:下载器在下载某个页面前等待多长的时间。该选项用来限制爬虫的爬取速度,减轻服务器压力。同时也支持小数。
  8. DOWNLOAD_TIMEOUT:下载器下载的超时时间。
  9. ITEM_PIPELINES:处理item的Pipeline,是一个字典,字典的key这个pipeline所在包的绝对路径,值是一个整数,优先级,值越小,优先级越高。
  10. LOG_ENABLED:是否启用logging。默认是True。
  11. LOG_ENCODING:log的编码。
  12. LOG_LEVEL:log的级别。默认为DEBUG。可选的级别有CRITICAL、ERROR、WARNING、INFO、DEBUG。
  13. USER_AGENT:请求头。默认为Scrapy/VERSION (+http://scrapy.org)。
  14. PROXIES:代理设置。
  15. COOKIES_ENABLED:是否开启cookie。一般不要开启,避免爬虫被追踪到。如果特殊情况也可以开启。
BOSS职位信息爬取

Bosspider.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from BOS.BOS.items import BosItem
class BosspiderSpider(CrawlSpider):
    name = 'Bosspider'
    allowed_domains = ['zhipin.com']
    start_urls = ['https://www.zhipin.com/job_detail/?query=python&city=100010000&industry=&position=']

    rules = (
        Rule(LinkExtractor(allow=r'.+\?query=python&page=\d+.+'), follow=True),
        Rule(LinkExtractor(allow=r'.+job_detail.+html.+'), callback='parse_job', follow=True),
    )

    def parse_job(self, response):
        info_primary = response.xpath("//div[@class='info-primary']")
        name = info_primary.xpath(".//div[@class='name']/h1/text()").get()
        salary = info_primary.xpath(".//span[@class='salary']/text()").get()
        job_info = info_primary.xpath(".//p/text()").getall()
        city = job_info[0].strip()
        work_years = job_info[1].strip()
        education = job_info[2].strip()
        company = response.xpath("//div[@class='company-info']/div[@class='info']/text()").get().strip()
        yield BosItem(name=name,salary=salary,city=city,
                      work_years=work_years,education=education,company=company)

items.py

import scrapy
class BosItem(scrapy.Item):
    name = scrapy.Field()
    salary = scrapy.Field()
    city = scrapy.Field()
    work_years = scrapy.Field()
    education = scrapy.Field()
    company = scrapy.Field()

middlewares.py

import json
import random

import requests
from scrapy import signals
from twisted.internet.defer import DeferredLock

from BOS.models import ProxyModel
class UserAgentDownloadMiddleware(object):
    USER_AGENTS=[
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36",
        "Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16",
        "Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14",
        "Mozilla/5.0 (Windows NT 6.0; rv:2.0) Gecko/20100101 Firefox/4.0 Opera 12.14",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0) Opera 12.14",
        "Opera/12.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.02",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1",
        "Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0",
        "Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20130401 Firefox/31.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A",
        "Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.13+ (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10",
        "Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko ) Version/5.1 Mobile/9B176 Safari/7534.48.3",
    ]

    def process_request(self,request,spider):
        user_agent = random.choice(self.USER_AGENTS)
        request.headers['User-Agent']=user_agent

class IPProxyDownloadMiddleware(object):
    PROXY_URL = ""

    def __init__(self):
        super(IPProxyDownloadMiddleware,self).__init__()
        self.current_proxy = None
        self.lock = DeferredLock()

    def process_request(self,request,spider):
        if 'proxy' not in request.meta or self.current_proxy.is_expiring:
            self.update_proxy()
        request.meta['proxy'] = self.current_proxy

    def process_response(self,request,response,spider):
        if response.status != 200 or "captcha" in response.url:
            if not self.current_proxy.blacked:
                self.current_proxy.blacked = True
            print("%s这个代理被加入黑名单了"%self.current_proxy.ip)
            self.update_proxy()

            return request
        return response

    def update_proxy(self):
        self.lock.acquire()
        if not self.current_proxy or self.current_proxy.is_expiring or self.current_proxy.blacked:
            response = requests.get(self.PROXY_URL)
            text = response.text
            print('重新获取了一个代理:',text)
            result = json.loads(text)
            if len(result['data']) > 0:
                data = result['data'][0]
                proxy_model = ProxyModel(data)
                self.current_proxy = proxy_model

models.py

from datetime import datetime,timedelta

class ProxyModel(object):

    def __init__(self,data):
        self.ip = data['ip']
        self.port = data['port']
        self.expire_str = data['expire_time']
        self.blacked = False

        date_str,time_str = self.expire_str.split("")
        year,month,day =date_str.split("-")
        hour,minute,second = time_str.split(":")
        self.expire_time = datetime(year=int(year),month=int(month),day=int(day),
                                    hour=int(hour),minute=int(minute),second=int(second))
        self.proxy = "https://{}:{}".format(self.ip,self.port)

    @property
    def is_expiring(self):
        now = datetime.now()
        if (self.expire_time-now) < timedelta(seconds=5):
            return True
        else:
            return False

pipelines.py

from scrapy.exporters import JsonLinesItemExporter


class BosPipeline(object):
    def __init__(self):
        self.fp = open('jobs.json','wb')
        self.exporter = JsonLinesItemExporter(self.fp,ensure_ascii=False)

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self,spider):
        self.fp.close()

setting.py

# -*- coding: utf-8 -*-

# Scrapy settings for BOS project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'BOS'

SPIDER_MODULES = ['BOS.spiders']
NEWSPIDER_MODULE = 'BOS.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'BOS (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
                 '(KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36',
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'BOS.middlewares.BosSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   # 'BOS.middlewares.BosDownloaderMiddleware': 543,
    'BOS.middlewares.UserAgentDownloadMiddleware':100,
    'BOS.middlewares.IPProxyDownloadMiddleware':200,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'BOS.pipelines.BosPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

start.py

from scrapy import cmdline
cmdline.execute("scrapy crawl Bosspider".split())
Selenium+scrapy 爬取简书网Ajax数据

jianshuspider.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from jianshu.items import JianshuItem


class JianshuspiderSpider(CrawlSpider):
    name = 'jianshuspider'
    allowed_domains = ['jianshu.com']
    start_urls = ['http://jianshu.com/']

    rules = (
        Rule(LinkExtractor(allow=r'.*/p/[a-z0-9]{12}.*'), callback='parse_detail', follow=True),
    )

    def parse_detail(self, response):
        title = response.xpath("//h1[@class='title']/text()").get()
        avatar = response.xpath("//a[@class='avatar']/img/@src").get()
        author = response.xpath("//span[@class='name']/a/text()").get()
        pub_time = response.xpath("//span[@class='publish-time']/text()").get().replace("*", "")
        url = response.url
        url1 = url.split("?")[0]
        article_id = url1.split('/')[-1]
        content = response.xpath("//div[@class='show-content']").get()

        word_count = response.xpath("//span[@class='wordage']/text()").get()
        comment_count = response.xpath("//span[@class='comments-count']/text()").get()
        read_count = response.xpath("//span[@class='views-count']/text()").get()
        like_count = response.xpath("//span[@class='likes-count']/text()").get()
        subjects = ",".join(response.xpath("//div[@class='include-collection']/a/div/text()").getall())

        item = JianshuItem(
            title=title,
            avatar=avatar,
            author=author,
            pub_time=pub_time,
            origin_url=response.url,
            article_id=article_id,
            content=content,
            subjects=subjects,
            word_count=word_count,
            comment_count=comment_count,
            read_count=read_count,
            like_count=like_count
        )
        yield item

items.py

import scrapy


class JianshuItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()
    article_id = scrapy.Field()
    origin_url = scrapy.Field()
    author = scrapy.Field()
    avatar = scrapy.Field()
    pub_time = scrapy.Field()
    read_count = scrapy.Field()
    like_count = scrapy.Field()
    word_count = scrapy.Field()
    comment_count = scrapy.Field()
    subjects = scrapy.Field()

middlewares.py
若要随机IP可参考Selenium的添加方式。

import time
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
class SeleniumDownloadMiddleware(object):
    def __init__(self):
        self.driver = webdriver.Chrome(executable_path=r'/home/kiosk/Desktop/chromedriver')

    def process_request(self,request,spider):
        self.driver.get(request.url)
        time.sleep(1)
        try:
            while True:
                showMore = self.driver.find_element_by_name('show-more')
                showMore.click()
                time.sleep(0.3)
                if not showMore:
                    break
        except:
            pass
        source = self.driver.page_source
        response = HtmlResponse(url=self.driver.current_url,body=source,request=request,encoding='utf-8')
        return response

pipelines.py

import pymysql
from pymysql import cursors
from twisted.enterprise import adbapi
class JianshuPipeline(object):
    def __init__(self):
        dbparms = {
            'host': '172.25.254.46',
            'port': 3306,
            'user': 'cooffee',
            'password': 'cooffee',
            'database': 'jianshu',
            'charset': 'utf8'
        }
        self.conn = pymysql.connect(**dbparms)
        self.cursor = self.conn.cursor()
        self._sql = None

    def process_item(self, item, spider):
        self.cursor.execute(self.sql,(item['title'],item['content'],item['author'],item['avatar'],item['pub_time'],item['origin_url'],item['article_id']))
        self.conn.commit()
        return item

    @property
    def sql(self):
        if not self._sql:
            self._sql = '''
            insert into article(title,content,author,avatar,pub_time,origin_url,article_id) values(%s,%s,%s,%s,%s,%s,%s)
            '''
            return self._sql
        return self._sql

## ConnectionPool异步存储数据。
class JianshuTwistedPipeline(object):
    def __init__(self):
        dbparms = {
            'host': '172.25.254.46',
            'port': 3306,
            'user': 'cooffee',
            'password': 'cooffee',
            'database': 'jianshu',
            'charset': 'utf8',
            'cursorclass': cursors.DictCursor
        }
        self.dbpool = adbapi.ConnectionPool('pymysql',**dbparms)
        self._sql = None

    @property
    def sql(self):
        if not self._sql:
            self._sql = '''
                insert into article(title,content,author,avatar,pub_time,origin_url,article_id) values(%s,%s,%s,%s,%s,%s,%s)
                '''
            return self._sql
        return self._sql

    def process_item(self,item,spider):
        defer = self.dbpool.runInteraction(self.insert_item,item)
        defer.addErrback(self.handle_error,item,spider)

    def insert_item(self,cursor,item):
        cursor.execute(self.sql, (item['title'], item['content'], item['author'], item['avatar'], item['pub_time'], item['origin_url'],item['article_id']))

    def handle_error(self,error,item,spider):
        print('='*10+"error"+'='*10)
        print(error)
        print('='*10+'error'+'='*10)

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for jianshu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jianshu'

SPIDER_MODULES = ['jianshu.spiders']
NEWSPIDER_MODULE = 'jianshu.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jianshu (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) '
               'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Mobile Safari/537.36',
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'jianshu.middlewares.JianshuSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   # 'jianshu.middlewares.JianshuDownloaderMiddleware': 543,
    'jianshu.middlewares.SeleniumDownloadMiddleware':200,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 'jianshu.pipelines.JianshuPipeline': 300,
    'jianshu.pipelines.JianshuTwistedPipeline':300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

start.py

from scrapy import cmdline
cmdline.execute("scrapy crawl jianshuspider".split())
  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
### 回答1: 在 Scrapy 项目的 settings.py 文件配置中间件的先后顺序为: 1. 先配置自定义中间件, 设置在 `DOWNLOADER_MIDDLEWARES` 和 `SPIDER_MIDDLEWARES` 中 2. 接着配置内置中间件 例如: ```python DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.MyCustomDownloaderMiddleware': 543, 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': None, 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350, 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500, 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550, 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600, 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750, 'scrapy.downloadermiddlewares.stats.DownloaderStats': 850, 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900, } SPIDER_MIDDLEWARES = { 'myproject.middlewares.MyCustomSpiderMiddleware': 543, 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500, 'scrapy.spidermiddlewares.referer.RefererMiddleware': 700, 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800, 'scrapy.spidermiddlewares.depth.DepthMiddleware': 900, } ``` 这里你可以看到在自定义中间件之后是默认的中间件,而且在配置时也会有优先级这么一个概念,数字越小优先级越高。 ### 回答2: 在Scrapy项目的settings.py文件中,可以通过DOWNLOADER_MIDDLEWARES和SPIDER_MIDDLEWARES配置中间件的先后顺序。 DOWNLOADER_MIDDLEWARES是用于配置下载中间件的,该中间件可以通过修改请求和响应来拦截和处理请求和响应。可以通过修改settings.py中的DOWNLOADER_MIDDLEWARES,设置列表中的元素来确定中间件的顺序。列表中的元素按照从高优先级到低优先级的顺序执行。可以通过修改元素的顺序来调整中间件的执行顺序。 SPIDER_MIDDLEWARES是用于配置爬虫中间件的,该中间件可以通过修改爬虫的输入和输出来拦截和处理爬虫的输入和输出数据。可以通过修改settings.py中的SPIDER_MIDDLEWARES,设置列表中的元素来确定中间件的顺序。列表中的元素按照从高优先级到低优先级的顺序执行。可以通过修改元素的顺序来调整中间件的执行顺序。 例如,如果想要在下载中间件中添加一个自定义的中间件,并希望它在其他中间件之前执行,可以将其添加到DOWNLOADER_MIDDLEWARES列表中的第一个位置。如果想要在爬虫中间件中添加一个自定义的中间件,并希望它在其他中间件之后执行,可以将其添加到SPIDER_MIDDLEWARES列表中的最后一个位置。 通过调整中间件的先后顺序,可以灵活地处理请求和响应以及爬虫的输入和输出,实现特定的功能和逻辑。 ### 回答3: Scrapy是一个功能强大的Python爬虫框架,可以用于爬取和提取网页数据。在Scrapy中,中间件是一个流程处理,可以对请求和响应进行预处理和后处理。通过在项目的settings.py文件中进行配置,可以控制中间件的先后顺序。 在settings.py文件中,有一个名为`DOWNLOADER_MIDDLEWARES`的配置项,它是一个字典,用于指定中间件及其优先级。 比如,我们可以将`DOWNLOADER_MIDDLEWARES`配置为: ```python DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.Middleware1': 543, 'myproject.middlewares.Middleware2': 544, 'myproject.middlewares.Middleware3': 545, } ``` 在这个例子中,`myproject.middlewares.Middleware1`是第一个中间件,优先级为543。`myproject.middlewares.Middleware2`是第二个中间件,优先级为544。`myproject.middlewares.Middleware3`是第三个中间件,优先级为545。 Scrapy会按照优先级来执行中间件,从而达到预期的处理顺序。较小的优先级值会被优先执行。在上面的例子中,Middleware1总是会被最先执行,然后是Middleware2,最后是Middleware3。 当然,也可以通过修改优先级值来调整中间件的执行顺序。较小的优先级值会被先执行,较大的优先级值会被后执行。 总结来说,通过在项目的settings.py文件配置`DOWNLOADER_MIDDLEWARES`,可以调整Scrapy中间件的执行顺序,从而实现对请求和响应的定制化处理。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值