使用scrapy框架的中间件(Middleware)设置随机请求头

最新推荐文章于 2024-07-27 10:27:01 发布

路漫漫`

最新推荐文章于 2024-07-27 10:27:01 发布

阅读量1.1k

点赞数 1

分类专栏： Python 爬虫文章标签： python http html5

本文链接：https://blog.csdn.net/qq_39579087/article/details/106201758

版权

Python 同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

爬虫

3 篇文章 0 订阅

订阅专栏

先scrapy startproject [爬虫项目名字]
cd 进去，再scrapy genspider [爬虫名字] “http://httpbin.org/”

之所以要用这个url是因为这个网站只返回你的user-agent，便于验证。

先看一下两个方法：
process_request
process_response
上面的图片结合下面的图一起看比较好(来源网络，侵权删)：

process_request

在下载器发送请求前执行，通常在这个方法里设置请求头或者代理ip
需要两个参数：request，spider
返回值：

None ：上图自左向右，设中间件1的返回值为None，那么会将这个请求发送给中间价2.
Response：设中间件1返回值是Response对象，那么将不会发送给中间件2，而是会给process_response，进而给引擎。
Request：设中间件1返回值是Request对象，那么将这个新的对象给中间件2，而不是旧的Request对象。
异常会调用process_exception方法。

process_response

数据已经下载完毕，即将给引擎
三个参数：request，response，spider
返回值：

Response：设中间件3返回值是Response对象，那么会将这个新的对象给中间件2，而不是旧的Response对象。
Request：设中间件3返回的值Requset对象，那么它会接着向下载器发送请求，去进行下载。
异常会调用Request的errback方法，如果没有指定这个方法会抛出一个异常。

代码部分

爬虫主程序

# -*- coding: utf-8 -*-
import scrapy
import json

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/user-agent']

    def parse(self, response):
        user_agent = json.loads(response.text)['user-agent']
        print(response.text)
        yield scrapy.Request(self.start_urls[0], dont_filter=True)

要注意的是改一下start_urls就可以了，yield Request可以让爬虫一直请求这个页面，后面的dont_filter是不让scrapy自动去重。

middlewares.py

在middlewares.py中，添加一个类，并实现上述方法：

class HttprequsetheaderDownloaderMiddleware:
    # 在这里添加请求头列表
    header = [
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
        'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'
    ]

    def process_request(self, request, spider):
        user_agent = random.choice(self.header)
        request.headers['User-Agent'] = user_agent

由于这篇的目的只是添加请求头，所以只需要实现这一个方法。

settings.py

在这里需要加上

DOWNLOADER_MIDDLEWARES = {
   'HttpRequsetHeader.middlewares.HttprequsetheaderDownloaderMiddleware': 543,
}

要注意名字是我们前面写的类的名字

运行结果

{
  "user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)"
}

2020-05-18 21:22:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: http://httpbin.org/user-agent)
{
  "user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
}

2020-05-18 21:22:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: http://httpbin.org/user-agent)
{
  "user-agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
}

2020-05-18 21:22:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: http://httpbin.org/user-agent)
{
  "user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
}

2020-05-18 21:22:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: http://httpbin.org/user-agent)
{
  "user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
}

2020-05-18 21:22:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: http://httpbin.org/user-agent)
{
  "user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)"
}

2020-05-18 21:22:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: http://httpbin.org/user-agent)
{
  "user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)"

可以看到每次运行的请求头是随机的，因此实现了功能。

路漫漫`

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用scrapy框架的中间件(Middleware)设置随机请求头

先scrapy startproject [爬虫项目名字]cd 进去，再scrapy genspider [爬虫名字] “http://httpbin.org/”之所以要用这个url是因为这个网站只返回你的user-agent，便于验证。先看一下两个方法：上面的图片结合下面的图一起看比较好(来源网络，侵权删)：process_request在下载器发送请求前执行，通常在这个方法里设置请求头或者代理ip需要两个参数：request，spider返回值：None ：上图自左向右，设中间
复制链接

扫一扫

专栏目录