scrapy设置headers和cookies

0. 在settings.py中设置

# settings.py


USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36'

 

1. 在 Request 对象中设置 headers 和 cookies

你可以在创建 Request 对象时直接设置 headerscookies 参数:

import scrapy

class MySpider(scrapy.Spider):
    name = 'my_spider'
    
    def start_requests(self):
        url = 'http://example.com'
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
        }
        cookies = {
            'name': 'value',
        }
        yield scrapy.Request(
                url, 
                headers=headers, 
                cookies=cookies, 
                callback=self.parse
                )
    
    def parse(self, response):
        # Your parsing code here
        pass

2. 在 settings.py 中设置默认 headers 和 cookies

你可以在 Scrapy 的设置文件 settings.py 中设置默认的请求头和 cookies,这样每个请求都会使用这些默认值:

# settings.py

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'Accept-Language': 'en',
}

COOKIES = {
    'name': 'value',
}

3. 使用中间件(middlewares)

你可以编写或配置下载中间件来动态设置或修改请求的 headers 和 cookies。中间件允许你对所有请求和响应进行更复杂的处理:

# middlewares.py

class MyCustomDownloaderMiddleware:

    def process_request(self, request, spider):
        request.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
        request.cookies['name'] = 'value'
        return None

# settings.py
# 开启管道
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.MyCustomDownloaderMiddleware': 543,
}

4. 中间件随机请求头

# middlewares.py

import random

class MyCustomDownloaderMiddleware:
    # 随机请求头列表
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
        "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6"
        ]


    def process_request(self, request, spider):
        # 随机请求头
        request.headers['User-Agent'] = random.choice(self.user_agent_list)
        request.cookies['name'] = 'value'
        return None

# settings.py
# 开启管道
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.MyCustomDownloaderMiddleware': 543,
}

也可以这样

# middlewares.py

"""
安装: pip install fake_useragent
"""

from fake_useragent import UserAgent


class MyCustomDownloaderMiddleware:

    def process_request(self, request, spider):
        # 使用第三方
        request.headers['User-Agent'] = UserAgent().random
        request.cookies['name'] = 'value'
        return None

# settings.py
# 开启管道
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.MyCustomDownloaderMiddleware': 543,
}

5. 设置代理 

# middlewares.py

class MyCustomDownloaderMiddleware:

    def process_request(self, request, spider):
        request.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
        request.cookies['name'] = 'value'
        
        # 设置代理
        proxy = 'https://1.71.188.37:3128' 
        request.meta['proxy'] = proxy

        return None

# settings.py
# 开启管道
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.MyCustomDownloaderMiddleware': 543,
}

6. 使用扩展的 CookieMiddleware

你可以扩展 Scrapy 内置的 CookiesMiddleware 来设置全局 cookies:

# middlewares.py

from scrapy.downloadermiddlewares.cookies import CookiesMiddleware

class CustomCookiesMiddleware(CookiesMiddleware):
    def process_request(self, request, spider):
        request.cookies = {
            'name': 'value',
        }
        return super().process_request(request, spider)

# settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None,
    'myproject.middlewares.CustomCookiesMiddleware': 700,
}

7. 处理cookies的方法

cookies = "t=e78029a1905a443ea3d54e5a95beab80; r=566; Hm_lvt_f5329ae3e00629a7bb8ad78d0efb7273=1718169339; Hm_lpvt_f5329ae3e00629a7bb8ad78d0efb7273=1718169533"

cookie = {}
for e in cookies.split('; '):
    k, v = e.split('=')
    cookie[k] = v
print(cookie)
print("---------------------------------------------------")

# 字典生成式
cookie2 = {e.split('=')[0]: e.split('=')[1] for e in cookies.split('; ')}
print("字典生成式----->", cookie2)

运行结果图 

Scrapy 设置 Cookie 池可以通过以下步骤实现: 1. 在 Scrapy 的 settings.py 文件中添加一个新的 COOKIES_ENABLED 配置项,将其设置为 True,启用 Cookie。 2. 添加一个新的 COOKIES_POOL_SIZE 配置项,指定 Cookie 池的大小,即最多保存多少个 Cookie。 3. 在 Scrapy 的 spider 中,编写一个自定义的中间件,用于处理 Cookie 池。该中间件需要实现以下功能: a. 在请求中添加 Cookie,从 Cookie 池中随机选择一个 Cookie。 b. 在响应中获取 Cookie,将 Cookie 添加到 Cookie 池中。 4. 在 Scrapy 的 spider 中,将自定义的中间件添加到 DOWNLOADER_MIDDLEWARES 配置项中。 下面是一个示例代码,用于实现 Cookie 池功能: settings.py: ``` COOKIES_ENABLED = True COOKIES_POOL_SIZE = 10 DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.CookiePoolMiddleware': 543, } ``` middlewares.py: ``` import random class CookiePoolMiddleware(object): def __init__(self, cookies_pool): self.cookies_pool = cookies_pool @classmethod def from_crawler(cls, crawler): return cls( cookies_pool=crawler.settings.get('COOKIES_POOL', []), ) def process_request(self, request, spider): if self.cookies_pool: cookie = random.choice(self.cookies_pool) request.cookies = cookie def process_response(self, request, response, spider): if 'Set-Cookie' in response.headers: cookie = response.headers.get('Set-Cookie', '') self.cookies_pool.append(cookie) return response ``` 在 spider 中使用: ``` class MySpider(scrapy.Spider): name = 'myspider' def start_requests(self): urls = ['http://www.example.com'] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): # Your spider code goes here pass ``` 这样,Scrapy 设置 Cookie 池就完成了。在请求时,中间件会从 Cookie 池中随机选择一个 Cookie,并将其添加到请求中;在响应时,中间件会将响应中的 Cookie 添加到 Cookie 池中。这样,每次请求都会使用不同的 Cookie,从而避免被识别为机器人。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值