scrapy如何设置随机User-Agent

134 篇文章 0 订阅
90 篇文章 0 订阅

爬虫程序是否因为使用默认user-agent或通用user-agent(fake_useragent包)而被服务器识别和阻止,出现403、504或429的http错误状态码拒绝响应?这是因为默认或通用的user-agent已经被大量使用,导致目标服务器黑名单,一旦再次使用这些user-agent,对应的爬虫请求都会被识别并拒绝响应,严重的情况会直接封爬虫服务器IP(关于如何在scrapy实现随机代理IP,请看我前面发布的关于爬虫代理的相关资料)。
出现这种情况只能通过自建user-agent库后,设置随机user-agent从而避免目标服务器识别,下面实现这个random_useragent模块,为每个请求设置一个随机user-agent,可以解决这个问题,包括过程实现和安装使用说明。
(1)首先实现random_useragent.py

#!/usr/bin/python
# -*-coding:utf-8-*-
"""Scrapy Middleware to set a random User-Agent for every Request.

Downloader Middleware which uses a file containing a list of
user-agents and sets a random one for each request.
"""

import random
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

__author__ = "Srinivasan Rangarajan"
__copyright__ = "Copyright 2016, Srinivasan Rangarajan"
__credits__ = ["Srinivasan Rangarajan"]
__license__ = "MIT"
__version__ = "0.2"
__maintainer__ = "Srinivasan Rangarajan"
__email__ = "srinivasanr@gmail.com"
__status__ = "Development"


class RandomUserAgentMiddleware(UserAgentMiddleware):

    def __init__(self, settings, user_agent='Scrapy'):
        super(RandomUserAgentMiddleware, self).__init__()
        self.user_agent = user_agent
        user_agent_list_file = settings.get('USER_AGENT_LIST')
        if not user_agent_list_file:
            # If USER_AGENT_LIST_FILE settings is not set,
            # Use the default USER_AGENT or whatever was
            # passed to the middleware.
            ua = settings.get('USER_AGENT', user_agent)
            self.user_agent_list = [ua]
        else:
            with open(user_agent_list_file, 'r') as f:
                self.user_agent_list = [line.strip() for line in f.readlines()]

    @classmethod
    def from_crawler(cls, crawler):
        obj = cls(crawler.settings)
        crawler.signals.connect(obj.spider_opened,
                                signal=signals.spider_opened)
        return obj

    def process_request(self, request, spider):
        user_agent = random.choice(self.user_agent_list)
        if user_agent:
            request.headers.setdefault('User-Agent', user_agent)

(2)现在实现安装程序

"""
Setup script for PyPi
"""
import codecs
import re
from setuptools import setup


# Get the long description from the relevant file
with codecs.open('README.rst', encoding='utf-8') as f:
    long_description = f.read()


# Open the package file so we can read the meta data.
with codecs.open('random_useragent.py', encoding='utf-8') as f:
    package_file = f.read()


def get_package_meta(meta_name):
    """Return value of variable set in the package where said variable is
    named in the Python meta format `__<meta_name>__`.
    """
    regex = "__{0}__ = ['\"]([^'\"]+)['\"]".format(meta_name)
    return re.search(regex, package_file).group(1)


version = get_package_meta('version')
author = get_package_meta('author')
email = get_package_meta('email')
license = get_package_meta('license')


setup(
    name='scrapy-random-useragent',
    version=version,

    description='Scrapy Middleware to set a random User-Agent for every Request.',
    long_description=long_description,

    author=author,
    author_email=email,
    url='https://github.com/cnu/scrapy-random-useragent',

    license=license,

    py_modules=['random_useragent'],
    platforms=['Any'],

    keywords="scrapy random user-agent ",
    classifiers=[
        'Development Status :: 4 - Beta',
        'Intended Audience :: Developers',
        'Environment :: Console',
        'License :: OSI Approved :: MIT License',
        'Operating System :: OS Independent',
        'Programming Language :: Python',
        'Framework :: Scrapy',
    ]
)

(3)安装

pip install scrapy-random-useragent

(4)修改配置文件settings.py,更新DOWNLOADER_MIDDLEWARES的部分内容

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'random_useragent.RandomUserAgentMiddleware': 400
}

这将禁用默认UserAgentMiddleware并启用RandomUserAgentMiddle ware。然后使用文本文件的路径创建一个新变量USER_AGENT_LIST,该文件包含所有User-Agent列表(每行一个User-Agent)。

USER_AGENT_LIST = "/path/to/useragents.txt"

安装配置完成后,爬虫的所有请求都将从文本文件中随机选择一个user-agent。
欢迎微信联系或者搜索亿牛云索取更多资料。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值