爬虫程序是否因为使用默认user-agent或通用user-agent(fake_useragent包)而被服务器识别和阻止,出现403、504或429的http错误状态码拒绝响应?这是因为默认或通用的user-agent已经被大量使用,导致目标服务器黑名单,一旦再次使用这些user-agent,对应的爬虫请求都会被识别并拒绝响应,严重的情况会直接封爬虫服务器IP(关于如何在scrapy实现随机代理IP,请看我前面发布的关于爬虫代理的相关资料)。
出现这种情况只能通过自建user-agent库后,设置随机user-agent从而避免目标服务器识别,下面实现这个random_useragent模块,为每个请求设置一个随机user-agent,可以解决这个问题,包括过程实现和安装使用说明。
(1)首先实现random_useragent.py
#!/usr/bin/python
# -*-coding:utf-8-*-
"""Scrapy Middleware to set a random User-Agent for every Request.
Downloader Middleware which uses a file containing a list of
user-agents and sets a random one for each request.
"""
import random
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
__author__ = "Srinivasan Rangarajan"
__copyright__ = "Copyright 2016, Srinivasan Rangarajan"
__credits__ = ["Srinivasan Rangarajan"]
__license__ = "MIT"
__version__ = "0.2"
__maintainer__ = "Srinivasan Rangarajan"
__email__ = "srinivasanr@gmail.com"
__status__ = "Development"
class RandomUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, settings, user_agent='Scrapy'):
super(RandomUserAgentMiddleware, self).__init__()
self.user_agent = user_agent
user_agent_list_file = settings.get('USER_AGENT_LIST')
if not user_agent_list_file:
# If USER_AGENT_LIST_FILE settings is not set,
# Use the default USER_AGENT or whatever was
# passed to the middleware.
ua = settings.get('USER_AGENT', user_agent)
self.user_agent_list = [ua]
else:
with open(user_agent_list_file, 'r') as f:
self.user_agent_list = [line.strip() for line in f.readlines()]
@classmethod
def from_crawler(cls, crawler):
obj = cls(crawler.settings)
crawler.signals.connect(obj.spider_opened,
signal=signals.spider_opened)
return obj
def process_request(self, request, spider):
user_agent = random.choice(self.user_agent_list)
if user_agent:
request.headers.setdefault('User-Agent', user_agent)
(2)现在实现安装程序
"""
Setup script for PyPi
"""
import codecs
import re
from setuptools import setup
# Get the long description from the relevant file
with codecs.open('README.rst', encoding='utf-8') as f:
long_description = f.read()
# Open the package file so we can read the meta data.
with codecs.open('random_useragent.py', encoding='utf-8') as f:
package_file = f.read()
def get_package_meta(meta_name):
"""Return value of variable set in the package where said variable is
named in the Python meta format `__<meta_name>__`.
"""
regex = "__{0}__ = ['\"]([^'\"]+)['\"]".format(meta_name)
return re.search(regex, package_file).group(1)
version = get_package_meta('version')
author = get_package_meta('author')
email = get_package_meta('email')
license = get_package_meta('license')
setup(
name='scrapy-random-useragent',
version=version,
description='Scrapy Middleware to set a random User-Agent for every Request.',
long_description=long_description,
author=author,
author_email=email,
url='https://github.com/cnu/scrapy-random-useragent',
license=license,
py_modules=['random_useragent'],
platforms=['Any'],
keywords="scrapy random user-agent ",
classifiers=[
'Development Status :: 4 - Beta',
'Intended Audience :: Developers',
'Environment :: Console',
'License :: OSI Approved :: MIT License',
'Operating System :: OS Independent',
'Programming Language :: Python',
'Framework :: Scrapy',
]
)
(3)安装
pip install scrapy-random-useragent
(4)修改配置文件settings.py,更新DOWNLOADER_MIDDLEWARES的部分内容
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'random_useragent.RandomUserAgentMiddleware': 400
}
这将禁用默认UserAgentMiddleware并启用RandomUserAgentMiddle ware。然后使用文本文件的路径创建一个新变量USER_AGENT_LIST,该文件包含所有User-Agent列表(每行一个User-Agent)。
USER_AGENT_LIST = "/path/to/useragents.txt"
安装配置完成后,爬虫的所有请求都将从文本文件中随机选择一个user-agent。
欢迎微信联系或者搜索亿牛云索取更多资料。