http://doc.scrapy.org/en/1.0/topics/practices.html#bans
1. User Agent轮换
2. 禁Cookie
3. 设置大于2s的DOWNLOAD_DELAY
4. 使用Google Cache (不懂)
5. 使用轮换IP(还不会)
6. 使用分布式下载器(不知道scrapy-redis算不算)
User Agent轮换例子
1)新建一个middlewares.py文件,内容如下,文件放在与items.py, settings.py所在的文件夹下。
#!/usr/bin/python
#-*-coding:utf-8-*-
import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
print ua, '-----------------yyyyyyyyyyyyyyyyyyyyyyyyy'
request.headers.setdefault('User-Agent', ua)
#the default user_agent_list composes c