Downloader Middleware 的用法
Downloader Middleware 即下载中间件,它是处于 Scrapy 的 Request 和 Response 之间的处理模块。请求去回都要经过下载中间件,去的时候可以对请求进行加工,加headers等,回去的时候检测responses有没有正常返回(比如安居客),没有正常返回可以直接在下载中间件里修改。可以有多个下载中间件,后边数字越小越靠近engine,request时越先执行,responses时越后执行。
DOWNLOADER_MIDDLEWARES = {
"xiaoshuo1.middlewares.Xiaoshuo1DownloaderMiddleware": 543,
}
Downloader Middleware 的功能十分强大,修改 User-Agent、处理重定向、设置代理、失败重试、设置 Cookies 等功能都需要借助它来实现。下面我们来了解一下 Downloader Middleware 的详细用法。
核心方法
process_request(request, spider)
在 Request 从队列里调度出来到 Downloader 下载执行之前,我们都可以用process_request() 方法对 Request 进行处理。方法的返回值必须为 None、Response 对象、Request 对象之一,或者抛出 IgnoreRequest 异常。
当返回为 None 时,Scrapy 将继续处理该 Request,接着执行其他 Downloader Middleware 的 process_request() 方法,直到 Downloader 把 Request 执行后得到 Response 才结束。这个过程其实就是修改 Request 的过程,不同的 Downloader Middleware 按照设置的优先级顺序依次对 Request 进行修改,最后推送至 Downloader 执行。(就是没有阻拦继续往下进行,当下载中间件对request进行修改时,修改也会保留)
当返回为 Response 对象时,更低优先级的 Downloader Middleware 的 process_request() 和 process_exception() 方法就不会被继续调用,每个 Downloader Middleware 的process_response() 方法转而被依次调用。调用完毕之后,直接将 Response 对象发送给 Spider 来处理。(就是返回response后会停止往下进行,直接返回到核心引擎,核心引擎直接返回到爬虫处理,剩下的下载中间件就不会调用了)
当返回为 Request 对象时,更低优先级的 Downloader Middleware 的 process_request() 方法会停止执行。这个 Request 会重新放到调度队列里,其实它就是一个全新的 Request,等待被调度。如果被 Scheduler 调度了,那么所有的 Downloader Middleware 的 process_request() 方法会被重新按照顺序执行。(就是把请求打回去了,比如发现代理坏了要换一个代理)
如果 IgnoreRequest 异常抛出,则所有的 Downloader Middleware 的 process_exception() 方法会依次执行。如果没有一个方法处理这个异常,那么 Request 的 errorback() 方法就会回调。如果该异常还没有被处理,那么它便会被忽略。
process_response(request, response, spider)
Downloader 执行 Request 下载之后,会得到对应的 Response。Scrapy 引擎便会将 Response 发送给 Spider 进行解析。在发送之前,我们都可以用 process_response() 方法来对 Response 进行处理。方法的返回值必须为 Request 对象、Response 对象之一,或者抛出 IgnoreRequest 异常。
当返回为 Request 对象时,更低优先级的 Downloader Middleware 的 process_response() 方法不会继续调用。该 Request 对象会重新放到调度队列里等待被调度,它相当于一个全新的 Request。然后,该 Request 会被 process_request() 方法顺次处理。
当返回为 Response 对象时,更低优先级的 Downloader Middleware 的 process_response() 方法会继续调用,继续对该 Response 对象进行处理。
如果 IgnoreRequest 异常抛出,则 Request 的 errorback() 方法会回调。如果该异常还没有被处理,那么它便会被忽略。
process_exception(request, exception, spider)
当 Downloader 或 process_request() 方法抛出异常时,例如抛出 IgnoreRequest 异常,process_exception() 方法就会被调用。方法的返回值必须为 None、Response 对象、Request 对象之一。
当返回为 None 时,更低优先级的 Downloader Middleware 的 process_exception() 会被继续顺次调用,直到所有的方法都被调度完毕。
当返回为 Response 对象时,更低优先级的 Downloader Middleware 的 process_exception() 方法不再被继续调用,每个 Downloader Middleware 的 process_response() 方法转而被依次调用。
当返回为 Request 对象时,更低优先级的 Downloader Middleware 的 process_exception() 也不再被继续调用,该 Request 对象会重新放到调度队列里面等待被调度,它相当于一个全新的 Request。然后,该 Request 又会被 process_request() 方法顺次处理。
定义header头的三种方法
新建了一个 Scrapy 项目,名为 scrapydownloadertest。进入项目,新建一个 Spider,名为 httpbin:scrapy startproject scrapydownloadertest
scrapy genspider httpbin httpbin.org
httpbin,源代码:
import scrapy
class HttpbinSpider(scrapy.Spider):
name = 'httpbin'
allowed_domains = ['httpbin.org']
start_urls = ['httpbin.org']
def parse(self, response):
pass
修改 start_urls 为:[‘httpbin.org’]。随后将 parse() 方法添加一行日志输出,将 response 变量的 text 属性输出,这样我们便可以看到 Scrapy 发送的 Request 信息了。 修改 Spider 内容如下所示:
import scrapy
class HttpbinSpider(scrapy.Spider):
name = 'httpbin'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/get']
def parse(self, response):
self.logger.debug(response.text)
运行后,显示发送的 Request 信息
{"args": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip,deflate,br",
"Accept-Language": "en",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "Scrapy/1.4.0 (+http://scrapy.org)"
},
"origin": "60.207.237.85",
"url": "http://httpbin.org/get"
}
Scrapy 发送的 Request 使用的 User-Agent 是 Scrapy/1.4.0(+http://scrapy.org),这其实是由 Scrapy 内置的 UserAgentMiddleware 设置的
修改请求时的 User-Agent 可以有两种方式:一是修改 settings 里面的 USER_AGENT 变量;二是通过 Downloader Middleware 的 process_request() 方法来修改。
第一种方法非常简单,我们只需要在 setting.py 里面加一行 USER_AGENT 的定义即可:
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
第二种方法更灵活,可以设置随机的 User-Agent ,在 middlewares.py 里面添加一个 RandomUserAgentMiddleware 的类:
import random
class RandomUserAgentMiddleware():
def __init__(self):
self.user_agents = ['Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2',
'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1'
]
def process_request(self, request, spider):
request.headers['User-Agent'] = random.choice(self.user_agents)
要使其生效,需要在 settings.py 中,将 DOWNLOADER_MIDDLEWARES 取消注释,并设置成如下内容
DOWNLOADER_MIDDLEWARES = {'scrapydownloadertest.middlewares.RandomUserAgentMiddleware': 543,}
全局修改header头,在setting里直接修改
针对特殊请求进行单个修改:
import scrapy
class HttpbinSpider(scrapy.Spider):
name = "httpbin"
allowed_domains = ["httpbin.org"]
# start_urls = ["http://httpbin.org/get"]
#
# def parse(self, response):
# print(response.text)
def start_requests(self):
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Language": "en,zh-CN;q=0.9,zh;q=0.8",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Pragma": "no-cache",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"sec-ch-ua": "\"Google Chrome\";v=\"123\", \"Not:A-Brand\";v=\"8\", \"Chromium\";v=\"123\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\""
}
yield scrapy.Request('http://httpbin.org/get',self.demo)
def demo(self,response):
给框架加代理:
在代理IP网站上下单,然后通过说明文档在里面找到关于scrapy框架的说明:
proxyUser、proxyPass、proxyHost、proxyPort就是在下单后的账户名、密码、网站url、端口。
在下面写一个新类,类名为proxymiddleware,将setting里面的类名也要修改为新增加的类名。
当已经在框架上加上代理以后需要检测一下是否成功加上代理,这个时候就需要运用到检测代理IP的网站
在testip模块上从写start_urls网站,通过一个循环查询的方法来检测是否有代理IP加上,可能弹出的结果只有一个,因为设置里面有自动去重,如果要看到全部的IP,需要在yield函数里面加上not_filter=ture就ok了。
给IP池加代理
主要是在middleware模块加这段代码,其中的内涵逻辑就是通过协程的方法去调用端口,返回代理IP去用。
要用代理IP池需要先把代理IP池打开,然后再把redis数据库打开。加完代理之后用检测模块中的检测网站进行检测,看代理能不能加的上。
在框架里加cookie
cookie是个身份跟踪令牌,根据具体情况判断需不需要加cookie,当需要加cookie的时候在请求的yield函数里面加上cookie参数,在cookie参数里面需要对cookie转换为键值对的格式,在那个转换网站里面就有这个功能,这样将转换后的cookie加到yield函数里面这样就给请求加上cookie了。
其他内容
创建scrapy项目:
scrapy startproject 项目名
cd到项目名下
scrapy genspider 爬虫名 www.baidu.com(网站网址)
scrapy genspider -l展示模板
scrapy genspider -t crawl taobao2 taobao.com
使用crawl模板抓淘宝
创建启动文件
fromscrapy.cmdline
importexecuteexecute(['scrapy','crawl','quotes'])
quotes是爬虫名,该文件创建在scrapy项目根目录下
css选择器:
response.css('.text::text').extract()
这里为提取所有带有class=’text’ 这个属性的元素里面的text返回的是一个列表
response.css('.text::text').extract_first()
这是取第一条,返回的是str
print(response.css("div span::attr(class)").extract())
这是取元素
url= response.url+response.xpath('/html/body/div/div[2]/div[1]/div[1]/div/a[1]/@href').extract_first()
和原来用法基本一样,这里是获取一个url 然后跟网站的主url拼接了
print(response.xpath("//a[@class='tag']/text()").extract())
取带有class=’tag’属性的超链接中间的文本内容
print(response.url)
print(response.status)
打印该请求的url,打印请求的状态码
保存为json形式的东西
scrapy crawl quotes -o quotes.json
json lines存储
scrapy crawl quotes -o quotes.jl
scrapy crawl quotes -o quotes.csv
scrapy crawl quotes -o quotes.xml
scrapy crawl quotes -o quotes.pickle
scrapy crawl quotes -o quotes.marshal
scrapy crawl quotes -o ftp://user:pass@ftp.example.com/path/to/quotes.csv
piplines.py中的操作
importpymysql
fromscrapy.exceptionsimportDropItem
# return DropItem('Missing Text')
classDingdianPipeline:
def__init__(self,username,password,db):
self.username= username
self.password= password
self.db= db
@classmethod
deffrom_crawler(cls, crawler):
returncls(username=crawler.settings.get('MYSQL_USERNAME'),password=crawler.settings.get('MYSQL_PWD'),db=crawler.settings.get('MYSQL_DB'))
defopen_spider(self, spider):
self.client= pymysql.connect(user=self.username,password=self.password,db=self.db)
self.cursor= self.client.cursor()
defprocess_item(self, item, spider):
book_name= item['book_name']
book_author= item['book_author']
sql= 'insert into tests(book_name,book_author) values ("{}","{}")'.format(book_name,book_author)
self.cursor.execute(sql)
self.client.commit()
returnitem
defclose_spider(self, spider):
self.client.close()
记得开setting.py:
ITEM_PIPELINES = { 'dingdian.pipelines.DingdianPipeline': 300}
下载器中间件
DownloadMiddleware
核心方法:
Process_request(self,request,spider)
Return None:继续处理这个request,直到返回response,通常用来修改request
Return Response 直接返回该response
Return Request 将返回的request 重新放归调度队列,当成一个新的request用
Return IgnoreRequest 抛出异常,process_exception被一次调用,
Process_response(self,request,response,spider)
Return request将返回的request 重新放归调度队列,当成一个新的request用
Return response 继续处理该response直到结束
Process_exception(request,excetion,spider)
Return IgnoreRequest 抛出异常,process_exception被一次调用,
通过重写中间件给request加useragent,将返回的状态码都改成201
在setting里:
DOWNLOADER_MIDDLEWARES= { 'dingdian.middlewares.AgantMiddleware': 543,}
在middleware里加header和修改状态码:
import random
class AgantMiddleware(object):
def __init__(self):
self.user_agent = ['Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0']
def process_request(self,request,spider):
request.headers['User-Agent'] = random.choice(self.user_agent) print(request.headers)
def process_response(self,request,response,spider):
response.status=201
return response
在middleware 里加ip的方法
class UserAgentMiddleware(object):
def __init__(self):
self.user_agent = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36','Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0']
def process_request(self,request,spider):
request.meta['proxy'] = 'http://'+'175.42.123.111:33995'
结合简单版本的代理IP池中间件写法如下
classProxyMiddleware(object):
# proxypool_url = 'http://127.0.0.1:5555/random'
logger= logging.getLogger('middlewares.proxy')
defprocess_request(self, request, spider):
if'proxy'notinrequest.meta:
request.meta['url_num'] = 0
data= r.rpop("list_name").split('|')
# 127.0.0.1:7777|0
ip= data[0]
num= data[1]
print('添加代理IP:'+ip)
request.meta['proxy'] = 'http://'+ip.strip('/')
request.meta['proxy_num'] = num
request.meta['download_timeout'] = 8
# request.meta.get('dont_retry', False)
defprocess_response(self, request, response, spider):
ip= request.meta['proxy'].strip('/')
# n = int(request.meta['proxy_num'])
print(response.status,'代理IP回填',"list_name", ip+'|0')
r.lpush("list_name", ip+'|0')
returnresponse
# if len(requests.text)
# if response.status < 400:
# print(1111111, ip +'|'+'0')
# r.lpush("list_name", ip +'|'+'0')
# return response
# else:
# n = n + 1
# if n < 3:
# print(3333, ip + '|' + str(n))
# r.lpush("list_name", ip + '|' + str(n))
# print('请求失败,', ip + '|' + str(n))
# else:
# print('舍弃' + ip)
#
# if request.meta['url_num'] > 3:
# r.lpush("error_url", response.url)
# return response
# else:
# request.meta['url_num'] += 1
# return request
defprocess_exception(self, request, exception, spider):
# if isinstance(exception, self.EXCEPTIONS_TO_RETRY) and not request.meta.get('dont_retry', False):
ip= request.meta['proxy'].strip('/')
print(2222222222222222,ip)
n= int(request.meta['proxy_num'])
n= n+1
ifn<10:
r.lpush("list_name", ip+'|'+str(n))
print('请求失败,回填', ip+'|'+str(n))
else:
print(request.meta)
print('舍弃'+ip)
ifrequest.meta['url_num'] >3:
r.lpush("error_url", request.url)
return'error'
else:
request.meta['url_num'] += 1
returnrequest
结合之前免费代理IP池的代码如下
importaiohttp
importlogging
classProxyMiddleware(object):
proxypool_url= 'http://127.0.0.1:5555/random'
logger= logging.getLogger('middlewares.proxy')
asyncdefprocess_request(self, request, spider):
asyncwithaiohttp.ClientSession() asclient:
response= awaitclient.get(self.proxypool_url)
ifnotresponse.status== 200:
return
proxy= awaitresponse.text()
self.logger.debug(f'set proxy {proxy}')
request.meta['proxy'] = f'http://{proxy}'
结合之前cookie池实现代码如下
classAuthorizationMiddleware(object):
accountpool_url= 'http://127.0.0.1:6789/antispider7/random'
logger= logging.getLogger('middlewares.authorization')
asyncdefprocess_request(self, request, spider):
asyncwithaiohttp.ClientSession() asclient:
response= awaitclient.get(self.accountpool_url)
ifnotresponse.status== 200:
return
credential= awaitresponse.text()
authorization= f'jwt {credential}'
self.logger.debug(f'set authorization {authorization}')
request.headers['authorization'] = authorization
如果在scrapy中使用上面2种方法并且在window下运行
fromscrapy.cmdlineimportexecute
importasyncio
# Python3.8安装 jupyter报错 NotImplementedError_jupyter tornado报错-CSDN博客
# python3.8 asyncio 在 windows 上默认使用 ProactorEventLoop
# 需要改回SelectorEventLoop
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
# execute(['scrapy','crawl','book'])
execute(['scrapy','crawl','app1'])
setting 情况
DOWNLOADER_MIDDLEWARES= {
#'scrapycompositedemo.middlewares.AuthorizationMiddleware': 543,
# 'scrapycompositedemo.middlewares.ProxyMiddleware': 544,
}
重写已有中间件
setting里
# Retry settings
RETRY_ENABLED= False
# RETRY_TIMES = 5 # 想重试几次就写几
# 下面这行可要可不要
# RETRY_HTTP_CODES = [500, 502, 503, 504, 408]
middlewares.py中对应代码
importredis
fromscrapy.downloadermiddlewares.retryimportRetryMiddleware
r= redis.Redis(host='127.0.0.1', port=6379,db=1,decode_responses=True)
fromscrapy.downloadermiddlewares.retryimportRetryMiddleware, response_status_message
importlogging
fromtwisted.internetimportdefer
fromtwisted.internet.errorimportTimeoutError, DNSLookupError, \
ConnectionRefusedError, ConnectionDone, ConnectError, \
ConnectionLost, TCPTimedOutError
fromurllib3.exceptionsimportProtocolError, ProxyError, ProxySchemeUnknown
fromtwisted.web.clientimportResponseFailed
fromscrapy.core.downloader.handlers.http11importTunnelError
# from versace import settings
importrequests
classRetryMiddleware(object):
# 当遇到以下Exception时进行重试
EXCEPTIONS_TO_RETRY= (defer.TimeoutError, TimeoutError, DNSLookupError, ConnectionRefusedError, ConnectionDone, ConnectError, ConnectionLost, TCPTimedOutError, ResponseFailed, IOError, TunnelError)
def__init__(self, settings):
'''
这里涉及到了settings.py文件中的几个量
RETRY_ENABLED: 用于开启中间件,默认为TRUE
RETRY_TIMES: 重试次数, 默认为2
RETRY_HTTP_CODES: 遇到哪些返回状态码需要重试, 一个列表,默认为[500, 503, 504, 400, 408]
RETRY_PRIORITY_ADJUST:调整相对于原始请求的重试请求优先级,默认为-1
'''
ifnotsettings.getbool('RETRY_ENABLED'):
raiseNotConfigured
self.max_retry_times= settings.getint('RETRY_TIMES')
self.retry_http_codes= set(int(x) forxinsettings.getlist('RETRY_HTTP_CODES'))
self.priority_adjust= settings.getint('RETRY_PRIORITY_ADJUST')
defprocess_response(self, request, response, spider):
# 在之前构造的request中可以加入meta信息dont_retry来决定是否重试
ifrequest.meta.get('dont_retry', False):
returnresponse
# 检查状态码是否在列表中,在的话就调用_retry方法进行重试
ifresponse.statusinself.retry_http_codes:
reason= response_status_message(response.status)
# 在此处进行自己的操作,如删除不可用代理,打日志等
returnself._retry(request, reason, spider) orresponse
returnresponse
defprocess_exception(self, request, exception, spider):
# 如果发生了Exception列表中的错误,进行重试
ifisinstance(exception, self.EXCEPTIONS_TO_RETRY) \
andnotrequest.meta.get('dont_retry', False):
# 在此处进行自己的操作,如删除不可用代理,打日志等
returnself._retry(request, exception, spider)
get_retry_request函数
# 该方法获取最大重试次数,和请求重试优先级,然后调用get_retry_request方法
def_retry(self, request, reason, spider):
max_retry_times= request.meta.get('max_retry_times', self.max_retry_times)
priority_adjust= request.meta.get('priority_adjust', self.priority_adjust)
returnget_retry_request(
request,
reason=reason,
spider=spider,
max_retry_times=max_retry_times,
priority_adjust=priority_adjust,
)
"""
读取当前重试次数和最大重试次数进行比较,
如果小于等于最大重试次数:
利用copy方法在原来的request上复制一个新request,并更新其retry_times,
并将dont_filter设为True来防止因url重复而被过滤。
如果超出最大重试次数:
记录重试失败请求量,并放弃该请求记录到logger日志中,logger级别为:error
"""
defget_retry_request(
request: Request,
*,
spider: Spider,
reason: Union[str, Exception] = 'unspecified',
max_retry_times: Optional[int] = None,
priority_adjust: Optional[int] = None,
logger: Logger= retry_logger,
stats_base_key: str= 'retry',
):
settings= spider.crawler.settings
stats= spider.crawler.stats
retry_times= request.meta.get('retry_times', 0) +1
ifmax_retry_timesisNone:
max_retry_times= request.meta.get('max_retry_times')
ifmax_retry_timesisNone:
max_retry_times= settings.getint('RETRY_TIMES')
ifretry_times<= max_retry_times:
logger.debug(
"Retrying %(request)s (failed %(retry_times)d times): %(reason)s",
{'request': request, 'retry_times': retry_times, 'reason': reason},
extra={'spider': spider}
)
new_request: Request= request.copy()
new_request.meta['retry_times'] = retry_times
new_request.dont_filter= True
ifpriority_adjustisNone:
priority_adjust= settings.getint('RETRY_PRIORITY_ADJUST')
new_request.priority= request.priority+priority_adjust
ifcallable(reason):
reason= reason()
ifisinstance(reason, Exception):
reason= global_object_name(reason.__class__)
stats.inc_value(f'{stats_base_key}/count')
stats.inc_value(f'{stats_base_key}/reason_count/{reason}')
returnnew_request
else:
stats.inc_value(f'{stats_base_key}/max_reached')
logger.error(
"Gave up retrying %(request)s (failed %(retry_times)d times): "
"%(reason)s",
{'request': request, 'retry_times': retry_times, 'reason': reason},
extra={'spider': spider},
)
returnNone
修改后的:
classMyRetryMiddleware(RetryMiddleware):
logger= logging.getLogger(__name__)
defdelete_proxy(self, proxy):
ifproxy:
# delete proxy from proxies pool
defprocess_response(self, request, response, spider):
ifrequest.meta.get('dont_retry', False):
returnresponse
ifresponse.statusinself.retry_http_codes:
reason= response_status_message(response.status)
# 删除该代理
self.delete_proxy(request.meta.get('proxy', False))
time.sleep(random.randint(3, 5))
self.logger.warning('返回值异常, 进行重试...')
returnself._retry(request, reason, spider) orresponse
returnresponse
defprocess_exception(self, request, exception, spider):
ifisinstance(exception, self.EXCEPTIONS_TO_RETRY) \
andnotrequest.meta.get('dont_retry', False):
# 删除该代理
self.delete_proxy(request.meta.get('proxy', False))
time.sleep(random.randint(3, 5))
self.logger.warning('连接异常, 进行重试...')
returnself._retry(request, exception, spider)
或者
classRetryMiddleware:
EXCEPTIONS_TO_RETRY= (defer.TimeoutError, TimeoutError, DNSLookupError,
ConnectionRefusedError, ConnectionDone, ConnectError,
ConnectionLost, TCPTimedOutError, ResponseFailed,
IOError, TunnelError)
def__init__(self, settings):
ifnotsettings.getbool('RETRY_ENABLED'):
raiseNotConfigured
self.max_retry_times= settings.getint('RETRY_TIMES')
self.retry_http_codes= set(int(x) forxinsettings.getlist('RETRY_HTTP_CODES'))
self.priority_adjust= settings.getint('RETRY_PRIORITY_ADJUST')
@classmethod
deffrom_crawler(cls, crawler):
returncls(crawler.settings)
defprocess_response(self, request, response, spider):
ifrequest.meta.get('dont_retry', False):
returnresponse
ifresponse.statusinself.retry_http_codes: # 可以自定义重试状态码
reason= response_status_message(response.status)
response.last_content= request.meta
returnself._retry(request, reason, spider) orresponse
returnresponse
defprocess_exception(self, request, exception, spider):
if(
isinstance(exception, self.EXCEPTIONS_TO_RETRY)
andnotrequest.meta.get('dont_retry', False)
):
returnself._retry(request, exception, spider)
def_retry(self, request, reason, spider):
max_retry_times= request.meta.get('max_retry_times', self.max_retry_times)
priority_adjust= request.meta.get('priority_adjust', self.priority_adjust)
request.meta['proxy'] = "xxx:xxxx"
request.headers['Proxy-Authorization'] = "proxyauth"
returnget_retry_request(
request,
reason=reason,
spider=spider,
max_retry_times=max_retry_times,
priority_adjust=priority_adjust,
)
用小象的隧道代理
importbase64
proxyUser= "963053782840004608"
proxyPass= "sdwjPycR"
proxyHost= "http-short.xiaoxiangdaili.com"
proxyPort= 10010
proxyServer= "http://%(host)s:%(port)s"%{
"host": proxyHost,
"port": proxyPort
}
proxyAuth= "Basic "+base64.urlsafe_b64encode(bytes((proxyUser+":"+proxyPass), "ascii")).decode("utf8")
classProxyMiddleware(object):
# # for Python2
# proxyAuth = "Basic " + base64.b64encode(proxyUser + ":" + proxyPass)
defprocess_request(self, request, spider):
request.meta["proxy"] = proxyServer
request.headers["Proxy-Authorization"] = proxyAuth
request.headers["Proxy-Switch-Ip"] = True
scrapy两种请求方式
一种
importscrapy
yieldscrapy.Request(begin_url,self.first)
第二种
fromscrapy.httpimportRequest
yieldRequest(url,self.first,meta={'thename':pic_name[0]})
使用post请求的方法:
fromscrapyimportFormRequest##Scrapy中用作登录使用的一个包
formdata= { 'username': 'wangshang', 'password': 'a706486'}
yieldscrapy.FormRequest(
url='http://172.16.10.119:8080/bwie/login.do',
formdata=formdata,
callback=self.after_login,
)
scrapy 个性化配置设置(每个项目一套配置)
setting里如下
custom_settings_for_centoschina_cn= {
'DOWNLOADER_MIDDLEWARES': {
'questions.middlewares.QuestionsDownloaderMiddleware': 543,
},
'ITEM_PIPELINES': {
'questions.pipelines.QuestionsPipeline': 300,
},
'MYSQL_URI': '124.221.206.17',
# 'MYSQL_URI' : '43.143.155.25',
'MYSQL_DB': 'mydb',
'MYSQL_USER':'root',
'MYSQL_PASSWORD':'123456',
}
爬虫部分
importscrapy
fromquestions.settingsimportcustom_settings_for_centoschina_cn
fromquestions.itemsimportQuestionsItem
fromlxmlimportetree
classCentoschinaCnSpider(scrapy.Spider):
name= 'centoschina.cn'
# allowed_domains = ['centoschina.cn']
custom_settings= custom_settings_for_centoschina_cn