一、爬虫中间件和下载中间件
1.下载中间件
1 写在middelwares.py中,写个类
2 类中写方法
process_request(self, request, spider):
-返回 None,继续进入下一个中间件
-返回 request对象,会进入引擎,被引擎放到调度器,等待下一次被调度执行
-返回 response对象,会被引擎调度取spider中,解析数据
-这里可以干什么事?
-修改请求头
-修改cookie
-加代理
-加selenium
process_response(self, request, response, spider):
3 爬虫和下载中间件要使用,需要在配置文件中settings.py
SPIDER_MIDDLEWARES = {
'crawl_cnblogs.middlewares.CrawlCnblogsSpiderMiddleware': 5,
}
DOWNLOADER_MIDDLEWARES = {
'crawl_cnblogs.middlewares.CrawlCnblogsDownloaderMiddleware': 5,
}
2.加代理、加header、集成selenium
0 在下载中间件的process_reqeust方法中
1 加cookie
# request.cookies['name']='lqz'
# request.cookies= {}
2 修改header
# request.headers['Auth']='asdfasdfasdfasdf'
# request.headers['USER-AGENT']='ssss'
3 加代理
request.meta['proxy']='http://103.130.172.34:8080'
4 fake_useragent模块,可以随机生成请求头user-agent
from fake_useragent import UserAgent
ua = UserAgent()
print(ua.ie) #随机打印ie浏览器任意版本
print(ua.firefox) #随机打印firefox浏览器任意版本
print(ua.chrome) #随机打印chrome浏览器任意版本
print(ua.random) #随机打印任意厂家的浏览器
5 如果process_request返回的是Request对象
-会交给引擎,引擎把请求放到调度中,等待下次被调度
6 集成selenium(有的页面有js,不会执行,使用selenium就会执行js,数据更全)
-在爬虫类中类属性
driver = webdriver.Chrome(executable_path='')
-在爬虫类中方法:
def close(spider, reason):
spider.driver.close()
-在中间件中的process_reqeust中
from scrapy.http import HtmlResponse
spider.driver.get(url=request.url)
response=HtmlResponse(url=request.url,body=spider.driver.page_source.encode('utf-8'),request=request)
return response
示例:
middlewares.py
from scrapy import signals
# 爬虫中间件
class MyfirstCrawlSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
# 下载中间件
class MyfirstCrawlDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s