crawlspider的使用
scrapy genspider -t crawl spider_name allowed_domain
class GtSpider(CrawlSpider):
name = 'gt'
allowed_domains = ['guokr.com']
start_urls = ['https://www.guokr.com/ask/highlight']
rules = (
Rule(LinkExtractor(allow=r'/ask/highlight/\?page=\d+'), follow=True),
# 提取详情页的url地址
Rule(LinkExtractor(allow=r'question/\d+/'), callback='parse_item'),
)
- parse()方法开发者不能使用,rules规则的提取,默认使用的就是parse方法
- Rule(LinkExtractor(allow=r’/ask/highlight/?page=\d+’), follow=True,callback=‘parse_item’)
- LinkExtractor:url的规则提取方式
- callback:符合规则的url,会自动发起请求,将来得到的响应交给callback方式去解析
- follow:符合规则的url,自行发起请求,得到响应会再次被rules中的规则进行匹配提取
下载器中间件的使用
- 在middlewares.py中定义一个类,必须具有两个方法
- prcess_request(request,spier)
- process_response(request,response,spider)
class DemoMidddleware:
def process_request(self, request, spider):
request.meta["proxy"] = "http://127.0.0.1:8000"
def process_response(self, request, response, spider):
print(request.headers["User-Agent"], "*" * 100)
return response
-
关于两个方法的返回值
prcess_request(request,spier)
- return None 继续请求,将request对象交给后续的中间件使用
- return Request 拦截当前请求,将返回值交给调度器
- return Response 拦截当前请求,将返回值交给爬虫去解析
process_response(request, response, spider)
1. return request 将请求交给调度器,放入请求队列 2. return response 将response交给后续的中间价,或者到达爬虫去解析
-
应用:
- 设置UA
process_request(self,request,spider): request.headers['User-Agent'] = random('UA')
- 设置代理
process_request(self,request,spider): request.meta['proxy'] = 'http://192.168.1.1:80'
- 设置cookies(主要是为了反反爬)
process_request(self,request,spider): request.cookies = cookies # 可以从cookies池随机取出一个
- scrpay集成selenium
class DemoPipeline open_spdier(self,spider): if spider.name = "itcast": spider.driver = webdirver.Chrome() close_spider(self,spider); if spider.name = "itcast": spider.driver.quit() class DemoMidddleware process_request(self,request,spdier); if spider.name = "itcast": spdier.dirver.get(request.url) return HtmlResponse(body=spdier.dirver.page_source,request=request,encoding="utf-8",url=spider.dirver.current_url)