目录
process_request(request,spider)
process_response(request,response,spider)
一.Scrapy介绍
什么是Scrapy
Scrapy是一个为了爬取网站数据,提取结构性数据二编写的应用框架,我们只需要实现少量的代码,就能快速的进行抓取,Scrapy使用了Twisted异步网络框架,可以极大的加速我们的下载速度。
Scrapy官方文档
初窥Scrapy — Scrapy 1.0.5 文档http://scrapy-chs.readthedocs.io/zh_CN/1.0/intro/overview.html
异步与非阻塞的区别
异步:调用在发出之后,这个调用就直接返回,不管有无结果
非阻塞:关注的是程序在等待调用结果时的状态,指在不能立刻得到结果之前,该调用不会阻塞当前线程
二.Scrapy工作流程
三.Scrapy入门
1 创建一个scrapy项目
scrapy startproject mySpider
cd mySpider
2 生成一个爬虫
scrapy genspider demo "demo.cn"
3 提取数据
完善spider 使用xpath等
4 保存数据
pipeline中保存数据
在命令行中运行爬虫
scrapy crawl qb # qb爬虫的名字
在pycharm中运行爬虫
from scrapy import cmdline
cmdline.execute("scrapy crawl qb".split())
四.pipline的使用
从pipeline的字典形可以看出来,pipeline可以有多个,而且确实pipeline能够定义多个
为什么需要多个pipeline:
1 可能会有多个spider,不同的pipeline处理不同的item的内容
2 一个spider的内容可以要做不同的操作,比如存入不同的数据库中
注意:
1 pipeline的权重越小优先级越高
2 pipeline中process_item方法名不能修改为其他的名称
如何翻页
scrapy.Request知识点
scrapy.Request(url, callback=None, method='GET', headers=None, body=None,cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, flags=None)
常用参数为:
callback:指定传入的URL交给那个解析函数去处理
meta:实现不同的解析函数中传递数据,meta默认会携带部分信息,比如下载延迟,请求深度
dont_filter:让scrapy的去重不会过滤当前URL,scrapy默认有URL去重功能,对需要重复请求的URL有重要用途
五.item的使用
1. 首先在item模块中定义要使用的属性字段
items.py
import scrapy
class TencentItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
position = scrapy.Field()
date = scrapy.Field()
2.在爬虫文件中导入item模块中的类
from scrapy.http.response.html import HtmlResponse
# 21这个命名不规范 第一种导入方式 绝对路径导入
# from 21.MySpider.MySpider.items import MyspiderItem
# 第二种导入方式 在自己设定的根目录下导入
# from MySpider.items import MyspiderItem
3.实例化item创建对象就可以使用item中定义好的字段了
item = MyspiderItem()
六.Scrapy.settings说明和配置
为什么需要配置文件:
配置文件存放一些公共的变量(比如数据库的地址,账号密码等)
方便自己和别人修改
一般用全大写字母命名变量名 SQL_HOST = '192.168.0.1'
settings文件详细信息:Scrapy学习篇(八)之settings - cnkai - 博客园https://www.cnblogs.com/cnkai/p/7399573.html
七.Scrapy CrawlSpider说明
之前的代码中,我们有很大一部分时间在寻找下一页的URL地址或者内容的URL地址上面,这个过程能更简单一些吗?
思路:
1.从response中提取所有的标签对应的URL地址
2.自动的构造自己resquests请求,发送给引擎
目标:通过爬虫了解crawlspider的使用
生成crawlspider的命令:scrapy genspider -t crawl 爬虫名字 域名
LinkExtractors链接提取器
使用LinkExtractors可以不用程序员自己提取想要的url,然后发送请求。这些工作都可以交给LinkExtractors,他会在所有爬的页面中找到满足规则的url,实现自动的爬取。
class scrapy.linkextractors.LinkExtractor(
allow = (),
deny = (),
allow_domains = (),
deny_domains = (),
deny_extensions = None,
restrict_xpaths = (),
tags = ('a','area'),
attrs = ('href'),
canonicalize = True,
unique = True,
process_value = None
)
主要参数讲解:
- allow:允许的url。所有满足这个正则表达式的url都会被提取。
- deny:禁止的url。所有满足这个正则表达式的url都不会被提取。
- allow_domains:允许的域名。只有在这个里面指定的域名的url才会被提取。
- deny_domains:禁止的域名。所有在这个里面指定的域名的url都不会被提取。
- restrict_xpaths:严格的xpath。和allow共同过滤链接
Rule规则类
定义爬虫的规则类
class scrapy.spiders.Rule(
link_extractor,
callback = None,
cb_kwargs = None,
follow = None,
process_links = None,
process_request = None
)
主要参数讲解:
- link_extractor:一个LinkExtractor对象,用于定义爬取规则。
- callback:满足这个规则的url,应该要执行哪个回调函数。因为CrawlSpider使用了parse作为回调函数,因此不要覆盖parse作为回调函数自己的回调函数。
- follow:指定根据该规则从response中提取的链接是否需要跟进。
- process_links:从link_extractor中获取到链接后会传递给这个函数,用来过滤不需要爬取的链接。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class YgSpider(CrawlSpider):
name = 'yg'
allowed_domains = ['sun0769.com']
start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=0']
rules = (
Rule(LinkExtractor(allow=r'wz.sun0769.com/html/question/201811/\d+\.shtml'), callback='parse_item'),
Rule(LinkExtractor(allow=r'http:\/\/wz.sun0769.com/index.php/question/questionType\?type=4&page=\d+'), follow=True),
)
def parse_item(self, response):
item = {}
item['content'] = response.xpath('//div[@class="c1 text14_2"]//text()').extract()
print(item)
八.Scrapy模拟登录
1. 直接向目标url发送请求并携带cookie
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
allowed_domains = ['qq.com']
start_urls = ['https://user.qzone.qq.com/1097566154']
# 携带cookie
def start_requests(self):
cookies = '_ga=GA1.2.1725491264.1617184478; pgv_pvid=5477592008; RK=364ATneaHc; ptcz=0a313f2bdd6331665eddc0e722991f9859878a5008817e4a960d5b4ab99357f7; tvfe_boss_uuid=18d15aea2302cf32; iip=0; pac_uid=0_fc2f75bbcf77a; o_cookie=1097566154; eas_sid=e1o6J339o5u7k4V9d3E72580y9; luin=o1097566154; lskey=00010000c14c93171ecc8364633b0f79398985d1df1cac66eafe92bc2524780c1333df21007c48468c0bdc77; LW_sid=s1x6k4X6S2W2j8G0D9v3T8d8H1; LW_uid=Y1V6N426D2W2r8j0v9U3z8h806; _qpsvr_localtk=0.6613989595256926; pgv_info=ssid=s139336820; ptui_loginuin=1097566154; uin=o1097566154; skey=@EOsiO4BQa; p_uin=o1097566154; pt4_token=p5oqxzXFq4WDD6xe7oVRqdnBMBYtaPXs*dOapmDmY7c_; p_skey=IlLf3cuGBcwnXTrY4g7B0eupYMsFK97XUU*Fk8IDKhg_; Loading=Yes; qz_screen=1920x1080; 1097566154_todaycount=0; 1097566154_totalcount=13974; QZ_FE_WEBP_SUPPORT=1; cpu_performance_v8=0; __Q_w_s__QZN_TodoMsgCnt=1'
cookies = {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
# headers = {
# 'cookies':cookies
# }
yield scrapy.Request(
url=self.start_urls[0],
callback=self.parse,
# headers=headers,
cookies=cookies
)
def parse(self, response):
with open('qzone.html','w',encoding='utf-8') as f:
f.write(response.text)
2. 向目标url发送post请求 携带data(账号和密码)
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider1'
allowed_domains = ['github.com']
start_urls = ['https://github.com/login']
def parse(self, response):
authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').get()
login = "1097566154@qq.com"
password = "wq15290884759."
timestamp = response.xpath('//input[@name="timestamp"]/@value').get()
timestamp_secret = response.xpath('//input[@name="timestamp_secret"]/@value').get()
data = {
"commit": "Sign in",
"authenticity_token": authenticity_token,
"login":login,
"password":password,
"webauthn - support":"supported",
"webauthn - iuvpaa - support":"unsupported",
"timestamp":timestamp,
"timestamp_secret":timestamp_secret,
}
# 携带数据发送post请求
yield scrapy.FormRequest(
url='https://github.com/session',
formdata=data,
callback=self.after_login
)
def after_login(self,response):
with open('github.html','w',encoding='utf-8') as f:
f.write(response.text)
3. 通过selenium来模拟登录(input标签,定位登录按钮) 在下载中间件中:
class SeleniumMiddleware:
def process_request(self, request, spider):
url = request.url
print(url)
driver = webdriver.Chrome()
driver.get(url)
time.sleep(2)
driver.find_element_by_css_selector('#user_login').send_keys('1097566154@qq.com')
driver.find_element_by_css_selector('#user_password').send_keys('wq15290884759.')
driver.find_element_by_css_selector('#new_user > div > div > div > div:nth-child(4) > input').click()
html = driver.page_source
return HtmlResponse(url=request.url,
body=html,
request=request,
encoding='utf-8',
status=200)
九.Scrapy内置方法保存图片文件
scrapy为下载item中包含的文件提供了一个可重用的item pipelines,这些pipeline有些共同的方法和结构,一般来说你会使用Files Pipline或者Images Pipeline
下载文件的 Files Pipeline
使用Files Pipeline下载文件,按照以下步骤完成:
- 定义好一个Item,然后在这个item中定义两个属性,分别为file_urls以及files。files_urls是用来存储需要下载的文件的url链接,需要给一个列表
- 当文件下载完成后,会把文件下载的相关信息存储到item的files属性中。如下载路径、下载的url和文件校验码等
- 在配置文件settings.py中配置FILES_STORE,这个配置用来设置文件下载路径
- 启动pipeline:在ITEM_PIPELINES中设置scrapy.piplines.files.FilesPipeline:1
下载图片的 Images Pipeline
使用images pipeline下载文件步骤:
- 定义好一个Item,然后在这个item中定义两个属性,分别为image_urls以及images。image_urls是用来存储需要下载的文件的url链接,需要给一个列表
- 当文件下载完成后,会把文件下载的相关信息存储到item的images属性中。如下载路径、下载的url和图片校验码等
- 在配置文件settings.py中配置IMAGES_STORE,这个配置用来设置文件下载路径
- 启动pipeline:在ITEM_PIPELINES中设置scrapy.pipelines.images.ImagesPipeline:1
十.Scrapy下载中间件
下载中间件是scrapy提供用于用于在爬虫过程中可修改Request和Response,用于扩展scrapy的功能。
使用方法:
编写一个Download Middlewares和我们编写一个pipeline一样,定义一个类,然后再settings中开启;
Download Middlewares默认方法:处理请求,处理响应,对应两个方法;
process_request(self,request,spider):
当每个request通过下载中间件时,该方法被调用
process_response(self,request,response,spider):
当下载器完成http请求,传递响应给引擎的时候调用
process_request(request,spider)
当每个Request对象经过下载中间件时会被调用,优先级越高的中间件,越先调用;该方法应该返回以下对象:None/Response对象/Request对象/抛出IgnoreRequest异常
- 返回None:scrapy会继续执行其他中间件相应的方法;
- 返回Response对象:scrapy不会再调用其他中间件的process_request方法,也不会去发起下载,而是直接返回该Response对象
- 返回Request对象:scrapy不会再调用其他中间件的process_request()方法,而是将其放置调度器待调度下载
- 如果这个方法抛出异常,则会调用process_exception方法
process_response(request,response,spider)
当每个Response经过下载中间件会被调用,优先级越高的中间件,越晚被调用,与process_request()相反;该方法返回以下对象:Response对象/Request对象/抛出IgnoreRequest异常。
- 返回Response对象:scrapy会继续调用其他中间件的process_response方法;
- 返回Request对象:停止中间器调用,将其放置到调度器待调度下载;
- 抛出IgnoreRequest异常:Request.errback会被调用来处理函数,如果没有处理,它将会被忽略且不会写进日志。
设置随机请求头
爬虫在频繁访问一个页面的时候,这个请求如果一直保持一致。那么很容易被服务器发现,从而禁止掉这个请求头的访问。因此我们要在访问这个页面之前随机的更改请求头,这样才可以避免爬虫被抓。随机更改请求头,可以在下载中间件实现。在请求发送给服务器之前,随机的选择一个请求头。这样就可以避免总使用一个请求头。
在middlewares.py文件中
class RandomUserAgent(object):
def process_request(self,request,spider):
useragent = random.choice(spider.settings['USER_AGENTS'])
request.headers['User-Agent'] = useragent
class CheckUserAgent(object):
def process_response(self,request,response,spider):
print(request.headers['User-Agent'])
return response
USER_AGENTS = [ "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5" ]
十一.Scrapy下载中间件+Selenium
利用Scrapy+Selenium爬取网易新闻实例:
爬虫文件:
import scrapy
from copy import deepcopy
from selenium import webdriver
class WySpider(scrapy.Spider):
name = 'wy'
allowed_domains = ['163.com']
start_urls = ['https://news.163.com/']
model_urls = []
# 加载驱动
driver = webdriver.Chrome()
def parse(self, response):
li_list = response.xpath('//div[@class="ns_area list"]/ul/li')
# 筛选出想要的界面
li_index = [2, 3, 5, 6]
for index in li_index:
li = li_list[index]
item = {}
item['大分类'] = li.xpath('./a/text()').get()
url = li.xpath('./a/@href').get()
# print(item, url)
self.model_urls.append(url)
# print(self.model_urls)
yield scrapy.Request(
url=url,
callback=self.parse_html,
meta={'item': deepcopy(item)}
)
def parse_html(self, response):
item = response.meta['item']
# print(item)
"""列表页面"""
# 这个response是下载中间件拦截到请求,然后处理给返回的
# print(response.text)
# 匹配一个属性值中包含的字符串
div_list = response.xpath('//div[contains(@class, "data_row")]')
for div in div_list:
detail_title = div.xpath('.//h3/a/text()').extract_first()
detail_url = div.xpath('.//h3/a/@href').extract_first()
# item = {}
item['title'] = detail_title
item['url'] = detail_url
print(item)
yield scrapy.Request(
url=detail_url,
callback=self.parse_detail,
meta={'item': item}
)
def parse_detail(self, response):
"""详情页面"""
print(response.text)
@staticmethod
def close(spider, reason):
spider.bro.quit()
中间件文件:
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
import time
from scrapy.http.response.html import HtmlResponse
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
class SeSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class SeDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class WangYiDownloaderMiddleware:
def process_request(self, request, spider):
'''拦截4个板块的请求'''
# print(spider.model_urls)
# print(request.url, "request.url")
if request.url in spider.model_urls:
# 这些请求应该拦截,用selenium去处理
driver = spider.driver
driver.get(request.url)
time.sleep(3)
# 获取页面当前的高度
current_height = driver.execute_script("return document.body.scrollHeight;")
# 向下滑动
while True:
# 滑到底部
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(3)
new_height = driver.execute_script("return document.body.scrollHeight;")
if new_height == current_height:
break
current_height = new_height
# 页面滑到底部了
try:
driver.find_element_by_xpath('//div[@class="post_addmore"]/span').click()
time.sleep(2)
except:
pass
# 当前获取到的页面返回response
return HtmlResponse(url=driver.current_url, body=driver.page_source, request=request, encoding='utf-8')
处理ajax动态加载数据比较方便。