Crawler-06: Scrapy Framework
scrapy框架基础
一、scrapy框架的介绍
- Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架,我们只需要实现少量的代码,就能够快速的抓取
- Scrapy使用了Twisted异步网络框架,可以加快我们的下载速度
- 异步:调用在发出之后,这个调用就直接返回,不管有无结果
- 非阻塞:关注的是程序在等待调用结果时的状态,指在不能立刻得到结果之前,该调用不会阻塞当前线程
二、scrapy框架的工作流程
项目 | 作用 |
---|---|
引擎 | 总指挥:负责数据和信号在不同模块间的传递 |
调度器 | 一个队列,存放引擎发过来的requests请求 |
下载器 | 下载引擎发过来的requests数据在发送给引擎 |
爬虫 | 处理引擎发来的response数据,提取url,并交给引擎 |
管道 | 处理引擎传过来的数据,进行下一步操作,例如存储 |
下载器中间件 | 可以自定义下载扩展,比如设置代理 |
爬虫中间件 | 可以指定requests请求和进行requests过滤 |
三、scrapy入门
- 创建scrapy
scrapy startproject 项目名称
- 创建爬虫
scrapy genspider 爬虫名字 demo.com
- 运行scrapy框架
scrapy crawl 爬虫名字
注意:
1、以上步骤如果在终端使用,先通过终端命令前往需要创建scrapy文件的文件夹,在重复以上步骤。
2、如果在编译器内部进行,需要在编译器与系统的相关的终端(Terminal)内部进行。在终端内部重复以上创建步骤,而运行步骤则需要在scrapy框架文件的第一层创建一个.py
文件。
在py文件的内部重复一下动作,在运行这个py文件,就可以运行scrapy框架来进行爬虫程序的运行。
from scrapy import cmdline
cmdline.execute(['scrapy', 'crawl', '爬虫的名字'])
或者这么写
cmdline.execute("scrapy crawl 爬虫的名字".split())
四、scrapy项目结构
- pipline:管道
- 1、对引擎传递的数据进行处理
- 2、pipline李可以有多个管道,数据值越小,管道优先级越高
- 3、pipeline中
process_item()
方法名不能修改为其他的名称
- items
- 用于对爬虫程序内的数据进行实例化对象
- settings
- 存储一些公共变量
- spider
- 爬虫程序,用于对引擎返回的response数据进行分析处理
- middlewares
- 1、自定义一个中间件
- 2、重写
process_request(seif,request,spider)
方法 - 3、实现随机ua
五、scrapy其他的知识
在开始编写爬虫程序之前,需要在settings.py文件的内部进行一些修改
- 修改一:
在内部添加
LOG_LEVEL = 'WARNING'
设置警告优先级
- 修改二:
ROBOTSTXT_OBEY = False
将Ture改为False
- 修改三:
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_3_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
添加请求头
- 修改四:
ITEM_PIPELINES = {
'Poetry.pipelines.PoetryPipeline': 300,
}
开启管道
六、中间件
- 下载中间件
- 爬虫中间件
- 自定义中间件
需求:添加随机ua
在下载器当中
process_resquest(seif, request, spider):
参数
request:发送求情的request对象
spider:发送请求的spider对象
在这个方法里面默认返回一个None(return None)
None:就是正常情况,处理正常的请求
Response:返回response对象直接执行process_response方法
Request:
process_response(self, request, response, spider):
七、middleware补充
总结
1、在middleware文件中设置随机ua
2、充血process_request方法
def process_request(self, request, spider):
uesr_agent = random.choice(self.USER_AGENTS)
request.headers['Uesr-Agent'] = uesr_agent
3、不能忘记在settings文件中开启middliware
扩展
如果你不希望过滤,可以使用
# dont_filter = True --默认是False,完成去重功能。
yield scrapy.Request(self.start_urls[0], dont_filter=True)
fake-useragent 可以随机生成ua
1、安装 pip3 install fake-uearagent
2、使用
1、随机ua
2、ua,random
3、生成指定的ua
八、scrapy模拟登陆
如何进行模拟登陆
- 1、写cookie模拟登陆
- 2、找到数据接口,发送post请求(提交账号和密码)
- 3、通过seleium进行模拟登陆
- 加载驱动
- 打开登陆界面
- 找到对应的imput标签输入文字
模拟登陆人人网
http://www.renren.com/975937712/profile 王佳欣的人人网个人详情页面
搜索cookies
1、可以通过middlewres
2、分析了源码,发现了start_requests()
# 保存文件
with open('renren.html', 'w', encoding='utf-8') as f:
f.write(response.body.decode())
总结
1、parse函数直接拿到的就是response,所以我们需要它像之前的url发起请求,并携带cookie
start_request()方法
在这个方法当中不要用headers、
# 发起请求
yield scrapy.Request(
url=self.start_urls[0],
# 处理请求结果
callback=self.parse,
# headers=headers
cookies=cookies
)
2、cookie格式的处理
需要使用字典推导式
cookies = {
i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
renren.py
import scrapy
class RenrenSpider(scrapy.Spider):
name = 'renren'
allowed_domains = ['renren.com']
start_urls = ['http://www.renren.com/975937712/profile'] # 详情页
# 重写start_requests()方法
def start_requests(self):
# 携带cookie
cookies = 'Cookie: anonymid=kn8jumirb7ee2k; depovince=GW; _r01_=1; taihe_bi_sdk_uid=422fb307325599aa438d408d4bb06a37; taihe_bi_sdk_session=60c3421a289deb3afd889f15a9ec0aad; JSESSIONID=abcF3UTXoHDWr75pvwWIx; t=88d15e116307bd1cbad7ed5b0b9aa5d12; societyguester=88d15e116307bd1cbad7ed5b0b9aa5d12; id=975937712; xnsid=4fa3ce39; jebecookies=f2fe4658-7b93-40b5-8ce2-f50fe03ed175|||||; ver=7.0; loginfrom=null; XNESSESSIONID=4400bd141850; wp_fold=0'
cookies = {
i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
headers = {
'Cookies':cookies
}
# 发起请求
yield scrapy.Request(
url=self.start_urls[0],
# 处理请求结果
callback=self.parse,
# headers=headers
cookies=cookies
)
def parse(self, response):
# print(response.body.decode())
# 保存文件
with open('renren.html', 'w', encoding='utf-8') as f:
f.write(response.body.decode())
pipline.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
import json
from itemadapter import ItemAdapter
class PoetryPipeline:
def open_spider(self, item):# 需要加入参数(随机参数)
self.gushiwen = open('古诗文.txt', 'w', encoding='utf-8')
def process_item(self, item, spider):
item_json = json.dumps(dict(item), ensure_ascii=False) # ensure_ascii=False将乱码装换成中文
self.gushiwen.write(item_json+'\n')
# print(item)
return item
def close_spider(self, item):
self.gushiwen.close()
settings.py
# Scrapy settings for Poetry project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'Poetry'
SPIDER_MODULES = ['Poetry.spiders']
NEWSPIDER_MODULE = 'Poetry.spiders'
LOG_LEVEL = 'WARNING' # 设置登记
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Poetry (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_3_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# 爬虫中间件
#SPIDER_MIDDLEWARES = {
# 'Poetry.middlewares.PoetrySpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# 下载器中间件
#DOWNLOADER_MIDDLEWARES = {
# 'Poetry.middlewares.PoetryDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 管道
ITEM_PIPELINES = {
'Poetry.pipelines.PoetryPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
九、scrapy框架的翻页\详情页操作
思路1 :
- 在原有的开始页面内找到下一页的url地址,通过
response.xpath()
提取出来,在通过response.urljoin()
来复原整个的url地址,在通过yield scrapy.Request(url= xx,callback = seif.parse)
将新的url地址重新进行请求。
爬虫文件源码展示:
import scrapy
from ChapterSpider.items import ChapterspiderItem
class YinnegzheSpider(scrapy.Spider):
name = 'yinnegzhe'
allowed_domains = ['bxwxorg.com']
start_urls = ['https://www.bxwxorg.com/read/121200/639119.html']
def parse(self, response):
chapter_name = response.xpath('//div[@class="content_read"]/div[@class="box_con"]/div[@class="bookname"]/h1/text()').extract_first()
# print(chapter_name)
chapter_contents = response.xpath('//div[@class="content_read"]/div[@class="box_con"]/div[@id="content"]/p/text()').extract()
# print(chapter_content)
chapter_text = '\n'.join(chapter_contents)
# print(chapter_content)
item = ChapterspiderItem()
item['chapter_name'] = chapter_name
item['chapter_text'] = chapter_text
yield item
chapter_href = response.xpath('//div[@class="content_read"]/div[@class="box_con"]/div[@class="bottem2"]/a/@href').getall()[3]
print(chapter_href)
chapter_url = response.urljoin(chapter_href)
# print(chapter_url)
if chapter_href == '/read/121200/':
pass
print('小说爬虫程序结束!')
else:
yield scrapy.Request(
url=chapter_url,
callback=self.parse
)
注意:这是一个小说爬虫文件的源码,其中if chapter_href == '/read/121200/':
这段代码用意是对爬虫文件翻页操作之后,来到最后一页进行判断结束,以免报错。
- scrapy.Request知识点
scrapy.Request(url, callback=None, method='GET', headers=None, body=None,cookies=None, meta=