1. Scrapy概述
1. 为什么要学习scrapy框架
- 爬虫必备的技术,面试会问相关的知识。
- 让我们的爬虫更快更强大。(支持异步爬虫)
2. 什么是Scrapy?
- 异步爬虫框架:Scrapy是一个基于Python开发的爬虫框架,用于抓取网站并从其页面中提取结构化数据,也是当前Python爬虫生态中最流行的爬虫框架,Scrapy框架架构清晰,可扩展性强,可以灵活高效的完成各种爬虫需求。
程序状态转换图:
3. 如何学习Scrapy?
- 官网:https://scrapy.org/
- 官方文档1(中文):https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html
- 官方文档2(英文):https://docs.scrapy.org/en/latest/
4. Scrapy工作流程
分工介绍表
版块 | 介绍 | 要求 |
---|---|---|
Scrapy engine(引擎) | 总指挥:负责数据和信号在不同模块之间的传递 | Scrapy已经实现 |
Scheduler(调度器) | 一个队列,存放引擎发过来的request请求 | Scrapy已经实现 |
Downoader(下载器) | 下载引擎发过来的requests请求的源码(即response),将源码返回给引擎 | scrapy已经实现 |
Spider(爬虫) | 处理引擎发来的response,提取数据,提取url,并交给引擎 | 需要手写 |
Item Pipline(管道) | 处理引擎传过来的数据,比如存储数据 | 需要手写 |
Downloader Middlewares(下载中间件) | 可以自定义的下载扩展,比如设置代理 | 一般不用手写 |
Spider Middlewares(中间件) | 可以自定义requests请求和进行response过滤 | 一般不用手写 |
2. Scrapy快速入门(小案例)
1. 安装
pip install scrapy
pip install scrapy==2.5.1 # 指定安装2.5.1版本的scrapy
在终端内输入“scrapy”命令验证是否安装好了:
以上显示就说明已经安装好了。
2. 创建项目
- 需要进入到项目保存位置的cmd中。
# scrapy startproject 项目名称
scrapy startproject my_Scrapy
3. 项目结构分析
- my_Scrapy
- my_Scrapy
- spiders
- __init__.py
- __init__.py
- items.py
- middlewares.py
- piplelines.py
- settings.py
- spiders
- scrapy.cfg
- my_Scrapy
功能介绍:
- scrapy.cfg:Scrapy项目配置文件,定义项目的配置文件的路径,部署信息。(一般不需要改)
- items.py:定义了item的数据结构,所有item的定义都可以放在这里。(定义爬取的数据内容有哪些)
- piplelines.py:定义item Pipeline的实现。
- settings.py:定义项目的全局配置。
- middlewares.py:中间件文件,定义了Spider Middlewares和Downloader Middlewares的实现。
- spiders:里面包含一个个spider(爬虫)的实现。每一个spider对应一个.py文件。
4. 创建Spider
# 先进入项目目录:
cd my_Scrapy
# scrapy genspider 爬虫文件名 爬取网站的域名
scrapy genspider spider1 www.baidu.com
- 修改spider1.py文件:
import scrapy
class Spider1Spider(scrapy.Spider):
# spider(爬虫)名称,需要记住,通过名字来启动爬虫:
name = 'spider1'
# 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
allowed_domains = ['http://quotes.toscrape.com/']
# 初始request请求:
start_urls = ['http://quotes.toscrape.com/']
# 解析方法:
def parse(self, response):
print(response.text)
官方参考案例网站:http://quotes.toscrape.com/
5. 创建item
- item是保存爬取数据的容器,定义爬取的数据结构。
修改项目中的items.py文件如下:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class MyScrapyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 采集的目标内容:名言、名人、分类标签
# 名言:
text = scrapy.Field()
# 名人:
author = scrapy.Field()
# 标签:
tags = scrapy.Field()
6. 解析Response
1. 仅仅爬取第一页的数据
- 修改spider1.py文件中的parse()方法,该方法用于解析源码中的目标内容。
import scrapy
from lxml import etree
class Spider1Spider(scrapy.Spider):
# spider(爬虫)名称,需要记住,通过名字来启动爬虫:
name = 'spider1'
# 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
allowed_domains = ['http://quotes.toscrape.com/']
# 初始request请求:
start_urls = ['http://quotes.toscrape.com/']
# 解析方法:
def parse(self, response):
# print(response.text)
"""
解析返回响应,提取数据或进一步生成要处理的请求。
:param response:
:return:
"""
# 方法一:通过css选择器来进行解析
"""
quotes = response.css('.quote') # 列表
for quote in quotes:
# text = quote.css('span.text::text') # ::text表示获取里面的文本(注意它是一个对象)
# 旧方法:
# extract_first() 返回第一个数据。(字符串)
# extract() 返回全部的数据。(字符串列表)
text = quote.css('span.text::text').extract_first() # ::text表示获取里面的文本 (extract_first()方法可以获取到对象里面的文本内容)
# author = quote.css('small.author::text') # 作者 (得到一个对象)
author = quote.css('small.author::text').extract_first() # 作者 (得到文本内容)
# tags = quote.css('div.tags a.tag::text') # 标签(CSS对象)
tags = quote.css('div.tags a.tag::text').extract() # 返回全部的数据
# print(tags)
# print(text, ' ——————', author, tags)
# 新方法:
# get() 返回一条数据
# getall() 返回全部的数据
text = quote.css('span.text::text').get()
author = quote.css('small.author::text').get()
tags = quote.css('div.tags a.tag::text').getall()
print(text, tags, ' ——————', author)
"""
# 方法二:通过xpath进行解析
html = etree.HTML(response.text)
quotes_divs = html.xpath('//div[@class="quote"]')
for quote_div in quotes_divs:
text = quote_div.xpath('./span[1]/text()')[0]
author = quote_div.xpath('./span[2]/small/text()')[0]
tags = quote_div.xpath('./div[@class="tags"]/a/text()')
print(text, tags, ' ------', author)
运行start.py文件的结果:
D:\Anaconda\python.exe C:/Users/lv/Desktop/scrapy框架的学习/my_Scrapy/start.py
2022-04-03 19:24:47 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: my_Scrapy)
2022-04-03 19:24:47 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-03 19:24:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-03 19:24:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'my_Scrapy',
'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-03 19:24:47 [scrapy.extensions.telnet] INFO: Telnet Password: e2250e171a87ebd6
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-03 19:24:47 [scrapy.core.engine] INFO: Spider opened
2022-04-03 19:24:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-03 19:24:47 [py.warnings] WARNING: D:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py:65: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry http://quotes.toscrape.com/ in allowed_domains.
warnings.warn(message, URLWarning)
2022-04-03 19:24:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-03 19:24:48 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2022-04-03 19:24:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2022-04-03 19:24:48 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-03 19:24:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 448,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2582,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 1.264597,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 3, 11, 24, 48, 658256),
'httpcompression/response_bytes': 11053,
'httpcompression/response_count': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 4, 3, 11, 24, 47, 393659)}
2022-04-03 19:24:48 [scrapy.core.engine] INFO: Spider closed (finished)
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” ['change', 'deep-thoughts', 'thinking', 'world'] ------ Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.” ['abilities', 'choices'] ------ J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” ['inspirational', 'life', 'live', 'miracle', 'miracles'] ------ Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” ['aliteracy', 'books', 'classic', 'humor'] ------ Jane Austen
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” ['be-yourself', 'inspirational'] ------ Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.” ['adulthood', 'success', 'value'] ------ Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.” ['life', 'love'] ------ André Gide
“I have not failed. I've just found 10,000 ways that won't work.” ['edison', 'failure', 'inspirational', 'paraphrased'] ------ Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is until it's in hot water.” ['misattributed-eleanor-roosevelt'] ------ Eleanor Roosevelt
“A day without sunshine is like, you know, night.” ['humor', 'obvious', 'simile'] ------ Steve Martin
Process finished with exit code 0
2. 翻页爬取数据
- 修改的主要是spider1.py文件中的部分语句。
import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem # 从items文件中导入MyScrapyItem类
class Spider1Spider(scrapy.Spider):
# spider(爬虫)名称,需要记住,通过名字来启动爬虫:
name = 'spider1'
# # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
# allowed_domains = ['http://quotes.toscrape.com/'] # 不设置限制,就可以一直爬取下一页了
# 初始request请求:
start_urls = ['http://quotes.toscrape.com/']
# 解析方法:
def parse(self, response):
# print(response.text)
"""
解析返回响应,提取数据或进一步生成要处理的请求。
:param response:
:return:
"""
# 方法一:通过css选择器来进行解析
"""
quotes = response.css('.quote') # 列表
for quote in quotes:
# text = quote.css('span.text::text') # ::text表示获取里面的文本(注意它是一个对象)
# 旧方法:
# extract_first() 返回第一个数据。(字符串)
# extract() 返回全部的数据。(字符串列表)
text = quote.css('span.text::text').extract_first() # ::text表示获取里面的文本 (extract_first()方法可以获取到对象里面的文本内容)
# author = quote.css('small.author::text') # 作者 (得到一个对象)
author = quote.css('small.author::text').extract_first() # 作者 (得到文本内容)
# tags = quote.css('div.tags a.tag::text') # 标签(CSS对象)
tags = quote.css('div.tags a.tag::text').extract() # 返回全部的数据
# print(tags)
# print(text, ' ——————', author, tags)
# 新方法:
# get() 返回一条数据
# getall() 返回全部的数据
text = quote.css('span.text::text').get()
author = quote.css('small.author::text').get()
tags = quote.css('div.tags a.tag::text').getall()
print(text, tags, ' ——————', author)
"""
# 方法二:通过xpath进行解析
html = etree.HTML(response.text)
quotes_divs = html.xpath('//div[@class="quote"]')
for quote_div in quotes_divs:
text = quote_div.xpath('./span[1]/text()')[0]
author = quote_div.xpath('./span[2]/small/text()')[0]
tags = quote_div.xpath('./div[@class="tags"]/a/text()')
# print(text, tags, ' ------', author)
# 将数据放入容器中,便于保存数据
item = MyScrapyItem()
item['text'] = text
item['author'] = author
item['tags'] = tags
# print(item)
# 通过迭代的方式将字典中的每一条数据交给Pipeline
yield item
# 定义翻页操作
next = response.css('ul.pager li.next a::attr("href")').get() # 获得“下一页”a标签按钮上面href属性值
print(next) # /page/2/
url = self.start_urls[0] # 获取当前爬取的url(方法一)
# print(url)
url = response.url # 获取当前爬取的url(方法二)
# print(url)
# 拼接形成下一页的url
url = response.urljoin(next)
print(url)
# 将请求交给调度器,重新构造下一个请求
yield scrapy.Request(url, callback=self.parse) # 对于新的请求,仍然执行parse()解析方法
运行结果部分截图:
7. 保存数据
1. 通过执行scrapy命令进行保存数据
1. 方式一:在终端执行命令
# scrapy crawl 爬虫文件名 -o 数据保存文件名
scrapy crawl spider1 -o demo.csv
2. 方式二:通过修改start.py启动文件的cmd命令行语句
# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline
# cmdline.execute('scrapy crawl spider1'.split()) # 调用终端命令
cmdline.execute('scrapy crawl spider1 -o demo.csv'.split())
# 红色的不是报错,而是scrapy框架自行打印的初始化信息。白色的内容就是print()语句输出的内容。
2. 通过自定义的方式保存(修改pipelines.py文件)
- 修改pipelines.py文件如下:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class MyScrapyPipeline:
def process_item(self, item, spider):
with open('demo.txt', 'a', encoding="utf-8") as f:
f.write(item['text'] + ' ——' + item['author'] + "\n")
return item
- 将settings.py文件中pipelines.py文件对应的注释取消掉(否则就无法成功将数据保存在txt文件中)
# Scrapy settings for my_Scrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'my_Scrapy'
SPIDER_MODULES = ['my_Scrapy.spiders']
NEWSPIDER_MODULE = 'my_Scrapy.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'my_Scrapy (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'my_Scrapy.middlewares.MyScrapySpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'my_Scrapy.middlewares.MyScrapyDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'my_Scrapy.pipelines.MyScrapyPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
- 运行结果截图:
8. 运行项目
1. 在终端内运行
# scrapy crawl 爬虫文件名
scrapy crawl spider1
最前面是爬虫运行的提示信息:
中间的就是网页源代码:
最后面是关闭爬虫的提示信息:
2. 通过PyCharm运行
需要在项目文件夹下创建一个启动项目的文件start.py:
# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline
cmdline.execute('scrapy crawl spider1'.split()) # 调用终端命令
# 红色的不是报错,而是爬虫的初始化信息。白色的内容就是print()语句输出的内容
运行结果:
D:\Anaconda\python.exe C:/Users/lv/Desktop/scrapy框架的学习/my_Scrapy/start.py
2022-04-03 16:34:35 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: my_Scrapy)
2022-04-03 16:34:35 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-03 16:34:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-03 16:34:35 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'my_Scrapy',
'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-03 16:34:35 [scrapy.extensions.telnet] INFO: Telnet Password: b9d4a8fccbb5b978
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-03 16:34:35 [scrapy.core.engine] INFO: Spider opened
2022-04-03 16:34:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-03 16:34:35 [py.warnings] WARNING: D:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py:65: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry http://quotes.toscrape.com/ in allowed_domains.
warnings.warn(message, URLWarning)
2022-04-03 16:34:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-03 16:34:36 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2022-04-03 16:34:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Quotes to Scrape</title>
<link rel="stylesheet" href="/static/bootstrap.min.css">
<link rel="stylesheet" href="/static/main.css">
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" / >
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
<span>by <small class="author" itemprop="author">J.K. Rowling</small>
<a href="/author/J-K-Rowling">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="abilities,choices" / >
<a class="tag" href="/tag/abilities/page/1/">abilities</a>
<a class="tag" href="/tag/choices/page/1/">choices</a>
</div>
</div>
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles" / >
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/life/page/1/">life</a>
<a class="tag" href="/tag/live/page/1/">live</a>
<a class="tag" href="/tag/miracle/page/1/">miracle</a>
<a class="tag" href="/tag/miracles/page/1/">miracles</a>
</div>
</div>
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>
<span>by <small class="author" itemprop="author">Jane Austen</small>
<a href="/author/Jane-Austen">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" / >
<a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>
<a class="tag" href="/tag/books/page/1/">books</a>
<a class="tag" href="/tag/classic/page/1/">classic</a>
<a class="tag" href="/tag/humor/page/1/">humor</a>
</div>
</div>
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>
<span>by <small class="author" itemprop="author">Marilyn Monroe</small>
<a href="/author/Marilyn-Monroe">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="be-yourself,inspirational" / >
<a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
</div>
</div>
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="adulthood,success,value" / >
<a class="tag" href="/tag/adulthood/page/1/">adulthood</a>
<a class="tag" href="/tag/success/page/1/">success</a>
<a class="tag" href="/tag/value/page/1/">value</a>
</div>
</div>
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>
<span>by <small class="author" itemprop="author">André Gide</small>
<a href="/author/Andre-Gide">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="life,love" / >
<a class="tag" href="/tag/life/page/1/">life</a>
<a class="tag" href="/tag/love/page/1/">love</a>
</div>
</div>
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>
<span>by <small class="author" itemprop="author">Thomas A. Edison</small>
<a href="/author/Thomas-A-Edison">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" / >
<a class="tag" href="/tag/edison/page/1/">edison</a>
<a class="tag" href="/tag/failure/page/1/">failure</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
</div>
</div>
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>
<span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
<a href="/author/Eleanor-Roosevelt">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" / >
<a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>
</div>
</div>
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>
<span>by <small class="author" itemprop="author">Steve Martin</small>
<a href="/author/Steve-Martin">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="humor,obvious,simile" / >
<a class="tag" href="/tag/humor/page/1/">humor</a>
<a class="tag" href="/tag/obvious/page/1/">obvious</a>
<a class="tag" href="/tag/simile/page/1/">simile</a>
</div>
</div>
<nav>
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
</nav>
</div>
<div class="col-md-4 tags-box">
<h2>Top Ten tags</h2>
<span class="tag-item">
<a class="tag" style="font-size: 28px" href="/tag/love/">love</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 26px" href="/tag/life/">life</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 22px" href="/tag/books/">books</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>
</span>
</div>
</div>
</div>
<footer class="footer">
<div class="container">
<p class="text-muted">
Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
</p>
<p class="copyright">
Made with <span class='sh-red'>❤</span> by <a href="https://scrapinghub.com">Scrapinghub</a>
</p>
</div>
</footer>
</body>
</html>
2022-04-03 16:34:36 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-03 16:34:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 448,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2578,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 1.29309,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 3, 8, 34, 36, 608493),
'httpcompression/response_bytes': 11053,
'httpcompression/response_count': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 4, 3, 8, 34, 35, 315403)}
2022-04-03 16:34:36 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
3. srcapy shell 的使用
1. 在终端内使用scrapy shell命令进行单次请求内容提取的测试
爬取网址:https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
在终端内输入命令:
scrapy shell https://docs.scrapy.org/en/latest/_st
atic/selectors-sample1.html
Microsoft Windows [版本 10.0.19042.1586]
(c) Microsoft Corporation。保留所有权利。
(base) C:\Users\吕成鑫\Desktop\scrapy框架的学习\my_Scr
apy>scrapy shell https://docs.scrapy.org/en/latest/_st
atic/selectors-sample1.html
2022-04-04 20:04:11 [scrapy.utils.log] INFO: Scrapy 2.
5.1 started (bot: my_Scrapy)
2022-04-04 20:04:11 [scrapy.utils.log] INFO: Versions:
lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel
1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (def
ault, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64
)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cr
yptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-04 20:04:11 [scrapy.utils.log] DEBUG: Using re
actor: twisted.internet.selectreactor.SelectReactor
2022-04-04 20:04:11 [scrapy.crawler] INFO: Overridden
settings:
{'BOT_NAME': 'my_Scrapy',
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilte
r',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-04 20:04:11 [scrapy.extensions.telnet] INFO: T
elnet Password: 358ca5f9dee7f2d7
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMidd
leware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddle
ware',
'scrapy.downloadermiddlewares.downloadtimeout.Downloa
dTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultH
eadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMidd
leware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMid
dleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCom
pressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddle
ware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddlewa
re',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMidd
leware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddlewa
re',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddlewa
re',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
item pipelines:
['my_Scrapy.pipelines.MyScrapyPipeline']
2022-04-04 20:04:11 [scrapy.extensions.telnet] INFO: T
elnet console listening on 127.0.0.1:6023
2022-04-04 20:04:11 [scrapy.core.engine] INFO: Spider
opened
2022-04-04 20:04:12 [scrapy.core.engine] DEBUG: Crawle
d (200) <GET https://docs.scrapy.org/robots.txt> (refe
rer: None)
2022-04-04 20:04:14 [scrapy.core.engine] DEBUG: Crawle
d (200) <GET https://docs.scrapy.org/en/latest/_static
/selectors-sample1.html> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Reques
t, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0
0000229978B6E80>
[s] item {}
[s] request <GET https://docs.scrapy.org/en/lates
t/_static/selectors-sample1.html>
[s] response <200 https://docs.scrapy.org/en/lates
t/_static/selectors-sample1.html>
[s] settings <scrapy.settings.Settings object at 0
x00000229978B6A20>
[s] spider <DefaultSpider 'default' at 0x22997d9
8898>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update
local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Reque
st and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]: response
Out[1]: <200 https://docs.scrapy.org/en/latest/_static
/selectors-sample1.html>
In [2]: response.text
Out[2]: "<html>\n <head>\n <base href='http://example
.com/' />\n <title>Example website</title>\n </head>\
n <body>\n <div id='images'>\n <a href='image1.html
'>Name: My image 1 <br /><img src='image1_thumb.jpg' /
></a>\n <a href='image2.html'>Name: My image 2 <br /
><img src='image2_thumb.jpg' /></a>\n <a href='image
3.html'>Name: My image 3 <br /><img src='image3_thumb.
jpg' /></a>\n <a href='image4.html'>Name: My image 4
<br /><img src='image4_thumb.jpg' /></a>\n <a href=
'image5.html'>Name: My image 5 <br /><img src='image5_
thumb.jpg' /></a>\n </div>\n </body>\n</html>\n\n"
In [3]: response.xpath('//a')
Out[3]:
[<Selector xpath='//a' data='<a href="image1.html">Nam
e: My image ...'>,
<Selector xpath='//a' data='<a href="image2.html">Nam
e: My image ...'>,
<Selector xpath='//a' data='<a href="image3.html">Nam
e: My image ...'>,
<Selector xpath='//a' data='<a href="image4.html">Nam
e: My image ...'>,
<Selector xpath='//a' data='<a href="image5.html">Nam
e: My image ...'>]
In [4]: response.xpath('//a').xpath('./img')
Out[4]:
[<Selector xpath='./img' data='<img src="image1_thumb.
jpg">'>,
<Selector xpath='./img' data='<img src="image2_thumb.
jpg">'>,
<Selector xpath='./img' data='<img src="image3_thumb.
jpg">'>,
<Selector xpath='./img' data='<img src="image4_thumb.
jpg">'>,
<Selector xpath='./img' data='<img src="image5_thumb.
jpg">'>]
In [5]: response.xpath('//a').xpath('./img')[0]
Out[5]: <Selector xpath='./img' data='<img src="image1
_thumb.jpg">'>
In [6]: response.xpath('//a').xpath('./img').getall()
...:
Out[6]:
['<img src="image1_thumb.jpg">',
'<img src="image2_thumb.jpg">',
'<img src="image3_thumb.jpg">',
'<img src="image4_thumb.jpg">',
'<img src="image5_thumb.jpg">']
In [7]: response.xpath('//a').xpath('./img').get()
Out[7]: '<img src="image1_thumb.jpg">'
In [8]: result = response.xpath('//a')
In [9]: result
Out[9]:
[<Selector xpath='//a' data='<a href="image1.html">Nam
e: My image ...'>,
<Selector xpath='//a' data='<a href="image2.html">Nam
e: My image ...'>,
<Selector xpath='//a' data='<a href="image3.html">Nam
e: My image ...'>,
<Selector xpath='//a' data='<a href="image4.html">Nam
e: My image ...'>,
<Selector xpath='//a' data='<a href="image5.html">Nam
e: My image ...'>]
In [10]: result.xpath('./img').getall()
Out[10]:
['<img src="image1_thumb.jpg">',
'<img src="image2_thumb.jpg">',
'<img src="image3_thumb.jpg">',
'<img src="image4_thumb.jpg">',
'<img src="image5_thumb.jpg">']
In [11]: response.xpath("//img")
Out[11]:
[<Selector xpath='//img' data='<img src="image1_thumb.
jpg">'>,
<Selector xpath='//img' data='<img src="image2_thumb.
jpg">'>,
<Selector xpath='//img' data='<img src="image3_thumb.
jpg">'>,
<Selector xpath='//img' data='<img src="image4_thumb.
jpg">'>,
<Selector xpath='//img' data='<img src="image5_thumb.
jpg">'>]
In [12]: response.css('a')
Out[12]:
[<Selector xpath='descendant-or-self::a' data='<a href
="image1.html">Name: My image ...'>,
<Selector xpath='descendant-or-self::a' data='<a href
="image2.html">Name: My image ...'>,
<Selector xpath='descendant-or-self::a' data='<a href
="image3.html">Name: My image ...'>,
<Selector xpath='descendant-or-self::a' data='<a href
="image4.html">Name: My image ...'>,
<Selector xpath='descendant-or-self::a' data='<a href
="image5.html">Name: My image ...'>]
In [13]: response.css('div#images')
Out[13]: [<Selector xpath="descendant-or-self::div[@id
= 'images']" data='<div id="images">\n <a href="ima
ge1....'>]
In [14]: response.css('div#images').get()
Out[14]: '<div id="images">\n <a href="image
1.html">Name: My image 1 <br><img src="image1_
thumb.jpg"></a>\n <a href="image2.html">Name
: My image 2 <br><img src="image2_thumb.jpg"><
/a>\n <a href="image3.html">Name: My image 3
<br><img src="image3_thumb.jpg"></a>\n <a h
html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>\n </div>'
In [15]: response.xpath('//a/text()').re('Name:\s(.*)')
Out[15]: ['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image
5 ']
In [16]: response.re('.*') # 不能这样直接使用re,需要在解析的内容后面使用re正则表达式
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-16-a22dedc07090> in <module>()
----> 1 response.re('.*')
AttributeError: 'HtmlResponse' object has no attribute 're'
In [17]:
4. 实现翻页功能
如何翻页?
-
回忆:
- requests模块时如何发送翻页的请求的?
- 1.找到下一页的地址
- 2.之后调用requests.get(url)
- requests模块时如何发送翻页的请求的?
-
思路:
- 1.找到下一页的地址
- 2.构造一个关于下一页url地址的request请求传递给调度器
1. 通过在最后进行拼接成url和回调实现翻页
import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem # 从items文件中导入MyScrapyItem类
class Spider2Spider(scrapy.Spider):
# spider(爬虫)名称,需要记住,通过名字来启动爬虫:
name = 'spider2'
# # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
# allowed_domains = ['quotes.toscrape.com/'] # 不设置限制,就可以一直爬取下一页了
# 初始request请求:
base_url = 'http://quotes.toscrape.com/page/{}/'
page = 1
start_urls = [base_url.format(page)]
# 解析方法:
def parse(self, response):
# print(response.text)
"""
解析返回响应,提取数据或进一步生成要处理的请求。
:param response:
:return:
"""
# 方法一:通过css选择器来进行解析
"""
quotes = response.css('.quote') # 列表
for quote in quotes:
# text = quote.css('span.text::text') # ::text表示获取里面的文本(注意它是一个对象)
# 旧方法:
# extract_first() 返回第一个数据。(字符串)
# extract() 返回全部的数据。(字符串列表)
text = quote.css('span.text::text').extract_first() # ::text表示获取里面的文本 (extract_first()方法可以获取到对象里面的文本内容)
# author = quote.css('small.author::text') # 作者 (得到一个对象)
author = quote.css('small.author::text').extract_first() # 作者 (得到文本内容)
# tags = quote.css('div.tags a.tag::text') # 标签(CSS对象)
tags = quote.css('div.tags a.tag::text').extract() # 返回全部的数据
# print(tags)
# print(text, ' ——————', author, tags)
# 新方法:
# get() 返回一条数据
# getall() 返回全部的数据
text = quote.css('span.text::text').get()
author = quote.css('small.author::text').get()
tags = quote.css('div.tags a.tag::text').getall()
print(text, tags, ' ——————', author)
"""
# 方法二:通过xpath进行解析
html = etree.HTML(response.text)
quotes_divs = html.xpath('//div[@class="quote"]')
for quote_div in quotes_divs:
text = quote_div.xpath('./span[1]/text()')[0]
author = quote_div.xpath('./span[2]/small/text()')[0]
tags = quote_div.xpath('./div[@class="tags"]/a/text()')
# print(text, tags, ' ------', author)
# 将数据放入容器中,便于保存数据
item = MyScrapyItem()
item['text'] = text
item['author'] = author
item['tags'] = tags
# print(item)
# 通过迭代的方式将字典中的每一条数据交给Pipeline
yield item
self.page += 1
# 注意:需要控制翻页的结束
if self.page < 11:
# 构造下一个请求:(方法一)
# yield scrapy.Request(self.base_url.format(self.page), callback=self.parse)
# 构造下一个请求:(方法二)
yield from response.follow_all(response.css('.pager .next a::attr("href")'), callback=self.parse)
"""
# 原本定义的翻页操作:
next = response.css('ul.pager li.next a::attr("href")').get() # 获得“下一页”a标签按钮上面href属性值
print(next) # /page/2/
url = self.start_urls[0] # 获取当前爬取的url(方法一)
# print(url)
url = response.url # 获取当前爬取的url(方法二)
# print(url)
# 拼接形成下一页的url
url = response.urljoin(next)
print(url)
# 将请求交给调度器,重新构造下一个请求
yield scrapy.Request(url, callback=self.parse) # 对于新的请求,仍然执行parse()解析方法
"""
2. 通过重写strat_requests()方法实现翻页
import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem # 从items文件中导入MyScrapyItem类
class Spider3Spider(scrapy.Spider):
# spider(爬虫)名称,需要记住,通过名字来启动爬虫:
name = 'spider3'
# # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
# allowed_domains = ['quotes.toscrape.com/'] # 不设置限制,就可以一直爬取下一页了
# 初始request请求:
base_url = 'http://quotes.toscrape.com/page/{}/'
page = 1
start_urls = [base_url.format(page)]
# 通过封装方法的形式构造翻页功能:
def start_requests(self): # 在爬虫开始请求的时候会执行的操作
for page in range(1, 11):
url = self.base_url.format(page)
yield scrapy.Request(url, callback=self.parse)
# 解析方法:
def parse(self, response):
# print(response.text)
"""
解析返回响应,提取数据或进一步生成要处理的请求。
:param response:
:return:
"""
# 方法一:通过css选择器来进行解析
"""
quotes = response.css('.quote') # 列表
for quote in quotes:
# text = quote.css('span.text::text') # ::text表示获取里面的文本(注意它是一个对象)
# 旧方法:
# extract_first() 返回第一个数据。(字符串)
# extract() 返回全部的数据。(字符串列表)
text = quote.css('span.text::text').extract_first() # ::text表示获取里面的文本 (extract_first()方法可以获取到对象里面的文本内容)
# author = quote.css('small.author::text') # 作者 (得到一个对象)
author = quote.css('small.author::text').extract_first() # 作者 (得到文本内容)
# tags = quote.css('div.tags a.tag::text') # 标签(CSS对象)
tags = quote.css('div.tags a.tag::text').extract() # 返回全部的数据
# print(tags)
# print(text, ' ——————', author, tags)
# 新方法:
# get() 返回一条数据
# getall() 返回全部的数据
text = quote.css('span.text::text').get()
author = quote.css('small.author::text').get()
tags = quote.css('div.tags a.tag::text').getall()
print(text, tags, ' ——————', author)
"""
# 方法二:通过xpath进行解析
html = etree.HTML(response.text)
quotes_divs = html.xpath('//div[@class="quote"]')
for quote_div in quotes_divs:
text = quote_div.xpath('./span[1]/text()')[0]
author = quote_div.xpath('./span[2]/small/text()')[0]
tags = quote_div.xpath('./div[@class="tags"]/a/text()')
# print(text, tags, ' ------', author)
# 将数据放入容器中,便于保存数据
item = MyScrapyItem()
item['text'] = text
item['author'] = author
item['tags'] = tags
# print(item)
# 通过迭代的方式将字典中的每一条数据交给Pipeline
yield item
"""
self.page += 1
# 注意:需要控制翻页的结束
if self.page < 11:
# 构造下一个请求:(方法一)
# yield scrapy.Request(self.base_url.format(self.page), callback=self.parse)
# 构造下一个请求:(方法二)
# 该方法是2.0版本之后出现的 拼接请求,进行回调
yield from response.follow_all(response.css('.pager .next a::attr("href")'), callback=self.parse)
"""
"""
# 原本定义的翻页操作:
next = response.css('ul.pager li.next a::attr("href")').get() # 获得“下一页”a标签按钮上面href属性值
print(next) # /page/2/
url = self.start_urls[0] # 获取当前爬取的url(方法一)
# print(url)
url = response.url # 获取当前爬取的url(方法二)
# print(url)
# 拼接形成下一页的url
url = response.urljoin(next)
print(url)
# 将请求交给调度器,重新构造下一个请求
yield scrapy.Request(url, callback=self.parse) # 对于新的请求,仍然执行parse()解析方法
"""
3. 修改start.py文件保存数据
# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline
# cmdline.execute('scrapy crawl spider1'.split()) # 调用终端命令
# cmdline.execute('scrapy crawl spider1 -o demo.csv'.split())
# cmdline.execute('scrapy crawl spider2'.split()) # 调用终端命令
cmdline.execute('scrapy crawl spider3'.split()) # 调用终端命令
# 红色的不是报错,而是爬虫的初始化信息。白色的内容就是输出内容
5. Scrapy框架-案例2
1. 分析网站
- 目标网站:腾讯招聘网站
- 目标:
- 爬取招聘岗位信息
- 翻页
虚假的url:https://talent.antgroup.com/off-campus
- 数据加载方式:动态和静态
抓包获取的含有数据的data-url:
第1页:
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1649127078399&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
第2页:
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1649127078399&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex=2&pageSize=10&language=zh-cn&area=cn
详情页:
url:https://careers.tencent.com/jobdesc.html?postId=1310124481703845888
data-url:https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1649156817199&postId=1310124481703845888&language=zh-cn - 爬取思路:
- 第一页url
- 解析第一页上每个岗位对应postid
- 构造url
2. 实现步骤
- 创建项目
scrapy startproject tencent
- 创建爬虫程序
cd tencent
scrapy gensipider spider1 tencent.com
运行结果:
C:\Users\lv\Desktop\scrapy框架的学习>scrapy startproject tencent
New Scrapy project 'tencent', using template directory 'd:\anaconda\lib\site-packages\scrapy\templates\project', created in: C:\Users\lv\Desktop\scrapy框架的学习\tencent
You can start your first spider with:
cd tencent
scrapy genspider example example.com
C:\Users\lv\Desktop\scrapy框架的学习>cd tencent
C:\Users\lv\Desktop\scrapy框架的学习\tencent>scrapy genspider spider1 tencent.com
Created spider 'spider1' using template 'basic' in module:
tencent.spiders.spider1
C:\Users\lv\Desktop\scrapy框架的学习\tencent>
- 用PyCharm打开tencent项目:
- 在命令行使用如下命令生成一个spider1.py文件:
scrapy genspider spider1 tencent.com
- 编辑spider1.py文件如下:
import scrapy
import json
from tencent.items import TencentItem
class Spider1Spider(scrapy.Spider):
name = 'spider1'
allowed_domains = ['tencent.com']
# 一页数据(10条)的url:需要更改页码值(pageIndex值)实现翻页获取
one_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1649127078399&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
# 每一条数据页的url:需要更改postId值,实现获取另一条职位数据的url
two_url = "https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1649156817199&postId={}&language=zh-cn"
start_urls = [one_url.format(1)]
# 解析列表页
def parse(self, response):
# 解析数据(由于获取到的不是网页源代码,而是数据包,是字典或json类型的)
data = json.loads(response.text)
for job in data['Data']['Posts']:
item = TencentItem()
post_id = job['PostId']
# print(post_id)
item['job_name'] = job['RecruitPostName']
# 构建详情页url
detail_url = self.two_url.format(post_id)
print(detail_url)
# 构造请求:
yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={"item": item})
# 翻页
for page in range(2, 5):
url = self.one_url.format(page)
yield scrapy.Request(url, callback=self.parse) # 翻页后解析的是列表页,而非详情页
# 解析详情页面的数据
def parse_detail(self, response):
item = response.meta.get('item')
# print(item)
data = json.loads(response.text)
item['job_duty'] = data['Data']['Requirement']
yield item
- 打开settings.py文件下的pipelindes的注释:
- 运行编写的start.py文件:
from scrapy import cmdline
# cmdline.execute("scrapy crawl spider1".split())
cmdline.execute("scrapy crawl spider1 -o demo.csv".split())
# cmdline.execute("scrapy crawl spider2".split())
运行结果如下:(会生成一个demo.csv文件)
补充一:Spider类的使用
1. Spider的运行流程
- 定义爬取网站的逻辑
- 分析爬取下来的页面
2. Spider类的分析
- name:设置爬虫名称。
- allowed_domains:允许访问的域名,防止爬虫爬到其他网址去。
- start_urls:请求的url列表。
- custom_settings:一个字典,专属于本spider的配置,这个配置会覆盖项目的全局配置,这个配置必须
- crawler:该属性由from_crawler()方法设置,代表spider对应的爬虫对象。可以通过该属性来获取项目的配置信息。
- closed:当前spider关闭时,方法会被调用,释放一些资源。
补充二:Request对象
1. 介绍
- Request对象是在构造新的请求时需要用到的scrapy的一个对象。
例如:
yield scrapy.Request(url=detail_url, callback=self.parse_detail)
2. 参数说明
- url:新请求的url地址。该url会被放入队列中。
- callback:回调的解析数据的函数。
- priority:请求的优先级。(自定义队列中哪个url需要先被请求。)默认是0,调度器进行request调度时使用它。数值越大,越优先被调度执行。
- method:请求方式,默认是“GET”。
- dont_filter:是否需要重复请求,默认为False。
- errback:设置请求发生错误后的处理方法,默认为None。(很少用到)
例如:
def parse(self, response):
...
yield scrapy.Request(url=detail_url, callback=self.parse_detail, errback=self.func)
def func(self):
print("请求出现错误后执行的方法")
- body:request内容。
- headers:请求头。
- cookies
- meta:通过response携带参数进行传递。相当于额外附加的信息。
例如:
def parse(self, response):
# 解析数据(由于获取到的不是网页源代码,而是数据包,是字典或json类型的)
data = json.loads(response.text)
for job in data['Data']['Posts']:
item = TencentItem()
post_id = job['PostId']
# print(post_id)
item['job_name'] = job['RecruitPostName']
# 构建详情页url
detail_url = self.two_url.format(post_id)
print(detail_url)
# 构造请求:
yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={"item": item})
# 解析详情页面的数据
def parse_detail(self, response):
item = response.meta.get('item')
print(item)
- encoding:编码格式,默认“utf-8”。
- cb_kwargs:设置回调方法需要额外携带的参数,可以通过字典传递。
例如:
def parse(self, response):
...
# 构造请求:
yield scrapy.Request(url=detail_url, callback=self.parse_detail, cb_kwargs={"num": 1})
# 解析详情页面的数据
def parse_detail(self, response, num):
print(num)
补充三:CSS选择器
"""
解析工具:
1. 正则表达式 效率高 语法难记
2. xpath 效率中等 语法中等
3. BS4(bs语法和css选择器) 效率低 语法最简单
"""
from bs4 import BeautifulSoup
# 推荐一个第三方库: parsel
import parsel # 内置了正则、xpath、css三种选择器
html = """
<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/titllie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a wel.</p>
<p class="story">...</p>
"""
# 一、通过BeautifulSoup模块使用css选择器:
# 解析
# lxml是第三方的解析器,比起默认的html.parser解析器速度快很多
soup = BeautifulSoup(html, features="lxml") # BeautifulSoup会自动补全不完整的html(例如加上<body>、</html>等)
# print(soup)
# 1. 通过标签名称进行查找
a_tags = soup.select('a')
print(a_tags)
# 2. 通过类名称进行查找
sister_class = soup.select('.sister')
print(sister_class)
# 3. 通过id名进行查找
link1_id = soup.select("#link1")
print(link1_id)
# 4. 组合查找
a_link2 = soup.select("p #link2")
print(a_link2)
a_link2 = soup.select("p > #link2") # > 代表直接的下一级
print(a_link2)
p_sister_class = soup.select("p > .sister")
print(p_sister_class)
# 同一个标签的id和class不能同时用
# p_sister_class_id = soup.select("p > .sister#link1")
# print(p_sister_class_id)
# 5. 通过属性查找
a_href = soup.select('a[href="http://example.com/elsie"]')
print(a_href)
# 6. 获取标签内的文本内容
text1 = soup.select('title')[0].get_text()
print(text1)
# 7. 获取标签属性的值(如获取href属性的值)
href = soup.select('a#link1')[0]['href']
print(href)
print("---"*20)
# 二、通过parsel模块使用CSS选择器:
selector = parsel.Selector(html) # 创建选择器对象
# selector.re()
# selector.xpath()
# selector.css()
# 1. 通过标签名查找
object_list = selector.css("a")
print(object_list.getall()) # getall()方法获取全部满足的结果
# for item in object_list:
# print(item.get())
# 2. 通过类名称进行查找
print(selector.css('.sister').get()) # get()方法获取第一个满足条件的结果
print(selector.css('.sister').getall())
# 3. 通过id名进行查找
print(selector.css('#link1').getall())
# 4. 组合查找
print(selector.css('p.story a#link2').getall())
# 5. 通过属性查找
print(selector.css('.story').get())
# 6. 获取标签内的文本内容
print(selector.css('p > #link1::text').get())
# 7. 获取标签属性的值(如获取href属性的值)
print(selector.css('p > #link1::attr(href)').get())
# 8. 伪类选择器
print(selector.css('a').getall()[1])
print(selector.css('a:nth-child(1)').getall()) # 选择第几个