Python爬虫 | 如何使用爬虫框架scrapy爬取分页数据案例演示

程序员晓晓

已于 2024-01-18 10:36:33 修改

阅读量1.9k

点赞数 1

文章标签： python 爬虫 scrapy Python学习 Python基础 Python爬虫 Python数据分析

于 2023-10-06 11:00:59 首次发布

本文链接：https://blog.csdn.net/cxyxx12/article/details/133607123

版权

导读

1、 scrapy框架爬虫

1.1 使用scrapy框架爬取分页数据案例演示

步骤1：数据解析

步骤2：在item类中定义相关的属性

步骤3：将解析的数据封装存储到item，并进行分页操作

步骤4：将item类型的对象提交给管道Pipeline持久化存储

步骤5：在管道类的process_item中要将接收到的item对象中存储的数据进行持久化存储

步骤6：修改配置文件settings.py

步骤7：执行工程

本章总结

完整代码领取

1、scrapy框架爬虫

前面我们介绍了使用明星框架scrapy中的Pipeline将数据持久化存储成txt格式的文件。今天我们来介绍爬取分页数据并持久化存储成txt文件的方法。

案例：使用爬虫框架Scrapy来爬取古诗词网苏轼的诗词的诗人、诗人简介信息，并爬取第1~5页数据中第2条诗词名、作者和诗词内容，并将解析的数据持久化存储到本地指定的txt文件中。

1.1使用scrapy框架爬取分页数据案例演示

创建工程的步骤和之前的文章中的步骤一样，今天就不在这里重复，有需要查看详情的，可以查看小编之前发布的文章【Python爬虫(16) | 使用明星框架scrapy中的Pipeline将数据持久化存储成txt格式的致胜法宝】

步骤1：数据解析

**方法：**双击SpiderSuShiDemo_01.py打开爬虫文件，修改或添加代码。

示例代码：

"""使用scrapy框架爬虫案例演示：爬取分页数据"""``# 案例：使用爬虫框架Scrapy来爬取古诗词网苏轼的诗词中的第1~5页数据中第2条诗词名、作者和诗词内容，并将解析的数据持久化存储到本地指定的txt文件中``import scrapy``from ScrapySuShiDemo.items  import ScrapysushidemoItem``   ``#  创建类Spidersushidemo01Spider``class Spidersushidemo01Spider(scrapy.Spider):`   `name = "SpiderSuShiDemo_01"`   `# allowed_domains = ["shici.com.cn"]`   `start_urls = ["https://shici.com.cn/poet/0x2bf504c9?"]``   `   `# 生成一个通用的url模板`   `url = 'https://shici.com.cn/poet/0x2bf504c9?page=%d'`   `page_num = 2``   `   `# 定义函数parse`   `def parse(self, response):`       `# 对响应结果进行解析：对诗词名、作者名称和诗词内容进行解析`       `div_list = response.xpath('//*[@id="wrapper"]/div/div[2]/div[1]')``   `       `# 存储所有解析到的数据`       `all_data = []``   `       `# 使用for循环遍历诗词名、作者名称和诗词内容`       `for div in div_list:`           `item = ScrapysushidemoItem()`           `# extract可以将Selector对象中data参数存储的字符串提取出来`           `# 苏东坡`           `poem = div_list.xpath('./div[1]/div[1]/div/div/h1/text()').extract()`           `# 将poem列表内容转换成字符串`           `poem = ''.join(poem)`           `# print("【诗人】：",poem)``   `           `intro = div_list.xpath('./div[1]/div[1]/div/div/p[2]//text()').extract()`           `# 将intro列表内容转换成字符串`           `intro = ''.join(intro)`           `# print("【诗人简介】：",intro, '\n')``   `           `# title = div.xpath('./div[3]/h3/a/text()').extract_first()`           `title = div.xpath('./div[3]/h3/a/text()').extract_first()`           `# print(title)``   `           `author = div.xpath('./div[3]/div[2]/div[1]/a/text()')[0].extract()``   `           `# author = div.xpath('./div[3]/div[2]/div[1]/a/text()')[0].extract()`           `# print(author)``   `           `# poetry = div.xpath('//*[@id="poetry-238002"]/p//text()').extract()`           `poetry = div.xpath('./div[3]/div[2]/div[2]/p//text()').extract()``   `           `# 将poetry列表内容转换成字符串`           `poetry = ''.join(poetry)`           `# print(poetry,'\n')``

–可左右滑动查看完整代码–

步骤2：在item类中定义相关的属性

**方法：**双击items.py打开爬虫文件，在该文件基础上修改代码，定义属性

示例代码：

# Define here the models for your scraped items``#``# See documentation in:``# https://docs.scrapy.org/en/latest/topics/items.html``import scrapy``   ``class ScrapysushidemoItem(scrapy.Item):`   `# define the fields for your item here like:`   `# name = scrapy.Field()`   `# 在item类中定义相关的属性`   `poem = scrapy.Field()`   `intro = scrapy.Field()`   `title = scrapy.Field()`   `author = scrapy.Field()`   `poetry = scrapy.Field()`   `pass``

–可左右滑动查看完整代码–

步骤3：将解析的数据封装存储到item类型的对象，并进行分页操作

**方法：**双击SpiderSuShiDemo_01.py打开爬虫文件，在步骤1的基础上修改代码。

示例代码：

（1）导入ScrapysushidemoItem类

# 从当前工程ScrapySuShiDemo的items中导入ScrapysushidemoItem类``from ScrapySuShiDemo.items  import ScrapysushidemoItem``

–可左右滑动查看完整代码–

（2）将解析的数据封装存储到item类型的对象

# 将解析的数据封装存储到item类型的对象``item = ScrapysushidemoItem()``item['poem'] = poem``item['intro'] = intro``item['title'] = title``item['author'] = author``item['poetry'] = poetry``   ``# 将item提交给管道``yield item``

–可左右滑动查看完整代码–

（3）分页操作

# 分页操作``if self.page_num <= 5:`    `new_url = format(self.url%self.page_num)`    `self.page_num += 1`    `# 手动请求发送：callback回调函数是专门用作于数据解析`    `yield scrapy.Request(url=new_url,callback=self.parse)``

–可左右滑动查看完整代码–

步骤4：将item类型的对象提交给管道（Pipeline）进行持久化存储的操作

**方法：**双击pipelines.py打开管道（Pipeline）文件，在该文件上修改代码。

示例代码：

# Define your item pipelines here``#``# Don't forget to add your pipeline to the ITEM_PIPELINES setting``# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html``   ``# useful for handling different item types with a single interface``from itemadapter import ItemAdapter``   ``class ScrapysushidemoPipeline:`   `fp = None`   `# 重写父类的一个方法：该方法只在开始爬虫的时候被调用一次`   `def open_spider(self, spider):`       `print("小伙伴们，请做好准备吧，开始爬虫啦......")``   `   `# 用来处理item类型对象，该方法可以接收爬虫文件提交过来的item对象`   `# 该方法每接收到一个item就会被调用一次`   `def process_item(self, item, spider):`       `poem = item['poem']`       `intro = item['intro']`       `title = item['title']`       `author = item['author']`       `poetry = item['poetry']``   `       `# 持久化存储的文件中写入这三个字段的值`       `self.fp.write(poem+'\n'+intro+'\n'+title+'\n'+author+'\n'+poetry+'\n')``   `       `return item``   `   `def close_spider(self, spider):`       `print("真棒，爬虫完成啦！")`       `self.fp.close()``

–可左右滑动查看完整代码–

步骤5：在管道类的process_item中要将其接收到的item对象中存储的数据进行持久化存储操作。

示例代码：

# 持久化存储``self.fp = open("./SuShi.txt", "w", encoding="utf-8")``

–可左右滑动查看完整代码–

**步骤6：**修改配置文件settings.py

**（1）修改ROBOTSTXT_OBEY 协议配置：**双击ScrapyDemoTest下的配置文件settings.py，将文件打开，发现文件中ROBOTSTXT_OBEY 配置项的值默认为True，将文件中ROBOTSTXT_OBEY 配置项的值修改为 False，如下图所示：

**（2）UA伪装：**在settings.py配置文件中找到USER_AGENT配置项，并将该配置项的值修改为我们爬取的网站的USER_AGENT的值。方法，找到我们要爬取的网站的USER_AGENT的值方法是点击F2打开调试控制台，点击Network,找到对应接口请求头Request-Headers中的USER_AGENT，如下图所示：