python爬虫自学宝典——如何爬取下一页信息

最新推荐文章于 2023-12-27 22:05:15 发布

良木66

最新推荐文章于 2023-12-27 22:05:15 发布

阅读量1w

点赞数 2

分类专栏： scrapy python

本文链接：https://blog.csdn.net/qq_44503987/article/details/105051951

版权

python 同时被 2 个专栏收录

22 篇文章 0 订阅

订阅专栏

scrapy

14 篇文章 4 订阅

订阅专栏

前文回顾，点击此处。
爬虫爬取下一页信息很简答，无非就是获取下一页的连接url而已。
首先，在提取完所有的response信息后，spider可以使用xpath找到页面中代表“下一页”的链接，然后使用request发送请求即可。
首先，在浏览器中打开我的播客主页HTML代码中的下一页链接信息（在浏览器中，按F12），如下图：
在这里插入图片描述
由上图只，下一翻页的xpath为‘//a[@class=“show_more btn-erweima”]’。现在，进入项目中的spiders文件，将自己创的虫子文件打开，给文件后加上如下代码：

# -*- coding: utf-8 -*-
import scrapy
from demo.items import DemoItem
class DemoSpiderSpider(scrapy.Spider):
    name = 'demo_spider'
    allowed_domains = ['csdn.net']
    start_urls = ['https://me.csdn.net/qq_44503987']
    def parse(self, response):
        for info in response.xpath('//div[@class="my_tab_page_con"]/dl[@class="tab_page_list"]'):
            item = DemoItem()
            item['name'] = info.xpath('./dt/h3/a[@class="sub_title"]/text()').extract_first().strip()#爬取到博客的name信息
            item['red_number'] = info.xpath('./dd[@class="tab_page_con_b clearfix"]/div[@class="tab_page_b_l fl"]/label/em/text()').extract_first().strip()#爬取到博客的阅读次数
            item['publish_date'] = info.xpath('./dd[@class="tab_page_con_b clearfix"]/div[@class="tab_page_b_r fr"]/text()').extract_first().strip()#爬取博客的发布日期
            yield item
        new_links = response.xpath('//a[@class="show_more btn-erweima"]/@href').extract()
        if new_links and len(new_links)>0:
            next_page = new_links[0]
            yield scrapy.Request('https:'+next_page,callback=self.parse)

一定要注意上述if new_links的位置，还要更注意“yield scrapy.Request(‘https:’+next_page,callback=self.parse)”这段代码中，‘https’这个位置如何写。因为我这个读取的next_page="//blog.csdn.net/qq_44503987"形式表示，是直接从https协议中发出的url，而不是相对于本网页发出的。所以写成’https:’+next_page。
怎么分辨“yield scrapy.Request(‘https:’+next_page,callback=self.parse)”中’https:'这个位置怎么写呢？一个简单的方法就是，打开浏览器，查看你代码的时候，看这个下一页的超链接是“https://www.bai.com”或“//www.bai.com”还是"/ssss/sss"；如果是“https://www.bai.com”或“//www.bai.com”，scrapy.Request(‘https:’+next_page,callback=self.parse)”改写成scrapy.Request(next_page,callback=self.parse)”。
如果是后者，改写为scrapy.Request(start_urls[0]+next_page,callback=self.parse)”。