Scrapy 教程(五)-分页策略

最新推荐文章于 2022-02-13 21:08:46 发布

weixin_30823227

最新推荐文章于 2022-02-13 21:08:46 发布

阅读量176

点赞数

原文链接：http://www.cnblogs.com/yanshw/p/10844768.html

版权

scrapy 爬取分页网站的策略

1. 检测当前页是否存在“下一页”

2. 如果存在，把“下一页”的链接交给本方法或者其他方法

3. 如果不存在，结束

图示

示例代码

def parse(self, response):
        mingyan = response.css('div.quote')
        for v in mingyan:
            text = v.css('.text::text').extract_first()
            tags = v.css('.tags .tag::text').extract()
            tags = ','.join(tags)
            fileName = '%s-语录.txt' % tags
            with open(fileName, "a+") as f:
                f.write(text)
                f.write('\n')
                f.write('标签：' + tags)
                f.write('\n-------\n')
                f.close()

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

在解析器中，检测下一页的链接，如果存在，就在解析器中继续爬取，这是一种递归实现分页爬取的策略。

当然你可以用其他方法。

注意，这只是一种思路，并不是绝对正确的方法，有些网站即使没有下一页链接，它也会有href，可能会href到第一页，要根据实际情况制定策略。

转载于:https://www.cnblogs.com/yanshw/p/10844768.html

weixin_30823227

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scrapy 教程(五)-分页策略

scrapy 爬取分页网站的策略1. 检测当前页是否存在“下一页”2. 如果存在，把“下一页”的链接交给本方法或者其他方法3. 如果不存在，结束图示示例代码def parse(self, response): mingyan = response.css('div.quote') for v in mi...
复制链接

扫一扫