Scrapy followlinks总结

最新推荐文章于 2021-10-07 11:33:47 发布

Yohohaha

最新推荐文章于 2021-10-07 11:33:47 发布

阅读量1.6k

点赞数

分类专栏： scrapy 文章标签： scrapy

本文链接：https://blog.csdn.net/Yohohaha/article/details/75195284

版权

scrapy 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

在写scrapy的spider类的parse方法的时候，有些链接需要提取出来继续爬取，这里scrapy提供了一些方法可以方便的实现这个功能，总结如下：

假设我们的目标a标签是target_a

方法1：

next_page = target_a.css('::attr(href)').extract_first()
if next_page is not None:
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, callback=self.parse)

方法2

next_page = target_a.css('::attr(href)').extract_first()
if next_page is not None:
    yield response.follow(next_page, callback=self.parse)

方法2变种1

next_page = target_a.css('::attr(href)')
if next_page is not None:
    yield response.follow(next_page[0], callback=self.parse)

方法2变种2

if target_a is not None:
    yield response.follow(target_a, callback=self.parse)

解释

方法1：直接获取到下一页的绝对url，yield一个新Request对象
方法2：不用获取到绝对的url，使用follow方法会自动帮我们实现
方法2变种1：不用获取提取url字符串，只需要传入href这个selector
方法2变种2：不用获取href这个selector，传递一个a的selector，follow方法自动会提取href

注意传入的对象只能是str或selector，不能是SelectorList