关于scrapy里response拼接urljoin属性的理解

最新推荐文章于 2023-05-15 21:04:32 发布

Andre_zeng

最新推荐文章于 2023-05-15 21:04:32 发布

阅读量5.3k

点赞数 10

本文链接：https://blog.csdn.net/weixin_47420595/article/details/106933543

版权

因为爬虫必须要翻页，那么url链接上需要加载新的参数或者值，拼接的方法有不少，如果不是动态jajx的，而是静态的有规律的url翻页地址的话，用urljoin非常方便

起始值（url）

    next_page_url = response.xpath('...').extract() #搞到拼接的变动的参数内容
    if next_page_url is not None:
        yield scrapy.Request(response.urljoin(next_page_url))

分析这句话：yield scrapy.Request(response.urljoin(next_page_url))

response.urljoin(next_page_url)整体是一个参数，作为一个url的链接传入函数scrapy.Request（yield这个函数可以爬取新的下一页的链接）

response.urljoin(next_page_url)这个怎么解释呢，就是next_page_url作为参数传入response.urljoin这个函数，直接合成了新的下一页链接。

其实这个函数的标准写法是response.urljoin(url,next_page_url)但是括号里前一个url就是在爬虫最开始设定的shart_urls = [’…’]里面这个url

举例：
比如scrapy起始地址为’http://quotes.toscrape.com/’
那么response.url的值就是默认的：‘http://quotes.toscrape.com/’
此时在用response.urljoin这个函数，起始的url这个参数就可以不要写进去，而直接写后面需要join的参数就行了！！！
假设这个时候设定next_page_url='此处必须为字符串形式‘
这个时候输出结果为：

next_page_url = ‘/page/2’
response.urljoin(next_page_url)
‘http://quotes.toscrape.com/page/2’

next_page_url = ‘wherearethefathers?’
response.urljoin(next_page_url)
‘http://quotes.toscrape.com/wherearethefathers’

Andre_zeng

关注

10
点赞
踩
18

收藏

觉得还不错? 一键收藏
0
评论
关于scrapy里response拼接urljoin属性的理解

因为爬虫必须要翻页，那么url链接上需要加载新的参数或者值，拼接的方法有不少，如果不是动态jajx的，而是静态的有规律的url翻页地址的话，用urljoin非常方便起始值（url） next_page_url = response.xpath('...').extract() #搞到拼接的变动的参数内容 if next_page_url is not None: yield scrapy.Request(response.urljoin(next_page_url))分
复制链接

扫一扫