用urllib.robotparser解析某些网站的robots文件时本应返回True却返回False是怎么回事？

最新推荐文章于 2024-09-14 08:25:16 发布

Elzzach

最新推荐文章于 2024-09-14 08:25:16 发布

阅读量302

点赞数

文章标签： python 爬虫

本文链接：https://blog.csdn.net/qumingguicai/article/details/117381946

版权

本文探讨了使用urllib.robotparser解析网站robots文件时遇到返回False的问题。分析发现，403错误导致了disallow_all属性变为True。解决方案是自定义发送请求，加入User-Agent信息，避免403错误，从而正确解析robots文件。

摘要由CSDN通过智能技术生成

（Python爬虫）用urllib.robotparser模块解析某些网站的robots文件时本应返回True的结果却返回了False是怎么回事？

新人第一次发文，哈哈哈哈

问题描述

用以下代码尝试分析简书网站的robots文件。

from urllib.robotparser import RobotFileParser
rp = RobotFileParser('https://www.jianshu.com/robots.txt')
rp.read()
print(rp.can_fetch('*', 'https://www.jianshu.com/mobile/campaign/day_by_day/join?from=home'))

通过观察简书的robots文件推断这里的代码返回结果应该是True，但实际执行时却返回了False。

原因探索

首先要了解can_fetch（）方法返回True或False的逻辑是什么样的。在Pycharm中查看can_fetch()方法的部分源码如下：

    def can_fetch(self, useragent, url):
        """using the parsed robots.txt decide if useragent can fetch url"""
        if self.disallow_all:
            return False
        if self.allow_all:
            return True

可见该方法会首先判断disallow_all属性的值，若为True就将返回False。而这个属性在RobotFileParser类中的默认值是False，也就是说在程序执行过程中它的值被修改为True了。而前面只有一个read（）方法可能修改了它的值，所以来看看read（）方法的源码：

    def read(self):
        """Reads the robots.txt URL and feeds it to the parser."""
        try:
            f = urllib.request.urlopen(self

最低0.47元/天解锁文章

Elzzach

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫