【爬虫篇】根据网站的robots.txt文件判断一个爬虫是否有权限爬取这个网页

1024码字猿

已于 2022-02-04 22:14:00 修改

阅读量7.2k

点赞数 1

文章标签：爬虫 python

于 2022-02-04 22:06:58 首次发布

本文链接：https://blog.csdn.net/weixin_40458518/article/details/122785687

版权

本文介绍了如何使用Python的robotparser模块来解析robots.txt文件，通过RobotFileParser类的方法判断爬虫是否可以爬取网页。详细讲解了三种方法，包括在创建对象时传入robots.txt链接、使用set_url()方法以及调用parse()方法进行文件读取和分析，并给出了实际案例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

使用robotparser模块来解析robots.txt文件，该模块提供了一个RobotFileParser，它可以根据网站的robots.txt文件判断一个爬虫是否有权限爬取这个网页。

语法：

urllib.robotparser.RobotFileParser(url='')

https://www.baidu.com/robots.txt的内容如下（截取部分内容）：

User-agent: Baiduspider			# 百度爬虫
Disallow: /baidu					# 不允许爬取/baidu下的内容
Disallow: /s?						# 不允许爬取/s?下的内容
Disallow: /ulink?					# 不允许爬取/ulink?下的内容
Disallow: /link?					# 不允许爬取/link?下的内容
Disallow: /home/news/data/			# 不允许爬取/home/news/data/下的内容
Disallow: /bh						# 不允许爬取/bh下的内容

User-agent: Googlebot			# 谷歌爬虫
Disallow: /baidu					# 不允许爬取/baidu下的内容
Disallow: /s?						# 不允许爬取/s?下的内容
Disallow: /shifen/					# 不允许爬取/shifen/下的内容
Disallow: /homepage/				# 不允许爬取/homepage/下的内容
Disallow: /cpro						# 不允许爬取/cpro下的内容
Disallow: /ulink?					# 不允许爬取/ulink?下的内容
Disallow: /link?					# 不允许爬取/link?下的内容
Disallow: /home/news/data/			# 不允许爬取/home/news/data/下的内容
Disallow: /bh						# 不允许爬取/bh下的内容

User-agent: *            		# 其它爬虫，指robots.txt文件协议中没有指出的其它爬虫
Disallow: /             			# 不允许爬取网站的所有目录及内容

1、方法1：在声明对象时，直接传入robots.txt文件的链接

# Python版本：3.6
# -*- coding:utf-8 -*-

from urllib.robotparser import RobotFileParser

# 在声明对象时，直接传入robots.txt文件的链接
rp = RobotFileParser('https://www.baidu.com/robots.txt')
# 不能缺失，一定要调用此方法，否则can_fetch方法中全都判断为False，该方法用来解析robots.txt文件并进行分析
rp.read()

# 允许百度爬虫访问 域名的根目录
print(rp.can_fetch('Baiduspider','https://www.baidu.com')) 				# True
# 允许百度爬虫访问 目录/homepage/
print(rp.can_fetch('Baiduspider','https://www.baidu.com/homepage/')) 	# True
# 禁止谷歌爬虫访问 目录/homepage/

最低0.47元/天解锁文章