查看网站的爬虫协议，简单介绍爬虫协议robots.txt，避免爬虫爬的好，牢饭吃得早(保姆级图文)

本文链接：https://blog.csdn.net/u011027547/article/details/122513958

本文介绍了爬虫协议Robots协议的基本概念及其查询方法，并通过实例详细解读了协议内容，包括爬虫引擎限制、禁止爬取的内容及请求速率等关键信息。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

什么是爬虫协议

正经正规网站一般都会有设置爬虫协议，规定哪些能够让你爬，哪些不能让你爬。
网页的爬虫协议就是Robots协议也叫robots.txt。

只要是在网站允许的范围内爬取数据，合法的使用数据，就可以避免避免爬虫爬的好，牢饭吃得早。

查询方法

打开一个网站的首页（必须是首页）

这里以简书为例子，简书的官网首页是

https://www.jianshu.com

在这里插入图片描述
在原来的首页网站后面加入/robots.txt

https://www.jianshu.com/robots.txt

得到了协议内容
在这里插入图片描述

解读协议内容

爬虫引擎限制

User-agent: *

*是通配符，表示可以被所有爬虫搜索引擎找到（一般网站都是这样，可以使得网站被更多引擎搜索到，增加曝光率）

User-agent: Crawler

限制只有Crawler搜索引擎爬取

禁止爬取内容

Disallow: /search

不允许爬取网站的search目录内容

请求速率

Request-rate: 1/2 # load 1 page per 2 seconds

请求速率：1/2#每2秒加载1页

爬网延迟

Crawl-delay: 10

爬网延迟：10

# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
User-agent: *
Disallow: /search
Disallow: /convos/
Disallow: /notes/
Disallow: /admin/
Disallow: /adm/
Disallow: /p/0826cf4692f9
Disallow: /p/d8b31d20a867
Disallow: /collections/*/recommended_authors
Disallow: /trial/*
Disallow: /keyword_notes
Disallow: /stats-2017/*

User-agent: trendkite-akashic-crawler
Request-rate: 1/2 # load 1 page per 2 seconds
Crawl-delay: 60

User-agent: YisouSpider
Request-rate: 1/10 # load 1 page per 10 seconds
Crawl-delay: 60

User-agent: Cliqzbot
Disallow: /

User-agent: Googlebot
Request-rate: 2/1 # load 2 page per 1 seconds
Crawl-delay: 10
Allow: /

User-agent: Mediapartners-Google
Allow: /

#看http://www.robotstxt.org/wc/norobots.html有关如何使用机器人的文档。txt文件

#

#要禁止整个站点中的所有spider，请取消注释下面两行：

用户代理：*

不允许：/search

不允许：/convers/

不允许：/notes/

不允许：/admin/

不允许：/adm/

不允许：/p/0826cf4692f9

不允许：/p/d8b31d20a867

不允许：/collections/*/推荐作者

不允许/审判/*

不允许：/keyword\u注释

不允许：/stats-2017/*



用户代理：trendkite akashic爬虫

请求速率：1/2#每2秒加载1页

爬网延迟：60



用户代理：YisouSpider

请求速率：1/10#每10秒加载1页

爬网延迟：60



用户代理：Cliqzbot

禁止：/



用户代理：谷歌机器人

请求速率：2/1#每1秒加载2页

爬网延迟：10

允许：/



用户代理：Mediapartners谷歌

允许：/