python 爬虫 robots协议

最新推荐文章于 2024-03-14 11:45:00 发布

sdu@xy

最新推荐文章于 2024-03-14 11:45:00 发布

阅读量240

点赞数 2

分类专栏： python 文章标签： python 网络

本文链接：https://blog.csdn.net/qq_44787993/article/details/105847420

版权

python 专栏收录该内容

6 篇文章 1 订阅

订阅专栏

Requests库：小规模，数据量小，爬取速度不敏感。

Scrapy:中规模，数据量较大，爬取速度敏感

定制开发（Google/Firefox）：大规模，搜索引擎，爬取全网,爬取速度关键

Robots:Robots Exclusion Standard 网络爬虫排除标准，网站告知爬虫哪些页面可以爬取形式：在网站根目录下的robots.txt

eg:http://www.jd.com/robots.txt

http://www.moe.edu.cn/robots.txt #无robots协议

User-agent: * #对于任何网络爬虫来源
Disallow: /?* #不允许访问以？开头
Disallow: /pop/*.html 
Disallow: /pinpai/*.html?* #符合该通配符均不允许访问
User-agent: EtaoSpider 
Disallow: / 
User-agent: HuihuiSpider 
Disallow: / 
User-agent: GwdangSpider 
Disallow: / 
User-agent: WochachaSpider #恶意爬虫，拒绝访问京东所有信息
Disallow: /#所有目录

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

sdu@xy

关注关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python 爬虫 robots协议

Requests库：小规模，数据量小，爬取速度不敏感。Scrapy:中规模，数据量较大，爬取速度敏感定制开发（Google/Firefox）：大规模，搜索引擎，爬取全网,爬取速度关键Robots:Robots Exclusion Standard 网络爬虫排除标准，网站告知爬虫哪些页面可以爬取形式：在网站根目录下的robots.txteg:http://www.jd.com/ro...
复制链接

扫一扫