Robots协议是互联网爬虫的一项公认的道德规范,它的全称是“网络爬虫排除标准”(Robots exclusion protocol),这个协议用来告诉爬虫,哪些页面是可以抓取的,哪些不可以。
如何查看网站的robots协议呢,很简单,在网站的域名后加上/robots.txt就可以了。
如百度https://www.baidu.com/robots.txt
User-agent: Baiduspider # 百度爬虫 Disallow: /baidu #disallow禁止访问,allow允许访问 Disallow: /s? Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: Googlebot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: MSNBot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: Baiduspider-image Disallow: /baidu Disallow: /s? Disallow: /shifen/