解析robots协议及meta标签

最新推荐文章于 2024-09-16 07:18:09 发布

cq377078944

最新推荐文章于 2024-09-16 07:18:09 发布

阅读量222

点赞数

文章标签：爬虫

robots.txt协议

# 文件名必须为rebots.txt而不是Robots.TXT

# 针对所有rebots，允许访问任何部分

# 原文解释为To allow all robots complete access：

User-agent: *

Disallow:

# 也可以使用 (【来自文档】or just create an empty "/robots.txt" file, or don't use one at all)

# 针对所有rebots，禁止访问任何部分

User-agent: *

Disallow: /

# 针所有rebots，禁止访问某些部分

User-agent: *

Disallow: /cgi-bin/

Disallow: /tmp/

Disallow: /junk/

# 针对所有rebots，禁止访问某一特殊文件

User-agent: *

Disallow: /directory/file.html

# 针对两个rebots，禁止访问特殊文件夹

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot

User-agent: Googlebot

Disallow: /private/

# 允许某一个rebots，访问所有部分

User-agent: Google

Disallow:

User-agent: *

Disallow: /

# 对于某些rebots

It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few sites, such as Google, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings.

l Example demonstrating multiple user-agents:

User-agent: googlebot # all Google services

Disallow: /private/ # disallow this directory

User-agent: googlebot-news # only the news service

Disallow: / # disallow everything

User-agent: * # any robot

Disallow: /something/ # disallow this directory

除此之外，还存在一些不标准的扩展：

访问http://en.wikipedia.org/wiki/Robots_exclusion_standard

并通过ctrl+f查找Nonstandard extensions

（下叙内容待考证）{

#robot版本号

Robot-version: Version 2.0

#允许在0点到7点之间访问（本地服务器时间？）

Visit-time: 0000-0700

# 限制url的读取速率，即在2点到6点之间访问速率只可是每分钟10次。

Request-rate: 10/1m 0200-0600

}

meta标签

格式【来自文档】：

...

……

作用：【来自文档】You can use a special HTML tag to tell robots not to index the content of a page, and/or not scan it for links to follow

# NAME的属性必须是ROBOTS

# CONTENT=""的备选的属性有：index,noindex,follow,nofollow

# 其中index代表告诉爬虫抓取该页面，follow允许爬虫可以沿该页面上的链接继续抓取

# CONTENT的缺省值为INDEX和FOLLOW，【原文】the default is "INDEX,FOLLOW", so there's no need to spell that out. That leaves:

# 其他三种分别为：

# 组合1

# 组合2

# 组合3

There are two important considerations when using the robots tag:

l robots can ignore your tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.

l the NOFOLLOW directive only applies to links on this page. It's entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page.

（下叙内容待考证）{

# google的搜索引擎完全支持上述meta标签，还新加了archive属性，代表是否允许保留网页快照

# 如下述代表允许所有搜索引擎保存该站点网页快照

}

在FAQ中有一个关于more about robots的网页，但是已经404了。

http://www.robotstxt.org/faq/info.html中提到

“

There is a Web robots home page on:

http://www.robotstxt.org/wc/robots.html

“

关于rebots.txt和meta的区别及关系

the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Also, the robots meta tag only works on HTML pages, not images, text files, PDF documents, etc. Finally, if the pages/resources have already been excluded by a robots.txt file, then they will not be crawled and the meta tags and headers will have no effect. This can have the counterintuitive effect that a web address is indexed by a search engine such as Google if it honors the site's robots.txt, stops crawling and never receives the advice not to index the site。

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/29773961/viewspace-1377335/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/29773961/viewspace-1377335/