解析robots协议及meta标签


  •   robots.txt协议


 

# 文件名必须为rebots.txt而不是Robots.TXT

# 针对所有rebots,允许访问任何部分

# 原文解释为To allow all robots complete access

User-agent: *

Disallow:

# 也可以使用 (【来自文档】or just create an empty "/robots.txt" file, or don't use one at all)

 

# 针对所有rebots,禁止访问任何部分

User-agent: *

Disallow: /

 

# 针所有rebots,禁止访问某些部分

User-agent: *

Disallow: /cgi-bin/

Disallow: /tmp/

Disallow: /junk/

# 针对所有rebots,禁止访问某一特殊文件

User-agent: *

Disallow: /directory/file.html

 

# 针对两个rebots,禁止访问特殊文件夹

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot
User-agent: Googlebot
Disallow: /private/

 

# 允许某一个rebots,访问所有部分

User-agent: Google

Disallow:

User-agent: *

Disallow: /

 

# 对于某些rebots

It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few sites, such as Google, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings.

l  Example demonstrating multiple user-agents:

User-agent: googlebot        # all Google services

Disallow: /private/          # disallow this directory

 

User-agent: googlebot-news   # only the news service

Disallow: /                  # disallow everything

 

User-agent: *                # any robot

Disallow: /something/        # disallow this directory

 

除此之外,还存在一些不标准的扩展:

访问http://en.wikipedia.org/wiki/Robots_exclusion_standard

并通过ctrl+f查找Nonstandard extensions

 

(下叙内容待考证){

#robot版本号

Robot-version: Version 2.0

 

#允许在0点到7点之间访问(本地服务器时间?)

Visit-time: 0000-0700

 

# 限制url的读取速率,即在2点到6点之间访问速率只可是每分钟10次。

Request-rate: 10/1m 0200-0600

}

 

 

 

 


  •   meta标签


格式【来自文档】:

...

……

 

 

作用:【来自文档】You can use a special HTML   tag to tell robots not to index the content of a page, and/or not scan it for links to follow

 

# NAME的属性必须是ROBOTS

# CONTENT=""的备选的属性有:index,noindex,follow,nofollow

# 其中index代表告诉爬虫抓取该页面,follow允许爬虫可以沿该页面上的链接继续抓取

 

# CONTENT的缺省值为INDEXFOLLOW,【原文】the default is "INDEX,FOLLOW", so there's no need to spell that out. That leaves:

# 其他三种分别为:

# 组合1

# 组合2

 

# 组合3

 

There are two important considerations when using the robots tag:

l  robots can ignore your tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.

l  the NOFOLLOW directive only applies to links on this page. It's entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page.

 

(下叙内容待考证){

# google的搜索引擎完全支持上述meta标签,还新加了archive属性,代表是否允许保留网页快照

# 如下述代表允许所有搜索引擎保存该站点网页快照

}

 

 

FAQ中有一个关于more about robots的网页,但是已经404了。

http://www.robotstxt.org/faq/info.html中提到

There is a Web robots home page on:

http://www.robotstxt.org/wc/robots.html

 

关于rebots.txtmeta的区别及关系

the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Also, the robots meta tag only works on HTML pages, not images, text files, PDF documents, etc. Finally, if the pages/resources have already been excluded by a robots.txt file, then they will not be crawled and the meta tags and headers will have no effect. This can have the counterintuitive effect that a web address is indexed by a search engine such as Google if it honors the site's robots.txt, stops crawling and never receives the advice not to index the site。

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/29773961/viewspace-1377335/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/29773961/viewspace-1377335/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值