Using robots.txt To Control Search Engine Spiders

最新推荐文章于 2024-02-07 22:09:49 发布

qinboecjtu

最新推荐文章于 2024-02-07 22:09:49 发布

阅读量636

点赞数

分类专栏： Web开发文章标签： search file database server web yahoo

Web开发专栏收录该内容

3 篇文章 0 订阅

订阅专栏

转自 http://www.activewebhosting.com/faq/web-robots.html

What are robots and spiders?

Search engines such as Google and Yahoo! use what is called 'robots' or 'spiders' to visit pages on the internet and then automatically add them to their search database. Many people even add their sites manually rather than wait for a robot or spider to visit their web site. When you put a web page on your web server, it can take some time for your site to show up in their search engine. Once the page is entered into their database, however, it can take a long time also for the page to be removed should the page move or be taken off the server. For more information on how the major search engine spiders work, please see the pages below:

However, there may be times where you have information that you do not want to share with everyone or have the search engines put in their database. You may even have a whole directory you wish to keep secret.

One way to keep a search engine from adding your pages to their database is to put a file called robots.txt in the directory where the pages you wish to protect exist. While this is not a fool-proof way to protect your pages, it may help keep them from showing up in most search engine databases, at the very least.

How do I create a robots.txt file?

You can create a robots.txt file from any Linux text editor or any text editor that saves to Unix format. This is important as the file must have Unix style line breaks. Please see Text Editors You Can Use To Create CGI Scripts for more information. Note that the robots.txt file must be in the root directory (and not in a sub directory) of your CGI or web server. You can have one in each if you want. Put a robots.txt file in the root directory of your CGI server to control spidering of files on only that server. Put a robots.txt file in the root directory of your web server to control spidering of files on your web server only. Your robots.txt file will only affect your own server(s) and not anyone else.

The robots.txt file usually needs only two fields: User-agent and Disallow . Here are a couple examples you can put in your robots.txt file. You can add more than one User-agent or Disallow field to your robots.txt file.

Allow all robots:

User-agent: *
Disallow:

This will allow all robots to visit all pages in the directory. Note nothing was entered for Disallow even though it was included in the robots.txt file.

Specify rules for a certain search engine:

User-agent: googlebot
Disallow:

This specifies Disallow rules to be followed only by google robots that may visit your site. Note nothing was entered for Disallow even though it was included in the robots.txt file. This means all files can be added to google's search database.

Keep all robots out of directory:

User-agent: *
Disallow: /

This keeps all robots from adding any of the pages in the directory the robot.txt file is placed. Note the slash / in the Disallow field means all files.

Ban a certain search engine from all directories:

User-agent: googlebot
Disallow: /

This would keep google from adding any pages in that directory to it's search engine database.

Protecting only certain files:

User-Agent: *
Disallow: /images/
Disallow: email.html

This keeps all robots from adding all files in the images directory and the email.html file from being added to their search database. Note that using Disallow: /images/ will cover the subdirectories as well, so there would be no need to add another for each subdirectory in the images directory. Spiders will not go into the images directory at all nor visit any of the directories or files inside it.

We recommend you take a look at an example robots.txt file from PimpSoft . You may want to copy this file and edjust it to your site's needs. This file helps to keep certain harmful robots (spiders) off your site and control how these robots spider your site. In this way, your pages can be indexed most efficiently.

Once you have constructed and saved your robots.txt file, upload it to your web server directory which you wish to protect using your FTP program.

Checking robot.txt Validity

Once you've uploaded the robot.txt file, it's usually a good idea to check the validity of the file and be sure there are no problems. You can do this using the one of the following robots.txt validators. Please be sure your robots.txt file is uploaded to your web site and provide the proper URL to the file, such as http://yourdomain.com/robots.txt .

Specifying Robot Rules in HTML Meta Tags

Alternatively (or even additionally) you can specify the rules in your HTML file itself, within a meta tag . This tag appears in the head tag. Here is an example:

<head> <meta name="robots" content="noindex,nofollow"> <title>My Page</title> </head>

In the content= area within the quotes you have a few choices you can add. The first word before the comma you can use either index meaning the robot will add the page to the search engine database, and noindex meaning the robot will not add the page to the search engine database.

The second word after the comma in the content= area you have two choices. You can use follow to mean it will also visit all other links you have on that page and catalog them (providing there is no robot meta tag preventing it, in which case it will skip over those), or nofollow meaning the robot will act on only that page and not follow links you have on that page.

Which do I use, robot.txt or in my meta tag?

The robots.txt method is best if you want to keep robots from indexing a whole directory or even protect certain files. It also lets you change things from one file rather than from each .html file you have. This is good for keeping your pages from being added to search engines.

The robot meta tag is best if you want search engines to add your pages to the search engines.

Do remember though that spiders will only find content on your pages and pages that are linked to. If any of your pages aren't linked to, then spiders may not find and index those pages.