From the W3.Org:
This section provides some simple suggestions that will make your documents more accessible to search engines.
<LINK rel="alternate" type="text/html" href="mydoc-fr.html" hreflang="fr" lang="fr" title="La vie souterraine"> <LINK rel="alternate" type="text/html" href="mydoc-de.html" hreflang="de" lang="de" title="Das Leben im Untergrund">
<META name="keywords" content="vacation,Greece,sunshine"> <META name="description" content="Idyllic European vacations">
<LINK rel="begin" type="text/html" href="page1.html" title="General Theory of Relativity">
When a Robot visits a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyze its contents to see if it is allowed to retrieve the document. You can customize the robots.txt file to apply only to specific robots, and to disallow access to specific directories or files.
Here is a sample robots.txt file that prevents all robots from visiting the entire site
User-agent: * # applies to all robots Disallow: / # disallow indexing of all pages
The Robot will simply look for a "/robots.txt" URL on your site, where a site is defined as a HTTP server running on a particular host and port number. Here are some sample locations for robots.txt:
Site URL | URL for robots.txt |
---|---|
http://www.w3.org/ | http://www.w3.org/robots.txt |
http://www.w3.org:80/ | http://www.w3.org:80/robots.txt |
http://www.w3.org:1234/ | http://www.w3.org:1234/robots.txt |
http://w3.org/ | http://w3.org/robots.txt |
There can only be a single "/robots.txt" on a site. Specifically, you should not put "robots.txt" files in user directories, because a robot will never look at them. If you want your users to be able to create their own "robots.txt", you will need to merge them all into a single "/robots.txt". If you don't want to do this your users might want to use the Robots META Tag instead.
Some tips: URL's are case-sensitive, and "/robots.txt" string must be all lower-case. Blank lines are not permitted.
There must be exactly one "User-agent" field. The robot should be liberal in interpreting this field. A case-insensitive substring match of the name without version information is recommended.
If the value is "*", the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
The "Disallow" field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example,
Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.
An empty value for "Disallow", indicates that all URLs can be retrieved. At least one "Disallow" field must be present in the robots.txt file.
The META element allows HTML authors to tell visiting robots whether a document may be indexed, or used to harvest more links. No server administrator action is required.
In the following example a robot should neither index this document, nor analyze it for links.
<META name="ROBOTS" content="NOINDEX, NOFOLLOW">
The list of terms in the content is ALL, INDEX, NOFOLLOW, NOINDEX. The name and the content attribute values are case-insensitive.
Note: In early 1997 only a few robots implement this, but this is expected to change as more public attention is given to controlling indexing robots.