Excluding Pages

There are a few ways available to exclude your pages from the MIT search engine and other search engines. Search engines and other automated web page visitors are referred to as "robots", and this term perpetuates in the solutions to exclude pages from search engines. MIT web publishers can restrict different levels of content: an individual web page, a directory, or an entire web server.

To exclude on a page by page basis, use a robots meta tag in the header. This will prevent all robot from indexing your page:

<head> <meta name="robots" content="noindex,nofollow"> </head>

To exclude a file or directory on web.mit.edu, put the phrase 'dontindex' somewhere in the URL. This is especially useful if you are developing a page or site and don't want it indexed yet. This will work with MIT's search engine, but not necessarily with public search engines such as go.com or AltaVista. For example, the following would not be indexed by the MIT installation of Ultraseek:

http://web.mit.edu/cwis/foo/dontindex/ http://web.mit.edu/somelocker/DontIndex/foo/bar/anyoldfile.html http://web.mit.edu/somelocker/foo/bar/dontindexme.html

To exclude an entire web server from both the MIT search engine and other robots, you can create a robots.txt file on the server. Documentation is available from a number of sources, including the W3.org and The Web Robots Pages. This text file is usually two lines to prevent all robots from indexing your site.

Web pages that restrict access via certificates are not included in the index. These restricted pages have an "s" in the url, such as https://mit.edu ("s" stands for "secure").

Comments to web-help@mit.edu
$Date: 2002/01/03 04:25:48 $