This page provides information about the Inktomi search engine software running
on search.mit.edu. Generic information about the search engine software
is available from Inktomi, and this page provides additional documentation for
the Inktomi search engine as it is configured for use at MIT.
search.mit.edu is an I/S-supported
search engine which indexes world-readable webpages served through webservers
at MIT. It finds pages to index by "spidering", or following links, from pages
it has already indexed, and it sets its own schedule to revisit pages based on
how frequently they change.
What is the
Inktomi Search Engine?
The Inktomi search engine (formerly known as the "Infoseek" or
"Ultraseek" search engine) is a product of the Inktomi
Corporation.
Generic Information
about Inktomi
General documentation about the Inktomi search products can be found
on Inktomi's Enterprise
Search webpages, and more technical documentation in the Support
webpages. General online help is available from the search engine itself at http://search.mit.edu/help/.
How Inktomi
is Configured at MIT
The Inktomi search engine software is configured and customized to
run at MIT. The general configuration and rules are roughly:
- Inktomi search software runs on search.mit.edu,
an I/S-supported system, dedicated to providing
search engine service, 24x7x365.
- Index world-readable webpages that are served through webservers whose hostnames
end in "MIT.EDU"
- Don't index anything that has a "dontindex" (case insensitive) in its URL
- Don't index any CVS, RCS or OldFiles directories
- Only index each document once (Note: This means that if there are multiple
URLs which ultimately lead to the same webpage, only one URL will be associated
with the webpage in a collection. In general, it is the first one the search engine
encounters. For website maintainers who use multiple webservers, virtual hosts
in the same webserver, different webservers serving the same pages, etc., the
default behavior of the indexer may be confusing.)
- Minimum time between page revisits is one day; maximum time is 32 days (Note:
The indexer adjusts its schedule based on how frequently a page changes. Pages
that change often are visited more frequently than those that change less often.)
- In calculating the "relevance" of a query as it pertains to webpages in its
indexes, the search engine uses "relevance weights". The order of weighting (from
most to least) in parts of a webpage are:
title > keywords > description > alt
(Note: Website administrators should use appropriate information in various metatags
to improve the relevance scores of their pages.)
- The indexer uses "word spam thresholds" (Note: Website administrators should
use appropriate information in metatags; if a word is overused, it will decrease
the relevance score.)
- There are two main "collections", both of which are searchable by default.
They are:
- webmit, the collection of world-readable pages served via 'web.mit.edu'
- websrvrs, the collection of world-readable pages served via other webservers
at MIT (not via 'web.mit.edu')
Comments to search@mit.edu