Options for making MIT's web pages searchable
Starting from the MIT home page, there are more than 20,000 HTML pages
available through web.mit.edu. Someone looking for information on a
specific topic may have an awfully hard time finding it. For example,
someone looking for pages about Kerberos would have a hard time
navigating from the MIT home page to any pages on that topic.
What's needed is a facility for searching MIT web pages. Being able to
type "kerberos" and find pages that mention it is much easier than
wandering around the web.
Searching is a much-needed service for users of MIT's CWIS. It's
important to look at options for providing this service and choose one
that will best meet the need.
There are hundreds of possible solutions, but they fall into four basic
- Point users at search services outside MIT
The simplest solution from a development standpoint is to do nothing.
Tell people to use InfoSeek or
any of the multitude of searching
services out there. If they want to find MIT's reengineering pages,
they can search for "MIT reengineering". If they want to find the CP's
pages, they can search for "MIT police".
Advantage: requires no development effort.
Disadvantage: adds dependence outside MIT.
- Use a robot like webcrawler
There's a World
Wide Web Robots, Wanderers, and Spiders site in England with a long
Advantage: there are plenty to choose from.
Disadvantage: possible detriment to server
The detriment to server performance results from the fact that normal operation of the web server depends on use
of the AFS cache. Many clients tend to request the same pages, which
web.mit.edu retrieves from local disk rather than from an AFS server. Robots tend to flush the cache by asking for all
pages in the tree, one after another.
- Use a local-disk indexer like WWWWAIS
There's a page
on yahoo that lists many of these. They too have their advantages
- faster, more efficient than web robots
- require that you
- index all files under one directory (e.g. /afs/athena/)
- OR enumerate all filenames to index
- then map filenames back to URL's
- no mechanism for combining indexes from multiple servers, which
is important if we want to cooperate with other MIT servers,
e.g. Mech E.
- Use harvest
In looking at the previous two options, it's clear that we want a
product that will walk pages from the MIT home page, not traverse all
files and directories in MIT's AFS cells. Yet we want it to get pages
out of AFS directly, not go through the server. The only product that
can do this is Harvest.
Who wrote Harvest? See the press
The central parts of the Harvest system are the gatherer, which
compiles a summary of information in web pages, and the broker,
which collects that information from the gatherer and builds a
searchable index. One uses the broker when doing a search. Read Distributing
the Gathering and Brokering Processes to learn how Harvest
can be configured to gather either across the network or from local
There's a Technical
Discussion of the Harvest System that explains other design
decisions in Harvest that make it scalable. For example, there's a
system for efficiently replicating
indexing information, so that searches aren't bottlenecked by one
You can try my experimental broker through this query page.
Bruce Lewis <firstname.lastname@example.org>
Last modified: Wed Nov 20 10:56:10 EST 1996