Options for making MIT's web pages searchable

Starting from the MIT home page, there are more than 20,000 HTML pages available through web.mit.edu. Someone looking for information on a specific topic may have an awfully hard time finding it. For example, someone looking for pages about Kerberos would have a hard time navigating from the MIT home page to any pages on that topic.

What's needed is a facility for searching MIT web pages. Being able to type "kerberos" and find pages that mention it is much easier than wandering around the web.

Searching is a much-needed service for users of MIT's CWIS. It's important to look at options for providing this service and choose one that will best meet the need.

There are hundreds of possible solutions, but they fall into four basic categories.

Point users at search services outside MIT

InfoSeek

searching services

Advantage: requires no development effort.

Disadvantage: adds dependence outside MIT.

Use a robot like webcrawler

World Wide Web Robots, Wanderers, and Spiders

list of robots

Advantage: there are plenty to choose from.

Disadvantage: possible detriment to server performance.

The detriment to server performance results from the fact that normal operation of the web server depends on use of the AFS cache. Many clients tend to request the same pages, which web.mit.edu retrieves from local disk rather than from an AFS server. Robots tend to flush the cache by asking for all pages in the tree, one after another.

Use a local-disk indexer like WWWWAIS

page on yahoo

Advantage:

faster, more efficient than web robots

Disadvantages:

require that you
- index all files under one directory (e.g. /afs/athena/)
- OR enumerate all filenames to index
- then map filenames back to URL's
no mechanism for combining indexes from multiple servers, which is important if we want to cooperate with other MIT servers, e.g. Mech E.

Use harvest

Harvest

Who wrote Harvest? See the press release.

The central parts of the Harvest system are the gatherer, which compiles a summary of information in web pages, and the broker, which collects that information from the gatherer and builds a searchable index. One uses the broker when doing a search. Read Distributing the Gathering and Brokering Processes to learn how Harvest can be configured to gather either across the network or from local disk.

There's a Technical Discussion of the Harvest System that explains other design decisions in Harvest that make it scalable. For example, there's a system for efficiently replicating indexing information, so that searches aren't bottlenecked by one server.

You can try my experimental broker through this query page.

Bruce Lewis <brlewis@mit.edu>

Last modified: Wed Nov 20 10:56:10 EST 1996