Options for making MIT's web pages searchable

Starting from the MIT home page, there are more than 20,000 HTML pages available through web.mit.edu. Someone looking for information on a specific topic may have an awfully hard time finding it. For example, someone looking for pages about Kerberos would have a hard time navigating from the MIT home page to any pages on that topic.

What's needed is a facility for searching MIT web pages. Being able to type "kerberos" and find pages that mention it is much easier than wandering around the web.

Searching is a much-needed service for users of MIT's CWIS. It's important to look at options for providing this service and choose one that will best meet the need.

There are hundreds of possible solutions, but they fall into four basic categories.

  1. Point users at search services outside MIT
  2. The simplest solution from a development standpoint is to do nothing. Tell people to use InfoSeek or any of the multitude of searching services out there. If they want to find MIT's reengineering pages, they can search for "MIT reengineering". If they want to find the CP's pages, they can search for "MIT police".

    Advantage: requires no development effort.

    Disadvantage: adds dependence outside MIT.

  3. Use a robot like webcrawler
  4. There's a World Wide Web Robots, Wanderers, and Spiders site in England with a long list of robots.

    Advantage: there are plenty to choose from.

    Disadvantage: possible detriment to server performance.

    The detriment to server performance results from the fact that normal operation of the web server depends on use of the AFS cache. Many clients tend to request the same pages, which web.mit.edu retrieves from local disk rather than from an AFS server. Robots tend to flush the cache by asking for all pages in the tree, one after another.

  5. Use a local-disk indexer like WWWWAIS
  6. There's a page on yahoo that lists many of these. They too have their advantages and disadvantages.

    Advantage:

    Disadvantages:

  7. Use harvest
  8. In looking at the previous two options, it's clear that we want a product that will walk pages from the MIT home page, not traverse all files and directories in MIT's AFS cells. Yet we want it to get pages out of AFS directly, not go through the server. The only product that can do this is Harvest.

    Who wrote Harvest? See the press release.

    The central parts of the Harvest system are the gatherer, which compiles a summary of information in web pages, and the broker, which collects that information from the gatherer and builds a searchable index. One uses the broker when doing a search. Read Distributing the Gathering and Brokering Processes to learn how Harvest can be configured to gather either across the network or from local disk.

    There's a Technical Discussion of the Harvest System that explains other design decisions in Harvest that make it scalable. For example, there's a system for efficiently replicating indexing information, so that searches aren't bottlenecked by one server.

    You can try my experimental broker through this query page.


Bruce Lewis <brlewis@mit.edu>
Last modified: Wed Nov 20 10:56:10 EST 1996