Notes on Design Project 1

This note summarizes observations gathered from the first 6.033 design project. It is based on material from student reports, and was made by David Wetherall with the help of the 6.033 staff (especially Jerry Saltzer). Read it for insight into the components of a strong report, as well as to learn about the Web caching issues and their possible solutions.

Design criteria // Caching in context // Possible solutions // On references

Design criteria

Let's begin with a roughly prioritized list of design objectives. This provides a framework within which to explore the design alternatives and evaluate a chosen design systematically. Beyond those included with the assignment, the following goals emerged:

Cost-effectiveness. The ultimate goal is to save money; any solution must be cost effective in its development, implementation, and ongoing costs.
Scalability. Beyond extensibility, the exponential growth in Web traffic requires that the design scale with load.
Performance. We prefer to increase performance (with reduced latency on cache hits) rather than to degrade it (with excessive latency on cache misses or a throughput bottleneck).
Security. The solution should identify security concerns, e.g., privacy and integrity of cache contents (though your design need not solve these problems).
Reliability and fault-tolerance. The solution should identify reliability concerns, e.g. single points of failure (though your design need not solve these problems).

Caching in context

Caching is not new to the Web; browsers and the HTTP protocol support caching to some extent, and there is a body of previous work. Solutions should be designed with this context in mind.

Mechanisms

While it can be difficult to distinguish between behaviors supported by the Web protocols and common use, the following mechanisms are relevant:

Proxies. These intermediaries provide a natural point at which to cache Web pages. Available proxies, e.g. CERN, Netscape, already support caching, and caching schemes, e.g., Harvest, already leverage proxies. Widespread browser support means that proxy schemes can be used via client configuration, rather than modification. The role of proxies in providing access from behind a firewall also means that clients may typically distinguish internal requests from external ones.
Client-side daching. Browsers such as Netscape already support single user caches. Thus some of the client interface issues associated with caching have already been addressed.
Get with If-Modified-Since. Conditional GETs along with the "304 (Not Modified)" response are designed to allow efficient update of cached pages with a minimum of traffic. They are now in common use. In HTTP/1.1, the Unless header is designed to allow more complicated conditional updates, e.g. based on size and message digest. It is not widely used.
Expires and Last-Modified. The Expires response header can provide validity information, but it is typically not returned. The Last-Modified response header has been used to gauge validity, and it is often returned.
Pragma: no-cache and Cache-Control. Headers specific to caching are emerging. In HTTP/1.0, the no-cache value in the Pragma header is used to request an authoritative copy (e.g. a browser reload) or indicate a non-cacheable response. In HTTP/1.1, a Cache-Control header is available to carry this information and other cache information, e.g., lifetimes, though it is not yet widely used.

Broader Scope

Your design should use these mechanisms and improve on the current situation. Caching between many clients, for example, provides the opportunity to increase the overall hit rate. Similarly, mirror sites may be used to replicate heavily used sites, but they provide a static and coarse-grained alternative compared to caching.

To save bandwidth, caching may be only one component in a larger scheme that employs other mechanisms, such as compression. HTTP request headers allow gzipped versions of HTML pages to be requested, and the JPEG image format to be preferred over GIF. Traffic out of MIT may be reduced by other means, such as mirroring MIT pages elsewhere, or configuring servers to be more miserly.

Possible solutions

Feasibility // Alternatives: Caching proxy server // Distributed cache // Hierarchical cache // Filtering gateway // Cache details: Objects and key // Allocation and eviction // Consistency // Size and performance // Evaluation

Feasibility

Before diving into design, we try to assess whether a cache is feasible using back-of-the-envelope calculations:

A T3 connection is worth on the order of $10000 per month.
MIT traffic statistics show that incoming traffic (16Gb/day) is roughly half the volume of outgoing traffic (40Gb/day), and is growing exponentially.
While there are no direct statistics for external Web traffic, around a fifth of the internal backbone traffic is Web related.
Previous caching efforts have yielded hits around half of the time.

The savings will be highly dependent on the volume pricing scheme, but it would be difficult for them to exceed $1000 per month (i.e., a fifth of half of $10000). This figure is speculative, but gives us an idea of the magnitude of the savings - enough to support a modest scheme, but not the development of customized multi-processor hardware!

Alternatives

Here are several broad alternative schemes for organizing caches and interposing them between browser and server. The details of the cache design and implementation are common to all schemes, and are given afterwards.

Caching proxy server

A simple scheme is to use the proxy mechanism to introduce a caching proxy that targets Web pages external to MIT. There is no need to cache internal MIT pages, since they are already close-by, and avoiding them will reduce the load on the proxy server. Some browsers support a no-proxy domain list that could be set to ".mit.edu" for this purpose.

It is straightforward for users to configure their browsers to use a proxy, and a substantial amount of configuration may be possible through Athena configuration files or configurable distributions of browser software. However, such schemes ultimately rely on user co-operation, and partial use of the proxy server will not only reduce traffic savings, but reduce the effectiveness of caching. Enforcement, via a firewall for external non-proxy traffic on port 80, may be feasible, and would provide the needed incentive!

Several other issues need to be resolved before the proxy can be used. Its placement depends on network topology, and should be chosen so as to minimize overall traffic and latency. Placing the proxy on the subnet with the NEARNET gateways means that a query and its response will cross the MIT backbone exactly once, whether the cache hits or not. Placing it elsewhere means that queries that result in cache misses cross the MIT backbone twice. A proxy may be bought or built. Unless the design is unusual, proxies that are already available, e.g., Netscape, Harvest, CERN, are likely to be suitable. The cheaper and better engineered they are the better! The proxy should be set up to prevent parties outside of MIT from using its services (since this will increase MIT's traffic bill).

There are two main difficulties that we must consider before adopting a single proxy: load and extensibility, and reliability. It is not clear that a proxy running on a single machine can bear the necessary load without itself becoming the bottleneck. Again, using back-of-the-envelope calculations, if the average page size is on the order of 20Kb, then say 4 Gb of Web traffic/day becomes around 200000 requests/day. This is a reasonable load for a single machine web-site, e.g., SIPB, though note that proxies may be more constrained since a Web server is effectively a proxy with a 100% hit rate. In other terms, 200000 requests/day is on average 2 requests/second (though the burst rate will be much higher), which is easily within reach of well-engineered servers running on reasonable platforms, e.g., Harvest. And HENSA recommends one reasonable processor per 200000 requests. Thus in the short term a single proxy, running on say $10000 worth of souped-up PC, may suffice.

But scalability and reliability are not well provided for with a single proxy. If the load becomes too high, or to cater for other external gateways, multiple independent proxies may be maintained. This has the disadvantage that the lack of sharing between their caches will decrease the overall hit-rate, and hence traffic savings, somewhat. It may not be possible to combine these independent caches into a single larger cache if load is the problem. (But see below for schemes that combine caches.) Further, to protect against single points of failure, round-robin DNS could be used to resolve a single hostname, e.g., web-proxy.mit.edu, into the different proxy IP addresses. This would allow one proxy to fail without affecting all clients. Fault-tolerance is not provided though -- the DNS mapping is independent of machine failure, and DNS caching would prevent the host to IP address mapping from being dynamic for every Web request (even if the proxy hostname were translated to an address every time).

Distributed cache

An alternative scheme is to distribute the single cache into a number of components that are handled by different machines, either based on domain, or a more random hash on the URL or cache key. This scheme may be more scalable by allowing the cache to be divided into enough components such that a single machine can comfortably handle the load.

Several issues must be resolved by such a design. Again, the clients must reach the cache. This might be accomplished using local, e.g., per subnet, proxy-like agents that are reached via the proxy mechanism and decide on further routing to the correct component, i.e., a level of indirection to avoid client modifications. These local proxies may or may not cache themselves, and may avoid forwarding non-cacheable queries to the distributed cache. Care should be taken to ensure that an extra level of indirection does not add significantly to latency. Careful choice of the distribution function would be needed to minimize the unevenness of the component load. The number of components would presumable be either small (given that a single decent proxy might bear the load) or large (so that machines could be reused instead of dedicated, though this might lead to reliability problems). There may or may not be provision to change the number of components dynamically.

In all cases, both the proxy-like agents and distributed cache units must be designed and implemented. A sound basis for this may be public domain source code for a caching proxy server. It would be possible to design a custom protocol between proxy-like agent and cache component, e.g. for reliability, or use HTTP, e.g., for simplicity. Beyond scalability, these distributed schemes seem to offer little over the proxy scheme, except that dedicated equipment may not be required. If the proxy mechanism is used to contact the distributed cache, it may represent a single point of failure for a number of clients. The system is more complex than a simple proxy solution, and may fail in more complicated ways.

Hierarchical cache

A different solution to scalability involves a hierarchical cache organization, as exemplified by Harvest. These schemes may differ in their behavior on a cache miss: a simple scheme may query the next higher-level cache, while more complicated schemes such as Harvest multicasts to siblings and the parent node in order to distribute load and minimize latency.

Again there are several issues. Larger (and more powerful) cache nodes are needed higher up the tree, and the layout of the tree and number of levels must be chosen. Protocols between cache nodes, e.g., providing better latency and reliability than HTTP and passing other information, may be designed. It would be advantageous to provide each distinct group that has common interests, e.g., LCS, ILGs, with their own node. Care must be taken to ensure that latency does not increase significantly (either on a hit or miss); Harvest has shown that well engineered nodes can manage three levels without adverse effects.

Filtering gateway

All schemes so far have relied on the proxy mechanism to interpose caches between browser and server. (The DEC caching relay used SOCKS, an application-neutral means of directly contacting and navigating through a firewall, but proxies are probably more widespread and convenient.) A different approach is to place a packet filtering device at the external gateway. The filter then skims Web requests and replies to be handled by a cache system. This scheme has the advantage of requiring no browser configuration, but a number of disadvantages, including violation of the network abstraction, performance, scalability and reliability.

The packet filter must operate at the IP level (or below), yet it must identify Web cacheable requests from information in the HTTP level, above the TCP level. Breaking this abstraction barrier is a complicated catch-22 -- a TCP connection for port 80 may be established by the filter, only to find that it is not a Web request! Or it may be that one Web request may be fragmented into two IP datagrams that the filter must re-assemble to interpret! And care must be taken to allow packets from the cache system through without interception. The filter must process at least all outgoing packets, Web-related or not. At 40Gb per day -- thousands of packets per second on average and higher rate bursts -- custom hardware is required. It must also process all incoming packets unless source/destination addresses are changed and the back-end cache uses a directly addressed proxy-like mechanism. The filter may be a performance bottleneck for all traffic, as well as a single point of failure.

Cache details

In all the schemes, the design of cache components must answer a common set of questions. Much can be borrowed from the design decisions embodied in existing caching proxies, but the following choices are relevant.

Objects and key

What objects should be cached? Not all Web pages are cacheable, the prime examples being those requiring authorization, POSTs, queries, or dynamically generated, e.g., cgi-bin scripts. Unfortunately, there is no reliable means for discerning whether an object is cacheable or not, and existing proxies use heuristics. Clues such as no explicit Content-Length and no Last-Modified date often indicate dynamic, unending, or otherwise non-cacheable data. Aborted retrievals should not be cached, while unsuccessful retrievals may be cached for a short period in order to prevent repeated failures causing repeated cache misses, i.e., negative caching.

What protocols should be cached? As well as the HTTP protocol, browsers communicate using FTP, GOPHER, etc., and it would be possible to cache some of the results, though it may not be worthwhile. A common tactic is to avoid caching excessively large documents, because they occupy valuable cache space. And copyrighted documents present an interesting issue -- perhaps they could be pre-expired?

The cache key also presents issues. A Web server may return a different object depending not only on the URL, but any of the request headers, most typically the client identification and list of accepted MIME types. Using these headers as well as the URL, however, will lower the hit rate without either altering the common cases or providing true transparency -- because of access control, the returned document further depends on the IP address of the client! Consequently, most proxies use the URL as the key, ignoring the MIME headers. Nonetheless, the cache must still check that the MIME type of a hit is acceptable to the client.

Aside from headers, however, the URL does not uniquely name a document, since multiple URLs can be constructed to reach the same page using host nicknames. To save on cache space, such duplicates may be detected, e.g., by comparing IP addresses, and weeded out. Similarly, DNS mappings may cause one hostname to have multiple IP addresses. Further, different URLs may name "equivalent" documents due to automatic directory listings, redundancy in the underlying directory structure, or simply because the "blue dot" image has been widely copied. Comparison based on message digest may identify these duplicates to further save on cache space.

Allocation and eviction

We need to decide how to allocate cache space and reclaim it when the cache is full. Without detailed information about access patterns, copying a page into the cache when serving a miss and evicting the least-recently used (LRU) pages seems appropriate; it is a simple scheme that is based on locality of reference and is hard to beat, and it has been successfully used by Harvest. It may also prove useful to evict expired or nearly expired pages first. Other schemes (based on frequency and size) are possible, but may be more complex but no more effective. Prefetching is used in some systems, but given our traffic objectives, it is not likely to be useful.

Consistency

There is a tradeoff between performance and cache consistency. It is difficult to know whether a cached page is still valid, or whether it has changed at the server and the copy is stale. If absolute consistency is required, the page may be refreshed each time it is hit and still save on bandwidth. Other proxies use Time-To-Live (TTL) schemes, estimating validity by interpreting the Last-Modified timestamp as an indication of the rate of change of the document. A fudge factor is used to trade stale pages for hits, thus further saving on bandwidth. If there is no Last-Modified timestamp, a default scheme is needed. Validity estimates may depend on the type of the object, e.g., images have been found to typically remain valid for longer than HTML pages. Dynamic schemes are possible, e.g., exponential backoff on objects expired too early, coupled with sampling to detect objects expired too late. Implementation and measurement are needed to assess the value of these schemes.

To efficiently refresh a document (in terms of both bandwidth and latency), the get-if-modified mechanism can be used. This also implies that there is no need to immediately evict documents that are presumed not to be valid; they are still useful.

A means for allowing users to bypass the caching system to be sure the object is fresh is also useful. This can be implemented using the Pragma: no-cache or Cache-Control headers. It is already used by some browsers, e.g. Shift-Reload in Netscape.

Size and performance

For equipment, the HENSA cache provides some interesting rules of thumb: two 2Gb disks plus another for every 100000 requests/day; 64Mb of RAM plus 32Mb for every 100000 requests/day; and a powerful processor per 200000 requests/day. Disk bandwidth as well as capacity is important, which is why 2Gb rather than 4Gb disks are suggested. HENSA reports a hit rate (in terms of bytes delivered no less!) of around 60%.

Thus, for MIT, an initial configuration might comprise four 2Gb disks, 128Mb RAM, and one processor (being targeted at 200000 requests/day). At $250/Gb or less, it would seem inexpensive to increase the amount of disk space if it proves useful. If there are around 4Gb/day of incoming Web data, of which 50% is already in the cache, then new material will purged from the cache if it is not used within four days. Using four 4Gb disks doubles this to eight days.

A further concern in implementing the cache is latency. Cut-through, as opposed to store and forward, return of objects should be used so that the client does not have to wait while the cache retrieves the entire document. This is necessary in any case to support unending flows of data, e.g., server animation push. In multi-level schemes, the number of levels should be the minimum necessary (although we are prepared to trade extra levels to obtain increased bandwidth savings).

Evaluation

While there isn't enough information to evaluate a design without its implementation, it is still possible to suggest an evaluation process. To support it, the cache should maintain an extensive range of measurements (e.g. Web traffic, hit rate, stale page rate, cache latency) to allow us to assess its performance. Some of these measures may require mechanisms that would not otherwise exist, e.g., sampling to determine the stale page rate.

We can measure the hit rate in terms of traffic to relate it more directly to our savings. Throughput and internal latency can be used to gauge cache performance, including when more resources are needed as traffic grows (or if too many are already allocated). Eviction of many valid pages may mean that the cache is too small and should be expanded. Increasing the size of the cache will bring diminishing returns in terms of hit rate, and at some point is not economic.

Costs for development, equipment, and administration and maintenance are all relevant to the feasibility of the project, and depend on your design.

On references

A note on references: prefer documents that have been reliably placed in the public domain and subject to peer review, i.e., conference proceedings are better than random Web pointers, though sometimes you'll have to settle for the latter. When you do cite Web references:

use the traditional 3-block citation: author, title, and publication data.
don't let the publication data consist solely of a URL; include an organization name or comparable information.

The test of a Web citation is then whether -- when the page moves but still exists -- there is enough information to quickly discover its new location. Also, try to use references to lend support to your arguments on specific points, as well as in general.

1. Steven Glassman, A Caching Relay for the World Wide Web, Proc. of the First International WWW Conference, August 1994.

2. Anawat Chankhunthod et al., A Hierarchical Internet Object Cache, Proc. of USENIX '96, January 1996.

3. Neil Smith, UK Academic National Web Cache at HENSA Unix, http://www.hensa.ac.uk/wwwcache/, HENSA Unix, 1996.

Top // 6.033 home // 6.033 Handout 25, issued 4/11/96