Pre-Fetching Web Browser

6.033 Design Project: a Pre-fetching Web Browser

1. Abstract

2. Introduction

3. Focus of Report

4. Design Criteria

5. Possible Implementations

6. Design Choices

6.1 Multithreading

6.2 Pipelining

6.3 Caching

6.4 Pre-fetching images

6.5 Interface

6.6 Design overview

7. Conclusion

8. References

1. Abstract

Most web browsers spend a great deal of time idle. To take advantage of this unused time, a browser might be designed which would load into memory other pages which the user might want to visit after finishing with the current page. Through multi-threaded page fetches initiated in the user's computer, these pages could be cached without a significant reduction in performance of the browser's current function.

2. Introduction

Browsers today exhibit start-and-stop behavior: they have bursts of activity (and accompanying delays) when a new page is fetched, but these frenzies of processor and network traffic are interspersed between long idle stretches while the user reads the current page. During these idle periods, the browser could be fetching other pages from remote servers and holding them in anticipation of their being requested by the user. Since many users follow links from the current page on the browser window, those pages whose URLs are listed as links on the current page constitute a set of likely candidates for this sort of pre-emptive fetching. If the browser were to cache all linked pages, it would reduce significantly the delay encountered by a user while loading a new page.

In designing the browser, a few special cases would have to be addressed. Pages containing Java applets would have to be dealt with, but that problem is beyond the scope of this paper. The browser would have to deal with cases where a single page contains dozens of links. Additionally, those links might be spread out over several servers, or concentrated on one. Issues like this would have to be taken into account in the browser implementation, allowing it to operate efficiently under a variety of conditions.

3. Focus of Report

The first section of the report deals with the design criteria for the browser. The goals of the browser design are listed explicitly, with possible hazards pointed out. This is followed by a section discussing possible implementations of the browser. Several possibilities are explored, and a few advantages and disadvantages are listed for each.

The next section details the design choices that I made for the browser. This is the most detailed section, and it contains details on how my design would implement each of the features discussed in the previous section.

The final section of the paper summarizes the factors involved in the design of the browser and relates the design to current issues in world-wide-web development.

4. Design Criteria

The browser would have a number of constraints on its operation. Its first priority would be non-interference with current browser functions. It wouldn't want to interfere with other applications running on the computer, either. It would have to be able to run under different conditions, such as varying network speed and processing speed. It would have to be responsible in its use of shared resources, such as network bandwidth and server processing time. Finally, it would have to be compatible with current network protocols.

Each of these design constraints would have to be traded off against the others in the actual implementation. In order to be useful in its function, i.e., to fetch pages in a reasonable amount of time, the browser would have to impinge somewhat on the other requirements. The priorities of the designer would have to guide this give-and-take process, and the final design might differ significantly from the sample implementation I propose later in this paper.

5. Possible Implementations

The basis for the pre-fetching function would be standard. After reading the HTML code for the current page, the browser would initiate processes to pre-fetch each link listed in the code. This could be implemented a number of ways, each having its own peculiarities of efficiency and complexity. The links could be fetched serially, in the order that they appear in the code. They could be fetched in another order, based on some heuristic of "link importance". They could be fetched concurrently using threads, with one thread per link. My implementation combines concurrency and a limited sort of serial fetching, which is handled by a thread manager.

Another important implementation choice is how to handle the caching process. The same cache could be used to store visited pages and pre-fetched pages, or a separate cache could be built into the browser. This raises issues of space allocation and efficient use of disk space. The other issue is the question of what to overwrite with what. Would new pre-fetched pages always overwrite old ones? Would they ever overwrite visited pages? When would new visited pages overwrite pre-fetched pages? These factors are key in determining the effectiveness of the entire design. If visited pages were always to take precedence in the cache over pre-fetched pages, the browser would never pre-fetch anything once the cache were full. If new pre-fetches were to take precedence over old visited pages, however, one page's links could provide enough pre-fetched page data to overwrite the entire cache, including frequently-visited pages.

The order of execution of the various processes in this new browser would be one of the key issues in determining its efficiency. Would it pre-fetch only after the current page were completely loaded, or would it use some sort of concurrency to pre-fetch while receiving the current page's data? If it were to wait for the all of the code to be received for the current page, would it wait for this code to be parsed and translated onto the browser window before pre-fetching? These decisions might affect the speed of both pre-fetching the linked page and loading the current page.

The browser might be designed to pre-fetch images for the linked pages along with HTML code. This would introduce a huge variable into the pre-fetch time for a page, as many images are much larger than an average HTML file. The order of pre-fetching would be complicated by this extra operation, and the designer would have to decide whether to fetch each linked page's images directly after receiving its HTML code or only after pre-fetching all of the linked pages.

Finally, the designer would have to decide how to handle a user-initiated request while pages were being pre-fetched. One implementation would be to have the browser kill all pre-fetch threads in order to increase the efficiency of pre-fetching the next page's links. Another would use some heuristic to determine which threads should be killed, and which preserved. For example, a browser might decide to keep all threads from one page back.

6. Design Choices

In implementing the browser, I would use a multi-threaded pre-fetching algorithm that would send pipelined requests to targeted servers. User-initiated requests and the loading of the current page would be given top priority by the scheduler, which would force the pre-fetch threads to yield to either of those processes immediately. The thread manager would dynamically set a limit on the number of threads running at once, allowing the browser to cope with varying network speeds and processor loads. The cache manager would allocate space for each pre-fetched page and keep track of the status and location of each page in the cache. Each of these elements would work in concert to provide an efficient, effective pre-fetching browser.

6.1 Multithreading

The design would definitely make use of multithreading. After parsing the HTML code and identifying the links, a list of links would be built. The links would be grouped according to their home server. At least one thread would be created per server named in the list of links. Each of these threads would be responsible for building a URL request to be sent to the server, sending the message, waiting for a response, and caching the data received from the web server. In order to cope with thread creation and coordination, the design would need some sort of thread manager with a scheduler to control thread switching. In my design, I would also include in the thread manager a method of limiting the number of threads based on network and processor speeds. A browser operating on a slow computer, or on a computer with a slow network connection, would run into severe problems if it were to spawn a thread for every link on its current page. Even if the pre-fetch threads were given lowest priority by the scheduler, there would be massive delays in the loading of the current page. The thread creation and switching overhead would devour processor time, and the requests pouring out and in over the network connection would use up all of the bandwidth, leaving the connection glutted and incapable of handling user-initiated page requests. In addition, processor time on a slow computer might be consumed by thread handling operations, interfering with other applications.

My advice would be to allow the thread manager to dynamically set the maximum number of concurrent threads based on the current network speed and processor speed. It would use the following heuristic: (1) A given processor would have a maximum number of concurrent threads based on its processor speed. (2) This limit could be lowered even further if the number of packets per second across the network dropped below a certain rate. (3) If the system were using virtual memory, the browser might even want to limit the number of threads based on disk-writing speed, as a fast network connection might allow more data to flow in across the network than could be written to disk at once. RAM would fill up quickly, causing a system freeze unless packets were discarded. Note that this would only happen in cases of extremely fast network connections, as the speed of the connection is usually the bottleneck in a system.

As a thread finished caching its page to memory, it would be terminated and a new thread would be forked to deal with the next set of links in the list compiled from the HTML code. The design would therefore use a limited form of serial link fetching, as new threads were forked to deal with later links in the list. Depending on the actual implementation and the computer system in question, there might be capability to support more threads than are called for by the list of links. This would be an ideal case, and would take full advantage of multithreading instead of serial link fetching.

Another multithreading issue arises when the user requests a new page, either by clicking on a link or by visiting the browser's bookmarks, "guide" or "search" buttons, or address book. I advocate killing all pre-fetch threads, even those that are nearly finished writing their pages to the cache, and closing their server connections. This preserves bandwidth, which becomes absolutely essential as soon as the user chooses to view a new page. The browser would close connections to all servers but the one containing the new page requested. If the new page were already in the cache, it would be parsed as usual by the browser. If not, and it were not currently being received from the server, a connection would be established to the server and a fresh page request made. In an ideal case, the page would already have been received from the server, and the browser could start immediately pre-fetching the new linked pages. This implementation would have the advantage that it would preserve bandwidth, but it has the drawback that many of the linked pages from the previous page would have to be re-fetched if the user were to hit the "back" button.

6.2 Pipelining

Pipelining is a design choice that would cut down significantly on three things: message creation overhead in the client, network traffic, and server-side congestion. A single pipe could encapsulate several page requests, making it faster to build than several independent page request messages. It might even be small enough to fit into a single packet, cutting down significantly on the number of packets sent by the client machine and therefore on the chances of congestion. Fewer packets would also mean a lowered risk of lost packets due to transient errors, which would in turn result in fewer re-send delays. Finally, a single request sent to the server would prevent an influx of dozens of packets at once into the server from a single client, each requesting a separate URL.

There is one serious downside to pipelining: in HTTP/1.1 , page data is received from the server in the order requested, rather than concurrently [1]. This would mean that a click-through to the last link in a pipe would force the browser to either wait for all other pages in the pipe to be received, or to send a redundant request for the single URL. This is one of the most significant trade-offs that the implementation would have to resolve. In my opinion, however, the additional network congestion and message creation overhead from independent messages seem to point to pipelining as the more efficient implementation.

If pipelining were used, each thread would have to handle the responses to its requests. The thread could accomplish this by handling each page serially (since this is how they will be received from the server), or it could make a fork call as each page header were received, spawning a new thread to handle the incoming page data. The latter might be a more effective strategy, as it could handle packets received out of order. On the other hand, most of the data will be serial, and the extra threads might simply sit idle most of the time. A serial model would have to have a buffer to handle the premature reception of page headers.

6.3 Caching

Each page coming into the machine would have to be cached somewhere for convenient lookup. A cache manager would allow quick lookup and could also hold key data on each page in the cache. The manager would want to contain at least the following data for each page in its cache: last date cached, size of page, location of page in memory, name/home server of page, and whether the page has been visited or pre-fetched. It could also act as a mutex for the cache by pre-allocating enough space for each page based on its header information: information which the computer would receive before any other packets containing data for that page. The first action taken by a pre-fetch thread while receiving packets would be to store the header of the incoming page in a buffer, sending the information to the cache manager [2]. The cache manager would set aside in memory or virtual memory the amount of space indicated by the header, preventing other incoming pages to be written in that location. Thus, multiple threads receiving packets at the same time could write to the cache concurrently without fear of data races.

The cache for pre-fetched pages would be somewhat separate from the cache for visited pages. Each would have a maximum size, although a page of one type (visited versus pre-fetched) could be written in unused space in the other cache. Pages of a given type would always have priority in their own cache, however, writing over any "intruding" data and erasing that page from the cache. To ensure that low-priority pages were written over, the writing thread would search for the oldest page and mark it to erasure. If a pre-fetched page were visited, its flag in the cache manager would switch from "pre-fetched" to "visited", and its memory would be counted in the "visited" cache. This implementation would ensure that new pages could always be pre-fetched (writing over older pre-fetched pages), but they wouldn't overwrite frequently-visited pages in the cache.

Another vital function of the cache manager, of course, would be to check which pages were already in the cache. While the current page's HTML code were being parsed, each link which appeared would be looked up in the cache manager. If it were already cached, and the cached page hadn't passed its "expiration time" as specified by the user in the "Preferences" dialogue, the link wouldn't be added to the list of pages to pre-fetch. In this regard, the cache manager would duplicate the operation of cache managers in conventional browsers, which direct the browser to read the cached page instead of requesting the page from the server. Note that the "expiration time" for pre-fetched pages would have to take into account the possibility of time-sensitive data on pages, and it might therefore be quite short.

The importance of the header information to cache management suggests that the browser might want to make header requests before requesting pages from the server. HTTP/1.1 allows clients to request header information from the server separate from the HTML code for a page. According to the protocol, these requests can be pipelined just like URL requests [1]. The browser could wait for the header information before deciding which pages to fetch, and in what order. To increase efficiency, it might arrange the files in the pipelined URL request with the smaller pages first, thus eliminating the need to wait for a large page to be sent by the server before it has a chance to send the smaller pages. The main problem with pipelining could thereby be reduced, as small pages would be fetched very quickly. The drawback to using header requests, of course, would be the added time to build the request, send it across the network, wait for a response, and build the ordered list of URLs from the header information. In my implementation, I assume that this delay would outweigh any benefits from ordering the URL pipeline by page size. I would therefore not use header requests in my implementation.

6.4 Pre-fetching images

One important design consideration is whether to pre-fetch images from the linked pages. It would introduce a new level of system complexity, and it could lead to an explosion in the amount of data requested at one time. On the other hand, the images on a page might be the key content that the user wants. One way to handle this problem would be to make this an option that the user could set in a "Preferences" dialogue with the browser. This is the approach I would take in my implementation. The pre-fetch threads would wait until all of the linked pages had been read into the cache, at which point they would parse their pages' HTML code and compile pipelined image requests as though the linked pages were each the current page. Note that this could result in several fork commands being executed by each link-fetching thread, which could lead to an explosion of thread creation. The thread manager would limit this as usual, but it would mean that linked pages with many images would use up a huge amount of bandwidth, even though the user might never see them.

6.5 Interface

The interface for the browser would be almost identical to that for a conventional browser. It would incorporate only three notable changes:

The "Preferences" dialogue would have a field for "pre-fetch timeout", which would set the "expiration time" for a pre-fetched page. This time would be very short, to account for time-sensitive data on pages.
The "Preferences" dialogue would also present the option to pre-fetch images along with pages. It would warn against this, since it would be very inefficient on most network connections.
While pre-fetching pages, the text bar at the bottom of the browser window would give a progress bar indicating how many of the links had been pre-fetched.

The third interface change would be sufficient to keep the user informed of how long he could expect a click-through to take. The other two changes are vital to regulating the cache and the pre-fetch algorithm, respectively.

6.6 Design Overview

Upon loading a new page, the browser would initiate a series of pre-fetch procedures. A low-priority thread would be forked for each linked server, each of which would construct and send a pipe of URL requests. As each page were received, the thread would fork off a new thread to deal with the incoming data and write it to the cache. If the user were to request a new page while these threads were executing, the threads would be killed and their connections closed.

A thread manager would handle the threads, ensuring that the pre-fetching threads would always yield to other processes. The thread manager would also handle the creation and abortion of threads.

A cache manager would serve to eliminate data writing conflicts in the cache due to concurrency. It would also serve to coordinate the pre-fetch and visited caches. Finally, it would allow quick access to data about the pages in the cache

7. Conclusion

The browser design has the capacity to speed some web navigation tasks. Many users would find the browser useful for their web-browsing activities if it were implemented properly. However, as always, some trade-offs will have been made. The browser will almost always be slower to respond to fast click-through navigation, and the interface will be slower to react to any user input. Other applications will run more slowly on the user's computer while running the browser, and the network connection will often be clogged with incoming packets from several web pages.

The repercussions of the widespread use of this browser would also be significant. Each page loaded into the browser would spawn an additional number of page requests equal to the number of links on the current page. If the mean number of links per web page were ten, this would mean a tenfold increase in network traffic. Given the state of congestion on the Internet today, this would be a noticeable annoyance to all users.

Nonetheless, the design would be quite effective on a well-connected, fast computer. It would remove much of the wait-time from web browsing, resulting in a more pleasant and efficient human-machine interface.

8. References

Fielding, R. et al, Hypertext Transfer Protocol : HTTP/1.1. Network Working Group, January 1997. http://www.w3.org/Protocols/rfc2068/rfc2068

I developed many of the ideas for this design through conversation with 6.033 students Tim Cargol, Ravindra Sarma, Alon Mozes, Monica Bhattacharya, and Morgan McGuire. The idea of storing incoming file headers in buffers for use in the caching system was Tim Cargol's idea. The other ideas voiced in the paper are my own, although their final forms may have come about through dialogue with the other students.