6.033 - Computer System Engineering | Handout 26 |
6.033 Spring 1999 Design project 1 Design space and design issues, version of 4/4/1999, 00:15. The following notes compiled by Jerry Saltzer explore the design space for the first 6.033 design project. To a large extent they consist of the union of most of the ideas mentioned in the papers for the two sections that he graded, plus several other suggestions made by other members of the teaching staff. Note that not all of these ideas are equally good. Occasionally there is a comment about the value of an idea, but for the most part those comments don't attempt anything resembling a comprehensive evaluation. 1. Definition of transparency - User doesn't have to do anything special, but can detect that something is going on - User can't tell he was redirected. (Seems unnecessary) 2. How dynamic does redirection need to be? - once assigned a replica, may use it for multiple visits - each time the home URL is issued, replica choice is reviewed - for each use of any URL in the site, replica choice is reviewed This choice interacts with caching and with bookmark behavior and also with whether or not the server maintains changeable state. 3. Design targets: What scale does your design intend to encompass? Explain the design center, the design minimum, and the design maximum, particularly as to number of servers. 4. Metric used to choose a server - randomization (distributes load, ignores network service quality) - geographical location (weak proxy for network proximity) - network topology from static maps compared with client IP addr - server properties - server up/down - server capacity - measured server response time to a small HTTP request - server port bandwidth - server port unused bandwidth - current server load - thread queue length (current or averaged) - cpu idle time (over how long?) - number of open TCP connections (current or averaged?) - current server disk channel utilization (how measured?) - current server remaining capacity (how measured?) - network properties measured at instant of connection request - minimum latency between client and server - latency as a proxy for congestion - ICMP ping latency from server to client - TCP probe latency from server to client - average of multiple probes from server to client - server to client hop count (traceroute, may be ponderous) - from conversation with a nearby router - hop count (not a significant factor in performance) (e.g., Cisco Distributed Director uses BGP info) - congestion/quality of service info - download rate of first HTTP connection - redirect only if below threshold. - first page from central server, each graphic on the page from a different server, after page is delivered the servers get together to decide who did best and central redirects the *next* request from that client. - history of observed network properties to the region near the client - region defined by network topology - region defined by domain name - region defined by IP address hierarchy (e.g., cache indexed by class-B or class-C subnet address) - observed properties - ICMP ping latency - TCP probe latency - packet loss rate - political (border crossing rules, etc.) - cost (some ISP's charge servers by gigabyte of network usage.) 5. Level of centralization/decentralization - all requests go to central site, which dispatches - initial request goes to any server (e.g., with DNS round-robin), which asks all the others if they are closer. 6. Method of assessing network latency caused by congestion - ping by each replica server to client. Requires high-level message from central server to each replica server, and decision of how long to wait for response. (longest response time may be from the server closest to a distant client, but waiting for it delays everyone.) - Use topology information to ask only promising servers to ping the client. (avoids flooding client with pings.) - ping by central site of each client, with loose source route via a router near each server (also ping server to subtract. Not widely implemented, so it probably doesn't work in practice.) - add a ping forwarding protocol to ACME servers: central site sends echo-him request to server, server sends echo request to client. - ask a nearby router for RIP info 7. How are multiple metrics combined - linear weighting (how do you choose weights? How do you linearly combine RTT in milliseconds and server load as a fraction?) - assign first server that responds with load below a chosen threshold (e.g. 50%) and RTT below a chosen threshold (e.g. 50 ms.) - use metrics in priority order, going lower only to break ties. e.g., 1/ server not overloaded 2/ RTT (tie if within, e.g., 20%) 3/ server connection speed 4/ random or round robin 8. When assessment leading to choice of server is made - after request from client (adds delays) - may post a temporary page saying, "one moment please, while we identify the best site to serve you" and playing the Acme theme song. - while central site is handling the first several requests from the client. (requires a cache of recently-assigned clients.) - in advance, using a proxy for the client (what proxy?) - load assessment can be done at a different time from network latency assessment. (e.g., immediately assign client to a lightly-loaded server, then reassign if another server later proves it has significantly less congestion on its path to the client. 9. Forwarding mechanism - DNS - standard DNS, client chooses from multiple name records (see client-side designs, below) - Acme DNS servers choose, set TTL to zero to avoid caching (N.B., works well for randomization, round-robin, or even server load-based distribution, but it is very difficult to base the decision on client latency... - the DNS server doesn't know who the real client is, because many DNS requests are forwarded by intermediate recursive DNS servers. - client timeouts may require fast action, and probably mean that there will be duplicate requests from the same client via different paths. - most companies don't run their own DNS servers.) - HTTP redirect - via host name (requires another DNS lookup) - put IP address in URL (avoids second DNS lookup) - permanent versus temporary redirect (affects bookmarks, partial URL's and perhaps POST) - DNS for first cut distribution not based on latency, use HTTP redirect for improvement if initial server latency is too high. - Base element redirect - Central site responds with initial page, but inserts a BASE element to do the redirection - all internal links are partial URL's - link insertion redirect: Central site responds with the initial page, but has dynamically constructed it so that all links--even the graphics on the initial page--are to a better replica. - IP Tunnel: Central site maintains an open TCP connection to every replica, and immediately tunnels the initial request to the preferred replica - replica thinks it is at the central IP address and tunnels its response back through the central site and thence to the client. All requests go through central site - replica tunnels initial response back (or tries to IP-spoof its way back directly.) The page returned by the replica contains links with absolute URL's that name the replica site, so future requests will go directly to the replica. 10. Client-side designs - what the upgraded client does - DNS sends list of servers, client chooses best. - Web server sends list of servers, client chooses best - client pings each server - client makes end-to-end request (e.g., HTTP HEAD) to include server load in RTT measurement. - send all probes in parallel, use first server that responds. - method of upgrading the client - initial server downloads a Java applet - wait for next release of Internet Explorer and NetScape 11. In addition to the above, fairly central, issues, a good paper should give some answer to each of the more subtle questions in the design project handout: - how hard is it to add a replica? - what if everyone used your design? - where is the bottleneck in your design? - how do you decide when and where to add replicas? - can anyone (e.g., AOL) create a replica? 12. Response to failure, e.g., central site crashes, replica crashes. - (not evaluated, since we haven't come to this topic yet.) 13. How well does the proposed latency-measuring algorithm deal with the following case: x Acme central---------------------------x----------server 1 \ x / \---------------server 2----x--client--/ x bad congestion The challenging problem here is that server 1 will measure a short delay to the client, but its report may take a very long time to get back to Acme central--maybe long enough to exceed Acme Central's timeout. Server 2 will measure a long delay to the client, but its report will get back to Acme Central very quickly. A short timeout may result in server 2 getting the job, while A long timeout will slow things down for the case where the client is on the near side of the congestion. It appears that a better way to identify this case is for Acme Central to also separately ping the client, and use some small multiple of that RTT as its timeout. (Note that client-based models handle this problem very nicely, as does the first-server-below-threshold method.) 14. Some naming issues that come up, even though this topic arises after the design project was due. - What is in mirror site hyperlinks that point within the site? - HREF = "http://www.acme.com/filename" - HREF = "http://mirror1.acme.com/filename" - HREF = "filename" - How does the previous decision interact with decision to use - 301 permanently moved - 302 temporarily moved - How do those two decisions interact with requirement to get user confirmation on forwarded POST operations? - How do those two decisions interact with bookmarks and, more generally, the intended dynamics of redirection? 15. Subtleties - Include a note and a link in the entity-body field of the initial response, for old browsers that don't know about HTTP redirection. - central site should occasionally ping the servers and keep a running average of the response time, to use in determining the timeout when making future inquiries about load, etc. - Decision of where more server capacity is needed is muddled by the efforts of the system to distribute the load to the least-loaded servers. 16. Abstract of one good design; there are many others that are equally good, depending on the desiderata chosen: Up to ten primary web servers have the same DNS name, with DNS round robin used to choose among them. When an HTTP request arrives at any server it pings the client and it sends a request to each of the other nine servers (each server must also have a distinct DNS name) asking them to ping the client and report back the RTT and their own processor utilization, averaged over the last 2 minutes. The first server that reports an RTT under 50 ms and an average load under 50% gets the job, with an HTTP 301 permanently moved response, using the server name. If no server is that good, redirect to the best available server, with (RTT/100 ms.) and (1/(1-load)) equally weighted. If more than ten servers are needed, additional servers are clustered around one of the primary servers and a local dispatcher distributes the load. Each primary web server maintains a cache of recently served clients, and if a request comes in that hits in the cache, it omits server polling and serves the client immediately unless its own utilization is running above a high threshold, say 95%. To keep clients returning to the designated site, all within-site URL's are partial rather than absolute.
Questions or comments regarding 6.033? Send e-mail to the TAs at
6.033-tas@mit.edu.
Questions or comments about this web page? Send e-mail to
6.033-webmaster@mit.edu.
Top // 6.033 home // Last updated $Date: 1999/04/11 05:42:06 $ by $Author: fubob $