6.033 Handout 26: DP1 Design Space

M.I.T. DEPARTMENT OF EECS

6.033 - Computer System Engineering
Handout 26
6.033 Spring 1999 
Design project 1 
Design space and design issues, version of 4/4/1999, 00:15.

The following notes compiled by Jerry Saltzer explore the design space 
for the first 6.033 design project.  To a large extent they consist of
the union of most of the ideas mentioned in the papers for the two
sections that he graded, plus several other suggestions made by other
members of the teaching staff.  Note that not all of these ideas are
equally good.  Occasionally there is a comment about the value of an
idea, but for the most part those comments don't attempt anything
resembling a comprehensive evaluation.

1.  Definition of transparency
    -  User doesn't have to do anything special, but can detect 
       that something is going on
    -  User can't tell he was redirected.  (Seems unnecessary)
    
2.  How dynamic does redirection need to be?
    -  once assigned a replica, may use it for multiple visits
    -  each time the home URL is issued, replica choice is reviewed
    -  for each use of any URL in the site, replica choice is reviewed
    This choice interacts with caching and with bookmark behavior and also
    with whether or not the server maintains changeable state.

3.  Design targets:  What scale does your design intend to encompass?  
    Explain the design center, the design minimum, and the design 
    maximum, particularly as to number of servers.

4.  Metric used to choose a server
    -  randomization (distributes load, ignores network service quality)
    -  geographical location (weak proxy for network proximity)
    -  network topology from static maps compared with client IP addr
    -  server properties
        - server up/down
        - server capacity
        - measured server response time to a small HTTP request
        - server port bandwidth
        - server port unused bandwidth
        - current server load 
           -  thread queue length (current or averaged)
           -  cpu idle time (over how long?)
           -  number of open TCP connections (current or averaged?)
        - current server disk channel utilization (how measured?)
        - current server remaining capacity (how measured?)
    -  network properties measured at instant of connection request
        - minimum latency between client and server
        - latency as a proxy for congestion
          - ICMP ping latency from server to client
          - TCP probe latency from server to client
        - average of multiple probes from server to client
        - server to client hop count (traceroute, may be ponderous)
        - from conversation with a nearby router  
           -  hop count (not a significant factor in performance)
           (e.g., Cisco Distributed Director uses BGP info)
           -  congestion/quality of service info
        - download rate of first HTTP connection
           - redirect only if below threshold.
           - first page from central server, each graphic on the
             page from a different server, after page is delivered
             the servers get together to decide who did best and
             central redirects the *next* request from that client.
    -  history of observed network properties to the region 
       near the client
         -  region defined by network topology
         -  region defined by domain name
         -  region defined by IP address hierarchy
            (e.g., cache indexed by class-B or class-C subnet address)
         -  observed properties
            -  ICMP ping latency
            -  TCP probe latency        
            -  packet loss rate
    -  political (border crossing rules, etc.)
    -  cost (some ISP's charge servers by gigabyte of network
       usage.)
    
5.  Level of centralization/decentralization
    -  all requests go to central site, which dispatches
    -  initial request goes to any server (e.g., with DNS round-robin),
       which asks all the others if they are closer.

6.  Method of assessing network latency caused by congestion
    -  ping by each replica server to client.  Requires high-level 
       message from central server to each replica server, and 
       decision of how long to wait for response.  (longest response 
       time may be from the server closest to a distant client, but 
       waiting for it delays everyone.)
    -  Use topology information to ask only promising servers to ping
       the client.  (avoids flooding client with pings.)
    -  ping by central site of each client, with loose source route
       via a router near each server (also ping server to subtract.  Not
       widely implemented, so it probably doesn't work in practice.)
    -  add a ping forwarding protocol to ACME servers: central site sends
       echo-him request to server, server sends echo request to client.
    -  ask a nearby router for RIP info

7.  How are multiple metrics combined
    -  linear weighting (how do you choose weights?  How do you linearly
       combine RTT in milliseconds and server load as a fraction?)
    -  assign first server that responds with load below a chosen
       threshold (e.g. 50%) and RTT below a chosen threshold (e.g. 50 ms.)
    -  use metrics in priority order, going lower only to break ties.
       e.g., 
        1/  server not overloaded
        2/  RTT (tie if within, e.g., 20%)
        3/  server connection speed
        4/  random or round robin

8.  When assessment leading to choice of server is made
    -  after request from client (adds delays)
       -  may post a temporary page saying, "one moment please, while
          we identify the best site to serve you" and playing the 
          Acme theme song.
    -  while central site is handling the first several requests from
       the client.  (requires a cache of recently-assigned clients.)
    -  in advance, using a proxy for the client (what proxy?)
    -  load assessment can be done at a different time from network 
       latency assessment.  (e.g., immediately assign client to a
       lightly-loaded server, then reassign if another server later 
       proves it has significantly less congestion on its path to 
       the client.

9.  Forwarding mechanism
    -  DNS
         -  standard DNS, client chooses from multiple name records
            (see client-side designs, below)
         -  Acme DNS servers choose, set TTL to zero to avoid 
            caching
            (N.B., works well for randomization, round-robin, or
            even server load-based distribution, but it is very 
            difficult to base the decision on client latency...
            -  the DNS server doesn't know who the real client is,
               because many DNS requests are forwarded by intermediate 
               recursive DNS servers.
            -  client timeouts may require fast action, and
               probably mean that there will be duplicate requests 
               from the same client via different paths.
            -  most companies don't run their own DNS servers.) 
    -  HTTP redirect
         -  via host name (requires another DNS lookup)
         -  put IP address in URL (avoids second DNS lookup)
         -  permanent versus temporary redirect 
            (affects bookmarks, partial URL's and perhaps POST)
    -  DNS for first cut distribution not based on latency, use HTTP 
       redirect for improvement if initial server latency is too high.
    -  Base element redirect
         -  Central site responds with initial page, but inserts a 
            BASE element to do the redirection
         -  all internal links are partial URL's
    -  link insertion redirect:  Central site responds with the 
       initial page, but has dynamically constructed it so that all 
       links--even the graphics on the initial page--are to a better 
       replica.
    -  IP Tunnel: Central site maintains an open TCP connection to 
       every replica, and immediately tunnels the initial request to 
       the preferred replica
          -  replica thinks it is at the central IP address and
             tunnels its response back through the central
             site and thence to the client.  All requests go
             through central site
          -  replica tunnels initial response back (or tries to 
             IP-spoof its way back directly.)  The page returned by 
             the replica contains links with absolute URL's that name 
             the replica site, so future requests will go directly to 
             the replica.
       
10.  Client-side designs
    -  what the upgraded client does
       -  DNS sends list of servers, client chooses best.
       -  Web server sends list of servers, client chooses best
       -  client pings each server
       -  client makes end-to-end request (e.g., HTTP HEAD) to include
          server load in RTT measurement.
       -  send all probes in parallel, use first server that responds.
    -  method of upgrading the client
       -  initial server downloads a Java applet
       -  wait for next release of Internet Explorer and NetScape
    
11.  In addition to the above, fairly central, issues, a good
paper should give some answer to each of the more subtle
questions in the design project handout:
    -  how hard is it to add a replica?
    -  what if everyone used your design?
    -  where is the bottleneck in your design?
    -  how do you decide when and where to add replicas?
    -  can anyone (e.g., AOL) create a replica?

12.  Response to failure, e.g., central site crashes, replica crashes.
    -  (not evaluated, since we haven't come to this topic yet.)

13.  How well does the proposed latency-measuring algorithm deal with
the following case:

                                                x
         Acme central---------------------------x----------server 1
                   \                            x           /
                    \---------------server 2----x--client--/
                                                x
                                               bad 
                                             congestion
                                                  
The challenging problem here is that server 1 will measure a short
delay to the client, but its report may take a very long time to get
back to Acme central--maybe long enough to exceed Acme Central's
timeout.  Server 2 will measure a long delay to the client, but its
report will get back to Acme Central very quickly.  A short timeout
may result in server 2 getting the job, while A long timeout will slow
things down for the case where the client is on the near side of the
congestion.  It appears that a better way to identify this case is for
Acme Central to also separately ping the client, and use some small
multiple of that RTT as its timeout.  (Note that client-based models
handle this problem very nicely, as does the
first-server-below-threshold method.)

14.  Some naming issues that come up, even though this topic arises
after the design project was due.

    -  What is in mirror site hyperlinks that point within the site?
         -  HREF = "http://www.acme.com/filename"
         -  HREF = "http://mirror1.acme.com/filename"
         -  HREF = "filename"
    -  How does the previous decision interact with decision to use 
         -  301 permanently moved
         -  302 temporarily moved
    -  How do those two decisions interact with requirement to 
       get user confirmation on forwarded POST operations?
    -  How do those two decisions interact with bookmarks and, more
       generally, the intended dynamics of redirection?
       
15.  Subtleties
    -  Include a note and a link in the entity-body field of the 
       initial response, for old browsers that don't know about
       HTTP redirection.
    -  central site should occasionally ping the servers and keep a
       running average of the response time, to use in determining
       the timeout when making future inquiries about load, etc.
    -  Decision of where more server capacity is needed is muddled by
       the efforts of the system to distribute the load to
       the least-loaded servers.
       
16.  Abstract of one good design; there are many others that are 
equally good, depending on the desiderata chosen:

Up to ten primary web servers have the same DNS name, with DNS
round robin used to choose among them.  When an HTTP request
arrives at any server it pings the client and it sends a request
to each of the other nine servers (each server must also have a
distinct DNS name) asking them to ping the client and report back
the RTT and their own processor utilization, averaged over the
last 2 minutes.   The first server that reports an RTT under 50
ms and an average load under 50% gets the job, with an HTTP 301
permanently moved response, using the server name.  If no server
is that good, redirect to the best available server, with
(RTT/100 ms.) and (1/(1-load)) equally weighted.  If more than
ten servers are needed, additional servers are clustered around
one of the primary servers and a local dispatcher distributes the
load.  Each primary web server maintains a cache of recently
served clients, and if a request comes in that hits in the cache,
it omits server polling and serves the client immediately unless
its own utilization is running above a high threshold, say 95%.
To keep clients returning to the designated site, all within-site
URL's are partial rather than absolute.
Questions or comments regarding 6.033? Send e-mail to the TAs at 6.033-tas@mit.edu.
Questions or comments about this web page? Send e-mail to 6.033-webmaster@mit.edu.
Top // 6.033 home // Last updated $Date: 1999/04/11 05:42:06 $ by $Author: fubob $