6.02 - Reliable Transport
Lecture #22
Katrina LaCurts, lacurts@mit.edu

======================================================================
Introduction
======================================================================

This is now our current model of a network:

[ Reliable transport         ]
[ Packets/Addressing/Routing ]
[ Bits                       ]
[ Physical layer: signals    ]

So far, we've filled in the bottom three layers.  Today we're going to
look at the top one.  We'll answer the question of how to send data
reliably in a best-effort packet-switched network where packets can be
delayed, re-ordered, or dropped (and even duplicated!).

But first, let's think about why we need reliable transport at all.
After all, the routing protocols we discussed last week can eventually
converge even when routing advertisements are lost.  What part of the
network actually *needs* to communicate reliably?

In reality, networks aren't used just by switches.  They're used by
the end points that are initiating the communications.  Presumably,
the communication is part of some application running on the machines
at the endpoints.  For instance, you might be watching Netflix on your
laptop.  The Netflix application (website) on your machine is
receiving packets from one of Netflix's servers; these packets contain
the underlying data for whatever movie you're watching.  If you were
to "view" the packets out-of-order, the movie would not make much
sense!

So, we need to add another layer to our model:

[ Application                ]
[ Reliable transport         ]
[ Packets/Addressing/Routing ]
[ Bits                       ]
[ Physical layer: signals    ]

Using this layering, we can make the following model: We have a
sending and receiving application at two endpoints:

 ---------------------                     -----------------------
| sending application |   ===network===>  | receiving application |
 ---------------------                     -----------------------

To allow them to reliably send data across the network, we will add a
reliable sender/receiver "underneath" the application: 

 ---------------------                     -----------------------
| sending application |                   | receiving application |
|         |           |                   |          ^            |
|         v           |                   |          |            |
|  reliable sender    |   ===network===>  |   reliable receiver   |
 ---------------------                     -----------------------

The sending application will send its data packets to the reliable
sender (both of these entities exist on the same machine; we don't
need to worry about how packets get between the application and
sender).  That reliable sender will then send them across the network,
reliably, to the reliable receiver.  The reliable receiver will take
care of transferring the data packets *in-order* to the receiving
application.

We'll say that the reliable sender and receiver implement a "reliable
transport protocol".  In such a protocol, the receiving application
receives all data bytes in exactly the same order that the sending
application sent them.  Each data byte is delivered exactly once.

The problems we'll need to solve to get this to work are:
- Packets may be lost
- Packets may be re-ordered in the network
- Packet delays are variable (due to queueing)

  Aside: If you are taking 6.02 late in your college career
  (particularly if you're taking it after 6.033 even though you
  shouldn't be!), or if you have some prior networking experience, you
  may already be familiar with a reliable transport protocol called
  TCP.  The protocols that we study in 6.02 use the same underlying
  concepts as TCP, but have a few details that are different.  Be
  careful not to mix the two up!

======================================================================
Stop-and-wait
======================================================================

----------------------------------------------------------------------
Sequence numbers and acknowledgments
----------------------------------------------------------------------

At a high level, any reliable transport protocol uses the notions of
sequence numbers and acknowledgments.

  Sequence numbers: Each packet in a connection includes a sequence
  number, stored in its header.  These sequence numbers start at 1,
  and increase incrementally for each packet in the connection (i.e.,
  the first packet has sequence number 1, the second has sequence
  number 2, ..).

  Acknolwedgments: When the reliable receiver receives a packet, it
  will acknowledge the receipt of the packet by sending an
  acknowledgment, or ACK, back to the sender.  ACKs are just a
  special type of packet.  An ACK packet will contain the sequence
  number of the packet it is acknowledging, so the sender will be able
  to differentiate an ACK for packet 1 from an ACK for packet 2, e.g.

Already, you can see that our model needs to be slightly different:
the reliable sender and receiver will be communicating in both
directions; data packets from the sender to the receiver, and ACK
packets from the receiver to the sender.

 ---------------------                      -----------------------
| sending application |                    | receiving application |
|         |           |                    |          ^            |
|         v           |                    |          |            |
|  reliable sender    |   <===network===>  |   reliable receiver   |
 ---------------------                      -----------------------

   Aside: In a network, there is nothing preventing an end point from
   being both a sender and receiver, transmitting both data packets
   and ACKS.  We are just starting out with a simpler model where data
   flows in one direction.

Here's our first attempt at a reliable transport protocol:

  At the sender:
    - Send one packet, and keep track of its sequence number.
    - Once an ACK is received for that packet, delete the stored
      sequence number, and send a new packet (using the same strategy
      of saving its sequence number and waiting for an ACK)

  At the receiver:
    - Upon receipt of packet k, send an ACK for packet k and deliver
      the packet to the application

Under normal operation, we'll see the following sequence of events
(remember that the receiver and receiving application exist on the
same machine):

Sender                 Receiver        Receiving application
  | ------ seq 1 -------> |                 |
  | <----- ack 1 -------- | ---- seq 1 ---> |
  | ------ seq 2 -------> |                 |
  | <----- ack 2 -------- | ---- seq 2 ---> |
  | ------ seq 3 -------> |                 |
  | <----- ack 3 -------- | ---- seq 3 ---> |

etc.  This is looking good!  All of the packets are being delivered
in-order to the receiving application.

----------------------------------------------------------------------
Problem 1: Handling loss
----------------------------------------------------------------------

However, let's imagine that a packet is lost.

Sender                 Receiver        Receiving application
  | ------ seq 1 -------> |                 |
  | <----- ack 1 -------- | ---- seq 1 ---> |
  | ------ seq 2 ---x     |                 |
  |                       |                 |

We can see immediately that we have a problem.  The receiver didn't
receive packet #2, and so didn't send an ACK for it.  The sender
hasn't yet received an ACK for packet #2, and so can't proceed to
packet #3.  This situation is an example of deadlock: both sides are
waiting for something to happen and neither can proceed.

What's a solution?  One idea is to have the sender retransmit the
packet after some fixed timeout.  Our protocol now becomes:

  At the sender:
    - Send one packet, and keep track of its sequence number and
      transmission time
    - Once an ACK is received for that packet, delete the stored
      sequence number, and send a new packet (using the same strategy
      of saving its sequence number and waiting for an ACK)
    - If an ACK hasn't been received after timeout seconds since the
      packet's transmission time, retransmit it to the receiver.

  At the receiver:
    - Upon receipt of packet k, send an ACK for packet k and deliver
      the packet to the application

Note now that the sender is keeping track of the outstanding packet's
sequence number *and* transmission time.  There is also the addition
of a timeout value; for now, assume it's fixed.

This protocol does in fact solve our earlier problem:

                Sender                 Receiver        Receiving app
                  | ------ seq 1 -------> |                 |
                  | <----- ack 1 -------- | ---- seq 1 ---> |
               -  | ------ seq 2 ---x     |                 |
               |  |                       |                 |
timeout seconds|  |                       |                 |
               |  |                       |                 |
               -  | ------ seq 2 -------> |                 |
                  | <----- ack 2 -------- | ---- seq 2 ---> |
                  | ------ seq 3 -------> |                 |
                  | < ---- ack 3 -------- | ---- seq 3 ---> |

But now we've introduced an additional problem.  Let's imagine that
instead of a packet being lost, an ACK is lost:

                Sender                 Receiver        Receiving app
                  | ------ seq 1 -------> |                 |
                  | <----- ack 1 -------- | ---- seq 1 ---> |
               -  | ------ seq 2 -------> | ---- seq 2 ---> |
               |  |    x-- ack 2 -------  |                 |
timeout seconds|  |                       |                 |
               |  |                       |                 |
               -  | ------ seq 2 -------> |                 |
                  | <----- ack 2 -------- | ---- seq 2 ---> |
                  | ------ seq 3 -------> |                 |
                  | < ---- ack 3 -------- | ---- seq 3 ---> |

Since the sender never received an ACK for packet 2, it retransmitted
it.  The receiving application now receives two copies of the same
packet!  We wanted the receiving application to receive exactly one
copy of each packet; currently it's getting *at least* one.

----------------------------------------------------------------------
Problem 2: Exactly-once delivery
----------------------------------------------------------------------

To solve this problem, we need to change our protocol again, this time
at the receiver:

  At the sender:
    - Send one packet, and keep track of its sequence number and
      transmission time
    - Once an ACK is received for that packet, delete the stored
      sequence number, and send a new packet (using the same strategy
      of saving its sequence number and waiting for an ACK)
    - If an ACK hasn't been received after timeout seconds since the
      packet's transmission time, retransmit it to the receiver.

  At the receiver:
    - Upon receipt of packet k, send an ACK for packet k
    - If k is greater than the last sequence number we received (or if
      we haven't received any packets yet), then deliver the packet to
      the application and keep track of k

By keeping track of the last sequence number received, it's easy to
recognize duplicates.  Our previous problem has been solved:

         Sender                 Receiver        Receiving app
           | ------ seq 1 -------> |                 |
           | <----- ack 1 -------- | ---- seq 1 ---> |
        -  | ------ seq 2 -------> | ---- seq 2 ---> |
        |  |    x-- ack 2 -------  |                 |
timeout |  |                       |                 |
seconds |  |                       |                 |
        -  | ------ seq 2 -------> |                 |
           | <----- ack 2 -------- |                 | <- no
           | ------ seq 3 -------> |                 |    delivery
           | < ---- ack 3 -------- | ---- seq 3 ---> |

This final form of the protocol we've been developing is known as the
stop-and-wait protocol.  It provides exactly-once semantics in a
best-effort network.

======================================================================
Problem 3: Choosing timeouts in stop-and-wait
======================================================================

There is one scenario we haven't studied.  Imagine that no packets in
our networks are lost, but instead there are very long delays:

           S       R          A
         - |\      |          |
        |  | \     |          |
timeout |  |  1    |          |
seconds |  |   \   |          |
         - |\   \  |          |
           | \   > | -- 1 --> |
           |  1  / |          |
           |   \/  |          |
           |   /\  |          |
           |  1  > |          |
           | /   / |          |
           |<   /  |          |

Here, nothing went wrong, per se -- the application is receiving
packets exactly once and in-order -- but packet 1 was retransmitted
when it didn't need to be.  This is known as a spurious
retransmission, and adds unnecessary congestion to the network, which
is bad.

Our timeout was too short; let's make it longer.  

           S       R          A
         - |\      |          |
        |  | \     |          |
timeout |  |  1    |          |
seconds |  |   \   |          |
        |  |    \  |          |
        |  |     > | -- 1 --> |
        |  |     / |          |
        |  |    /  |          |
        |  |   /   |          |
        |  |  1    |          |
        |  | /     |          |
        |  |<      |          |
        |  |\      |          |
        |  | \     |          |
        |  |  2    |          |
        |  |   \   |          |
        |  |    \  |          |
        |  |     > | -- 2 --> |
        |  |     / |          |
        |  |    /  |          |
        |  |   2   |          |
        |  |  /    |          |
        |  | <     |          |
        |  |       |          |
         - |       |          |

Seems okay.  Now what happens when a packet is lost?

           S       R          A
         - |\      |          |
        |  | \     |          |
timeout |  |  1    |          |
seconds |  |   x   |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
        |  |       |          |
         - |\      |          |
           | \     |          |
           |  1    |          |
           |   \   |          |
           | etc.

Boy.  We sure had to wait a long time to retransmit that packet.  Much
longer than was necessary, even.

What you should see from this is that our timeout value should be
close to the delay of the network.  If it's too short, we retransmit
unnecessarily, but if it's too long, performance suffers:

           S       R          A
          - |\      |          |
 timeout |  | \     |          |
  should |  |  1    |          |
  be at  |  |   \   |          |
  least  |  |    \  |          |
   this  |  |     > | -- 1 --> |
   long, |  |     / |          |
    and  |  |    /  |          |
  not    |  |   /   |          |
   much  |  |  1    |          |
  longer |  | /     |          |
          - |<      |          |

(In general, we’d prefer a timeout that was slightly too long to one
that was slightly too short.  Too short means spurious
retransmissions, which add congestion; too long means a slight
performance hit.)

This problem seems easy enough to solve; we'll just need to measure
the delay in the network before we set our timeout value, and make the
value slightly larger.  Another term for this end-to-end delay is the
round-trip-time, or RTT: the time it takes a packet to travel one
"round trip" in the network.

How might we measure the RTT?  We can measure the time it takes from
sending a packet to receiving an ACK for it by storing the time at
which we sent it, and comparing that to the time we received the ACK.
A more elegant way is to store the current time in the packet header,
and have this time echoed back to the sender in the ACK.

----------------------------------------------------------------------
Handling variable RTTs
----------------------------------------------------------------------

Is setting the timeout really so easy?  Of course not!  You should
know from our study of queues that queueing delay is *variable*.  It
depends on the number of other packets in the queue, which depends on
the number of other connections in the network, which is constantly
changing.  That means the RTT is going to vary.  For instance:

|  \_     | -
|    \_   |  |
|      \  |  |
|       > |  | RTT 1
|     _/  |  |
|   _/    |  |
|  /      |  |
| <       | -
| \__     | -
|    \__  |  |
|       > |  | RTT 2
|     __/ |  |
|  __/    |  |
| <       | -

If we keep making these measurements, we're going to end up with a
distribution of RTTS.  So now our goal is to estimate a timeout from
this distribution of RTTs.  We do not know what model the distribution
fits (e.g., we don't know that it's Gaussian).

To start, let's use the mean of our RTT distribution.  I.e., when we
receive the nth RTT sample, do the following:

	 n_rtts = n_rtts + 1
     srtt[n] = (rtt_sample + (srtt[n-1] * (n_rtts - 1)))/n_rtts
	 timeout = srtt[n]

(Here, I'm using the variable n_rtts to keep track of the number of
rtt samples and srtt to keep track of the average rtt value (the s
stands for "smoothed"))

Though this was a reasonable idea, there are two problems.  One deals
with RTTs varying over long timescales, the other with RTTs varying
over short timescales.

----------------------------------------------------------------------
Problem 1: RTTs can change over time
----------------------------------------------------------------------

First, consider a network where there has been very little delay for
the past hour, and the RTT samples have been nearly constant at (say)
one second.  Thus:

	srtt[n] = 1sec
    timeout[n] = 1sec
	n_rtts = large

Now, imagine that one node in the network starts sending a lot more
traffic, so that the RTT samples are now consistently around two
seconds.

Since we've been collecting data for so long, it will take our srtt
estimate a very long time to reach the RTT that the network is
currently experiencing.  In this particular case, the timeout is too
short; we will be spuriously retransmitting packets for a long time,
which will add to the congestion that is already happening.  So in
fact, we'd like to react fairly quickly to these changes.

One solution might be to throw out some of the data and average over
just the last x seconds of data.  In practice, we do something a bit
more sophisticated: we use an exponentially-weighted moving average
(EWMA):

	 srtt[n] = alpha * rtt_sample + (1 - alpha) * srtt[n-1]

Notice, though, that even though we want to react quickly to
*persistent* changes, such as the one in the example above, we don't
want to react quickly when there is an outlier in the data.

An EWMA lets us control both.  As alpha varies, the influence of the
most recent rtt_sample changes; the higher alpha, the more influential
it is.

In networks prone to congestion, we typically use .1 <= alpha <= .25.
This allows our estimate to react relatively quickly to changes in the
RTT, but it's not influenced very heavily by a single spurious RTT.

  Aside: You should recognize this EWMA as an LTI filter with input
  that is the rtt_sample stream and output that is the smoothed (i.e.,
  lowpass filtered) srtt stream.  The unit sample response of this
  filter is h[n] = (1-alpha)^{n-1}*alpha*u[n-1], hence the name
  exponentially weighted moving average.  (You can compute the
  frequency response yourself to verify that this is a lowpass filter
  whose approximate cutoff is decreased as alpha is decreased.)

----------------------------------------------------------------------
Problem 2: Non-zero standard deviation
----------------------------------------------------------------------

Now that we've solved the problem of reacting to changes in the RTTs,
our algorithm has become:

     srtt[n] = alpha * rtt_sample + (1 - alpha) * srtt[n-1]
	 timeout = srtt[n]

The second problem we want to deal with is the fact that it's very
rare for packets to experience an RTT that is exactly equal to mu.
For instance, remember our Gaussian distributions; half of the values
lie above the mean!

In our context, since our timeout is set as (roughly) the mean of the
distribution, any packet with an RTT greater than the mean will cause
a spurious retransmission.  If the RTTs are distributed in a Gaussian,
that would be roughly half of the samples!

   Mathematical aside: We don't know the underlying RTT distribution.
   Despite that, there is actually a theorem that can help us estimate
   how likely it is that an individual sample will be far away from
   the mean: Chebyshev's Inequality.  Chebyshev's Inequality tells us
   this:

      If X is a random variable with a finite mean mu and a finite
      non-zero variance sigma^2, then for any real number k:

      P(|X - mu| >= k*sigma) <= 1/k^2

   We can use this theorem to make statements about how many values
   might be outside of a particular range around the mean regardless
   of the underlying distribution (as long as the mean is finite and
   the variance is finite and non-zero).

   This theorem can be very conservative.  For example, in the case of
   a Gaussian the probability is actually *much* smaller than 1/k^2.
   The virtue of Chebyshev's Inequality is that it holds for all
   distributions, which is relevant for us, since our RTT distribution
   is unknown.

Naturally, this means that we need an estimate of how the RTTs are
varying.  Just like our estimate of the mean, we're going to use an
EWMA for this estimate.  In practice, we estimate the mean absolute
deviation because it's easy to compute; estimating the standard
deviation would also serve our purpose.

     srtt[n] = alpha * rtt_sample + (1 - alpha) * srtt[n-1]
	 dev_sample = |rtt_sample - srtt[n]|
	 srttdev[n] = beta * dev_sample + (1 - beta) * srttdev[n-1]
	 timeout = srtt[n] + k*srttdev[n]

(The absolute deviation of a sample is just the absolute value of its
difference from the mean)

So in the end we've set the timeout to be the estimated mean plus some
constant times the estimated deviation.  In practice, k=4; Chebyshev's
Inequality tells us that no more than 6.25% of RTT samples should be
above this timeout value (and possibly far fewer).

======================================================================
Analysis of stop-and-wait
======================================================================

All right.  So finally, here we are, with a correctly-working
stop-and-wait protocol and a very fancy way to estimate an appropriate
timeout value.

What kind of performance are we getting with this protocol?

Let T = the expected time (in seconds) between successful deliveries
of packets.  If N data packets are sent (N large), then the time to
send them will be N*T.  Then throughput will be N packets / (NT
seconds) = 1/T packets per second.  We need to find T.

We can't just assume that T is the round-trip-time, because packets
get lost.  In fact, based on our previous discussion, we can't even
assume the RTT is constant.  But just hold off on that problem for
now, and let "RTT" represent the round-trip-time.

So we'll add a new variable, L, which represents the probability that
a packet *or its ACK* is lost.

   Aside: We can calculate L if we're given a per-link loss
   probability p.  If the packet and its ACK travel a total of k hops
   in the round-trip, the probability of neither the packet or its ACK
   being lost is (1-p)^k, and so the probability of either being lost
   is 1 - (1-p)^k

Now: T = P(packet is transmitted successfully)*(delay of packet when
         it's transmitted successfully) +
         P(packet is not transmitted successfully)*(expected delay of
         packet when it's not transmitted successfully)
       = (1 - L) * RTT + L * (timeout + T)

Why is there a T in the second term?  Because once a timeout has
happened, it will take an expected time of T more seconds for the
retransmission to succeed.

Solving for T, we get:

     T = RTT + (L/(1-L)) * timeout

Remember that 1/T represents throughput.  We'd like to figure out how
much throughput we could possibly get in this protocol.  The best case
happens when the RTT is constant, and so our timeout is close to this
value.  (Because we're just using this model to estimate throughput,
it's okay to sidestep issues like variable RTTs.)

     T = RTT * (L/(1-L)) * RTT
       = RTT/(1-L)

And so the throughput, 1/T, is (1-L)/RTT

Well, so what?  Let's say we have a bottleneck link that can support
100 packets per second, and the RTT is 100 milliseconds.  Using
stop-and-wait, the maximum throughput (which occurs when L=0) is 10
packets per second.  That's 10% utilization.

This is terrible!  We solved the problem of reliable transport, but we
completely destroyed performance.  Next week we'll figure out how to
have both.

Additionally, you should get comfortable with the type of modeling I
just did to figure out the maximum throughput of stop-and-wait.