6.02 - Congestion Control
Lecture #24
Katrina LaCurts, lacurts@mit.edu

======================================================================
The Theme of this Lecture
======================================================================

Today's lecture is about a fairly advanced networking topic known as
congestion control.  My goal in this lecture is *not* to turn you into
congestion-control experts.  I'd like you to learn something about
congestion control, but more importantly, I'd like you to start
thinking about networking from a distributed-systems perspective.

This means that I'd like you to start considering how the protocols
we've discussed so far work (or don't work) when there are millions of
users, each with different demands and requirements, who are joining
and leaving the network constantly.

To get at this broader perspective, we will focus on one single
question: How can a single reliable sender, using a sliding-window
protocol, set its window size to maximize utilization, given that
there could be many other (unknown) connections using the network?

======================================================================
Introduction
======================================================================

Let's back up.  Last time, we discussed the sliding-window protocol.
The throughput that this protocol achieves depends on the window size,
W.  Through some analysis, we found that we want W to be equal to (or
slightly larger than) the bandwidth-delay product: W = B * RTT_min.
And in the previous lecture, we assumed that W was fixed and known at
the sender.

----------------------------------------------------------------------
A fixed, unknown window-size
----------------------------------------------------------------------

Let's first imagine that the sender *doesn't* know the bandwidth-delay
product of the path it's using.  Its goal is to determine what the
bandwidth-delay product is and then set its size accordingly.  For
this section of the notes, I'm going to refer to the bandwidth-delay
product as being the "correct" window size; our sender's goal is to
determine the correct window size for this path.

One thing the sender might do is simply try different window sizes.
Different choices for W will result in different throughputs.

Is there a way to determine where we are on the graph?  Yes.  I'll
break it down into two cases:

1. W < B*RTT_min.  When W < B*RTT_min, the throughput will be less
   than B packets per second.  So if W < B*RTT_min, and we increase W
   slightly (for now, let's say by 1), we should see an increase in
   throughput.

2. W > B*RTT_min.  When W > B*RTT_min, the throughput will be B
   packets per second.  Increasing W by 1 will still result in B
   packets per second, so throughput won't change.  And in fact,
   decreasing W by 1 will (likely) not change the throughput either.
   Though if we decrease W enough, we should fall back on the "left
   side" of the graph.

So an algorithm for setting W might be:
- Set W to an initial value of 1.  Calculate throughput with this W
  value (call it tput_1)
- Increase W slightly.  Calculate throughput with this W (call it
  tput_2).
- Keep increasing W slightly until tput_1 = tput_2 (which means we've
  hit the correct window size).

This algorithm has some issues, which I'll come to later, but let's
stick with it for now.

----------------------------------------------------------------------
Multiple Senders
----------------------------------------------------------------------

Now imagine multiple senders using the same network path (or, really,
the same bottleneck link).  Let's assume they've all got their window
set correctly.  For instance:

S1 -- 2 Mb/s -- A ---- 2 Mb/s ---- B -- 2 Mb/s -- D1
                |                  |
S2 -- 2 Mb/s ---                    -- 2 Mb/s -- D2

What will happen?  The queue at A will build up, which will lead to
large delays in the network and eventually packet drops.  This
scenario is what we mean when we say congestion.

Note that both S1 and S2 were sending at the "correct" rate; the
bottleneck link (A-B) could handle 2Mb/s, and they were sending at
2Mb/s).  The problem here is obvious: combined, S1 and S2 were sending
at 4Mb/s; twice the bottleneck rate.

What *should* have happened?  Well, that's actually debatable (you'll
see in a second), but one reasonable alternative is to have S1 and S2
send at half the bottleneck link rate:

S1 -- 1 Mb/s -- A ---- 2 Mb/s ---- B -- 1 Mb/s -- D1
                |                  |
S2 -- 1 Mb/s ---                    -- 1 Mb/s -- D2

But how would S1 and S2 know to do this?  Remember that S1 only knows
about itself; it doesn't know that there is a second sender in the
network.  Moreover, a third sender could come in, or S2 could
disappear, or S2 could start sending less than 1Mb/s, etc.  How is S1
going to be able to achieve the "correct" rate, assuming we can even
decide on what rate is correct?

======================================================================
Congestion Control
======================================================================

The mechanism that we use to solve this problem is known as congestion
control.  Congestion control means controlling the source rate to
achieve high performance (among other things, this means we'd like to
avoid packet drops).

   Fact: The Internet was first built without any congestion control
   in place.  It eventually experienced "congestion collapse": too
   many packets were sent by endpoints, and no usable data got through
   (all of the packets were retransmits of lost packets).  So in a
   very real sense, congestion control is (one of) the reason(s) the
   Internet works at all.

----------------------------------------------------------------------
Goals
----------------------------------------------------------------------

There are two objectives of any congestion control mechanism:
efficiency and fairness.

1. Use the network efficiently
   - Minimize drops, minimize delay
   - Maximize bottleneck utilization
2. Share the network bandwidth fairly among all connections that are
   using it.

Before we get an actual congestion control protocol, let me describe
to you what we mean by sharing bandwidth fairly.  There are many
notions of fairness in networking.  The one we use for congestion
control is very intuitive; it's known as "max-min" fairness.  It's
best described with examples.  And it is *NOT* the same as the
fairness we studied when we studied MAC protocols.

   Example 1:  Suppose my bottleneck link rate is 2Mb/s, and both S1
   and S2 have a lot of data to send.  We'll say they have "infinite
   demand".  Then the bottleneck link should be split equally: S1
   should get 1Mb/s, S2 should get 1Mb/s.

If all sources have infinite demand, then max-min fairness just boils
down to splitting the bottleneck link equally.  The interesting parts
come when some sources don't have infinite demand:

   Example 2:  Suppose the same bottleneck link rate (2Mb/s), but now
   suppose S1 only has .5Mb/s of data to send (S2 still has infinite
   demand).  Then S1 should get .5Mb/s, S2 should get 1.5Mb/s).

   Example 3: Now imagine a third source enters the network, also with
   infinite demand (S1 still has .5Mb/s to send, S2 still has infinite
   demand).  Then S1 should get .5Mb/s, and S2 and S3 should get
   .75Mb/s.

Basically: If there are N connections, any connection that requires
less than 1/N^th of the bandwidth will get whatever bandwidth it
needs.  Then we will divide the remaining bandwidth up among the
remaining M connections in the same manner.  If all connections have
demand greater than 1/M, we divide the bandwidth equally.

   Aside: The reason this is called "max-min" fairness is because
   we're maximizing the minimum amount of resources allocated to any
   connection.  We will not prove this in 6.02.

Max-min fairness is nice because it's intuitive.  But just for
context, other definitions of fairness might bring cost/price paid for
a service into the calculation, or they might optimize for equalizing
resources over the sum of all the links, not just the bottleneck
links.  Again, we won't worry about those alternative definitions of
fairness; you should just know that other definitions exist.

----------------------------------------------------------------------
Types of congestion control
----------------------------------------------------------------------

The particular type of congestion control we're going to study today
is an end-to-end congestion control protocol, sometimes known as a
window-based congestion control protocol.

The basic idea behind these protocols is that the switches in the
middle of the network are dumb; all they can do is drop packets.  The
sources, however, are smart; they can adjust their rate to converge to
the optimal rate.

An alternative approach is to allow the switch in the middle to help
compute the optimal rate.  We will briefly compare these approaches at
the end of class.

======================================================================
Window-based Congestion Control
======================================================================

Let's go back to the algorithm we developed at the beginning of class.
The basic idea of this algorithm was to increase W until throughput
leveled off.

This idea of increasing W is going to stick around.  But our algorithm
has a few problems:
1. It's slow.  We increased W slowly.
2. Once we found the correct value for W, we never re-adjusted.
3. It requires measuring throughput, which takes time (say, seconds)
   and our measurements might not be perfectly accurate.  Moreover, if
   there are multiple connections using the network, we don't know what
   the "correct" throughput should be, and also the correct throughput
   is constantly changing.

Our biggest problem right now is the third one.  We need a better
signal to tell us what to do with our window.

----------------------------------------------------------------------
Remember MAC protocols?
----------------------------------------------------------------------

Remember Stabilized ALOHA?  In Stabilized ALOHA, nodes increase or
decrease their sending probability depending on whether they
experience a collision.  In effect, a collision signals that there is
congestion in the network.  If a node experiences a collision, it
interprets that as there being congestion, and decreasing its sending
probability (which, in turn, will decrease the overall congestion).

We'd like a similar signal for a reliable-transport protocol.  What
reliably signals to an end point that there is congestion in the
network?

Packet drops!

----------------------------------------------------------------------
AIMD
----------------------------------------------------------------------

The protocol that I'm going to describe here is the congestion control
protocol used by TCP.  TCP stands for Transmission Control Protocol;
it is the reliable transport protocol that the Internet uses.  It is a
sliding-window protocol that behaves just like the one we have been
studying, except with a few additional optimizations.

The basic idea TCP congestion control: Keep increasing your window
until you see packet drops, then backoff and try again.

  Aside: The paper that described this congestion control mechanism
  originally discusses packet drops saying "this [congestion] signal
  is  delivered automatically by all existing networks, without
  special modification".  This was in contrast to some other proposals
  which required modifications to packets and/or the switches in the
  network.

The specific way that it increases its window is:
    Every RTT:
     - If there is no loss, W = W+1
     - If there is loss, W = W/2

This technique is known, for obvious reasons, as "additive increase
multiplicative decrease", or AIMD.  Essentially, senders are
conservative about increasing their windows, but scale back
dramatically when they experience congestion.

    Aside: There are also more technical reasons for doing additive
    increase multiplicative decrease rather than, say, multiplicative
    increase multiplicative decrease.  Essentially, the network
    remains most stable when AIMD is used.

This protocol is in fact both efficient and fair.

(See diagram I drew on board in lecture)

======================================================================
Slow Start
======================================================================

We already solved two of the problems with our original algorithm.
TCP's congestion control doesn't use throughput as a congestion
signal, instead it uses packet drops.  It is also constantly adjusting
the window, and so will adapt to changes in the network.

But what about its slowness?  After all, we're still increasing the
window size additively.

First of all, it's not quite as slow as it might seem.  Before, we
were starting with a window size of 1, and increasing additively.
Now, except for the very beginning, we're starting from a window size
of V/2, where V is the correct size.  And, as I mentioned above in the
aside, if we increase more quickly than additively, we risk an
unstable network.

However, additively increasing at the beginning of the connection is
still a concern from a performance standpoint.  To handle that, TCP
does something different at the beginning of a connection: it
exponentially increases its window, doubling it every round-trip-time
until it sees loss.  This is done with the following algorithm:

    - Start with W = 1
    - On every ACK, do W = W + 1

This algorithm is, somewhat ironically, known as "slow start".  Don't
let that fool you.  When TCP was first developed, connections started
by sending as much data as they could, and then backed off.  Compared
to that, an exponential window increase is indeed slow.

======================================================================
Additional Details
======================================================================

There are a few additional details of TCP that are worth mentioning:

1. It uses cumulative ACKs.  Some of you may have seen a problem on
   cumulative ACKs in recitation.  Cumulative ACKs mean that when a
   receiver sends an ACK with sequence number k, it means "I have
   received all packets up to and including k", rather than just "I
   have received packet k".

   Aside: TCP's ACKs actually reflect the next packet it *expects* to
   receive.  This is not important for us.

   What you should have seen from that problem -- and what you should
   be able to figure out yourself -- is that cumulative ACKs have less
   overhead when there is packet loss.

2. It has a heuristic for distinguishing between lost packets and
   re-ordered packets known as "fast retransmit".  In most cases, TCP
   is able to quickly retransmit a lost packet instead of waiting for
   a full timeout.

3. Despite fast retransmit, in the case of severe congestion, TCP's
   window will drop back down to 1.  This is why it is so important to
   avoid congestion in a network; your TCP connections become
   "throttled" (they get very low throughput) in the face of
   congestion.

4. TCP also delays ACKs, and sends a cumulative ACK for (roughly)
   every other packet, not every packet.  So, for instance, in the
   case of no loss, we will see a sequence of ACKs with numbers "2, 4,
   6, 8, .." rather than "1, 2, 3, 4, 5, 6, 7, 8".

   This makes the statistic I showed last week for Netflix's ACK
   traffic that much more astounding.

======================================================================
An Alternate Approach
======================================================================

There is an alternate approach to congestion control other than
end-to-end congestion control.  End-to-end congestion control is
(relatively) simple and requires no change to the infrastructure.
However, it's not as efficient nor as fair as it could be (in part
because endpoints are constantly readjusting their window).

The alternative is to involve the switches in the middle of the
network, and have them "help" endpoints compute the optimal rate.
These approaches are more powerful, more efficient, and more fair, but
they are complex and require changes to the infrastructure.

For instance, here is one idea: TCP reacts to congestion only when
there are packet drops, which means it reacts only when the queue is
already full.  That *also* means that TCP doesn't react until there
are already large delays in the network.

Imagine that the router started dropping an occasional packet when the
queue was, say, 50% full.  Some TCP senders would back off, and
perhaps the queue could be prevented from ever getting completely
full.  This would keep delay low and utilization high (remember: we
can utilize a network fully without the queues being full; that's what
we get from the W vs. throughput graph).

Protocols such as RED, DCTCP, and ECN work this way.

======================================================================
Why This is All Cool
======================================================================

Like I said at the beginning, after this lecture, you aren't supposed
to be a congestion control expert.  I didn't even give you all of the
details of TCP's congestion control algorithm (I did give you the most
important ones, though).

So what was the point?  The point was to get you to think about all of
the problems we have to deal with when multiple connections are
sharing a network (e.g., the Internet).  Assumptions that we often
make in class -- a single sender and receiver -- don't hold.  The
approaches we developed -- set W to be fixed as the bandwidth-delay
product -- are still relevant, but aren't enough.

Moreover, whenever we design protocols for the Internet, we can't rely
on a centralized authority.  Congestion control would be simple if
there was an oracle that could tell every sender exactly at what rate
to send, but we don't have that.  The senders need to figure out the
missing information all by themselves.

Basically: the Internet is an extraordinarily complex environment to
design for, and you all should be amazed that it works at all.