6.02 - Reliable Transport Lecture #22 Katrina LaCurts, lacurts@mit.edu ====================================================================== Introduction ====================================================================== This is now our current model of a network: [ Reliable transport ] [ Packets/Addressing/Routing ] [ Bits ] [ Physical layer: signals ] So far, we've filled in the bottom three layers. Today we're going to look at the top one. We'll answer the question of how to send data reliably in a best-effort packet-switched network where packets can be delayed, re-ordered, or dropped (and even duplicated!). But first, let's think about why we need reliable transport at all. After all, the routing protocols we discussed last week can eventually converge even when routing advertisements are lost. What part of the network actually *needs* to communicate reliably? In reality, networks aren't used just by switches. They're used by the end points that are initiating the communications. Presumably, the communication is part of some application running on the machines at the endpoints. For instance, you might be watching Netflix on your laptop. The Netflix application (website) on your machine is receiving packets from one of Netflix's servers; these packets contain the underlying data for whatever movie you're watching. If you were to "view" the packets out-of-order, the movie would not make much sense! So, we need to add another layer to our model: [ Application ] [ Reliable transport ] [ Packets/Addressing/Routing ] [ Bits ] [ Physical layer: signals ] Using this layering, we can make the following model: We have a sending and receiving application at two endpoints: --------------------- ----------------------- | sending application | ===network===> | receiving application | --------------------- ----------------------- To allow them to reliably send data across the network, we will add a reliable sender/receiver "underneath" the application: --------------------- ----------------------- | sending application | | receiving application | | | | | ^ | | v | | | | | reliable sender | ===network===> | reliable receiver | --------------------- ----------------------- The sending application will send its data packets to the reliable sender (both of these entities exist on the same machine; we don't need to worry about how packets get between the application and sender). That reliable sender will then send them across the network, reliably, to the reliable receiver. The reliable receiver will take care of transferring the data packets *in-order* to the receiving application. We'll say that the reliable sender and receiver implement a "reliable transport protocol". In such a protocol, the receiving application receives all data bytes in exactly the same order that the sending application sent them. Each data byte is delivered exactly once. The problems we'll need to solve to get this to work are: - Packets may be lost - Packets may be re-ordered in the network - Packet delays are variable (due to queueing) Aside: If you are taking 6.02 late in your college career (particularly if you're taking it after 6.033 even though you shouldn't be!), or if you have some prior networking experience, you may already be familiar with a reliable transport protocol called TCP. The protocols that we study in 6.02 use the same underlying concepts as TCP, but have a few details that are different. Be careful not to mix the two up! ====================================================================== Stop-and-wait ====================================================================== ---------------------------------------------------------------------- Sequence numbers and acknowledgments ---------------------------------------------------------------------- At a high level, any reliable transport protocol uses the notions of sequence numbers and acknowledgments. Sequence numbers: Each packet in a connection includes a sequence number, stored in its header. These sequence numbers start at 1, and increase incrementally for each packet in the connection (i.e., the first packet has sequence number 1, the second has sequence number 2, ..). Acknolwedgments: When the reliable receiver receives a packet, it will acknowledge the receipt of the packet by sending an acknowledgment, or ACK, back to the sender. ACKs are just a special type of packet. An ACK packet will contain the sequence number of the packet it is acknowledging, so the sender will be able to differentiate an ACK for packet 1 from an ACK for packet 2, e.g. Already, you can see that our model needs to be slightly different: the reliable sender and receiver will be communicating in both directions; data packets from the sender to the receiver, and ACK packets from the receiver to the sender. --------------------- ----------------------- | sending application | | receiving application | | | | | ^ | | v | | | | | reliable sender | <===network===> | reliable receiver | --------------------- ----------------------- Aside: In a network, there is nothing preventing an end point from being both a sender and receiver, transmitting both data packets and ACKS. We are just starting out with a simpler model where data flows in one direction. Here's our first attempt at a reliable transport protocol: At the sender: - Send one packet, and keep track of its sequence number. - Once an ACK is received for that packet, delete the stored sequence number, and send a new packet (using the same strategy of saving its sequence number and waiting for an ACK) At the receiver: - Upon receipt of packet k, send an ACK for packet k and deliver the packet to the application Under normal operation, we'll see the following sequence of events (remember that the receiver and receiving application exist on the same machine): Sender Receiver Receiving application | ------ seq 1 -------> | | | <----- ack 1 -------- | ---- seq 1 ---> | | ------ seq 2 -------> | | | <----- ack 2 -------- | ---- seq 2 ---> | | ------ seq 3 -------> | | | <----- ack 3 -------- | ---- seq 3 ---> | etc. This is looking good! All of the packets are being delivered in-order to the receiving application. ---------------------------------------------------------------------- Problem 1: Handling loss ---------------------------------------------------------------------- However, let's imagine that a packet is lost. Sender Receiver Receiving application | ------ seq 1 -------> | | | <----- ack 1 -------- | ---- seq 1 ---> | | ------ seq 2 ---x | | | | | We can see immediately that we have a problem. The receiver didn't receive packet #2, and so didn't send an ACK for it. The sender hasn't yet received an ACK for packet #2, and so can't proceed to packet #3. This situation is an example of deadlock: both sides are waiting for something to happen and neither can proceed. What's a solution? One idea is to have the sender retransmit the packet after some fixed timeout. Our protocol now becomes: At the sender: - Send one packet, and keep track of its sequence number and transmission time - Once an ACK is received for that packet, delete the stored sequence number, and send a new packet (using the same strategy of saving its sequence number and waiting for an ACK) - If an ACK hasn't been received after timeout seconds since the packet's transmission time, retransmit it to the receiver. At the receiver: - Upon receipt of packet k, send an ACK for packet k and deliver the packet to the application Note now that the sender is keeping track of the outstanding packet's sequence number *and* transmission time. There is also the addition of a timeout value; for now, assume it's fixed. This protocol does in fact solve our earlier problem: Sender Receiver Receiving app | ------ seq 1 -------> | | | <----- ack 1 -------- | ---- seq 1 ---> | - | ------ seq 2 ---x | | | | | | timeout seconds| | | | | | | | - | ------ seq 2 -------> | | | <----- ack 2 -------- | ---- seq 2 ---> | | ------ seq 3 -------> | | | < ---- ack 3 -------- | ---- seq 3 ---> | But now we've introduced an additional problem. Let's imagine that instead of a packet being lost, an ACK is lost: Sender Receiver Receiving app | ------ seq 1 -------> | | | <----- ack 1 -------- | ---- seq 1 ---> | - | ------ seq 2 -------> | ---- seq 2 ---> | | | x-- ack 2 ------- | | timeout seconds| | | | | | | | - | ------ seq 2 -------> | | | <----- ack 2 -------- | ---- seq 2 ---> | | ------ seq 3 -------> | | | < ---- ack 3 -------- | ---- seq 3 ---> | Since the sender never received an ACK for packet 2, it retransmitted it. The receiving application now receives two copies of the same packet! We wanted the receiving application to receive exactly one copy of each packet; currently it's getting *at least* one. ---------------------------------------------------------------------- Problem 2: Exactly-once delivery ---------------------------------------------------------------------- To solve this problem, we need to change our protocol again, this time at the receiver: At the sender: - Send one packet, and keep track of its sequence number and transmission time - Once an ACK is received for that packet, delete the stored sequence number, and send a new packet (using the same strategy of saving its sequence number and waiting for an ACK) - If an ACK hasn't been received after timeout seconds since the packet's transmission time, retransmit it to the receiver. At the receiver: - Upon receipt of packet k, send an ACK for packet k - If k is greater than the last sequence number we received (or if we haven't received any packets yet), then deliver the packet to the application and keep track of k By keeping track of the last sequence number received, it's easy to recognize duplicates. Our previous problem has been solved: Sender Receiver Receiving app | ------ seq 1 -------> | | | <----- ack 1 -------- | ---- seq 1 ---> | - | ------ seq 2 -------> | ---- seq 2 ---> | | | x-- ack 2 ------- | | timeout | | | | seconds | | | | - | ------ seq 2 -------> | | | <----- ack 2 -------- | | <- no | ------ seq 3 -------> | | delivery | < ---- ack 3 -------- | ---- seq 3 ---> | This final form of the protocol we've been developing is known as the stop-and-wait protocol. It provides exactly-once semantics in a best-effort network. ====================================================================== Problem 3: Choosing timeouts in stop-and-wait ====================================================================== There is one scenario we haven't studied. Imagine that no packets in our networks are lost, but instead there are very long delays: S R A - |\ | | | | \ | | timeout | | 1 | | seconds | | \ | | - |\ \ | | | \ > | -- 1 --> | | 1 / | | | \/ | | | /\ | | | 1 > | | | / / | | |< / | | Here, nothing went wrong, per se -- the application is receiving packets exactly once and in-order -- but packet 1 was retransmitted when it didn't need to be. This is known as a spurious retransmission, and adds unnecessary congestion to the network, which is bad. Our timeout was too short; let's make it longer. S R A - |\ | | | | \ | | timeout | | 1 | | seconds | | \ | | | | \ | | | | > | -- 1 --> | | | / | | | | / | | | | / | | | | 1 | | | | / | | | |< | | | |\ | | | | \ | | | | 2 | | | | \ | | | | \ | | | | > | -- 2 --> | | | / | | | | / | | | | 2 | | | | / | | | | < | | | | | | - | | | Seems okay. Now what happens when a packet is lost? S R A - |\ | | | | \ | | timeout | | 1 | | seconds | | x | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - |\ | | | \ | | | 1 | | | \ | | | etc. Boy. We sure had to wait a long time to retransmit that packet. Much longer than was necessary, even. What you should see from this is that our timeout value should be close to the delay of the network. If it's too short, we retransmit unnecessarily, but if it's too long, performance suffers: S R A - |\ | | timeout | | \ | | should | | 1 | | be at | | \ | | least | | \ | | this | | > | -- 1 --> | long, | | / | | and | | / | | not | | / | | much | | 1 | | longer | | / | | - |< | | (In general, we’d prefer a timeout that was slightly too long to one that was slightly too short. Too short means spurious retransmissions, which add congestion; too long means a slight performance hit.) This problem seems easy enough to solve; we'll just need to measure the delay in the network before we set our timeout value, and make the value slightly larger. Another term for this end-to-end delay is the round-trip-time, or RTT: the time it takes a packet to travel one "round trip" in the network. How might we measure the RTT? We can measure the time it takes from sending a packet to receiving an ACK for it by storing the time at which we sent it, and comparing that to the time we received the ACK. A more elegant way is to store the current time in the packet header, and have this time echoed back to the sender in the ACK. ---------------------------------------------------------------------- Handling variable RTTs ---------------------------------------------------------------------- Is setting the timeout really so easy? Of course not! You should know from our study of queues that queueing delay is *variable*. It depends on the number of other packets in the queue, which depends on the number of other connections in the network, which is constantly changing. That means the RTT is going to vary. For instance: | \_ | - | \_ | | | \ | | | > | | RTT 1 | _/ | | | _/ | | | / | | | < | - | \__ | - | \__ | | | > | | RTT 2 | __/ | | | __/ | | | < | - If we keep making these measurements, we're going to end up with a distribution of RTTS. So now our goal is to estimate a timeout from this distribution of RTTs. We do not know what model the distribution fits (e.g., we don't know that it's Gaussian). To start, let's use the mean of our RTT distribution. I.e., when we receive the nth RTT sample, do the following: n_rtts = n_rtts + 1 srtt[n] = (rtt_sample + (srtt[n-1] * (n_rtts - 1)))/n_rtts timeout = srtt[n] (Here, I'm using the variable n_rtts to keep track of the number of rtt samples and srtt to keep track of the average rtt value (the s stands for "smoothed")) Though this was a reasonable idea, there are two problems. One deals with RTTs varying over long timescales, the other with RTTs varying over short timescales. ---------------------------------------------------------------------- Problem 1: RTTs can change over time ---------------------------------------------------------------------- First, consider a network where there has been very little delay for the past hour, and the RTT samples have been nearly constant at (say) one second. Thus: srtt[n] = 1sec timeout[n] = 1sec n_rtts = large Now, imagine that one node in the network starts sending a lot more traffic, so that the RTT samples are now consistently around two seconds. Since we've been collecting data for so long, it will take our srtt estimate a very long time to reach the RTT that the network is currently experiencing. In this particular case, the timeout is too short; we will be spuriously retransmitting packets for a long time, which will add to the congestion that is already happening. So in fact, we'd like to react fairly quickly to these changes. One solution might be to throw out some of the data and average over just the last x seconds of data. In practice, we do something a bit more sophisticated: we use an exponentially-weighted moving average (EWMA): srtt[n] = alpha * rtt_sample + (1 - alpha) * srtt[n-1] Notice, though, that even though we want to react quickly to *persistent* changes, such as the one in the example above, we don't want to react quickly when there is an outlier in the data. An EWMA lets us control both. As alpha varies, the influence of the most recent rtt_sample changes; the higher alpha, the more influential it is. In networks prone to congestion, we typically use .1 <= alpha <= .25. This allows our estimate to react relatively quickly to changes in the RTT, but it's not influenced very heavily by a single spurious RTT. Aside: You should recognize this EWMA as an LTI filter with input that is the rtt_sample stream and output that is the smoothed (i.e., lowpass filtered) srtt stream. The unit sample response of this filter is h[n] = (1-alpha)^{n-1}*alpha*u[n-1], hence the name exponentially weighted moving average. (You can compute the frequency response yourself to verify that this is a lowpass filter whose approximate cutoff is decreased as alpha is decreased.) ---------------------------------------------------------------------- Problem 2: Non-zero standard deviation ---------------------------------------------------------------------- Now that we've solved the problem of reacting to changes in the RTTs, our algorithm has become: srtt[n] = alpha * rtt_sample + (1 - alpha) * srtt[n-1] timeout = srtt[n] The second problem we want to deal with is the fact that it's very rare for packets to experience an RTT that is exactly equal to mu. For instance, remember our Gaussian distributions; half of the values lie above the mean! In our context, since our timeout is set as (roughly) the mean of the distribution, any packet with an RTT greater than the mean will cause a spurious retransmission. If the RTTs are distributed in a Gaussian, that would be roughly half of the samples! Mathematical aside: We don't know the underlying RTT distribution. Despite that, there is actually a theorem that can help us estimate how likely it is that an individual sample will be far away from the mean: Chebyshev's Inequality. Chebyshev's Inequality tells us this: If X is a random variable with a finite mean mu and a finite non-zero variance sigma^2, then for any real number k: P(|X - mu| >= k*sigma) <= 1/k^2 We can use this theorem to make statements about how many values might be outside of a particular range around the mean regardless of the underlying distribution (as long as the mean is finite and the variance is finite and non-zero). This theorem can be very conservative. For example, in the case of a Gaussian the probability is actually *much* smaller than 1/k^2. The virtue of Chebyshev's Inequality is that it holds for all distributions, which is relevant for us, since our RTT distribution is unknown. Naturally, this means that we need an estimate of how the RTTs are varying. Just like our estimate of the mean, we're going to use an EWMA for this estimate. In practice, we estimate the mean absolute deviation because it's easy to compute; estimating the standard deviation would also serve our purpose. srtt[n] = alpha * rtt_sample + (1 - alpha) * srtt[n-1] dev_sample = |rtt_sample - srtt[n]| srttdev[n] = beta * dev_sample + (1 - beta) * srttdev[n-1] timeout = srtt[n] + k*srttdev[n] (The absolute deviation of a sample is just the absolute value of its difference from the mean) So in the end we've set the timeout to be the estimated mean plus some constant times the estimated deviation. In practice, k=4; Chebyshev's Inequality tells us that no more than 6.25% of RTT samples should be above this timeout value (and possibly far fewer). ====================================================================== Analysis of stop-and-wait ====================================================================== All right. So finally, here we are, with a correctly-working stop-and-wait protocol and a very fancy way to estimate an appropriate timeout value. What kind of performance are we getting with this protocol? Let T = the expected time (in seconds) between successful deliveries of packets. If N data packets are sent (N large), then the time to send them will be N*T. Then throughput will be N packets / (NT seconds) = 1/T packets per second. We need to find T. We can't just assume that T is the round-trip-time, because packets get lost. In fact, based on our previous discussion, we can't even assume the RTT is constant. But just hold off on that problem for now, and let "RTT" represent the round-trip-time. So we'll add a new variable, L, which represents the probability that a packet *or its ACK* is lost. Aside: We can calculate L if we're given a per-link loss probability p. If the packet and its ACK travel a total of k hops in the round-trip, the probability of neither the packet or its ACK being lost is (1-p)^k, and so the probability of either being lost is 1 - (1-p)^k Now: T = P(packet is transmitted successfully)*(delay of packet when it's transmitted successfully) + P(packet is not transmitted successfully)*(expected delay of packet when it's not transmitted successfully) = (1 - L) * RTT + L * (timeout + T) Why is there a T in the second term? Because once a timeout has happened, it will take an expected time of T more seconds for the retransmission to succeed. Solving for T, we get: T = RTT + (L/(1-L)) * timeout Remember that 1/T represents throughput. We'd like to figure out how much throughput we could possibly get in this protocol. The best case happens when the RTT is constant, and so our timeout is close to this value. (Because we're just using this model to estimate throughput, it's okay to sidestep issues like variable RTTs.) T = RTT * (L/(1-L)) * RTT = RTT/(1-L) And so the throughput, 1/T, is (1-L)/RTT Well, so what? Let's say we have a bottleneck link that can support 100 packets per second, and the RTT is 100 milliseconds. Using stop-and-wait, the maximum throughput (which occurs when L=0) is 10 packets per second. That's 10% utilization. This is terrible! We solved the problem of reliable transport, but we completely destroyed performance. Next week we'll figure out how to have both. Additionally, you should get comfortable with the type of modeling I just did to figure out the maximum throughput of stop-and-wait.