6.02 - Viterbi Decoding Lecture #7 Katrina LaCurts, lacurts@mit.edu ====================================================================== Recap ====================================================================== On Monday, we started convolutional codes. These are codes that operate over a stream of message bits, and transmit only parity bits (not parity and also data). We discussed the encoding process, but left decoding for today. Recall the three different views for convolutional codes 1. Shift-register view, which reflects how the encoding process is implemented in hardware. 2. State-machine view, which took the possible states that the shift-register could be in, and visualized how we move from one state to another, and what parity bits are outputted. 3. Trellis view, which expanded the state machine out over time. Today, we will concentrate entirely on the trellis view, because it is most useful in the decoding process. ====================================================================== Decoding ====================================================================== Aside: Every time I use the word "decoding" in the first part of this lecture, I am really talking about "hard-decision decoding". You will understand this difference when I get to the second part of the lecture and explain to you what the contrasting "soft-decision decoding" is. In the trellis view, the encoding process creates a path through the trellis corresponding to the codeword, and we send the corresponding parity bits. At the other end, the receiver receives a sequence of parity bits; its job is to figure out what the corresponding codeword is, i.e., what path the transmitter took through the trellis. Really, we are finding the *most likely* path that the transmitter took through the trellis, given that some of the parity bits may be in error. On Monday we noted that finding the most likely path is equivalent to determining the codeword with minimal Hamming distance between itself and the received sequence of bits. So imagine that we received the following sequence of parity bits: 00 01 01 10 .. We know we start at the [00] state. The first two parity bits we received were 00. So that's easy; we know we took the transition from [00] -> [00], and thus the first message bit is a zero. The next two received parity bits are 01. This is problematic! First, the only transitions out of [00] correspond to parity bits 00 and 11; our received parity bits don't match up here. Moreover, at this point, it seems like both of those transitions are equally as likely; the Hamming distance between 01 and 00 is the same as the Hamming distance between 01 and 11. ---------------------------------------------------------------------- Naive Decoding ---------------------------------------------------------------------- Here is one way we could do the decoding: - Enumerate all paths that start at the 00 state. - Find the one that corresponds to a parity bit sequence that's closest in Hamming distance to the received parity bits With L message bits, there are 2^L possible paths. And we're dealing with a fairly large L here (on the order of hundreds/thousands); 2^L is not feasible for us. ---------------------------------------------------------------------- Viterbi Decoding: Intuition ---------------------------------------------------------------------- Let's go back to our example. We couldn't figure out what the second bit of the message should be, which left us with two states to consider: either [00] or [10]. We also know that the next two parity bits are (again) 01. So imagine we're in state [00]: parity bits 01 don't correspond to either transition out of that state. Now imagine we're in state [10]. Ah-ha! 01 does correspond to a transition out of that state. So with this new information, it seems more likely that our path through the states was [00] -> [00] -> [10] (and then -> [11]) rather than [00] -> [00] -> [00] (and then either -> [00] or -> [10]). This gives you some intuition for the Viterbi algorithm. At various points, we may not be sure exactly what transition was used. But as we move through the message, we can use additional parity bits to make a pretty good guess as to what previous transitions -- that we were unsure about -- should have been. ====================================================================== Viterbi Decoding ====================================================================== ---------------------------------------------------------------------- Background ---------------------------------------------------------------------- The algorithm we will use to decode this is called the Viterbi algorithm. It actually solves a larger problem than just convolutional codes. It is a "recursive optimal solution to the problem of estimating the state sequence of a discrete-time finite-state Markov process observed in memoryless noise." (Forney, Proc. IEEE, 1973) In the context of convolutional codes, it's used in both CDMA and GSM cellular, dial-up modems, statellite, deep-space communications, and 802.11 wireless LANs. More generally, it is used in speech recognition, speech synthesis, keyword spotting, computational linguistics, and bioinformatics. Aside: In a more general context, we would talk about the Viterbi algorithm as doing the following: given a set of observations about a system, it finds the most likely sequence of "hidden states" that led to these observations. For us, those observations are parity bits, and the hidden states are the message bits (which are hidden from the receiver). The algorithm becomes more general by assigning different probabilities to the transitions. This is not something we'll cover in 6.02, so for now it's fine to only understand Viterbi in the context of convolutional codes. But it's good to know that it's actually much more powerful. The Viterbi algorithm is a dynamic programming algorithm, which means it's going to solve a smaller sub-problem at each step, and build these solutions up into our final solution. Aside: dynamic programming is also a much more general concept. We will compute the most likely message sequence leading up to every intermediate state, as well as the associated cost. At the end, we'll figure out the allowed final state that has the lowest cost, and "trace back" from there. (We say "allowed" because we may have additional constraints, e.g., the final state has to be 00) ---------------------------------------------------------------------- Introduction ---------------------------------------------------------------------- As discussed in class on Monday, every path through the trellis from the possible initial states to the possible final states corresponds to a valid codeword. Thus, every path through the trellis also corresponds to a transmitted message. For a received codeword r, I can calculate the Hamming distance between r and any valid codeword. Thus, we have a notion of a distance for a particular path through the trellis: the distance of any path p, which corresponds to valid codeword c, is the Hamming distance between c and r. I'll call this distance the error metric for a path. Now, suppose I show you two possible paths through the trellis. Path 1, corresponding to valid codeword c1, has error metric 5; path 2, corresponding to valid codeword c2, has error metric 3. Which codeword is more likely to be correct? c2. Thus, to figure out the transmitted message, we can just take a look at the message bits that correspond to the transitions of path 2. The Viterbi algorithm will work in part by exploring many possible paths through the trellis. When we get to the end of the trellis, we will determine which has the minimum error metric, and that path will correspond to the most-likely codeword; from the path, we will be able to determine the (most-likely) message. Part of what makes this algorithm efficient is that instead of exploring all possible paths through the trellis, it is able to eliminate certain paths as it moves through the trellis. In fact, it only keeps track of the optimal path to each state at each time as it moves through the trellis, thereby eliminating from consideration all but 2^(K-1) candidate paths at each time. ---------------------------------------------------------------------- Example: The beginning ---------------------------------------------------------------------- We're going to deal with the following received message: 00 01 01 10 11 00 And the same parity equations from Monday. We know that the transmitter started at state [00]. Our first two received parity bits are 00. If the transmitter had taken transition [00] -> [00], the Hamming distance would be 0. If the transmitter had taken transition [00] -> [10], the Hamming distance would be 2. So we are starting off on two possible paths. Notice that I am putting an error metric in the states. This indicates the error metric for the path that ends at that particular state in this iteration. You should immediately ask yourself: what happens if multiple paths end at the same state? This is a question I will answer shortly. Now let's continue. My next two parity bits are 01. From what I did in the last step, it seems that I could potentially be in one of two states right now: [00] or [10]. From [00], I can make one of two transitions: [00] -> [00]; parity bits 00 => Hamming distance = 1 [00] -> [10]; parity bits 11 => Hamming distance = 1 From [10], I can make one of two transitions: [10] -> [01]; parity bits 10 => Hamming distance = 2 [10] -> [11]; parity bits 01 => Hamming distance = 0 I want to figure out the error metric for this iteration, so I'm going to add the new Hamming distances to the error metrics that already exist. ---------------------------------------------------------------------- Branch Metric ---------------------------------------------------------------------- Before we get to far into this example, I want to start attaching some more formal terms. At each iteration, we have been computing the Hamming distance between the received parity bits at that iteration and the parity bits associated with different transitions (sometimes we call the transitions "arcs"). Formally, we are computing the "branch metric": the distance between the received parity bits and the expected parity bits for each transition. The branch metric is not restricted to being the Hamming distance, and we will see an alternative choice later in lecture. In general, the branch metric is proportional to the negative log likelihood, i.e., the negative log probability that we receive r given that x was sent. As you have seen, the larger the branch metric, the less probable it was that we made that transition. ---------------------------------------------------------------------- Example: Continuing on ---------------------------------------------------------------------- Okay. So we have processed the received parity bits 00 01, and we have error metrics calculated through four different paths through the trellis. At this point, it is unclear what makes the Viterbi algorithm so efficient, as we are essentially considering all possible paths. Our next two received parity bits are 01. We start by calculating all of the appropriate branch metrics: [00] -> [00]: BM = 1 [10] -> [01]: BM = 2 [00] -> [10]: BM = 1 [10] -> [11]: BM = 0 [01] -> [00]: BM = 1 [11] -> [01]: BM = 0 [01] -> [10]: BM = 1 [11] -> [11]: BM = 2 Now, if we took the branch from [00] -> [00], that would make our total error metric 2 (1 + 1 = 2). But wait! There are actually two paths that end at state [00]; the one from [00] -> [00], and the one from [01] -> [00]. If we took the other path, that would give us an error metric of 4+1=5. So at this stage, which path is more likely? The first. We'll do the same at the other states (I'll abbreviate "error metric" as "EM") At [01]: Taking [10] -> [01] yields EM = 3 (1+2); taking [11] -> [01] yields EM = 2 (2 + 0). So we'll use [11] -> [01] At [10]: Taking [00] -> [10] yields EM =2; taking [01] -> [10] yields EM=5 At [11]: Taking [10] -> [11] yields EM=1; taking [11] -> [11] yields EM=4 Now you can see where the efficiency of the Viterbi algorithm comes in. There are, in theory, eight potential paths through the trellis now (we're on the 3rd iteration), but we just eliminated some of them. Why did we do that? Well, suppose that a wizard appears, and let's you know that at iteration 3, you are supposed to be at state [00]. Originally, there were going to be two paths from the starting state to [00]: [00] -> [00] -> [00] -> [00] (EM=2), and [00] -> [10] -> [01] -> [00] (EM=5). It is more likely that we took the first path; it has the smaller error metric. And any paths that continue on from state [00] at iteration 3 will just increase that error metric. So we don't even consider the second path anymore. ---------------------------------------------------------------------- Path Metric ---------------------------------------------------------------------- The final thing to formalize before we continue this discussion regards the path metric, which is related to the error metric I've been discussing, but is subtly different. We keep track of the path metric for each state at each iteration, i.e., we have PM[s,i] for each state s of the 2^{K-1} transmitter states, and for each iteration i (0 <= i < L-1) What the path metric is, is this: PM[s,i] = the smallest sum of branch metrics, minimized over all message sequences that place the transmitter in state s at time i. This definition is equivalent to the minimum error distance over all message sequences that place the transmitter in state s at time i. Aside: In a more perfect world, this quantity would be called the "optimal path metric", not the "path metric". Up until this iteration, we only had one possible path to each state in each iteration; the error distance corresponded to the path metric, even though we hadn't introduced a notion of the best path. We compute PM[s,i+1] using PM[s,i] and the BM for the incoming branches. You can see at the beginning, we had some states without error distances filled in. Formally, those are infinite; there was no path that placed the transmitter in those states at those times. ---------------------------------------------------------------------- The Algorithm, Formally ---------------------------------------------------------------------- At time i+1, the following is known: PM[s,i] for all states s. For each state s, at time i+1: - Determine a and b, the two states that the transmitter could've been in at time i to get to s. These are the "predecessor states" of s. Ex: for state [00], the predecessors are a=[00] and b=[01] Any message sequences that leaves the trellis in state s at time i+1 must have left the trellis in state a or b at time i. - Calculate the branch metrics for these two transitions; this gives us BM_a, BM_b Ex: if the received parity bits are 01 and we're considering s=[00], then BM_a = 1 and BM_b = 1. - Determine PM[s,i] = min(PM[a,i] + BM_a, PM[b,i] + BM_b Ex: (from the diagram) PM[[00],3] = min(PM[[00],2] + 1, PM[[01],2] + 1) = min(1 + 1, 4 + 1) = min(2, 5) = 2. (If there were a tie, we'd break it arbitrarily) - Keep track of the predecessor state (equivalently, the arc) used to get to s in iteration i. Ex: this was [00] -> [00] for our example. Aside: In implementation, you could do this with a variety of data structures. In class, I'm just keeping track on the board by coloring over the appropriate arc. ---------------------------------------------------------------------- Example: The end ---------------------------------------------------------------------- Having completed four iterations, let's look at the states. Which one was the transmitter most likely in at this point? State [11]: it has the smallest path metric. What does the path metric represent? It represents the smallest number of bit errors obtainable on any path to that state at that time. So the path metric of 1 for state [11] at this end time indicates that there is a path with a 1 bit error between the codeword corresponding to that path and the received codeword. The other states indicate that the codewords corresponding to their respective path are a Hamming distance of either 2 or 3 away from the received codeword. So the most likely "ending state" is [11]. How can we recover the message? We just need to trace back through the trellis, and figure out what received message bit corresponds to each transition: [11] -> [11] -> [10] -> [00] -> [00] ^ ^ ^ ^ 1 1 1 0 So it looks like our received message was 0111. In this case we were able to correct the bit error. Without calculating the parity bits for that message, can you tell where the bit error came about in our received message? It happened between step 1 and 2. This is where the path metric for our chosen path increased from 0 to 1. (Note: this process worked because in this case we were able to decode the message correctly.) We can check this: If the message had been transmitted correctly, that message would've resulted in the parity bits 00 11 01 10. Our parity bits (00 01 01 10) do indeed reflect a 1-bit error in this message, on the second set of parity bits (which corresponds to "between step 1 and 2" in our algorithm). ====================================================================== Viterbi Algorithm: Complexity ====================================================================== What is the complexity of this algorithm? At any given time there are 2^{K-1} paths that we're tracking, i.e., 2^{K-1} most-likely message that we're tracking. An alternative way of thinking: at each step i, we're choosing between the 2 ways to get into each of the 2^{K-1} states. Thus, we get L*2^{K-1} complexity, i.e., the time complexity of the algorithm grows exponentially with constraint length K, but only linearly with message length L (as opposed to exponentially in L, for simple enumeration). This is fine; K is often small, whereas L is often large. ====================================================================== Soft-decision Decoding ====================================================================== Recall our binary-symmetric channel: [transmitter] ----- BSC -----> [receiver] We have actually been eliding one layer in this diagram: [encode data][send as voltages] ---> [receive voltages][convert voltages to bits][decode data] To transmit the bits, the transmitter takes the encoded data and sends it across the channel as voltages. For now, you can just assume that the transmitter sends 0 volts to represent a 0 bit, and 1 volt to represent a 1 bit (another popular choice is -1 for 0, +1 for 1). How we get these voltages across the channel is a topic for the second part of 6.02. Let's talk about the "convert voltages to bits" part of the receiver's process. Without noise, the task on the receiver's end would be easy: for each voltage received, if it's 0 volts, interpret that as a 0 bit; if it's 1 volt, interpret that as a 1 bit. But the voltages don't arrive as a sequence of 0 and 1 volts. Because the channel introduces noise, the received voltages can be corrupted. For instance, you might receive 0.02 instead of 0, or 0.96 instead of 1. The receiver thus needs to "digitize" the voltage samples into bits. Aside: I am still eliding a host of details here. For instance, the receiver actually receives multiple voltage samples per bit, not just one. But this is a challenge for another time. For now, we'll use the following rule: if the received voltage is >= .5, interpret as a 1; else, a 0. You should be wondering why I'm bringing this up at all. Consider the following sequence of voltages: .9999 .9997 .9999 .5001 .9998 By my rule, all of these will be interpreted as 1's. But that 4th voltage sample is awfully close to being interpreted as a 0 bit. In some sense, it's not a very "good" 1 bit. When we immediately digitize voltage samples, we lose information about how "good" the bit is, and throwing away information is almost never a good idea when making decisions. If the Viterbi algorithm had access to these voltages, perhaps it could redefine its branch metric to indicate, e.g., that the distance between .999V .999V and 11 is very small, but the distance between .999V and .501V and 11 is greater, despite the fact that both of those voltage pairs would get digitized to 11. Not surprisingly, this wasn't a hypothetical exercise. Everything I have taught you up to this point is really "hard-decision decoding": we digitize bits first, and then decode. We can also do "soft-decision decoding", where the Viterbi algorithm is sent the received voltages instead of the digitized bits. To redefine our branch metric, we often use the square of the Euclidean distance between received voltages and expected voltages. Ex: for .999,.999 and 11: (.999-1)^2 + (.999-1)^2 = .000002 for .999,.501 and 11: (.999-1)^2 + (.501-1)^2 = .249002 Aside: there are other metrics we could use. The reason we choose this one is because it works best for the type of noise we see on most channels. You'll learn more about this in the coming weeks. The rest of the Viterbi algorithm is exactly the same. ====================================================================== Free Distance (again) ====================================================================== From Monday, recall the free distance: d_free = the smallest-weight among nonzero codewords. In our example, d_free = 5, corresponding to path [00] -> [10] -> [01] -> [00]. This means we expect to be able to correct floor(5-1/2) = 2 bit errors as long as they are "far enough apart". Looking at the diagram, you can get a better sense of how far "far enough apart" is for this code. It takes our decoder six output bits to "settle" back to the [00] state. If we were receiving the all zero codeword, we'd be back in the correct state at that point, and ready to correct more errors. ====================================================================== Puncturing ====================================================================== The rates of convolutional codes are 1/r, where r is the number of parity bits per message bit. The code we've been working with has rate 1/2. What happens if we want to have a rate higher than 1/2? Or a rate between 1/r and 1/(r+1)? To achieve these types of rates, we can use a technique called puncturing. In puncturing, the transmitter only sends a subset of the parity bits, not all of them. This subset is agreed upon ahead of time by the transmitter and receiver. For instance, take our example with two parity equations. These equations result in two streams of bits, which we interleave when we transmit: p0[0]p1[0] p0[1]p1[1] p0[2]p1[2] p0[3]p1[3] p0[4]p1[4] ... Let's puncture the first stream by skipping every third bit in the stream: p0[0]p1[0] p0[1]p1[1] - p1[2] p0[3]p1[3] p0[4]p1[4] - p1[5] (I've placed a "-" where no parity bit from p0 is sent). This process could be described by the puncturing vector 110. We can puncture the second stream at the same time. Let's use the vector 101. We get: p0[0]p1[0] p0[1] - - p1[2] p0[3]p1[3] p0[4] - - p1[5] (As a stream, that's just p0[0]p1[0] p0[1]p1[2] p0[3]p1[3] p0[4]p1[5]) Three questions to answer: 1. What is the rate now? 2. How does the decoder deal with a punctured stream? 3. How are the error correction properties affected? Before, our rate was 1/2, or equivalently, six parity bits for every 3 message bits (2 parity bits for every bit). Now we send four parity bits for every 3 message bits; our rate is 3/4. On the decoder's end, since it knows which parity bits are missing, it just skips them; they don't factor into the calculation of branch metrics. Other than that, the decoding procedure is exactly the same. Regarding the error correction properties: intuitively, we will likely be able to correct fewer errors with the punctured code. One very nice thing about puncturing is that it allows us to use a single convolutional encoder/decoder regardless of the rate we want. So if we have already built a convolutional encoder/decoder, but we want a rate 3/4 code, we don't have to build a different device; we can use the same implementation and puncture the stream. ====================================================================== The end of coding ====================================================================== This lecture ends the first part of 6.02! To recap, we discussed: 1. Encoding data efficiently via Hamming codes and LZW compression. Hamming codes assume we know the symbol probabilities beforehand, and that they are independent; LZW makes no such assumptions. 2. Adding error correcting abilities to our codes via linear block codes (e.g., Hamming codes) and convolutional codes. 3. Decoding these two types of codes efficiently, via syndrome decoding and Viterbi decoding, respectively. When we choose how to encode data, we care about multiple things: - How many bits we're using (we'd like to use fewer) - How the rate of the code compares to the channel capacity (are we getting as much information through the channel as possible?) - How efficient the encoding and decoding processes are - The constraints of the system itself (remember with the space probe examples, those transmitters could only handle certain block sizes) What you will learn about in the second part of this course is how we actually get those bits across a channel, and how we deal with the types of errors that can manifest there.