6.02 - Convolutional Coding
Lecture #6
Katrina LaCurts, lacurts@mit.edu

======================================================================
Motivation: Space
======================================================================

1. 1962-1967 Early Mariner Probes (Mars, Venus): No ECC.  Very slow
   communication (~8bits/second)

2. 1969-1976 Later Mariner and Viking Probes (Mars, Venus): Linear Block
   Codes (Hadamard codes; see below)

3. 1971 Mariner 9 (Mars orbit): Also used a bi-orthogonal code

   Mariner 9 needed a code to correct picture transmission errors.
   Error correction was necessary because a low "signal-to-noise
   ratio" (SNR), which means that it was difficult to discern the
   signal from the noise (you will get lots more practice with the
   notion of an SNR in coming weeks).

   The design and use of the craft put some constraints on the type of
   codes that could be used.  Here's the type of system we're working
   with:
   - The central computer and sequencer had an onboard memory of 512
     words (words!)
   - The command system was programmed with fewer than 100 commands
   - Data was stored on reel-to-reel tape recorder, which could store
     22.5 MB recorded at 132Kbit/s.  Playback could be done at up to
     16Kb/s.

   And to communicate:
   - Telecommunications were via dual S-band -- the S-band is part of
     the microwave band of the electromagnetic spectrum -- 10 W/20 W
     transmitters and a single receiver through the high gain
     parabolic antenna, the medium gain horn antenna, or the low gain
     omni-directional antenna.

   (http://nssdc.gsfc.nasa.gov/nmc/spacecraftDisplay.do?id=1971-051A)

   Coding wise: Data blocks were 6 bits long (i.e., k=6), to hold 64
   grayscale values.  The available block length was around 30, due to
   the design of the transmitter.  With n=30, they could've done a 5x
   repetition code, but a comparable rate with better error correction
   comes from the (32, 6, 16) Hadamard code:

   This code consists of the all zero codeword, the all 1 codeword,
   and then all other codewords have 16 zeros and 16 ones; the
   complement of each codeword is a codeword.

   In general, these types of codes give us (2^{k-1}, k, 2^{k-2})
   codes.

   Another factor in using this code -- which was actually used
   through the 1980s -- was the efficient decoding algorithm (it's
   essentially syndrome decoding, but with a few tricks added.  See:
   http://www.mcs.csueastbay.edu/~malek/TeX/Hadamard.pdf)

4. (1977) Voyager 1

   Voyager 1 has currently *left the solar system*.  So let's think
   more about that transmitter.  The 20W transmitter on the Voyager 1
   has to cover tens of billions of kilometers.

   We can do some fancy things to the hardware itself:
   - Direct the antenna beam to produce the same received intensity as
     an omni-directional antenna radiating ~megawatts.
   - "cryogenically-cooled low-noise amplifiers"

   But additionally: "..data coding and error-correction schemes.
   These systems can collect, detect, lock onto, and amplify a
   vanishingly small signal that reaches Earth from the spacecraft,
   and can extract data from the signal virtually without errors.”

   The types of codes that Voyager (and subsequent space probes such
   as Cassini, Mars Exploration Rover, ..) uses are called
   "convolutional codes", and the decoding mechanism they use is
   called "Viterbi decoding".  This is what we'll be covering today
   and Wednesday.

5. (1997) Cassini (Saturn)

   Cassini uses a convolutional code with r=1/2 or r=1/6.  Using these
   codes alone on the channel between Cassini and the Earth typically
   results in a BER of 1 in 200.

   ..Which actually isn't so great; this high BER is indicative of
   fairly high noise on the channel.

   However, the Cassini's convolutional code is used in conjunction
   with a particular type of linear block code called a Reed-Solomon
   code.  This code is (255, 223, 33). (Aside: now you can see why in
   some cases, the naive method of decoding is going to be really
   expensive.)  Cassini first does a Reed-Solomon encoding of the data
   (also using interleaving, which you saw on Wednesday), and then
   convolutionally encodes the Reed-Solomon symbols.  We call this a
   concatenated code.

   The BER of this concatenated code is 1 per million or better.

   Even though we won't talk about concatenated codes explicitly, know
   this: much of coding is focused on designing codes that approach
   the channel capacity (i.e., get as much information across the
   channel as they can), but are also efficient to encode/decode.
   Neither linear block codes nor convolutional codes alone satisfy
   both of these constraints.  By concatenating these two types of
   codes in smart ways, we can start to achieve these properties.

   (We're not going to prove any of that in 6.02.  If it whets your
   appetite for coding, there are further courses you can take!)

Convolutional codes/Viterbi decoding are used in many terrestrial
applications as well.  For instance, most of your cellphones.  Also,
digital video broadcasting; Qualcomm decodes about 10^15 bits every
second of every day with Viterbi decoders in digital TVs.

======================================================================
Convolutional Codes
======================================================================

The most obvious differences between convolutional codes and linear
block codes are:
1. Convolutional codes act on message bits streaming into the encoder.
2. We transmit *only* parity bits, not data bits and parity bits.
   This means that convolutional codes are non-systematic codes
   (non-systematic = only parity bits).

The way we will be able to get error correction properties out of
convolutional codes is by having each message bit affect many
different parity bits (in some sense: a particular message bit will be
involved in many different parity equations).

The code rate of these types of codes tells us how many parity bits
are sent for each message bit.  We will be talking mostly about 1/r
codes (i.e., r parity bits per message bit).

----------------------------------------------------------------------
Easy Example
----------------------------------------------------------------------

Like linear block codes, we describe convolutional codes by giving the
equations for the parity bits.  Here is a simple convolutional code:

p_0[n] = x[n] + x[n-1] + x[n-2]
p_1[n] = x[n] + x[n-2]

Imagine the following message: 11001101

Here, x[0] = 1, x[1] = 1, x[2] = 0, etc.  This is how we are thinking
of the message as a stream.

We're going to assume that there is an initial part of the stream that
is all zeros (I will make this more formal shortly), and then look at
what happens when we get our first nonzero data bit.  So we'll
actually imagine the message as 0011001101

Now, at time n=0, which bits will we use?  x[0]x[-1]x[-2] = 001.  So:

0011001101 p_0[0] = 1 + 0 + 0 = 1
^^^        p_1[0] = 1 + 0 = 1

At n=1, we are using x[1]x[0]x[-1]:

0011001101 p_0[1] = 1 + 1 + 0 = 0
 ^^^       p_1[1] = 1 + 0 = 1

At n=2, we are using x[2]x[1]x[0]

0011001101 p_0[2] = 0 + 1 + 1 = 0
  ^^^      p_1[2] = 0 + 1 = 1

Transmitted sequence: 110101..  (the parity bits)

Etc.  You can see that we moved through the message bit with a
"sliding window" of data.  For instance, for this example, when n=2,
we used x[0] through x[2].  When n=3, we used x[1] through x[3].

The width of this window (in bits) is called the code's constraint
length K.  For this code, K=3.

For transmitting, we send the parity sequences, not the message
itself.  When we have multiple parity equations, we interleave the
bits of the parity sequences.  E.g., for our example, we send:

    p_0[0] p_1[0] p_0[1] p_1[1] p_0[2] p_1[2] ...

Each message bit is "spread across" K elements of each parity
sequence, so the parity sequences are better protection against bit
errors than the message sequence itself.

Finally, to finish transmitting the message, we assume that it ends
with all zeroes.

----------------------------------------------------------------------
Decoding
----------------------------------------------------------------------

Let's think about how we would decode a message, assuming that there
was no noise introduced.

p_0[n] = x[n] + x[n-1] + x[n-2]
p_1[n] = x[n] + x[n-2]

Adding these equations, we see that p_0[n] + p_1[n] = x[n-1].  So we
can simply sum the parity bits and decode the message (we'd be one bit
behind in each step).

In the presence of errors in the parity stream, however, the
calculated message bits will end up corrupted at about the same rate
as parity bits were.  So this is not going to work very well.  We will
come back to this notion at the end of class, and discuss it at length
on Wednesday.

----------------------------------------------------------------------
Shift-register View
----------------------------------------------------------------------

There are a lot of different ways to visualize/describe convolutional
codes.  The first is the "shift-register" view, which actually
reflects how this coding process might be implemented in hardware.

For the example I did above, here is the shift-register view:

          ----------------> + --------> p_1[n]
         |                  ^
         |                  |
         |     -----------------
x[n] -------> | x[n-1] | x[n-2] |
         |     -----------------
         |         |        |
         |         |        v
         |         -------> + --------> p_0[n]
         |                  ^
         |                  |
         --------------------

x[n-1] and x[n-2] are held in two registers.  A new bit, x[n], arrives
at the left.  The box computes the parity bits from the incoming bit
x[n] and the K-1 previous message bits.  At the end of the iteration,
the saved message bits are shifted right by one, and the incoming bit
moves into the left position.

So, in our message 011001101, we would have:

1 --> 10    p0 = 0; p1 = 1
0 --> 11    p0 = 0; p1 = 1
0 --> 01    p0 = 1; p1 = 1
1 --> 00    p0 = 1; p1 = 1

Etc.

At the start of encoding, we assume x[n-1] and x[n-2] contain zero.
When we finish, we add k-1 zeroes to the end of the code, to allow the
registers to "settle", again at zero.

----------------------------------------------------------------------
Generators
----------------------------------------------------------------------

Not surprisingly, we have a notion of a generator for convolutional
codes, just as we had a generator matrix for linear block codes.  We
typically speak of a set of generators, where g_i is the generator for
parity bit p_i.

In our example:
  p_0 = x[n] + x[n-1] + x[n-2] ==> g_0 = 111
  p_1 = x[n] + x[n-2] ==> g_1 = 101

If we had a third:
  p_2 = x[n] + x[n-1] ==> g_2 = 110

So each element in g_i is 0 or 1.  There is a 1 in the j^th position
if x[n-j] is used (we denote the first position as index 0).

This relates directly to our shift-register view, as you can see on
the diagram.

----------------------------------------------------------------------
Convolution
----------------------------------------------------------------------

With these generators, we can express the parity bits in a different
way.  A convolutional code generates sequences of parity bits from
sequences of message bits by a convolution operation:

   p_i[n] = sum_{j=0}^{K-1} g_i[j] x[n-j]   mod 2

K here is, again, the constraint length.  g_i is the K-element
generator for parity bit p_i.

  p_0[n] = sum_{j=0}^{K-1} g_0[j] x[n-j]  mod 2
         = sum_{j=0}^{2} g_0[j] x[n-j]  mod 2
         = g_0[0][0]*x[n] + g_0[1]*x[n-1] + g_0[2]*x[n-2]
         = 1*x[n] + 1*x[n-1} + 1*x[n-2]

Note that the input bitstream, as a function of j, is reversed in
time.

  An aside: The convolution operation is going to become very
  important to us later on in this class, so you should start to get
  familiar with it now.

Regarding the constraint length K, you can see that the larger K is,
the more times a particular message bit is used when calculating
parity bits.  This will lead to greater redundancy, and usually to
better error correcting possibilities.

======================================================================
Additional Views
======================================================================

In addition to the shift-register view, there are two additional ways
to view convolutional codes.

----------------------------------------------------------------------
State-machine View
----------------------------------------------------------------------

First we'll start with the State-machine view.  Recall the
shift-register view, where we said the values in the registers tell us
the "states" the encoder can be in, i.e., there is a state for each
possible combination of x[n-1] x[n-2].  That means for K=3, there are
2^{K-1} states: 00, 01, 10, 11

Depending on the next bit, we will be able to transition from one
state to another.  So for example, if we start with 00, and the next
bit is a 0, we will return to 00.  If the next bit is a 1, we will
transition to 10.  There is no transition from 00 to 01, or to 11;
that would be impossible.

(Note to students: Drawing the state-machine view was beyond my
ASCII-diagram skills, so please consult the lecture slides for an
illustration of this process)

Now, notice that we have said nothing about what parity equations
we're using yet.  We haven't needed them!  Every state machine diagram
for K=3 will have these states and these transitions.  The only thing
that will change is what I'm going to add next.

On each transition, we're going to add to the label.  Instead of just
x[n], we will also indicate the parity bits that are outputted.

  Ex: On the transition from 00 to 10, we have x[n] = 1, x[n-1] =
  x[n-2] = 0.  With our example, p_0[n] = 1 + 0 + 0 = 1 and p_1[n] = 1
  + 0 = 1.  So we label this edge as "1/11" (x[n]/p_0[n]p_1[n])

----------------------------------------------------------------------
Trellis View
----------------------------------------------------------------------

The state machine alone is pretty useful.  But it only gives a visual
representation of where we are at iteration n.  It would be nice to
have a representation that kept track of what states we had been in
previously.

(In fact, when we learn how to decode convolutional codes, it's going
to be *very* nice)

What we will do is, essentially, take the state machine and spread it
out over time.  This is called the "trellis" view.  With this view, we
can trace a path over time.

In the trellis view, we will only label the transitions with the
parity bits, instead of always including x[n].  We do this because of
the two arrows exiting a transition, the upper arrow always
corresponds to x[n]=0, and the lower arrow always corresponds to
x[n]=1.

(Students, again, please see the slides for this illustration)

Our encoding procedure is now this:  Beginning at the starting state,
for each message bit, make the appropriate state transition, and send
p_0p_1.  Then end the message with K-1 zeros, to ensure return to the
starting state.

Any path through the trellis corresponds to a code word.  We could
also consider the subset of paths that start in the zero state and end
at time step N in the zero state – these would generate codewords of
length N (though a convolutional code is not limited to this, or to a
particular choice of N). The message bits can be read off by noting
whether the path at each stage takes the upper branch (bit 0) or lower
branch (bit 1). And the codeword bits can be read off from the labels
on the traversed path.

  Aside: Because the states are binary strings, you will sometimes see
  them interpreted as decimal numbers, i.e., "0, 1, 2, 3" instead of
  "00, 01, 10, 11".  The former can be particularly useful in
  implementations.

----------------------------------------------------------------------
Free Distance
----------------------------------------------------------------------

One last thing before we start to talk about the decoding process.
The parity stream forms a linear code (the parity bits are linear
functions of the data bits).

  Aside: The fact that parity bits are linear functions of the data
  bits is essentially saying that c=dG, which establishes the
  linearity of the code.

  If you want to instead invoke our other characterization of
  linearity, note that the sum of two *data* words is a data word,
  because all data words are allowed.  In view of the convolution
  expression for the parity bits, it follows that the sum of two
  parity streams is a parity stream (the stream corresponding to the
  sum of the two data words).

  Thanks to George Verghese for the above explanation, which is much
  more succinct than what I had originally.

The smallest-weight nonzero codeword has a weight that, locally in
time, plays a role analogous to d (the minimum Hamming distance).
It's called the free distance of the convolutional code; we'll denote
it d_{free}.

With linear block codes, d told us that we could correct all patterns
of up to floor( (d-1)/2 ) errors in a block.  In convolutional codes,
d_free tells us something similar: we can correct floor  ( (d_free -
1)/2 ) errors as long as they are "far enough apart".

Why this notion of "far enough apart"?  If we took the interpretation
of the minimum Hamming distance to convolutional codes, we might
conclude that we could correct only floor( (d-1)/2 ) errors in the
entire stream.  That is false; the stream is infinite, and we can
certainly correct more than floor( (d-1)/2 ) errors in total.  This is
where the notion of "far enough apart" comes in; we can't correct more
than floor( (d_free-1)/2 ) errors in a burst.

======================================================================
Decoding in the presence of errors
======================================================================

Earlier we discussed a naive way of decoding (in our example, we can
just sum the parity bits to recover the stream), and alluded to the
fact that it won't work well in the presence of errors.

What is really going on at the receiver?  The receiver doesn't have
direct knowledge of the transmitter's state transitions; it only knows
the (possibly corrupted) parity bits p_i.

The goal is to find the most likely sequence of transmitter states
that could have generated the received parity bits.

Since we're assuming a BSC with p < .5, it's more probable that we
have fewer errors than more errors (E.g., if we're back in our
repetition code, and I receive 00010, it's a lot more likely that the
transmitter sent 00000 instead of 11111).  Because of this, the
maximum-likelihood message sequence is the one that generated the
codeword (here, the sequence of parity bits) with the smallest Hamming
distance from the received codeword (again, here, the sequence of
parity bits).

So the goal: find the nearest valid codeword closest to the received
codeword.  This is maximum-likelihood decoding, and we will learn how
to do it for convolutional codes on Wednesday.