Highly Fault-tolerant Parallel Computation
Professor Dan Spielman(MIT)
When the function of a processor is critical, it is common to use
three in place of the one so that even if one fails, its instructions
will be overridden by the other two. Similarly, one could protect
against the failure of k processors by following the instructions of a
majority of 2k+1. But, what if these processors are components of a
large parallel machine? Is it necessary to replicate each processor
2k+1 times to insure against the loss of any k? We show that, in many
cases, one can do better.
The k-wise repetition scheme was studied by Von Neumann in his classic
paper, "Probabilistic Logics and the Synthesis of Reliable
Organisms from Unreliable Components". Elias pointed out
that Von Neumann's results, while a "triumph", had "the character of
pre-information theory results on the reliable transmission of
information over unreliable channels." We use techniques recently
developed to study complexity theory and parallel computation
to construct fault-tolerant computations with the character of
modern information theory--at each time step, the states of our
parallel machines are close to codewords in Reed-Solomon Codes.