Highly Fault-tolerant Parallel Computation

Professor Dan Spielman(MIT)

When the function of a processor is critical, it is common to use three in place of the one so that even if one fails, its instructions will be overridden by the other two. Similarly, one could protect against the failure of k processors by following the instructions of a majority of 2k+1. But, what if these processors are components of a large parallel machine? Is it necessary to replicate each processor 2k+1 times to insure against the loss of any k? We show that, in many cases, one can do better.

The k-wise repetition scheme was studied by Von Neumann in his classic paper, "Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components". Elias pointed out that Von Neumann's results, while a "triumph", had "the character of pre-information theory results on the reliable transmission of information over unreliable channels." We use techniques recently developed to study complexity theory and parallel computation to construct fault-tolerant computations with the character of modern information theory--at each time step, the states of our parallel machines are close to codewords in Reed-Solomon Codes.