6.033 - Computer System Engineering Replicated State Machine Assignment

Hands-on 9: Replicated State Machine

Complete the following hands-on assignment. Do the activities described, and submit your solutions using the online submission site by 11:59p.

This assignment asks you to create a simple replicated state machine, focusing on ordering and determinism.

This assignment must be completed on a Unix based OS running Python version 2.6 or later. Athena is a suitable environment.

This assignment involves programming.

I. Warmup

Download and decompress rsm.tgz, a tar file, which contains 3 files: clnt.py, srv.py, and coord.py, and an empty directory: tmp/, and then change into the decompressed directory. You might run the following command to do this:

  $ wget http://web.mit.edu/6.033/www/assignments/rsm.tgz; tar -zxvf rsm.tgz; cd rsm

Read through the Python code in srv.py and clnt.py before continuing.

Open two terminal windows, ensure you are in the rsm folder, and run the following commands in the window indicated:

  top% ./srv.py ./tmp/a
  Listening on port xxxxx ...

  bottom% ./clnt.py xxxxx
  (after a few seconds) ^C
  bottom% cat ./tmp/a

Replace xxxxx with the numbers displayed on your console. ^C means to press control-c.

The server listens on the port specified for transfer requests and stores the balances of the bank accounts in the file ./tmp/a. The client repeatedly chooses two random accounts out of 10 accounts and sends a request to the server to transfer money between the two accounts.

Note: You ran the client and server on the same machine for convenience, but it is simple to run them on different machines: just replace 'localhost' with the DNS name of the remote server.

Now run with two servers and one client. Make sure to pass the port numbers displayed on your console of both servers to the client so that it contacts both servers:

  top% ./srv.py ./tmp/a &
  Listening on port xxxxx ...
  (you may need to press enter to get the next prompt)

  top% ./srv.py ./tmp/b &
  Listening on port yyyyy ...

  bottom% ./clnt.py xxxxx yyyyy
  (after a few seconds) ^C
  bottom% cat ./tmp/a
  bottom% cat ./tmp/b

Question 1: Are the balances in ./tmp/a and ./tmp/b equal? (Make sure to start new servers between exercises.)

Note: You ran both servers on the same machine, but to achieve fault-tolerance you would run them on different machines so that if one machine fails, the other server could keep serving transfer requests. It is easy to run the servers on different machine by again replacing 'localhost' with the DNS name of the remote server.

Now run with two clients:

  bottom% ./clnt.py xxxxx yyyyy &
  bottom% ./clnt.py xxxxx yyyyy
  ^S when first error appears
  ^Q to resume

  bottom% fg

Question 2: The clients are printing error messages. What does the error message mean? Why is it printing an error message? (You need to study clnt.py to answer this question.)

II. Ordering with a primary

To ensure that client requests are ordered in the same order at both servers, we use a third program: coord.py. The idea is to use the coordinator to implement Primary-Backup replication. We run one machine, the primary, with a coordinator and server, and one machine, the backup, with just a server.

Question 3: We have given you the skeleton of the coordinator, and you need to complete it so that clients don't print error messages. A correct run can complete the following script without printing any error messages:

  top% ./srv.py ./tmp/a &
  Listening on port xxxxx ...

  top% ./srv.py ./tmp/b &
  Listening on port yyyyy ...

  top% ./coord.py xxxxx yyyyy
  Listening on port zzzzz ...

  bottom% ./clnt.py zzzzz &
  bottom% ./clnt.py zzzzz &
  bottom% jobs

  (let it run for a while)

  bottom% kill %1
  bottom% kill %2

You might find the following page on how to program with sockets in Python useful. clnt.py shows a good example of how to use client sockets in Python.

III. Handling non-determinism

An important correctness property of the servers is that their state is identical, so that one can always replace a primary server with the backup server. By starting in the same state and applying all operations in the same order, the servers end up in the same state, as long as operations are deterministic. With deterministic we mean that they produce the same result on both servers. Real servers often perform non-deterministic operations, such asking for the current time. If the one server executes the operation a bit later than the one other (e.g., one receives the request earlier than the other), then the time stamps are different, and their final states are different.

Question 4: Modify the coordinator (and srv.py) so that the time stamps on the top of the files ./tmp/a and ./tmp/b are guaranteed to be identical.

When done, submit your answers to the questions and your source code for both srv.py and coord.py to the online submission site.

IV. Handling failures

Optional: Your implementation doesn't handle failures correctly. If a backup server fails, a new server should be started and it should acquire the state from the primary. If a primary fails, the backup should become primary, and a new backup server should be added. Finally, a real system would have a plan to replicate the coordinator. If you have spare time, you can think about how to extend the implementation to handle some of these failures. It turns out doing it correctly is difficult, and if you want to know how to do it right, take 6.824.

Go to 6.033 Home Page