H. Attiya, A. Bar-Noy, and D. Dolev. "Sharing memory robustly in message-passing systems."
Journal of the ACM, 42(1):124-142, Jan. 1995
There are N replicas in a group. A replica group stores a collection of items; for example it might store files or file pages. Each item has an identifier and a state.
Each replica stores the state for each item, plus an extra piece of
information: a version number. The version number is a pair
The idea is that if different replicas store different version numbers for an item, the state associated with a larger version number is more recent than the state associated with a smaller version number.
Reads go to a read quorum of size R and writes go to a write quorum of size W. Furthermore, we require that R+W N, i.e., read quorums always intersect with write quorums. This will ensure that read results always reflect the result of the most recent write (because the read quorum will include at least one replica that was involved in the most recent write).
For example, consider a group of 3 replicas. Then we have the following possibilities:
write (x, s)
s <- read (x)
Here x is an identifier that indicates the item of interest (e.g., the name of a file). The write operation takes the new state that is intended to be stored for that item, and the read operation returns the current state of the item.
The first phase can be avoided if the client knows what version number to provide without having to read it. This will be true if there is just one writer, i.e., one client modifies the item and all the other clients just read it. It can also be true if there is some way that writers coordinate to find out the proper version number without reading it.
Note that if the first phase is needed, it doesn't do any good to choose a small write quorum and a larger read quorum since completing a write will involve communicating with a read quorum.
Usually, the write-back phase will not be needed. The reason is that write requests (in phase 2 of the write operation) are sent to all replicas and in the absence of failures all replicas will record the results of that write. Therefore all the responses in phase 1 of the read operation will return the same version number.
In response to a write request, if the new version number is greater than what the replica has stored, the replica overwrites with the new version number and new state. Otherwise it doesn't overwrite but instead just keeps what it has. In either case it returns an acknowledgement.
Some care is required when a replica recovers: for the protocol to work properly it must be the case that a replica responds to requests only after it knows all modifications that it acknowledged before the failure.
One way to satisfy this requirement is to write the new state for a modified item to disk before sending the acknowledgement. Then on recovery the replica can continue to provide service immediately, provided its disk has not been trashed.
However, if there is a problem with the disk, or if the replica sent acknowledgements before writing to disk, then the replica must obtain all modifications it may have lost. It can do this by reading all items from the other replicas. This read doesn't require a write-back, since the next read done by a client will do the write-back.
The reason the replica must recover its state before responding to client requests is that otherwise it would count in a quorum but the information it is sending might be stale.
For example consider a three-replica system in which both read and write quorums consist of two replicas.
Now consider the following scenario:
Note that if the read had happened while R1 was failed, there wouldn't be a problem because in this case the read would require responses from R2 and R3, and therefore it would return the value s1. Note also that the problem will not occur if R1 recovers its state before responding to client requests.
The exact mechanisms whereby a replica detects that its state is bad, and the way it determines exactly which items it needs to read from other replicas, are beyond the scope of this note.