H. Attiya, A. Bar-Noy, and D. Dolev. "Sharing memory robustly in message-passing systems."
Journal of the ACM, 42(1):124-142, Jan. 1995
There are N replicas in a group. A replica group stores a collection of items; for example it might store files or file pages. Each item has an identifier and a state.
Each replica stores the state for each item, plus an extra piece of
information: a version number. The version number is a pair
The idea is that if different replicas store different version numbers
for an item, the state associated with a larger version number is more
recent than the state associated with a smaller version number.
Reads go to a read quorum of size R and writes go to a write quorum of
size W. Furthermore, we require that R+W N, i.e., read quorums
always intersect with write quorums. This will ensure that read
results always reflect the result of the most recent write (because
the read quorum will include at least one replica that was involved in
the most recent write).
For example, consider a group of 3 replicas. Then we have the
following possibilities:
write (x, s)
Here x is an identifier that indicates the item of interest (e.g., the
name of a file). The write operation takes the new state that is
intended to be stored for that item, and the read operation returns
the current state of the item.
The first phase can be avoided if the client knows what version number
to provide without having to read it. This will be true if there is
just one writer, i.e., one client modifies the item and all the other
clients just read it. It can also be true if there is some way that
writers coordinate to find out the proper version number without
reading it.
Note that if the first phase is needed, it doesn't do any good to
choose a small write quorum and a larger read quorum since completing
a write will involve communicating with a read quorum.
Usually, the write-back phase will not be needed. The reason is that
write requests (in phase 2 of the write operation) are sent to
all replicas and in the absence of failures all replicas will record
the results of that write. Therefore all the responses in phase
1 of the read operation will return the same version number.
In response to a write request, if the new version number is greater
than what the replica has stored, the replica overwrites with the new
version number and new state. Otherwise it doesn't overwrite but instead
just keeps what it has. In either case it returns an acknowledgement.
Some care is required when a replica recovers: for the protocol to
work properly it must be the case that a replica responds to requests
only after it knows all modifications that it acknowledged before the
failure.
One way to satisfy this requirement is to write the new state for a
modified item to disk before sending the acknowledgement. Then on
recovery the replica can continue to provide service immediately,
provided its disk has not been trashed.
However, if there is a problem with the disk, or if the replica sent
acknowledgements before writing to disk, then the replica must obtain
all modifications it may have lost. It can do this by reading all
items from the other replicas. This read doesn't require a
write-back, since the next read done by a client will do the
write-back.
The reason the replica must recover its state before responding to
client requests is that otherwise it would count in a quorum but the
information it is sending might be stale.
For example consider a three-replica system in which both read and
write quorums consist of two replicas.
Now consider the following scenario:
Note that if the read had happened while R1 was failed, there
wouldn't be a problem because in this case the read would require
responses from R2 and R3, and therefore it would return the value
s1. Note also that the problem will not occur if R1 recovers
its state before responding to client requests.
The exact mechanisms whereby a replica detects that its state is bad,
and the way it determines exactly which items it needs to read from
other replicas, are beyond the scope of this note.
How it works
Each client machine runs a client-side proxy that carries out the
replication protocols. The proxy provides two operations that can be
called by user code on the client machine:
s <- read (x)The write operation
A write requires either one or two phases. These phases require
communication with a quorum, and the phase is complete when a quorum
of responses has arrived at the client side.
If there are concurrent writers, the one choosing the largest version
number will prevail. The protocol doesn't provide any
synchronization; instead this would be done outside, e.g., by using
locks.The Read Operation
To perform a read requires either one or two phases.
The second phase of read is called the "write-back" phase. It is
needed when the read is concurrent with a write, in order to ensure
condition (b) below. It is also useful to handle the case of a client
that fails in the middle of a write, since it effectively completes
that write.Protocol at the replicas
In response to a read request for some item, the replica returns the
value with its version number (or just the version number if this is
what the client wants).Discussion
This protocol provides atomicity.
Another name for this condition is linearizability.Recovery
This protocol has the property that when a replica fails (or is
inaccessible because of a network problem), operations continue to
work properly. If the needed quorum of replicas is available,
operations complete; otherwise their execution will be delayed until
the failure or inaccessibility is over.
At this point a read might complete with a quorum consisting of
replicas R1 and R3, providing the user with the state s0, which is
wrong.