Notes on keeping version histories of files
-------------------------------------------

One of the lowest-level tasks of a version control system (such as
RCS, SCCS, or CVS) is to maintain an archive of the changing contents
of an individual file.  For systems like RCS, this is most of the
system's job, while for project-oriented version control systems the
task is dwarfed by the larger bookkeeping problems of versioning
directories full of files.

This document will analyze three methods of keeping such an archive.

METHOD #1: Plain texts of each revision

This is the baseline method; it is not commonly used.  Whenever the
contents of a file change, a complete new plain text is stored.  The
time it takes to reconstruct any revision is proportional to the size
of that revision's contents, assuming the archive is indexed
efficiently.  The space taken by the archive is proportional to the
sum of the sizes of all revisions of the file.

It may be that, as disk space grows cheaper, this method will become
popular due to its simplicity.

METHOD #2: The weave

The weave method is used by SCCS and Bitkeeper.  It is traditionally
implemented with text files in mind, although nothing prevents the
method from being applied to binary files.  The archive contains all
the lines of text present in any revision of the file, all "woven"
together, with interspersed control instructions indicating which
lines are included in which revisions of the file.  For instance, in
SCCS, the following weave represents the contents of a file containing
the two lines "foo" and "bar" in the first revision and the two lines
"bar" and "baz" in the second revision ("^A" denotes a control-A
character):

  I 1
  D 2
  foo
  E 2
  bar
  I 2
  baz
  E 2
  E 1

The time it takes to reconstruct any revision is proportional to the
size of the archive.  The archive size is difficult to characterize,
but in general it grows with each new revision proportionally to the
size of the changed lines in that revision relative to previous
revisions.

The weave method has the advantage of uniform retrieval time for all
revisions of a file.  Unfortunately, it does not appear to be the
subject of much research, it is difficult to implement, and there are
few, if any, open-source implementations of the weave to refer to.  As
such, it is not widely used.

METHOD #3: Deltas

Deltas are used by RCS, CVS, PRCS, arch, Subversion, and other version
control systems.  There are many variations on this method, but in
general, the plaintext of the most recent revision is stored, and
prior revisions are expressed in terms of their difference relative to
the succeeding revision.  For instance, in RCS, the following
represents the contents of a file (as above) containing the two lines
"foo" and "bar" in the first revision and the two lines "bar" and
"baz" in the second revision:

  1.2
  text
  @bar
  baz
  @

  1.1
  text
  @a0 1
  foo
  d2 1
  @

The most recent revision is expressed in plain text, while the
previous revision is expressed as a delta (in this case, a text-format
delta which uses append and delete commands to express changed lines).

Similarly to the weave, the archive size grows with each new revision
in proportion to the size of the changed lines relative to the
previous revision.  (In contrast to the weave, however, if a line
disappears in one revision and reappears in the next, it will
typically be expressed twice in the archive.)

To reconstruct an old revision, we must (at least, naively) take the
most recent revision and apply a series of deltas to it.  The time it
takes to apply a delta is proportional to the size of the source plus
the size of the delta; applying N deltas, then, takes time
proportional to the size of the deltas plus N times the average size
of the file across all those revisions.

The second term of the above expression may grow unreasonably large
for an old revision of a big file.  For instance, if a change log file
grows from zero to one million bytes over the course of one thousand
revisions, then roughly five hundred million bytes will have to be
processed in order to reconstruct the oldest version of the file.

In spite of this performance weakness, the delta approach enjoys more
popularity than the weave approach, partly because there has been
extensive published research into delta algorithms and there are many
open source delta implementations to use or refer to.  Also,
reconstructing the most recent revision is faster in the delta
approach, and that is usually a much more common case than examining
an old revision.

Fortunately, there are two ways to improve on the performance of the
delta approach without abandoning it entirely: delta combination and
skip-deltas.

Delta combination seeks to reduce the time it takes to apply a series
of deltas, by combining all of the deltas together into one big delta
before looking at the source data.  If a million-byte file has
undergone a hundred small changes, then the million bytes will only
need to be processed once, instead of being copied over a hundred
times with minor changes each time.  Delta combination alone brings
the performance of the delta method in line with the weave method,
while retaining quick reconstruction of the most recent revisions.

Skip-deltas seek to drastically limit the number of delta applications
required to reconstruct an old revision of the file.  Instead of
expressing each old revisions as a delta against the succeeding
revision, some are expressed against more advanced revisions.  A file
with ten revisions (numbered 0-9) would be represented with:

  Revision 0 expressed as a delta against revision 8
  Revision 1 expressed as a delta against revision 2
  Revision 2 expressed as a delta against revision 4
  Revision 3 expressed as a delta against revision 4
  Revision 4 expressed as a delta against revision 8
  Revision 5 expressed as a delta against revision 6
  Revision 6 expressed as a delta against revision 8
  Revision 7 expressed as a delta against revision 8
  Revision 8 expressed as a delta against revision 9
  Revision 9 expressed in plain text

Or, graphically:

  0 --------------------------------------------> 8 --> 9
                          4 --------------------> 8
              2 --------> 4           6 --------> 8
        1 --> 2     3 --> 4     5 --> 6     7 --> 8

With skip-deltas, the number of deltas required to reconstruct any
revision of the file is proportional to the base-2 logarithm of the
number of revisions in the file.  If a file has a thousand revisions,
it would take only eight delta applications to reconstruct revision 0
using the plain text at revision 999.

Skip-deltas come with a space penalty; the average delta runs across
lg(N) revisions instead of one.  In practice, the space cost is
generally much less than a factor of lg(N).

Skip-deltas also come with a penalty in creation time.  Whenever a new
revision is added, an average of two previous revisions must be
re-expressed against the new plain text, instead of just one.  In the
worst case, the base-2 logarithm of the current number of revisions
must be re-expressed.

The space and time penalty can be reduced by not applying the
skip-delta technique to the first few revisions, by skipping the top
row (which only helps revision 0), or by skipping some number of rows
close to the bottom.  Any of these variations can produce notable
savings for practical data sets, while only increasing the number of
deltas needed for reconstruction by a constant number.

REFERENCES

RCS		http://www.gnu.org/software/rcs/rcs.html
SCCS		http://www.cvshome.org/cyclic/cyclic-pages/sccs.html
PRCS		http://prcs.sourceforge.net
Bitkeeper	http://www.bitmover.com
arch		http://www.regexps.com/arch.html
Subversion	http://subversion.tigris.org

The technique of delta combination is currently implemented in
Subversion, and is either planned for or implemented in PRCS.  The
author does not know of papers or other references describing the
technique.

The technique of skip-deltas is implemented in Subversion.

AUTHOR

This document was written by Greg Hudson <ghudson@mit.edu> on
2002-06-24.  It was last updated on 2002-10-03.