A response to Tom Lord's "Diagnosing svn"
by Greg Hudson, 2004-01-28
-----------------------------------------

In February of 2003, Tom Lord wrote a missive entitled "Diagnosing
svn" to the arch-users mailing list, in which he attempted to explain
why Subversion is--from the perspective of the target audience--a
failure.  Since the mailing list archives from that period are not
currently available, I have placed a reference copy at:

  http://web.mit.edu/ghudson/thoughts/diagnosing

To some readers with only a passing familiarity with Subversion, it
struck a chord, and convinced them that Subversion is a failure when
they were not convinced before.  But much of the content contained
within is inaccurate or is combined with a controversial axiom in
order to arrive at the conclusion that Subversion went about things
the wrong way.

The claims I want to address are:

  1. Meta-CVS and DCVS are better at shoring up CVS than svn is.
  2. SVN developers had some big hammers and were looking for nails.
  3. SVN has an underdeveloped notion of version control.
  4. SVN erroneously used APIs to hide implementation issues.
  5. SVN developers were incapable of taking a step back.
  6. SVN unwisely layered onto Apache and DAV.

I'll add in one point of agreement:

  7. SVN should not have used Berkeley DB for its back end.

1. Meta-CVS and DCVS

  "People have recently said things here along the lines of 'svn
  fails to significantly improve upon CVS and, to the degree it
  does, meta-CVS and dcvs do the same job in a better way' (I pretty
  much agree)"

Meta-CVS implements directory structure versioning using CVS as an
underlying layer, much as CVS originally used (and in some sense,
still uses) RCS as an underlying layer.  DCVS is a reimplementation of
a CVS server.  From a practical perspective, neither appears to have a
significant user or developer community, so let's evaluate this claim
from the theoretical perspective.

Meta-CVS's layering strategy reduces initial development time, but it
does not achieve any kind of meaningful compatibility with the
installed base, it does not migrate away from the CVS code base (which
is widely acknowledged to be awful), and it cannot achieve whole-tree
versioning or failure-atomic commits.  Moreover, the layering strategy
almost certainly introduces puzzling failure modes where a CVS error
messages is presented in response to a Meta-CVS command.

By using the CVS client, DCVS achieves compatibility with the CVS
installed base, but is hamstrung both by the CVS network protocol
(which is also widely acknowledged to be awful) and by the client's
operational model, which assumes that all versioning is performed by
file pathname and there are no directory or tree versions.  So, while
it can accomplish atomic commits and other improvements over CVS, it
cannot accomplish proper directory structure versioning.

Does Subversion significantly improve upon CVS?  That depends on what
you most dislike about CVS.  If it's the lack of directory structure
versioning, the inability to see what has changed about a whole
directory tree, the confusing and inefficient branch support, or a
variety of implementation flaws, then Subversion may seem exciting and
fresh.  If it's a lack of support for highly distributed development
or history-sensitive merging, or a fundamental dislike of the
operational model, then Subversion in its current form may appear
humdrum and derivative.  Regardless, Meta-CVS and DCVS are not better
paths to the same goal.

2. Big hammers

Tom makes claims that Subversion developers had two big hammers in
search of nails: the transactional filesystem and the associative
lookup table.

  "So here's the first mistake: the idea of a transactional FS is like
  a shiny new hammer.  It's pretty natural to let it possess you and
  start running around looking for nails."

The problem with this claim is that a transactional FS, with a few
annotations, is exactly the right hammer for version control as
conceived of by Subversion ("taking snapshots of trees").  As much as
one might fervently believe that this is the wrong conception of
version control, it's a workable and very intuitive conception.  See
the next section for details.

  "Application of patterns like property lists in a design bull
  session all too easily gives rise to the feeling that 'all the
  problems we're thinking about have natural solutions in this design'
  even though all you're really saying is 'the problems we need to solve
  can be expressed in terms of associative lookup'."

There may be some truth to this claim as it applies to Subversion's
original design, but the resulting Subversion implementation makes
very limited use of property lists, so it has limited applicability to
the question of whether Subversion goes about version control in the
right way.

Tom's general assertion that Subversion "underwent fuzzy design
conceptualization" appears to be a combination of two factors: (1) an
unfair comparison between what has gone on inside his head in the
development of Arch (most of which is not written down) and what was
publically written down during Subversion's development and read by
him, and (2) his belief that ambitious merge support is a fundamental
and central feature of modern version control.  Since Subversion's
design was (like most software design) not written down with
mathematical precision, and since Subversion made an early decision to
put off history-sensitive merging until the future, he concludes that
the entire design process was "fuzzy."

3. An under-developed notion of version control

  "Suppose you have the same intuition that Walter expressed a while
  back, which I'll paraphrase as:  'The first and most fundamental
  task of a revision control system is to take snapshots of working
  directories.'

  If you don't believe that that's a seductive (even though wrong)
  intuition, go back and look at how I replied.  It took many, quite
  abstract paragraphs.  What revision control is really about
  (archival, access, and manipulation of changesets) is subtle and
  _non_-intuitive."

This is a subtle and important point, one which divides the
centralized or tree-oriented version control systems (Perforce,
Clearcase, CVS, Subversion) from the changeset-oriented ones
(Bitkeeper, Arch).  A full treatment of this issue could fill multiple
journal articles, but one should recognize that it is an issue with
two sides:

  * Changeset-oriented version control is more powerful, but it is
    power which is largely unnecessary in all but the most chaotic of
    development projects.

  * Changeset-oriented version control is harder to learn.  In many
    environments, a shallow learning curve is the most important
    feature of a version control system.

  * Changeset-oriented version control is hard to get right.  Perhaps
    the best support for this statement can be found in a March 2003
    note from Larry McVoy to the linux-kernel list:

      http://www.ussg.iu.edu/hypermail/linux/kernel/0303.1/0130.html

  * Changeset-oriented version control can be built on top of a
    tree-oriented foundation, although it will have all the
    disadvantages listed above.  As Tom himself notes, tree-oriented
    storage is a dual to changeset-oriented storage.  svk
    (http://svk.elixus.org/) serves as a working prototype of changeset-
    oriented version control implemented on top of Subversion.

4. Use of APIs

  "When you lack confidence about your intended way to implement
  something, a common pattern is to decide to hide the implementation
  under an API.   That way you can always change the implementation
  later, right?"

  "Sixth mistake: assuming that defining APIs all over the place 
  would improve chances for success."

This claim is particularly puzzling to me, because from my
perspective, Subversion's use of APIs is largely about allowing use of
Subversion's functionality by third-party applications, and has very
little to do with plastering over implementation flaws.  In the few
cases where Subversion envisioned multiple implementations of an
API--the editor, reporter, repository access, and filesystem
APIs--there are at least two dramatically different implementations of
all but the filesystem API.

5. Taking a step back

  "If a team went away for six months and came back with SVN as it
  works today, I think it'd be pretty easy to say: 'That's a great and
  even useful prototype.  It definately proves, at least
  schematically, the idea of using a transactional FS to back
  revision control.  There are clearly some bad choices in the
  implementation, and clearly some important neglected revision
  control issues that competing projects are starting to leapfrog you
  over.  And there's a heck of a lot of code here for what it does.
  Let's suspend development on it for a little while, and invest in
  a design effort to see how to get this thing _right_'."

Much of this claim dates to a series of Subversion mailing list
exchanges in 2002 and early 2003 where Tom said pretty much just this.
His comments failed to persuade for a variety of reasons, but mostly
for two reasons: they were too abstract (in particular, they tended to
assume a lot of familiarity with arch, at a time when it was not easy
to rapidly gain such familiarity), and they appealed to priorities
which were not shared, such as a belief in the high importance of
ambitious, history-sensitive merging.  One could make a similar
argument about Arch "failing to fail" by asserting that
Unix-specificity and a steep learning curve are fundamental and
insurmountable obstacles.  Based on the current size of their user and
developer communities, neither argument holds; both projects have been
able to achieve some acceptance in spite of their shortcomings.

More generally, though, the tendency of software projects to have a
"failure to fail" is not necessarily a result of "crappy
socio-economic circumstances" but simply an acceptance of the fact
that no large software project can be perfect, and only the ones which
make a drive for stability and acceptance, in spite of their flaws,
can achieve success.  Two essays on this topic:

  http://www.joelonsoftware.com/articles/fog0000000069.html
  http://www.neilgunton.com/rewrites_harmful/

6. Jumping on the W3C bandwagon

  "SVN came into being at a time when it looked to many like HTTP and
  Apache were the spec and best implementation of the new distributed
  OS for the world that would solve everything beautifully.  There
  was a kind of dog-pile onto everything W3C based on irrational
  exhuberence.  Well, they weren't that OS and they don't solve
  everything beautifully."

There's some truth here, but it's greatly exaggerated.  There was no
dog-pile onto "everything W3C"; Subversion uses basic XML in various
places, and uses the Apache httpd implementation and the HTTP, WebDAV,
and (to some extent) DeltaV standards in one of its three repository
access methods.  Although the use of these components has not been
without difficulty, it has also not been without benefit.  WebDAV
compatibility allows desktop users to mount Subversion repositories as
file shares and even modify them transparently with "auto-versioning".
The use of Apache and Neon has allowed for relatively easy access to a
limited but useful suite of authentication and authorization methods.

At any rate, with the exception of the use of XML in the working copy
library, it's easy to avoid all of that stuff by using svnserve, which
is also faster and simpler than the HTTP access method.

7. A point of agreement: Berkeley DB

  "Wanting to make progress simply and quickly, they spotted the
  Berekeley DB library.   After all: it provides transactions with
  ACID properties for our favorite handwavy design tool -- the
  associative lookup table."

  "Well, I think Berkeley DB is a lousy choice for this application.
  It creates administrative headaches, and it's optimized for simple
  associations, not hierarchical filessytems.  It doesn't natively
  provide any sort of delta-compression -- you'll have to layer that.
  Ultimately _all_ that it buys you is transactions and locks --
  every other aspect is a force-fit."

These statements are not completely true; in addition to transactions
and locks, Berkeley DB also provides a convenient way to have many
associations with relatively low overhead.  And as noted before,
Subversion developers have had no particular fascination with the
associative lookup table, so the "handwavy" comment is hardly
charitable.

But by and large, the point is valid: as long as Subversion's back end
is tied to Berkeley DB, its reliability will suffer.  You can run out
of locks.  An improperly terminated operation can cause the database
to require recovery.  Access to the database from multiple Unix
accounts requires proper umasks or the database appears to become
corrupted.  The database will not work properly over a remote
filesystem.  Moving the database from one platform to another results
in cryptic failures.

Fortunately, it should be possible to replace Subversion's back end
without changing any other part of Subversion, even the server-side
repository logic.

8. Conclusion

The target audience of Tom's mail was Arch users, who can be expected
to generally agree with Tom's priorities and therefore find Arch more
interesting than Subversion.  In that context, much of his "diagnosis"
makes sense, at least where it is not inaccurate.  But the casual
reader should not conclude that Subversion is some kind of colossal
failure; it has the largest development and user community of any free
CVS replacement, and its users generally have positive things to say
about it.