A response to Tom Lord's "Diagnosing svn" by Greg Hudson, 2004-01-28 ----------------------------------------- In February of 2003, Tom Lord wrote a missive entitled "Diagnosing svn" to the arch-users mailing list, in which he attempted to explain why Subversion is--from the perspective of the target audience--a failure. Since the mailing list archives from that period are not currently available, I have placed a reference copy at: http://web.mit.edu/ghudson/thoughts/diagnosing To some readers with only a passing familiarity with Subversion, it struck a chord, and convinced them that Subversion is a failure when they were not convinced before. But much of the content contained within is inaccurate or is combined with a controversial axiom in order to arrive at the conclusion that Subversion went about things the wrong way. The claims I want to address are: 1. Meta-CVS and DCVS are better at shoring up CVS than svn is. 2. SVN developers had some big hammers and were looking for nails. 3. SVN has an underdeveloped notion of version control. 4. SVN erroneously used APIs to hide implementation issues. 5. SVN developers were incapable of taking a step back. 6. SVN unwisely layered onto Apache and DAV. I'll add in one point of agreement: 7. SVN should not have used Berkeley DB for its back end. 1. Meta-CVS and DCVS "People have recently said things here along the lines of 'svn fails to significantly improve upon CVS and, to the degree it does, meta-CVS and dcvs do the same job in a better way' (I pretty much agree)" Meta-CVS implements directory structure versioning using CVS as an underlying layer, much as CVS originally used (and in some sense, still uses) RCS as an underlying layer. DCVS is a reimplementation of a CVS server. From a practical perspective, neither appears to have a significant user or developer community, so let's evaluate this claim from the theoretical perspective. Meta-CVS's layering strategy reduces initial development time, but it does not achieve any kind of meaningful compatibility with the installed base, it does not migrate away from the CVS code base (which is widely acknowledged to be awful), and it cannot achieve whole-tree versioning or failure-atomic commits. Moreover, the layering strategy almost certainly introduces puzzling failure modes where a CVS error messages is presented in response to a Meta-CVS command. By using the CVS client, DCVS achieves compatibility with the CVS installed base, but is hamstrung both by the CVS network protocol (which is also widely acknowledged to be awful) and by the client's operational model, which assumes that all versioning is performed by file pathname and there are no directory or tree versions. So, while it can accomplish atomic commits and other improvements over CVS, it cannot accomplish proper directory structure versioning. Does Subversion significantly improve upon CVS? That depends on what you most dislike about CVS. If it's the lack of directory structure versioning, the inability to see what has changed about a whole directory tree, the confusing and inefficient branch support, or a variety of implementation flaws, then Subversion may seem exciting and fresh. If it's a lack of support for highly distributed development or history-sensitive merging, or a fundamental dislike of the operational model, then Subversion in its current form may appear humdrum and derivative. Regardless, Meta-CVS and DCVS are not better paths to the same goal. 2. Big hammers Tom makes claims that Subversion developers had two big hammers in search of nails: the transactional filesystem and the associative lookup table. "So here's the first mistake: the idea of a transactional FS is like a shiny new hammer. It's pretty natural to let it possess you and start running around looking for nails." The problem with this claim is that a transactional FS, with a few annotations, is exactly the right hammer for version control as conceived of by Subversion ("taking snapshots of trees"). As much as one might fervently believe that this is the wrong conception of version control, it's a workable and very intuitive conception. See the next section for details. "Application of patterns like property lists in a design bull session all too easily gives rise to the feeling that 'all the problems we're thinking about have natural solutions in this design' even though all you're really saying is 'the problems we need to solve can be expressed in terms of associative lookup'." There may be some truth to this claim as it applies to Subversion's original design, but the resulting Subversion implementation makes very limited use of property lists, so it has limited applicability to the question of whether Subversion goes about version control in the right way. Tom's general assertion that Subversion "underwent fuzzy design conceptualization" appears to be a combination of two factors: (1) an unfair comparison between what has gone on inside his head in the development of Arch (most of which is not written down) and what was publically written down during Subversion's development and read by him, and (2) his belief that ambitious merge support is a fundamental and central feature of modern version control. Since Subversion's design was (like most software design) not written down with mathematical precision, and since Subversion made an early decision to put off history-sensitive merging until the future, he concludes that the entire design process was "fuzzy." 3. An under-developed notion of version control "Suppose you have the same intuition that Walter expressed a while back, which I'll paraphrase as: 'The first and most fundamental task of a revision control system is to take snapshots of working directories.' If you don't believe that that's a seductive (even though wrong) intuition, go back and look at how I replied. It took many, quite abstract paragraphs. What revision control is really about (archival, access, and manipulation of changesets) is subtle and _non_-intuitive." This is a subtle and important point, one which divides the centralized or tree-oriented version control systems (Perforce, Clearcase, CVS, Subversion) from the changeset-oriented ones (Bitkeeper, Arch). A full treatment of this issue could fill multiple journal articles, but one should recognize that it is an issue with two sides: * Changeset-oriented version control is more powerful, but it is power which is largely unnecessary in all but the most chaotic of development projects. * Changeset-oriented version control is harder to learn. In many environments, a shallow learning curve is the most important feature of a version control system. * Changeset-oriented version control is hard to get right. Perhaps the best support for this statement can be found in a March 2003 note from Larry McVoy to the linux-kernel list: http://www.ussg.iu.edu/hypermail/linux/kernel/0303.1/0130.html * Changeset-oriented version control can be built on top of a tree-oriented foundation, although it will have all the disadvantages listed above. As Tom himself notes, tree-oriented storage is a dual to changeset-oriented storage. svk (http://svk.elixus.org/) serves as a working prototype of changeset- oriented version control implemented on top of Subversion. 4. Use of APIs "When you lack confidence about your intended way to implement something, a common pattern is to decide to hide the implementation under an API. That way you can always change the implementation later, right?" "Sixth mistake: assuming that defining APIs all over the place would improve chances for success." This claim is particularly puzzling to me, because from my perspective, Subversion's use of APIs is largely about allowing use of Subversion's functionality by third-party applications, and has very little to do with plastering over implementation flaws. In the few cases where Subversion envisioned multiple implementations of an API--the editor, reporter, repository access, and filesystem APIs--there are at least two dramatically different implementations of all but the filesystem API. 5. Taking a step back "If a team went away for six months and came back with SVN as it works today, I think it'd be pretty easy to say: 'That's a great and even useful prototype. It definately proves, at least schematically, the idea of using a transactional FS to back revision control. There are clearly some bad choices in the implementation, and clearly some important neglected revision control issues that competing projects are starting to leapfrog you over. And there's a heck of a lot of code here for what it does. Let's suspend development on it for a little while, and invest in a design effort to see how to get this thing _right_'." Much of this claim dates to a series of Subversion mailing list exchanges in 2002 and early 2003 where Tom said pretty much just this. His comments failed to persuade for a variety of reasons, but mostly for two reasons: they were too abstract (in particular, they tended to assume a lot of familiarity with arch, at a time when it was not easy to rapidly gain such familiarity), and they appealed to priorities which were not shared, such as a belief in the high importance of ambitious, history-sensitive merging. One could make a similar argument about Arch "failing to fail" by asserting that Unix-specificity and a steep learning curve are fundamental and insurmountable obstacles. Based on the current size of their user and developer communities, neither argument holds; both projects have been able to achieve some acceptance in spite of their shortcomings. More generally, though, the tendency of software projects to have a "failure to fail" is not necessarily a result of "crappy socio-economic circumstances" but simply an acceptance of the fact that no large software project can be perfect, and only the ones which make a drive for stability and acceptance, in spite of their flaws, can achieve success. Two essays on this topic: http://www.joelonsoftware.com/articles/fog0000000069.html http://www.neilgunton.com/rewrites_harmful/ 6. Jumping on the W3C bandwagon "SVN came into being at a time when it looked to many like HTTP and Apache were the spec and best implementation of the new distributed OS for the world that would solve everything beautifully. There was a kind of dog-pile onto everything W3C based on irrational exhuberence. Well, they weren't that OS and they don't solve everything beautifully." There's some truth here, but it's greatly exaggerated. There was no dog-pile onto "everything W3C"; Subversion uses basic XML in various places, and uses the Apache httpd implementation and the HTTP, WebDAV, and (to some extent) DeltaV standards in one of its three repository access methods. Although the use of these components has not been without difficulty, it has also not been without benefit. WebDAV compatibility allows desktop users to mount Subversion repositories as file shares and even modify them transparently with "auto-versioning". The use of Apache and Neon has allowed for relatively easy access to a limited but useful suite of authentication and authorization methods. At any rate, with the exception of the use of XML in the working copy library, it's easy to avoid all of that stuff by using svnserve, which is also faster and simpler than the HTTP access method. 7. A point of agreement: Berkeley DB "Wanting to make progress simply and quickly, they spotted the Berekeley DB library. After all: it provides transactions with ACID properties for our favorite handwavy design tool -- the associative lookup table." "Well, I think Berkeley DB is a lousy choice for this application. It creates administrative headaches, and it's optimized for simple associations, not hierarchical filessytems. It doesn't natively provide any sort of delta-compression -- you'll have to layer that. Ultimately _all_ that it buys you is transactions and locks -- every other aspect is a force-fit." These statements are not completely true; in addition to transactions and locks, Berkeley DB also provides a convenient way to have many associations with relatively low overhead. And as noted before, Subversion developers have had no particular fascination with the associative lookup table, so the "handwavy" comment is hardly charitable. But by and large, the point is valid: as long as Subversion's back end is tied to Berkeley DB, its reliability will suffer. You can run out of locks. An improperly terminated operation can cause the database to require recovery. Access to the database from multiple Unix accounts requires proper umasks or the database appears to become corrupted. The database will not work properly over a remote filesystem. Moving the database from one platform to another results in cryptic failures. Fortunately, it should be possible to replace Subversion's back end without changing any other part of Subversion, even the server-side repository logic. 8. Conclusion The target audience of Tom's mail was Arch users, who can be expected to generally agree with Tom's priorities and therefore find Arch more interesting than Subversion. In that context, much of his "diagnosis" makes sense, at least where it is not inaccurate. But the casual reader should not conclude that Subversion is some kind of colossal failure; it has the largest development and user community of any free CVS replacement, and its users generally have positive things to say about it.