I've been thinking about version control a lot lately. I just wanted to get some thoughts I've been having out there so people who know what they're talking about can shoot them down.

I'm going to structure this as a series of predictions, none of which I particularly believe are correct. Nevertheless, the format makes for an easy true-or-false binary judgement later on. The playing with format is also because I've been reading GEB, and I feel like trying a new format. So here goes:

1. Bazaar Will Never be a Serious Player

Bazaar, depending on your perspective, either takes the best of both Mercurial and Git, or the worst. I can sympathize with both views. On one hand, it is Python and pretty well cross-platform, and not the C-and-shell which caused so many portability issues for Git. It's also apparently supposed to be "getting there", speed-wise - although, the checkout of bzr trunk I have going on is certainly taking its time.

But bzr has at least one problem - and that is that it has broken repository-format compatability, at least once. This kept me from trying out Zed Shaw's projects for some time - the time investment to apt-get and then bzr branch is okay, but once I had to start worrying about versions, I gave up. Mercurial is working today.

And while problems can be forgiven, then need to be offset with something significantly new, and I don't see that with bzr. Mercurial, and Git to an even larger extent, are innovating in the DVCS space, while Bazaar is making a slower, less widely used reimplementation of all their features.

On a less-serious note, Bazaar also has another disadvantage: bzr is awkward to type compared to "git" and "hg". "hg" wins, as it's two keystrokes by different fingers - you can even alias gh to hg in your .bashrc and make it basically instantaneous. Git requires a "A-B-A" pattern, but with alternating hands; bzr requires A-B-A but on the same hand, with a further distance between the first and third keystrokes, and a pinky-keystroke required for z. But hey, whatever. Maybe it's not a major issue, but call me when a version control program named qzzqx storms the world, k?

Bazaar does have another big advantage, and that is Canonical. I can still see Bazaar usage increasingly greatly in a very specific and unlikely set of circumstances, among them Launchpad being open-sourced soon and 3rd-party hosting services using Launchpad taking off. But even if that were to happen, I'd be surprised if no one just wrote a Git or Mercurial backend for Launchpad's VCS features. But it could happen. It's just not likely.

I've finally got the bzr trunk checked out, and I do want to say that the code looks damn nice. I'd heard Mercurial's code pointed out to me as good Python code, but the ui-passing everywhere turned me off (if you've coded for the Mercurial APIs, you know what I'm talking about). Bazaar's source looks considerably cleaner, with sane style. It reminds me of Django's source. This is, of course, a 2-minute, mile-high overview, but that's what I see.

2. Mercurial is Worse (is Better)

I started recognizing Python as a better Blub than Lisp (I'll expand on that someday) when I learned that many Python "warts" - such as statement-less lambdas - were in fact conscious decisions. The Python developers looked at their choices, considered the arguments for more powerful lambdas, and decided against them. Python is as Python is because that's what the Python devs want it to be. There are certainly warts - the whole unicode/str thing in 2.x comes to mind - but some design decisions were made to trade power for clarity, and that's a fair trade.

With Mercurial, similar tradeoffs are apparent. Tags are implemented with a .hgtags file that simply maps a revision number to a name. I commented to a coworker how that seemed somewhat hacky, and his response was "what's wrong with it?" And he's right: that's all the tags really are; 1-to-1 mappings. A flat file is perfect for them, but my object-oriented mind objected to custom formats in a flat file. I mean, yuck! Shouldn't it at least be serialized somehow?

Of course, the flaw was with my dogmatic assumptions, and not with the implementation, which has worked just fine.

(I do have one beef with .hgtags and .hgignore files - if Mercurial keeps up that convention, there'll be too many .hgFOO files. Just stick them in .hg/, please! Not that I'm actually advocating this change - it'd be too complex and backwards-incompatible to be worth it, but if I had been involved 3 years ago, it's what I would have advocated.)

Likewise, Mercurial ignores directories and the cross-platform vagaries of versioning something that varies widely across operating systems and filesystems. It just versions files at a particular path instead, and creates directories on-the-fly as needed, as well as removing them when empty. This inspired the same WTH? reaction in me at first, but after more time has passed, the question has become, "why not?" A huge chunk of complexity and bugs was removed, and replaced with sane behavior that, in 90% of situations, is exactly what the programmer would have done anyway.

2.1. But Worse is Worse!

But these simplifying abstractions are not complete, and as such they do leak, and one way they do so is with renames. Mercurial handles renames as a remove-and-add. This leaks in a few ways - first, in 'hg status', it actually looks like an add and a remove, which still makes me have to think twice when looking at that situation. More importantly, though, it can inflate the repository size when moving large numbers of binary files, as a copy of the file's contents is stored at each path at which it appeared (in the internal repository format).

This is exacerbated by the fact that Mercurial repositories can never shrink. History cannot be (easily) rewritten. SVN made the same choice, and I think it's worth noting that after all their experience with this choice, they want to change that behavior. Couldn't we learn from their experience, and have partial-history and partial-tree cloning considered must-haves from the get-go?

The suggested solutions all tend to rely on what looks to my untrained eye like 'fooling' Mercurial itself by adding nonsense data to the repository, and then teaching Mercurial to look for these sentinel values and treat them specially. It'll probably work, and work well, in the true 'Worse is Better' sense - but it shouldn't be mistaken for built-in, semantic knowlege of partial or shallow checkouts by Mercurial itself.

I'm going to blatantly assume that this state of affairs is the result of further incompleteness in the simplifying abstractions that were chosen for Mercurial, but I'm really just guessing. It'd make sense though.

I don't know enough about Git to speak about what challenges its choice of abstractions cause. As you could probably have guessed by now, I'm going to anyway. I'll pretty much just make up stuff as I go along and pretend that actually researched it.

3. Git's Model Will Die of Packing

You know what bugs me about Postgres? Vacuum. There's really no other way to do append-only databases... but it still bugs me.

Still, Postgres gets a free pass because they are the database experts, not me, and if the database wants to have a daemon process running, I'm okay with letting them handle that as they best judge, especially since a daemon will need to be running to accept connections anyway.

Git doesn't have that situation. Git repositories need maintenance, and yet, aren't maintained unless the programmer does it. This is unfortunate. I only know of one other piece of software that gets away with this; I'm thinking of ZODB. But ZODB only needs to be maintained by a server admin, and only every few months. Git has to be maintained (maybe if I say "hand-held" it'll convey my grimacing expression better) by every person who uses it. That's either going to change, or... well, people will keep complaining until it does change. There is no real other outcome; the chorus will just grow louder until patches start flowing, and until one of them goes into the master.

4. The Lines Between DVCS, Database, and Filesystem Will Blur

I think the next interesting DVCS innovation will be storing repository metadata in an abstractable layer - and that the metadata backend that will be particularly interesting will be the database-backed ones. Mercurial is in a good situation here, because Python 2.5 has sqlite3 built-in. Advantage Mercurial.

On the flip side, though, this will require treating repository metadata and file data separately, and that's something that Git already does. Deuce.

This separation will make gitweb-style services very interesting - repositories could simply be database rows, ditto for trees (to stick with Git terminology). The trees would form a graph of objects, either pointing to other rows or to file SHA1s. All those files could objects stored in a shared, either globally or per-repo, content-addressable filesystem, presumably using SHA1 as that address.

Do My Homework (Or Hobby-Work, anyway)

This is actually something I've been meaning to do. It'd be perfect for deploying, say, on Amazon's S3. Hint: Hashes make for perfect bucket keys. Unfortunately, I just don't have the time, at least not right now. But the time to author this was doable, so it's what I did. I hope it helps someone, or at least mirrors what someone else has been thinking as well.