I've been thinking about version control a lot lately. I just wanted to get some thoughts I've been having out there so people who know what they're talking about can shoot them down.
I'm going to structure this as a series of predictions, none of which I particularly believe are correct. Nevertheless, the format makes for an easy true-or-false binary judgement later on. The playing with format is also because I've been reading GEB, and I feel like trying a new format. So here goes:
1. Bazaar Will Never be a Serious Player
Bazaar, depending on your perspective, either takes the best of both Mercurial and Git, or the worst. I can sympathize with both views. On one hand, it is Python and pretty well cross-platform, and not the C-and-shell which caused so many portability issues for Git. It's also apparently supposed to be "getting there", speed-wise - although, the checkout of bzr trunk I have going on is certainly taking its time.
But bzr has at least one problem - and that is that it has broken repository-format compatability, at least once. This kept me from trying out Zed Shaw's projects for some time - the time investment to apt-get and then bzr branch is okay, but once I had to start worrying about versions, I gave up. Mercurial is working today.
And while problems can be forgiven, then need to be offset with something significantly new, and I don't see that with bzr. Mercurial, and Git to an even larger extent, are innovating in the DVCS space, while Bazaar is making a slower, less widely used reimplementation of all their features.
On a less-serious note, Bazaar also has another disadvantage: bzr is awkward to type compared to "git" and "hg". "hg" wins, as it's two keystrokes by different fingers - you can even alias gh to hg in your .bashrc and make it basically instantaneous. Git requires a "A-B-A" pattern, but with alternating hands; bzr requires A-B-A but on the same hand, with a further distance between the first and third keystrokes, and a pinky-keystroke required for z. But hey, whatever. Maybe it's not a major issue, but call me when a version control program named qzzqx storms the world, k?
Bazaar does have another big advantage, and that is Canonical. I can still see Bazaar usage increasingly greatly in a very specific and unlikely set of circumstances, among them Launchpad being open-sourced soon and 3rd-party hosting services using Launchpad taking off. But even if that were to happen, I'd be surprised if no one just wrote a Git or Mercurial backend for Launchpad's VCS features. But it could happen. It's just not likely.
I've finally got the bzr trunk checked out, and I do want to say that the code looks damn nice. I'd heard Mercurial's code pointed out to me as good Python code, but the ui-passing everywhere turned me off (if you've coded for the Mercurial APIs, you know what I'm talking about). Bazaar's source looks considerably cleaner, with sane style. It reminds me of Django's source. This is, of course, a 2-minute, mile-high overview, but that's what I see.
2. Mercurial is Worse (is Better)
I started recognizing Python as a better Blub than Lisp (I'll expand on that someday) when I learned that many Python "warts" - such as statement-less lambdas - were in fact conscious decisions. The Python developers looked at their choices, considered the arguments for more powerful lambdas, and decided against them. Python is as Python is because that's what the Python devs want it to be. There are certainly warts - the whole unicode/str thing in 2.x comes to mind - but some design decisions were made to trade power for clarity, and that's a fair trade.
With Mercurial, similar tradeoffs are apparent. Tags are implemented with a .hgtags file that simply maps a revision number to a name. I commented to a coworker how that seemed somewhat hacky, and his response was "what's wrong with it?" And he's right: that's all the tags really are; 1-to-1 mappings. A flat file is perfect for them, but my object-oriented mind objected to custom formats in a flat file. I mean, yuck! Shouldn't it at least be serialized somehow?
Of course, the flaw was with my dogmatic assumptions, and not with the implementation, which has worked just fine.
(I do have one beef with .hgtags and .hgignore files - if Mercurial keeps up that convention, there'll be too many .hgFOO files. Just stick them in .hg/, please! Not that I'm actually advocating this change - it'd be too complex and backwards-incompatible to be worth it, but if I had been involved 3 years ago, it's what I would have advocated.)
Likewise, Mercurial ignores directories and the cross-platform vagaries of versioning something that varies widely across operating systems and filesystems. It just versions files at a particular path instead, and creates directories on-the-fly as needed, as well as removing them when empty. This inspired the same WTH? reaction in me at first, but after more time has passed, the question has become, "why not?" A huge chunk of complexity and bugs was removed, and replaced with sane behavior that, in 90% of situations, is exactly what the programmer would have done anyway.
2.1. But Worse is Worse!
But these simplifying abstractions are not complete, and as such they do leak, and one way they do so is with renames. Mercurial handles renames as a remove-and-add. This leaks in a few ways - first, in 'hg status', it actually looks like an add and a remove, which still makes me have to think twice when looking at that situation. More importantly, though, it can inflate the repository size when moving large numbers of binary files, as a copy of the file's contents is stored at each path at which it appeared (in the internal repository format).
This is exacerbated by the fact that Mercurial repositories can never shrink. History cannot be (easily) rewritten. SVN made the same choice, and I think it's worth noting that after all their experience with this choice, they want to change that behavior. Couldn't we learn from their experience, and have partial-history and partial-tree cloning considered must-haves from the get-go?
The suggested solutions all tend to rely on what looks to my untrained eye like 'fooling' Mercurial itself by adding nonsense data to the repository, and then teaching Mercurial to look for these sentinel values and treat them specially. It'll probably work, and work well, in the true 'Worse is Better' sense - but it shouldn't be mistaken for built-in, semantic knowlege of partial or shallow checkouts by Mercurial itself.
I'm going to blatantly assume that this state of affairs is the result of further incompleteness in the simplifying abstractions that were chosen for Mercurial, but I'm really just guessing. It'd make sense though.
I don't know enough about Git to speak about what challenges its choice of abstractions cause. As you could probably have guessed by now, I'm going to anyway. I'll pretty much just make up stuff as I go along and pretend that actually researched it.
3. Git's Model Will Die of Packing
You know what bugs me about Postgres? Vacuum. There's really no other way to do append-only databases... but it still bugs me.
Still, Postgres gets a free pass because they are the database experts, not me, and if the database wants to have a daemon process running, I'm okay with letting them handle that as they best judge, especially since a daemon will need to be running to accept connections anyway.
Git doesn't have that situation. Git repositories need maintenance, and yet, aren't maintained unless the programmer does it. This is unfortunate. I only know of one other piece of software that gets away with this; I'm thinking of ZODB. But ZODB only needs to be maintained by a server admin, and only every few months. Git has to be maintained (maybe if I say "hand-held" it'll convey my grimacing expression better) by every person who uses it. That's either going to change, or... well, people will keep complaining until it does change. There is no real other outcome; the chorus will just grow louder until patches start flowing, and until one of them goes into the master.
4. The Lines Between DVCS, Database, and Filesystem Will Blur
I think the next interesting DVCS innovation will be storing repository metadata in an abstractable layer - and that the metadata backend that will be particularly interesting will be the database-backed ones. Mercurial is in a good situation here, because Python 2.5 has sqlite3 built-in. Advantage Mercurial.
On the flip side, though, this will require treating repository metadata and file data separately, and that's something that Git already does. Deuce.
This separation will make gitweb-style services very interesting - repositories could simply be database rows, ditto for trees (to stick with Git terminology). The trees would form a graph of objects, either pointing to other rows or to file SHA1s. All those files could objects stored in a shared, either globally or per-repo, content-addressable filesystem, presumably using SHA1 as that address.
Do My Homework (Or Hobby-Work, anyway)
This is actually something I've been meaning to do. It'd be perfect for deploying, say, on Amazon's S3. Hint: Hashes make for perfect bucket keys. Unfortunately, I just don't have the time, at least not right now. But the time to author this was doable, so it's what I did. I hope it helps someone, or at least mirrors what someone else has been thinking as well.
Comments
69 spam comments omitted.
I am no longer accepting new comments.
Mikko
#4507, 2008-06-24T03:52:59Z
Git introduced automatic repacking in version 1.5.4, so 'needs to be hand-held by every user' is strictly not true any more. As I understand, repacking is triggered by default automatically every 50 commits in the most recent version, but this is configurable.
Shaun
#4508, 2008-06-24T04:47:55Z
"I do have one beef with .hgtags and .hgignore files"
You make a fair point, but consider also that by putting them under .hg/ you lose the ability to put them into version control.
Lee
#4511, 2008-06-24T18:21:44Z
Regarding "bzr". Funny enough, in most cases I actually like words I can type with one hand, especially the left hand as that means I don't have to move my right hand from the mouse, but since this happens at command line my hand is not likely to be on the mouse anyway. :P
Jakub Narebski
#4569, 2008-06-29T13:06:56Z
Two comments:
First, git has implemented automatic repacking ("git gc --auto"), which is safe (so it can be used from 'cron' for example), and also is invoked by commands which generate usually large amount of objects.
Second, Shawn has send patch series to git mailing list adding support for storing git repositories on Amazon's S3... unfortunately this is for jgit (reimplementation of git in Java).
Jakub Narebski
#4570, 2008-06-29T13:15:51Z
I have another beef with Mercurial, namely the way it treats tags and local branches.
First, tags are either versioned and transferred (in comitted .hgtags file) or not versioned and not transferred (in .hg/localtags or similar), while they should be non-versioned (they are repository metadata, of sorts; and should be outside graph of revisions). This causes Mercurial to have to have special treatment of .hgtags file (as compared for example to .hgignore file): always most recent version is used. Git does it correctly; tags are outside repository object database, they are refs and can be autofollowed.
Second, branches (from what I understand of Mercurial documentation) are a special version of tags, and being on a branch IIRC is having branch tag in ancestry, which is just plain weird and strange, not to say stupid.
Mercurial doesn't have annotated tags either, I think.
Katie Molnar
#17716, 2009-04-16T02:30:37Z
I was just thinking about this stuff today, having encountered Bazaar checking out a pygame GUI toolkit from Launchpad.
My first thought was "Oh cool, it's like Git." My second thought was "Wow, this is slow."
I'm not really sure what the aim is with bzr. It seems like a decent reinvention of the DVCSS wheel, but it doesn't really do anything other tools don't do better already. I can't find one new feature in it.
I think the only thing going for it is Launchpad.
Ironically I showed up on this site for completely different reasons, hitting the discussion on Python's property builtin from Google.
And yes, "bzr" is a left-hand word, but so are "sed" and "gcc," so although it's uncomfortable, I think the similarity with "bizarre" is more likely to cause it trouble than the abbreviation. =)
Also, I see you're using Markdown. That's pretty awesome.