I believe I may have said this before, but it bears saying again: a big thank you to all of you who have posted insightful comments, both here and on Hacker News, in response to my recent article on git and its followup. I say without a shadow of exaggeration that I’ve learned more about git from these comments than from anything else I’ve ever read about it. (Yes, a couple of the comments were borderline abusive, but I think three bad ‘uns out of 300 is beating the averages pretty handily.)
Anyway: among many excellent comments, this long one by Weavejester @ Hacker News was the best of them all. As I read this, I felt a grin creeping slowly across my face, and at one point I literally laughed out loud as I realised not just what he was doing, but how well he was doing it. With his explicit permission, I am reposting it here in its entirety, because it deserves a much wider audience. (The title of this post is also’s Weavejester’s, based on the title of an article about monads.)
Here we go:
Mike, Git seems unintuitive because you don’t have a good grasp of what it does behind the scenes. Imagine trying to get to grips with a Unix shell, if you had no concept of files or directories. In such a scenario, even a simple command like “cat” would seem incomprehensible.
If you’ll indulge me, I’d like to propose a thought experiment.
Designing a patch database
Consider you’re responsible for administering a busy open source project. You get dozens of patches a day from developers and you find it increasingly difficult to keep track of them. How might you go about managing this influx of patch files?
The first thing you might consider is how do you know what each patch is supposed to do? How do you know who to contact about the patch? Or when the patch was sent to you?
The solution to this is not too tricky; you just add some metadata to the patch detailing the author, the date, a description of the patch and so forth.
The next problem you face is that some patches rely on other patches. For instance, Bob might publicly post a patch for a great new scheduler, but then Carol might post a patch correcting some bugs in Bob’s code. Carol’s patch cannot be applied without first applying Bob’s patch.
So you allow each patch to have parents. The parent of Carol’s patch would be Bob’s patch.
You’ve solved two major problems, but now you face one final one. If you want to talk to other people about these patches, you need a common naming scheme. It’s going to be problematic if you label a patch as ABC on your system, but a colleague labels a patch as XYZ. So you either need a central naming database, or some algorithm that can guarantee everyone gives the same label to the same patch.
Fortunately, we have such algorithms; they’re called one-way hashes. You take the contents of the patch, its metadata and parents, serialize all of that and SHA1 the result.
Three perfectly logical solutions, and ones you may even have come up with yourself under similar circumstances.
Under this system, how would a merge be performed? Let’s say you have two patches, A and B, and you want to combine them somehow. One way is to just apply each in turn to your source, fix any differences that can’t be automatically resolved (conflicts), and then produce a new patch C from the combined diff.
That works, but now you have to store A, B and C in your patch database, and you don’t retain any history. But wait! Your patches can have parents, so what if you created a ‘merge’ patch, M, with parents A and B?A B \ / M
This is externally equivalent to what you did to produce C: patches A and B are applied to the source code, and then you apply M to resolve the differences. M will contain both the differences that can be resolved automatically, and any conflicts we have to resolve manually.
Having solved your problem, you write the code to your patch database and present the resulting program to your colleague.
A user tries to merge
“How do I merge?” he asks.
“I’ve written a tool to help you do that,” you say, “Just specify the two patches you want to combine, and the tool will merge them together.”
“Um, it says I have a merge conflict.”
“Well, fix the problem, then tell the system to add your file to the ‘merge patch’ it’s making.”
Your colleague dutifully hacks away, and solves the conflict. “So I’ve fixed the file,” he says, “But when I tell it to ‘commit file’ it fails.”
“Remember, this is a patch database,” you reply, “We’re not dealing with files, we’re dealing with patches. You have to add your file changes to your patch, and then commit the patch. You can’t commit an individual file.”
“What? That’s not very intuitive,” he grumbles, “Hey! I’ve added the file to the patch, but it tells me the merge isn’t complete!”
“You need to add all of the files that have differences that were automatically resolved as well.”
“Because,” you explain patiently, “You might not like the way those files have been changed. It needs your approval that the way it’s resolved the differences is correct.”
“Why to I have to re-commit everything my buddy has made?” he complains, “Seriously, I want to just commit one file. What the hell is up with your system?”
So that’s it — sneaky old Weavejester has not only tricked me into designing git, but got me defending its design to my dumb-ass colleagues who don’t Get It.
Where do I go from here? I am not truly sure. I need to give this some time to sink in, and blog about something else for a while. But I think one distressingly likely outcome is that I’m going to buy the book [amazon.com, amazon.co.uk, free online version], learn git properly and then start alienating all my friends by telling them all, in the most patronising possible manner, that they’re thinking about version control all wrong and it’s really change control.
Ah, poop. I feel like C. S. Lewis must have felt when he famously wrote “In the Trinity Term of 1929 I gave in [...] perhaps that night, the most dejected and reluctant convert in all England.”