The contributions to the original binary search thread continue to trickle in, and now stand at an astonishing tally of 679 comments, most of them containing code. Thanks once more to all of you who’ve attempted this challenge, and to everyone who’s commented.
In one of the more interesting comments, Darius Bacon tested a sample of 20 of the Python submissions, and found that exactly 10% of them both passed the functional test and appeared to run in O(log n) time — which of course is exactly in line with Jon Bentley’s original statistic, that 10% of the professional programmers he’d examined were able to write correct binary search code (albeit under rather different conditions). It subsequently became apparent, though, that Darius’s testing code was overly strict, and that a further two of the sampled routines did run correctly (though not necessarily in O(log n) time). Still — it’s a surprisingly low hit-rate, especially when you bear in mind that, contrary to the rules, many of the programmers did use testing to help them develop the code. (You should read the second linked comment, as it has plenty of other interesting observations, too.)
[Update, an hour or two later: as Darius Bacon points out in a comment below, I misinterpreted his results. Once the testing code had been fixed, his sample of 21 routines found 9 that passed all test, of which 6 were O(log n).]
“It’s dumb not to test”
By far the most common comment on this challenge was that it’s dumb to try to write correct code without testing. Many, many people made this point, not only here but especially on Reddit. For example, someone calling him/herself K wrote: ” I don’t really see a point in this exercise – are you (or the author of the book) implying that if a programmer can’t implement this on first try, without testing, he’s not as good as one who can? IMO, it’s not a good metric for the ability of the programmer.”
First of all, let me be very clear what I wasn’t saying: I was not saying that it’s a bad thing to test your code. I am very much in favour of testing. So is Jon Bentley. That’s because neither of us is insane.
The point is this: testing is only one of the weapons in our armoury, and it’s one that has recently become so fashionable that a lot of programmers are in danger of forgetting the others. Although testing is valuable, even indispensible, it is also limited in important ways, and we need to have facility with other techniques as well.
Here are three limitations of testing:
- Tests can only show the presence of bugs, not their absence
- Tests can be buggy, just like tested code
- Tests increase confidence, not understanding
I’ll expand on each of these below. Before I do that, let me say one more time that testing is good. Please do not respond in the comments saying “Mike is a doofus because he says testing is bad.” Testing is very good. But like so many very good things — Doctor Who, sushi, Wi-Fi, Gold Miner ale — it turns out that it’s not the only very good thing you need.
So the purpose of the write-your-code-without testing exercise was not to discard testing because it’s useless; but, precisely because testing is useful, it’s important that we sometimes force ourselves to walk without that particular crutch … and perhaps we’ll find out how atrophied our muscles have become. Then we can discover what other areas we need to concentrate on, what techniques we need to bone up on, what tools we need to relearn and what exercises we need to do.
(I’m assuming here that we aspire to be excellent, rather than merely adequate, programmers.)
Tests can only show the presence of bugs, not their absence
I wrote this heading off the top of my head, then looked at it and thought it sounded a bit familiar. When I grepped around, I found that it was, almost word for word, a quote from an old friend:
Testing can show the presence of bugs, but not their absence.— Edsgar W. Dijkstra, University of Texas
So I guess I am in good company in making this point. To say the same thing in more general terms, as Tom Holtz is fond of saying, “Absence of evidence is not evidence of absence”. (He’s usually talking about fossils in a particular rock formation, but the principle is good more widely. And to be fair, absence of evidence is indeed evidence of absence, but it’s not proof of absence.)
You can test all you want; but all you have shown at any given point is that you have not yet found any bugs. We can see this principle at work in several of the posted solutions — I don’t want to pick on individuals, so I won’t name names, but I’ve seen submissions tagged with statements like “Mine works”, “works for me” and “no errors at the first try”, which have been subsequently shown to be buggy — presumably because the author’s own tests didn’t test all the edge cases.
Again, this is not to say that testing is useless: each passed test can legitimately increase confidence in the tested code. But it can never get you all the way. For that, you need something different.
Tests can be buggy, just like tested code
“Quis custodiet ipsos custodes?”, asks Juvenal (and let us not forget that quidquid latine dictum sit, altum viditur). “Who will watch the watchmen?” replies Alan Moore. “Who will test the tests?”, say I.
This is another problem that cropped up several times in the comments. For example:
- “My code worked the first time, but my test had a bug.” — nicholas.
- “Worked at first try though I did test it before submitting here, which took some time before I realized that while the binary search was working, my small test code was faulty.” — Robert Leffmann.
- “I felt my heart sink when it threw an exception on first run, but it turned out the error was forgetting to sort the test arrays in the test code, so all was ok.” — Paddy.
- “Had some bugs in the harness itself, but didn’t catch any in the binary search method.” — Dan Carroll.
We’ve all been there, I’m sure. As it happen, this very afternoon I spent an unpleasant hour or so trying to debug the test suite of a Perl module that I was building a Debian package for. We fall too easily into talking as though code and tests are two different things, but tests are code. We know that we’re not clever enough to write correct code all the time, so why would we think that we can reliably write correct tests?
Actually, there are two reason why this argument isn’t as strong as it sounds. One is that the code that makes up test suites is often — not always, but often — much simpler than the code being tested. In which case it’s more likely to be correct. I’ve been in situations where that’s not true, where the testing code is itself a major piece of work; but I admit those cases are the exception. The second reason why this argument is not super-strong is this: writing tests lets you triangulate. You’re looking at the same problem from two different directions, and if your tests are broken there’s a fair chance that your correct code will help you spot that fact and fix it. (If you’re really unlucky, your code and your test will both be broken but in such a way that the deficiencies of one patch over the mistakes in the other; but that seems to be rare.)
Tests increase confidence, not understanding
Here we reach my core point. It’s this. If you had to boil down the whole programming skill-set to a single core skill — a completely general one that applies across all that we do — what would you choose? I can imagine various different answers, but for me the answer would be: thinking clearly. I’d argue that the essence of good programming is thinking clearly about the problem; understanding; grokking.
Testing a piece of code is essentially a black-box process. If you hack together a binary search routine, you can and will write tests for it that are not based on how the code works, but only on what it claims to do. This is right and good — it’s proper separation of concerns. But what it means is that writing and running tests does not in itself increase your understanding of the code being tested.
And we need to understand.
In the case of binary search (and let’s all remember that this was chosen because it’s a particularly straightforward algorithm), we’ve seen posted code that does all sorts of horrible things — some of them kind of harmless; some of them with performance implications (like all those array-slicing Python implementations that run in O(n) time, and so are no better than a trivial linear search); and some positively detrimental. The point is this: no amount of testing will, in itself, give you any insight into these coding mistakes.
(Aaaaand I am going to say it yet again, in the desperate hope of not being misunderstood: this does not mean that I think testing is bad; it means that I think it serves some purposes but not all.)
If you’re not convinced, consider this pair of comments. From Jonathan Deutsch: “Writing code without testing is akin to doing math without a calculator”; and from Matthias Goergens: “Arithmetic without a calculator borders on nonsense. But for analysis, algebra or proofs in general, you’d need a very advanced calculator.” Unit tests are great for verifying your arithmetic; but they don’t help you to think through your calculus. For that, you need to think. There’s no way out of it. Sometimes, no amount of agile, pattern-oriented, test-driven, refactoring-focussed pair-programming methodology will get you out of the responsibility to invest some actual thought into your program.
It’s not all just Move Method and Pull Up Instance Variable, you know. Sometimes I wonder whether tools that can do mechanical things reliably for us are blunting our ability to think clearly about subtle problems. One of my very favourite comments on the original post was EoghanM’s: “I definitely thought about it a lot harder when I couldn’t test.”
“Test-driven development” is is not even a zinc bullet
Fred Brooks, perhaps best known as the author of The Mythical Man-Month [amazon.com, amazon.co.uk], is also the author of the classic 1986 paper, No Silver Bullet, in which he argued that “there is no single development, in either technology or management technique, which by itself promises even one order of magnitude improvement within a decade in productivity, in reliability, in simplicity”. Admittedly, what he says is impossible is a high goal — an order of magnitude — and I doubt that even the most committed Object Jockey, Extreme Programmer, Refactoring Guru or Test-Driven Developer would claim benefits that great for their favoured methodology. And yet, these fads keep coming along, and Test-Driven Development seems to be the fashionable one right now.
Well, OK, maybe a bit. Here’s where I think the cognitive mismatch is: a test-driven approach can be helpful in designing (and of course testing) an API; but it makes no contribution whatsoever to designing an algorithm. In other words, it might help with the easy bit, but doesn’t help at all with the actual work. But APIs are such a big deal nowadays (especially with the framework obsession and all), that we too easily fall into thinking that the API is the program. (Over-Fowlering the code can also contribute to this misapprehension.) That is a mistake on the same order as “the medium is the message“. The medium is not the the message, it’s what we need to get the message; the API is not the code, it’s how we invoke the code. But, oh my, isn’t it seductive? To spend our time breaking a problem down into smaller and smaller parts, designing lots of little classes, writing masses of test code, and telling ourselves that we’re being productive, without ever putting any effort in solving the actual problem.
The classic example of this is Peter Norvig’s and Ron Jeffries’ sudoku solvers. These are gathered and linked from the brief article Learning From Sudoku Solvers, which has some excellent comments. I urge you to go and look at these for yourself when you’re done here, but in a nutshell: Ron Jeffries wrote a series of five articles on the test-driven development of a sudoku solver that eventually faded away into nothingness without ever finding a solution, or even coming close; while Peter Norvig spent a while thinking hard, applied powerful non-trivial techniques, and wrote a 100-line Python program that solves any sudoku puzzle. [Jeffries’ articles are often unavailable, as his web-site comes and goes. If you can’t read them there, you can find them in Google’s cache.]
How did Norvig do that? By using skills that are not often thought about in these days of test-driven kool-ade, reflexive application of design patterns and automatic refactoring tools. Once more, let me be clear that all these things are good so far as they go — really, I agree with you, they are! — but they are no substitute for actually thinking, and thinking is what’s needed for hard problems.
I was going to say that in the last part of this article I’ll look at a couple of mental tools for thinking about code, but yet again I have gone on for much longer than I intended, so that will have to wait until a subsequent article. Soon, though, I promise! (And yes, I am well aware of the irony that this is one of a series of blog entries wherein I keep promising to get to the meat but never do, and that here in this very entry I castigate Ron Jeffries for a series of blog entries wherein he kept promising to get to the meat but never did. The difference is that we’ve been waiting three years for RJ to keep his promise but only three days for me to keep mine. Also, I know what my punchline is going to be.)
Before I wrap up for this time, I want to address one more point:
“But I would have had a working binary searcher more quickly if I’d been allowed to test along the way”
Yes, maybe you would. Bearing in mind that binary search is in fact a pretty simple algorithm, it’s likely true that you could have bashed your way through to a working implementation after a few iterations of coding and testing. But —
- Your code would be less clear
- You wouldn’t understand clearly why it works
- You would find it harder to extend or modify in future
Because as soon as it’s aimed at a non-trivial problem, test-driven development becomes debugging-driven development. It stops being “write the test, then write the code, then check that your test says the code is OK”. It starts being “write the test, then write the first draft of the code, the run the test to find the first bug, then patch the code, then write the code to find the next bug; GOTO 40”.
Let’s look at a hypothetical example of how this can happen. Tell me if this sounds hauntingly familiar.
I bang out a first draft of my binary search routine, and test it. I notice that it sometimes looks at an out-of-range array element and realise that I was setting the upper bound to the size of the array rather than the index of its last element. I fix that and see that sometimes it goes into an infinite loop. I guess that’s because when I iterate, I am setting the new upper bound one too high so that it doesn’t converge, so I try subtracting 1. Now I find it sometimes fails to find values that are included, so I remove the -1 and try setting the lower bound one higher instead. It seems to work. The next test shows that my code doesn’t handle an empty array, so I add a special case for that. Then I find it doesn’t work on an array of a single element, so I add another special case. A few more tests suggest that I’ve caught all the edge cases, so I pronounce myself done.
I’m sure I don’t need to (A) persuade you that the resulting code is suboptimal, (B) point out that it likely still contains bugs, or (C) draw your attention to the fact that more than a few of the submitted solutions a couple of posts ago look suspiciously like this. My fear is that most code looks like this — a patchwork of special cases that pretty much work most of the time, more or less. Testing helps us to have more confidence that where we’ve arrived is good enough; but it doesn’t help to make the code good. And that is what I want to talk about next time.
Usually, I gather my sushi photographs from many different sources. But for this article, I drew them all from the website of the Sushi 8 Japanese Restaurant in Mountain View, California. They just have such nice photographs! I have no idea whether their stuff is good or not, so don’t construe this as a recommendation, but it’s certainly easy on the eye.
Also: ignore this Technorati claim-code: EECRHMA873AV