Thanks to all of you who have taken part in yesterday’s selection-sort challenge, and everyone who’s still submitting code. (If you haven’t yet done so but plan to, then please don’t read this article until you’ve written your code according to the rules laid down in that article, otherwise you’ll see hints that you probably don’t want to see up front.)
I promised yesterday that I’d post some ideas about a test-suite in a subsequent article: this is it.
First up, here is the Ruby code that I wrote, and which I want to test. (I deliberately wrote it using low-level constructs, avoiding list operations, so it should look familiar to people who are used to other languages.)
def selection_sort!(a) for i in 0..a.size-2 do # Invariant: a[0..i-1] contain smallest values in order smallest = i for j in i+1..a.size-1 do # Invariant: a[smallest] is the smallest value in a[i..j-1] if a[j] < a[smallest] smallest = j end end a[i], a[smallest] = a[smallest], a[i] end end
Fairly simple code. (The function name ends with an exclamation mark because it updates the array a destructively. This is a helpful naming convention in Ruby, stolen from Scheme.)
So what do we need to test?
Sorting the empty array
First, the degenerate case. This is always a good way to start testing — in the words of Kernighan and Pike, “Make sure your code ‘does nothing’ gracefully” — a rule that they follow with a more general version, “Test programs at their boundary values”. So first up, we need to test that sorting an empty array does the right thing, i.e. nothing at all.
In a comment on the previous article, Sean Conner wrote:
[My] code doesn’t handle 0 sized arrays (as my tests bore out) but it works on arrays of 1 or more elements. Personally, I don’t consider a 0-element array as being valid, but that’s me.
I hope Sean won’t mind me picking on him, but I find this absolutely bewildering. Of course an empty set is a perfectly good set. It’s just as valid a set as an empty string is a valid string: how would we feel about a strlen() function that didn’t work on empty strings?
Or think about the ls (Unix) and dir (Windows) commands that list files in the working directory, sorted: what would we think of such a program if it didn’t work in empty directories, due to the sort routine not working on empty arrays?
So, rule number one: if your program fails on an empty array, it fails. Test on [].
Oh, and rule number two is like unto it: a routine needs to be able to sort an array of a single element (i.e., again, do nothing). I use the array [0].
Repeated elements
It’s not unusual for programs to branch the wrong way on equality — something else that Kernighan and Pike’s rules remind us about — so it’s often worth testing code against lists that have repeated elements.
For sorting, this is perhaps less likely to uncover bugs than when searching, but it’s still a case well worth covering. So as well as an empty array and a single-element array, my test suite includes arrays of two and three identical elements: [0, 0] and [0, 0, 0].
Elements in and out of order
Some broken sorts will appear to do the right thing if the elements are already in order — for example, the O(1) sort consisting of no statements at all will do this. Others, more rarely, might stumble on the right answer if the input is sorted in reverse. Yet others will be caught out by cases in which some of the elements are already in order and others are not. So my set of test cases includes in-order and out-of-order two-element arrays [0, 1] and [1, 0], plus the full set of six permutations of [0, 1, 2].
Because repeated elements can also cause problems, I also threw in [0, 1, 1] and the other two permutations of that array.
Since all the test-cases so far have been of very small arrays — three or fewer elements — I added the sequences 1..10 and 10..1, ascending and descending.
MAXINT
I admit it didn’t occur to me to test whether an array containing MAXINT is properly sorted, but some commenters on the original post have found bugs in their routines when the largest representable integer is part of the array.
I don’t have a test for this because I’m using Ruby, which has no MAXINT: it automatically promotes numbers from Fixed to Bignum when they get too big to fit into the Fixed representation. (It feels odd that we still have high-level languages that don’t do this: coping with integer overflow is a reasonable thing to do in C programs, but it seems bizarre in Java.)
Everything in together
I think that the cases I’ve listed above cover all the edges, so I capped off my set of canned tests with a slightly bigger and more random example, including some negative numbers and a repeated value. I used [42, 9, 17, 54, 602, -3, 54, 999, -11], but it could have been anything. I also added the already-sorted version of that same array, because — hey, why not? Tests are cheap!
Random testing
Is that enough? Well, it might be. But it’s surprising how often purely random test-cases can shake out bugs that carefully designed test-cases miss. Barton P. Miller and co-authors, in an article in the December 1990 Communication of the ACM, described how they tested almost 90 Unix utilities, on seven different versions of the operating system, and found that a quarter of them could be crashed simply by feeding them a random stream of input characters. In one case (running refer on a Sequent Symmetry under Dynix 3.0) they were able to crash the operating system!
Ever since I read that paper — and I am a bit shocked to find that it was twenty years ago! — I’ve had a healthy respect for value of random testing. Accordingly, I extended my test-suite with a set of randomly generated arrays to sort. (I used ten arrays each of 100 elements, chosen in the range -500,000 to 499,999, but it’s trivial to make more, or change or randomise their size, or tweak the range of values.)
How do we test for correctness?
So there I am with 19 hand-crafted test-case arrays to sort, and ten (or more) randomly generated ones. What does my actual test-harness look like?
The obvious thing to do is check that after sorting, each array is indeed sorted — something like
for i in (0..array.size-2) { assert array[i] <= array[i+1] }
But that’s not rigorous enough. For example, it would report success for the O(n) non-sort that sets all elements of the array to zero. So we also need to check that the result of the sort routine leaves the array elements as a permutation of what they were before.
We might check that the new array is a permutation of the old by ensuring that they are the same size and that each element in the source array is also in the sorted array. But that can fail when the source array has repeated elements. For example, consider a broken sort that takes the input [0, 0, 1] and yields the result [0, 1, 1]. The output array is the same size as the input array, and each element in the input is also in the output, but the sort is still broken. Not only that, but the simple way of doing this check is O(n^2), which is just as expensive as the sort itself.
So as it turns out, the simplest approach is just to test against a known-good sort routine. Since pretty much every environment has one of these, it’s easy to do; and since the system sort routines will in nearly every case use an O(n log n) algorithm, doing this additional sort and an O(n) comparison will also be much quicker than the naive permutation test — at least, for big arrays.
So that’s what I did: I sorted each test-case array once with my selection-sort routine and once with Ruby’s built-in sort function, and compared the two sorted arrays for equality.
My testing script
In case anyone cares, then, here is the complete testing script that I used:
tests = [ [], [ 0 ], [ 0, 0 ], [ 0, 0, 0 ], [ 0, 1 ], [ 1, 0 ], [ 0, 1, 2 ], [ 0, 2, 1 ], [ 1, 0, 2 ], [ 1, 2, 0 ], [ 2, 0, 1 ], [ 2, 1, 0 ], [ 0, 1, 1 ], [ 1, 0, 1 ], [ 1, 1, 0 ], [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ], [ 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 ], [ 42, 9, 17, 54, 602, -3, 54, 999, -11 ], [ -11, -3, 9, 17, 42, 54, 54, 602, 999 ], ] random_tests = [] 10.times do a = [] rand(100).times do a << rand(1000000)-500000 end random_tests << a end ntests = 0 nok = 0 (tests + random_tests).each_with_index do |a, i| copy = a.clone selection_sort!(copy) if (copy == a.sort) nok += 1 else puts "failed test #{i}: [#{a.join(' ')}] -> [#{copy.join(' ')}]" end ntests += 1 end printf "passed %d of %d tests (%d%%)\n", nok, ntests, 100*nok/ntests
I suggest that this is a pretty good set of tests (especially if you turn up the number, size and range of the randomly generated arrays), and I’d like to humbly suggest that those of you who have claimed success for your routines might like to translate this test suite into your implementation language and check that your routines really do work. (If you do this, please post the results as a comment on this article.)
But I’m certainly not claiming that this test-suite is complete and exhaustive! If you see any important cases that I missed, please point them out in the comments. Hopefully the outcome of this will be a nice, solid and rigorous test-harness for sort routines, which we can use again when we come to do more interesting sorting exercises later.
What I got wrong
Finally, honesty compels me to admit to the two mistakes I made in my own selection-sort routine. Well, three.
First, Error Zero: I was too impatient to run the code. Yes, after all the good work I’d done preaching about invariants, bound functions and preconditions and postconditions, I stupidly ignored my own awesome advice and didn’t think through my code properly, using these tools, before running it.
So, then: because of that, Error One was that I started the inner loop at one more than the value of the outer loop (which in itself is fine), but started the which-index-has-the-lowest-value-we’ve-seen-so-far variable as nil rather than the value of the outer loop-counter.
Error Two was a really dumb one, a mere typo. I had the swap statement inside the inner loop rather than immediately after it.
These are stupid mistakes, and I kind of wish I’d not told you about them now. Oh well.
I’ll get it right next time, for sure!
—
References
Miller, Barton P., Lars Frederiksen and Bryan So. 1990. An Empirical Study of the Reliability of UNIX Utilities. Communications of the ACM, 33(12).
I approve of writing out test-cases. Not only does this confirm to you personally that the code satisfies certain properties, but it’s great for other people who want to modify your code and need an authoritative way to test that they are still compliant with your specification.
I think automatic test-suites are very helpful for this arena. I defined the above and all I have to write is this, in the Haskell REPL:
quickCheck generates random values of type Ord a => [a], that is, lists of orderable values. It is just a fantastic way to quickly check for errors. In my humble opinion, all libraries should come with a set of tests. Just like you wouldn’t release a language implementation without a grammar. (Cough Perl, Markdown, cough.)
Real World Haskell has a chapter all about testing: http://book.realworldhaskell.org/read/testing-and-quality-assurance.html
By the way, regarding the previous post, I think using Haskell is cheating. You can say a good programmer writes good code in any language, but I think the language makes it easier (or harder). I’d find it harder in other languages.
My previously-untested C++ code passed all tests in the test suite.
Stupid mistakes or not, I think they bear out why I consider the “only 10%” a useless scare statistic. You aren’t in that 10% either for writing a sort, apparently. And, personally, I think that *really* doesn’t matter.
(I’m not sure if my last comment made it through, so let’s try again with pastebin…)
Here’s a translation of the tests into JavaScript:
http://pastebin.com/Pi6Sh5eT
It’s fairly literal, except for the 3 functions at the top. I also added a test that uses +/- Infinity and +/- Number.MAX_VALUE. My selection sort passes it.
passed 29 of 29 tests (100%)
My code ended up pretty similar to yours, so exploring the differences might be enlightening.
The for constructs you use are definitely the way to go (I used ugly while loops. Boo!), and it was clever (but of minimal optimization) to note that you can skip the last element completely in the outer loop.
I only swap values if they aren’t already identical, which leads to fewer swap operations in arrays with duplicates (and skips trying to swap my last element with itself)… would need to test whether the time spent on the extra compare is worth skipping those swaps… depends on your expected inputs.
Also, because I’m a sucker for trinary operators, I’d rewrite your inner if statement as:
Oof — I originally tested and passed all these cases except MAXINT in the input, which crashes me. Needed a less-than-or-equals.
Nah, I don’t mind being picked on. On reflection, you are right—a 0 element array is perfectly valid. On the other hand, being passed a NULL (or null or nil or undef) I do consider invalid (comes from my background in C—all my routines expect valid data (and in the case of pointers, a non-NULL pointer). To me, defensive programming hides bugs.
I’m generally not in favor of specifically testing for an empty array. It’s more than possible (in fact most of the implementations do) to write the code in such a way that empty arrays produce the correct output without an explicit test for them.
Mike, I don’t know what your original code is, but Error One might have been correct. I also put the swap in the inner loop (guarded by the ‘if’ condition). This still satisfied the invariant I had (a[0…k] <= a[k+1…n] & a[0…k] is sorted), and the code passed all my tests which covered all the cases u mentioned above.
Granted, swapping on the inner loop is less efficient, but by definition it should still be correct and satisfy O[n^2]!
Anyway, like you, I will take it out of the inner loop, but won't concede it's not correct :)
passed 29 of 29 tests (100%)
My code passed the test suite, though it lacks the optimizations mentioned by Yobgod. Not sure if i’d call that a bug or not.
rphan wrote:
I think you’re saying that sort routines, when written correctly, do not need a special-case test for a zero-length array? And I certainly agree: the best way to write this code is in such a way that it doesn’t need to handle such cases separately, but just does the right thing as a consequence of the loops being right.
But that certainly doesn’t mean that we don’t test for correct handling of empty arrays in the test suite!
It’s worth noting that your ruby sort program can sort any objects that can be compared. So it has potential to be useful for any comparable type.
That’s a nice feature of ruby!
But one small thing – it’s worth testing if the sort is stable if you are sorting anything other than plain integers. (meaning items that compare the same appear in the same order in the output).
It’s easy to be tripped up by this, and sort stability (or not) should be documented.
Good point on sort stability, Steve. I guess we’ll talk about that in a separate article some time.
As for Ruby’s (and other languages’) duck-typing, and the glorious way they make routines like sort Just Work for objects of any type … I’m sure you spotted the weasel words in the Challenge article: “Don’t bother making it generic for multiple types unless that’s trivial in your language — sorting a list of int is just fine.” In Ruby and philosophically similar languages, you have to go out of your way to prevent such a routine from working on all types; in Java and its brethren, it’s a lot of additional work to make it so. In exchange, of course, people on the Java side of the fence get compile-time type-checking, which can find some classes of bugs, and better run-time performance. There’s a whole nother article to be written about that trade-off; but the bottom line for me is that, except in vary rare special cases, I consider my time more valuable than the computer’s.
Empty arrays.
As much natural as empty arrays are in some high level languages like Ruby, they feel different in C. Like Sean, I did not consider them to be valid. My rationale was that malloc(0) is weird and that empty arrays would probably be NULL pointers in C, hence enjoying special treatment. That said, Mike’s post (plus some reasearch) made me change my mind.
For my defense, my choice was made explicit in the code as I added an assertion that stated the sort() function precondition: “array is not null and size of array is strictly greater than 0.”
That said… my code just happens to work with both a NULL array and an array of size 0 anyway. This is just because I did not do a “minus 1” in the outer for loop.
As a rule of thumb, I tend to avoid +1 and -1 in boundary indexes like the plague, and I generally consider +N/-N with N>=2 to be probably wrong (and if not I comment). These index shifts are usually a source of trouble, leading to overruns or underruns, especially when you come back later an modify the code.
Oh yeah. And it seems that my code was correct, but it’s not optimal.
My code passed the tests, and a more exhaustive random testing including MAXINT (and MININT if that matters). It uses unsigned integers for array size and has a simple check to exit on sizes less than 2, so it should handle the maximum array size as well.
Reporting that all tests passed.
My JavaScript implementation passed Jesse Merriman’s translation of these test cases.
(I hadn’t tested it ’til now.)
@Mike Taylor: That’s exactly what I meant! :)
Or, to clarify, many of the submissions had a:
if array is empty, return empty array
statement at the top. They were then followed by an implementation that loops over the array and in fact would have simply returned an empty array anyway given one for input. Not only does it make the conditional at the top redundant, but gives the impression that an empty array is an edge case that needs special handling, when more often than not it isn’t and doesn’t.
Mike, I noticed your code does:
for i in 0..a.size-2 do
[/code>
What does this do when a is empty? Or has less than 3 elements? If it passes, then I guess Ruby skips the loop of the upper count is negative. That … doesn’t sit well with me.
@Sean Conner:
If you mentally translate it to the conventional C version;
“for (i = 0; i <= size-2; i++) …"
it might make more sense…
You might argue that "a high-level language like Ruby" should always iterate from the smaller value to the larger, regardless of how they were specified, or that it should automatically decide to decrement the loop variable instead of incrementing it when appropriate, or that specifying a "range" with the first value larger than the last should give an exception… Such behaviour would tend to surprise programmers, though — and the benefits are questionable.
For those who submitted a Python function and want to try Mike’s test suite, it translates pretty directly. Tweak line 38 if your function returns a sorted copy rather than sorting in-place.
Oops, I mean tweak line 40. Forgot I had added the leading comment.
Sean Conner wrote:
If you ask Ruby to loop from a number up to another that is less than it, then the loop executes zero times. This seems obviously correct to me — what else could it do?
Sean,
Despite mostly writing it in low-level constructs, Mike did use one bit of so called idiomatic ruby there (because ruby doesn’t give you any choice to do a C style loop, afaik): the range operator, x..y.
Either value is allowed to be negative in a range. If Y is less than X, the range is essentially empty (if you call to_a on it, for instance, it will return []).
Don’t worry, you’d hardly ever do it that way anyways, that was just the most comparable to standard Algol language style way.
My algorithm passed all of your 19 tests as well as 100 random ones.
Just in case anyone wants to use the java code to test their algorithm (or wants to point out errors in my tests…):
Here’s my test harness for C. The tested function has the prototype
and sorts the array in place. I added a few tests that included INT_MIN and INT_MAX in the arrays. All tests passed.
Interesting. Had I used int instead of size_t (in my C version) then the 0 case would have been handled, but I would have been restricted to only INT_MAX items (minimum maximum of 32,767, if that makes any sense) instead of ULONG_MAX (4.2 billion, which is big enough for anybody).
The bug I initially hit was that with unsigned arithmetic in C, that 0 – 1 = U*_MAX (* being the base type), so it failed spectacularly. The convention in C is to use size_t types to iterate through arrays, as negative indices tend to fall into undefined behavior (and that may explain why “negative” indicies don’t sit well with me). And there’s always the fact that using signed quantities cuts the potential size of your arrays in half (not that they’ll necessarily ever get that large).
@Jakko: calling srand(15739603U) is technically invalid C code, as 15739603U could be larger than UINT_MAX on some platforms.
@Sean: Yep, I know, and I probably should have changed it before posting. Initially, the code was just test code to be run on one specific machine that I know to have 32-bit ints, so I consider it acceptable, but it shouldn’t have made it to wider distribution.
And you could have used long to get a larger size without hitting any problems with unsigned.
Re: Ruby num..num logic:
If the right one is smaller than the left, it does nothing, the range has no length. Unless it is negative.
If the right number is negative, it represents the nth digit from the *right* side. ie, [1..-1] will get everything but the first item in the array, [1..-2] will get everything but the first and last.
“Not only that, but the simple way of doing this check is O(n^2), which is just as expensive as the sort itself.”
Wait, what? There’s a case that inefficiency can be a good sign when you’re writing a test. It’s good to write a test in as simple and declarative a way as possible, so you reduce the chance of a bug in the test. You think, “What’s the relationship between the input and the output of this routine?”
“The output should have the same number of each input element, and all the elements should be in order.” Something like that. Then you write a test that tests exactly that in as literal and unoptimized way as possible.
Yes, unit tests shouldn’t take a long time. Using the above in Paul Winkler’s test makes it take 0.58 sec on my old machine. I would recommend everyone make your tests as stupid as possible. That is the right way. If they get slow, consider using fewer or smaller test cases.
The “compare against a known good function” approach has a couple of problems. First of all, it’s not always an option when you’re writing something you don’t already have! Second, a “known good” function can have a subtle bug. It might have been written by you, with the same blind spot you have now. Also, in this case it wouldn’t work with unstable sorts where different items compare “equal”.
Anyway, mine passed. Thanks, Paul Winkler. I thought it was funny that “from selection_sort import selection_sort” worked verbatim. But…is this a selection sort?
It does find the smallest and put it in place, it just also keeps the smallest seen so far in the array as it’s looking. Is the separate index- of- the- smallest- seen- so- far variable worth the speed you get from it? You did get a bug by initializing it wrong. The most reliable part is the one that isn’t there.
Pingback: How correct is correct enough? « The Reinvigorated Programmer