How slow are functional implementations of quicksort?

I’ve been talking with my colleague Jakub Skoczen.  He’s insisting that functional implementations of quicksort, which make and return new arrays at each level of recursion, must surely be much slower than imperative implementations that modify an array in place.

And I can see his point.

We all know that tail-call optimisation takes care of the grosser inefficiencies when recursion is used to faux-iterate along a list.  For example, the following recursive Lisp function returns a list whose elements are double those the list passed in:

(defun double (x)
  (if (null x) x
      (cons (* 2 (car x)) (double (cdr x)))))

(For those of you unfamiliar with Lisp syntax, this says: if the argument to double is an empty list, then the result is empty; otherwise it’s twice the head of the list concatenated with the result of running the double function on the tail of the list.)

It’s easy to see how a clever compiler can recognise that the last thing the function does is re-invoke itself on a lightly modified version of the same argument, and that this can be optimised by changing the argument in place and jumping back up to the top of the function.  And indeed this is what many compilers will do.  (IIRC, Scheme guarantees this; other Lisp dialects may not.)

But consider a functional implementation of the quicksort algorithm.  Quicksort sorts an array by picking a pivot element at random, partitioning the array into two subarrays containing the elements less than and greater than the pivot, and returning the concatenation of three things: the result of quicksorting the less-than subarray, the pivot itself, and the result of quicksorting the greater-than subarray.  As an example, here is the Ruby version that I wrote for the JRuby post:

def qsort(a)
  return a if a.length <= 1
  pivot = a.delete_at rand(a.length)
  return (qsort(a.select { |x| x <= pivot }) +
          [ pivot ] +
          qsort(a.select { |x| x > pivot }))
end

This is doubly recursive: the qsort function calls itself in two places, and only one of those can be optimised as tail-recursion (right?)

Which surely means that the first of the two recursive qsort invocations has to be done properly, with a real recursive call, a new stack frame, and lots of constructing and copying.  Which means that Jakub has to be right, doesn’t it?

And so to my question: are we missing something?  Can a Sufficiently Smart Compiler somehow optimise this recursive quicksort into one that is as efficient as an imperative quicksort that swaps elements in place?  Do such optimisations exist even in theory?  Better still, are there any functional-language compilers that actually implement them?

Inquiring minds want to know!

44 responses to “How slow are functional implementations of quicksort?

  1. Maxime Caron

    it can be done using Uniqueness type with linear lisp.

    see http://portal.acm.org/citation.cfm?id=181750

  2. Matt DiMeo

    The recursive calls in qsort aren’t tail calls.

    // tail call
    return f(differentarg) ;

    // non-tail-call
    return 1+f(differentarg) ;

    The first one can be optimized by changing the original parameter to the function and branching. The second one cannot be so optimized.

  3. again with the fucking sushi?

    I stopped reading after the first pic

  4. First, your version of qsort is not stable, you should do three selects, one for == pivot too (if you picked the last item as pivot, this is not needed).

    Second, you are iterating the array 2 times (3 times with my suggestion); instead use a split function that results in 2 (3) arrays in one iteration.

    In a lazy language, if you use selective parts of the sorted list, it can skip other thunks and thus not have to run the full qsort. But likely this same property will prevent any in-place optimization.

    In a strict language (where the compiler can prove no side effects), I still doubt it can optimize this into an in-place algorithm, it can likely turn it into a loop. But that prevents it from stack-allocating the (smaller) arrays, which could reduce memory accounting (trading it for stack space).

    Indeed if anyone knows of compilers or papers that turn a functional qsort into an in-place algorithm, or do other smart tricks, I would love to know about it too. (So good question!)

  5. Be careful to distinguish a functional-looking solution in a language with destructive updates and a solution in a functional language–especially a _pure_ functional language.

    Okasaki’s “Purely Functional Data Structures” goes into this sort of issue in great depth. I’m not expert, but this is what I think I recall…

    If the language does not support destructuve updates then at runtime on each recurse a “new” data structure can be created which shares structure with the “old” version. So no need to copy the values from “old” to “new”, just point into the “old” strucutre where the values you want are. This can be very efficient.

  6. > This is doubly recursive: the qsort function calls itself in two places, and only one of those can be optimised as tail-recursion (right?)

    I believe this is wrong. Neither is in a tail call position, from what I remember from Scheme class, because you are doing something with the result of both of them, concatenating them with the pivot element.

  7. I have the impression there are no compile-side optimizations.
    Time ago I have learn Ocaml for fun and the Ocaml’s manual has few paragraphs about implementing recursive algorithms managing arrays.
    The manual points the more efficient way to solve performance issue (of stack grown) is using tail-recursion.
    http://en.wikipedia.org/wiki/Tail_recursion

    This is a functional-way to implement an imperative like behaviour.
    The Ocaml’s manual points the stack issues but I don’t know about other languages/compilers .

    But please consider that tail-recursion is a good functional programmer practise which should be considered part of the basics.

    Log time has passed, please forgive my vagueness

  8. I don’t know about all functional implementations. In Clojure (based on Lisp) “all collections are immutable and persistent. In particular, the Clojure collections support efficient creation of ‘modified’ versions, by utilizing structural sharing, and make all of their performance bound guarantees for persistent use.” See also http://en.wikipedia.org/wiki/Persistent_data_structure.

    I don’t have benchmarks handy to back it up, but at least they are not naively recreated every time.

  9. In functional languages you are rarely dealing with arrays. Quick sort on a linked list, for example, can be rearranged by simply creating new nodes that point to other parts of the list, which is vastly cheaper than creating a new copy of the whole array (which would be very expensive).

    If you want to know more about how it is possible to create performant functional data structures and algorithms that maintain referential transparency and persistence, you should check out “Purely Functional Data Structures” by Chris Okasaki. His ~20 line red/black tree example is really amazing and illustrates that the power and expressiveness of these systems does not imply a performance penalty.

    As for sufficiently smart compilers, if that is your interest I recommend learning a little bit about what GHC does internally. There are some really amazing optimizations like deforestation and stream fusion which are made possible by functional languages.

  10. Your first Scheme example is not a tail call. The last thing the function does is call ‘cons, not recurse. You need something more like this

    (defun double (x so-far)
    (if (null? x) (reverse so-far)
    (double (cdr x) (cons (* 2 (car x)) so-far))))

  11. First, you’ve actually haven’t implemented quicksort. Second, your looking at the wrong performance problem. The introduction of On Iteration By Andrei Alexandrescu (http://www.informit.com/articles/printerfriendly.aspx?p=1407357) has a really good deconstruction of the functional ‘quicksort’ algorithm. In short, “Quicksort, as defined by Hoare in his seminal paper [8], is an in-place algorithm.” Since functional languages can’t mutate state, you can never define a quick sort routine. So let’s consider what a Sufficiently Smart Compiler would need to know to do functional quick sort with only 0 or 1 data copies (i.e. internally in-place with one possible copy if multiple references to the array or list exist). It has to know that the two filter/select routines create two mutually exclusive sets. It can then transform filter/select into an internal partition function.

    As for recursion itself, most quick sort algorithms I’ve seen in Java, C++, wikipedia, etc are all recursive.

    P.S. performing a quick sort with floating point numbers is not bug free. Specifically, the quick sort algorithm has the invariant than the elements have a total ordering, which floating point numbers don’t have.

  12. Okay, each step of a quicksort addresses the whole array once through, the first time in one pass and subsequently broken up into 2^k parts but adding up to one whole array for each level of depth. The levels of depth average log2 (n) (worst case is n, but that is a pathological condition with good implementations).

    So quicksort is O(n log n). Now consider a pure functional variant. Each level of depth requires that we copy the entire array once. A bad implementation might copy the whole array for each comparison and swap but functional style dictates that you keep a list of swaps and apply them all at once. Copying an array is O(n) and you do it log n times so the functional version is O(n log n).

    So no, the functional version is not much slower. But it probably is slower than a good C implementation. Almost certainly at least three times as slow and probably slower. But so is an imperative version in Java or C-octothorpe.

    Clojure (and Scala and Haskell) has advanced data types the minimize the cost of copying arrays when elements are updated, mostly by using trees connecting subarrays of 32 or 64 elements and only copying the subarrays that change. All of that happens cheaply and transparently under the hood. It doesn’t help much with the granularity of quicksort, which will require a full copy.

    Clojure includes a mechanism for pulling back the immutability of arrays inside inner loops when performance is critical. Keeping transients inside one function preserves the immutability property for the rest of the program and speeds up some rare cases considerably. One nice thing about Clojure is that edge cases like this can be addressed without losing the abstractions that make the language powerful.

  13. I think another question worth asking is why you would want a compiler to change the actual function of the code so thoroughly. Debugging will become more difficult if the compiled code doesn’t look like the source code at all, surely? And a compiler that clever is more likely to cock up in tricky-to-pin-down ways than a simpler compiler. Hidden compiler errors are the worst thing that can happen to a coder. That said I may be the last person to comment – I’ve never written a single line of recursive code in a released project. Have I missed out? Well, clearly I’ve missed out on opaque code, inconsistent frame rates and hours of bug tracking, but what else?

  14. Canonical quicksort in Haskell:

    qsort []     = []
    qsort (x:xs) = qsort (filter (< x) xs) ++ [x] ++ qsort (filter (>= x) xs)
    

    The ‘filter’ command is iterative (tail recursive)

  15. @Brian

    This can be optimised away as there is such a thing as tail recursion modulo cons:

    http://en.wikipedia.org/wiki/Tail_recursion_modulo_cons#Tail_recursion_modulo_cons

    If you have a function

    (define (nums-from-to x y)
      (if (= x y)
            '(x)
           (cons x (nums-from-to (- x 1) y))))
    

    this can be optimised to be tail recursive. You could generalise this from CONS to primitive operators (+, -, etc.) and end up with quite aggressive TRMC->TR optimisation.

  16. As Brian says, Clojure allows you to do some mutable operations in an isolated fashion on things like arrays.

    Haskell has a similar thing called ST http://www.haskell.org/haskellwiki/Monad/ST which allows mutable state like arrays or references to things or whatever. You run the computation inside a monad, and the runST function is completely pure. `runST foo’ will always return the same result. Example: http://nickvanderweit.net/?p=14

  17. Since pure functions can black boxes, a sufficiently smart implementation of functional quicksort would be to copy the array and then use in-place quicksort on copy.

    Alternatively, instead of using an imperative in-place quicksort on the copy of the array, one could write a version that uses uniqueness types to allow the compiler use mutation under the hood.

  18. The double function is not tail recursive. If you are used to C-like languages, an easy way to determine if a call is in the tail position is to ask yourself if inserting a “return” before the call would break the function. Your example gives

    function double(x)
    {
    return isNull(x)
    ? x
    : cons(2 * car(x), return double(cdr(x));
    }

    The first return is there to reproduce the implicit lisp return. The second return is the important one and will bypass the call to cons. This new function is broken and will always return null.

    In the next paragraph, you imply that a function must call itself in the tail position to be subject to tail-call elimination (“jumping back to the top of the function”). This is not the case. Any target will do. In your example, the call to cons is in the tail position and will be subject to tail-call elimination. This is important if you want to implement state machines or write your code in continuation passing style. TCE applies if the tail position contains a call. It does not care, must not care about the target. You can also handle self-tail-calls in a special way, but this buys you nothing but speed, whereas TCE allows functional languages to replace jump-based flow-control with function calls.

    Yes, only one branch of your ruby quick-sort can be in the tail position. The other will use stack space. Qsort always requires log(2, n) space to remember the boundaries. You could modify your implementation to reduce the constant factor by putting a recursion in the tail position, but the log(2, n) has to be there in both functional and imperative implementations.

    Function call overhead (stack frames) will not be the dominant factor in a functional qsort implementation. If you want the function to be more efficient, you should reduce the number of intermediate allocations, to reduce GC stress. When sorting a lisp list, you could build it as you go. When sorting an array, you could return a rope, building it as you go. In an impure language, the fastest qsort is probably something like function qsort(x) { return qsortInPlace(copy(x)); }.

    Just for fun, you could move both calls to qsort in tail positions by rewriting the code in continuation passing style. It would probably be slower and definitely uglier. You would still need the log(2, n) space.

    A SSC might notice that all your calls to the array constructor end up in calls to append. It might also notice that they can’t end up anywhere else. It might also notice that they add up to the length of the original array. With that information, it might decide to skip the call to append and hand you slice of a single n-long array. Don’t count on this kind of smarts.

    BTW, i don’t know ruby, but I’m pretty sure that your implementation will modify the source array. Since the first iteration will work on the outside-visible array, that’s bad. Either that or you will have multiple instances of the pivot in the result. That is also bad.

  19. This performance question is difficult to answer because you will run in other language design differences, such as Clojure’s immutable collections or Lisp’s linked lists. None of these will ever compete with a mutable array. Even if the functional compiler is smart enough to produce the best possible linked-list code, it won’t be as fast as code that sorts a simple array with in-place updates. Even if the compiler is able to use a raw array and in-place updates as temporary representation, there’s still the cost of converting the original data structure to that optimized form, and after sorting, converting again to the higher-level (and inefficient) form. These are just O(1) costs but they are quite significant, remarkably on large data sets – when you have millions of elements to sort so they don’t fit in the L1 cache, each extra O(1) step over the data will make your whole sorting 2X slower, and there’s nothing you can do about that (compiler magic, parallelism, etc.).

  20. A couple points to consider, before declaring a “win” for imperative programming on this issue (not that that’s what you’re doing, just in case anyone interprets you post that way):

    1. A recursive function operating on a mutable array can could still be considered functional (though not pure) if the visibility of the side-effects is limited. (See Clojure’s transients (http://clojure.org/transients))

    2. A well designed purely functional/immutable language is far less likely to use an an array for storing a logical vector. Internally it would probably use some sort of tree-like structure, which would utilize shared structure so there would be much less need to create and copy arrays.

  21. I a compiler can do a recursion for one call and tailrecurse the other call.

  22. @Luke V.: Yes if you have a functional language that supports arrays, it can be just as fast – btw such language can even be functional-pure: you write code that in theory replaces an array’s content with a new array, but under the covers a Sufficiently Smart Compiler produces in-place updates of that array because it sees that the old value was not aliases and won’t be reachable after the sort.

    JavaFX Script’s sequences are very good example of that: they are immutable, but internally backed by arrays, and the compiler is capable of Clojure-like usage of a mutable representation; and the immutable sequence is just a thin wrapper over the mutable data so these transitions have O(1) cost when the compiler is “smart enough”. The language doesn’t offer Lisp-like manipulation syntax that see the sequence as a car:cdr; programmers are instead presented to an array-like model, with operations that are efficient over arrays including indexed access, initialization with generators, and bulk concatenation. The compiler is not yet very smart, it does some optimizations but there are many that are possible and should follow.

    A tree-like structure is indeed much better than a linked list with one node per element, but in practice it’s only an improvement when you need few updates. Sorting a random collection will still create a hideous amount of extra work.

  23. Is there any real benefit to doing a so called “functional quicksort” vs a functional merge sort?

    I think you may be discarding the benefits of the quicksort vs the merge sort in your functional formulation. The whole point of quicksort is that it has lower memory overhead and is more optimizable in languages like C.

    Mergesort actually has a stronger complexity guarantee. Its worst case runtime is O(n*log(n)), whereas for quicksort the worst case time is O(n^2). It’s only quicksort’s *average* runtime that is O(n*log(n)).

    I like functional programming, and I think there are many places where a functional implementation of an algorithm is clearer. However, often that comes at the expense of runtime complexity. But, in many cases, that just doesn’t matter!

    Honestly, one of my favorite languages is Python, and it is dog slow. But so what? In most cases a slow language is still plenty fast. For the cases that it isn’t, I’m still comfortable programming in C or C++…

  24. Arild Haugstad

    First, as others have mentioned, the “last” thing the double-function does is the cons, not the recursive call, and similarly, the last thing the qsort-function does is the list construction. There are common techniques for writing “proper” tail-recursive functions, but that translation may be comparable to recognising and replacing a selection sort with a quicksort; “obvious” on some level of abstraction, but not something to expect the compiler will do.

    On the other hand, I get the impression that you consider the function calls expensive, and lament the “fact” that only one of the two recursive calls in a quicksort will be tail-call-optimised — unlike in imperative languages like C, where you get the “benefit” of being able/forced to handle the one side with a normal loop, and the other with either a recursive call (stack frame and all that) or by managing your very own stack implementation…

    There *will* be “extra” objects constructed, but the “expensive” part should be in the construction of the new arrays/lists for ” pivot”, not in passing those to recursive calls… Still, allocating and releasing memory with an optimised GC is a lot cheaper than malloc/free in C (unless available memory is low), placing elements in “new” arrays isn’t much worse than swapping them in existing arrays, calls/stack frames don’t need to be as expensive as those specified for the platform C calling convention — so it doesn’t need to be *much* slower. (Then again, it depends on your definition of “much”. I guess I need to test/compare; just not today…)

  25. 1) As others have pointed out, your example of a tail-recursive function in Scheme is not in fact tail-recursive, since it can’t reuse the same stack frame.

    2) This problem seems to me to be deeper than just an issue of recursion versus iteration; you’re really talking about mutable versus immutable data structures. If I recall, Clojure does some clever things to make copying its immutable sequence types less expensive, which I think maybe addresses the real issue of creating all those copies of the array. Can somebody who knows Clojure better (or knows another language that does this–Haskell? OCaml?) confirm or deny?

  26. Just for laughs I created a naive Haskell program to sort an array of integers and compared it with a naive C program on the same box. The Haskell version used 3.2G and 100% of one CPU core and took 25.3 seconds on 10k values. The C version did it in 0.002 sec. The C program sorted 1 million values in 0.115 sec.

    The figures speak for themselves. On the other hand, if you actually needed to do this in your Haskell program, it would be trivial to use the C FFI to use a C function to handle that bit of work.

  27. Many have pointed out that the function double is not tail-recursive. Chris Done commented that there is such a thing called “tail recursion modulo cons”, which which double can be implemented in a tail recursive manner.

    My two cents:

    1. I suspect Chris’s claim that this optimisation can be generalised to other primitive operators like + or – (as an automatic transformation). Notice that to transform

    fact n = n * fact (n-1)

    to

    fact n = fact ‘n 1

    fact’ n m = fact’ (n-1) (n*m)

    (some cases omitted) we need to know that (*) is associative — the orders of multiplication are different in the two programs. To verify such a transformation, the compiler would need more knowledge about the domain.

    One may say that fact and fact’ are different algorithms that happen to compute the same function.

    2. Similarly, for a compiler to “optimise” a copying quicksort to one that uses in-place update, it has to know a vast amount of domain-specific knowledge. The two quicksorts should be considered different algorithms. As one commentor said, it’s like “optimising” a selection sort to quicksort.

    This is not usually what we expect of a compiler, and it is in fact dangerous for a compiler to attempt to do so, without manually inserted heuristics or compiler pragmatics.

  28. @Doug: indeed, I think the problem is not iteration vs. recursion but the space requirements. Both quicksorts run in O(n^2) worstcase and O(n log n) average time, but the functionally pure variant uses quadratic space in the worst case variant, including all the copying that goes into that (and also it filters the array twice, instead of partitioning it, but that is only a constant factor).

    I would be quite surprised if Clojure can optimise this away. I’d guess that it’s persistent data structures will do a good job with a small amount of changes to a list, keeping track of what has changed and falling back to the original structure for the rest, for example for the concat case (left + pivot + right).

    But these filter operations create a completely different array. You’d have the choice of either evaluating the predicate for each operation on it again, or to build a new array. And in each recursion, we build up more and more filter predicates that need to be applied after each other. I’d expect Clojure to rather build a new array.

    In any case, you are either looking at quadratic space, or at an additional n comparison operations for each recursion step, which would turn the algorithm into O(n^3) worst case and O(n^2 log n) average. Both kind of sucks ;-)

  29. What the hell. Don’t you guys read? “double is not tail recursive”

    http://en.wikipedia.org/wiki/Tail_recursion_modulo_cons#Tail_recursion_modulo_cons

  30. Is there different sort algorithm that is more of a natural match for the pure functional paradigm? Would it make more sense to compare ‘quicksort in a mutable arrays environment’ with this instead?

  31. >In functional languages you are
    >rarely dealing with arrays.

    Only if you are are focused on the Miranda/Haskell model of functional languages. There’s an entire other branch of functional languages which are array based: fp, J, K, and others.

    Also, there’s an excellent paper (which I can’t find online) which shows that this favorite Haskell example isn’t a quicksort at all, but a deforested tree sort.

  32. I have a question .. (I’m not a “functional” guru).

    Mike’s post asks about functional implementation of quicksort but..
    I have understood that quicksort is an algorithm which has been implemented with a imperative style in mind and optimized for imperative languages.
    My question is:
    Are there functional-specific sorting algorithms? ( specifically developed for being functional optimized?)

  33. @Jonathan Hartley
    My post poses a similar question, I’m sorry I have missed your post.

  34. Realistically, it seems unlikely that any algorithm that builds fresh lists on each recursion can ever be as fast as an in-place algorithm on typical hardware today. Even if allocating and releasing the extra memory is relatively fast in your language of choice, there is always some overhead. In any case, using extra copies probably means poor data locality, which means poor use of caches, which dominates performance on many modern architectures.

    To get effective performance out of poor man’s quicksort, you therefore need your optimiser to be smart enough to convert it into an in-place algorithm. The very first comment here, by Maxime Caron, is probably the most relevant on this point: typical type systems in today’s mainstream languages aren’t powerful enough to make the necessary guarantees, but there are more powerful type systems that would recognise the single-movement nature of quicksort’s partitioning algorithm. Given such a system, it would be theoretically possible for an optimiser to safely convert from poor man’s quicksort into real, in-place quicksort.

  35. Mike, your post is wrong on so many levels.

  36. I lost count at lg( n ).

  37. Steve Witham wrote:

    Mike, your post is wrong on so many levels.

    Regular readers will have noticed that I have a very liberal commenting policy: I’ll basically let anyone post anything so long as it’s not spam. But occasionally I see a comment like this one, which is merely abuse and has no actual point to make, and I wonder whether I should filter a bit more strictly.

    I’m interested in others’ thoughts on this.

    (To be clear, I have no problem whatsoever with a comment that begins with “your post is wrong on so many levels” and then goes on to explain why: comments of that kind have been some of the best on this blog, and I’ve learned a lot from them. The ones I’m not sure about are where the comment is just the abuse and nothing more.)

  38. Mike, I was going to say the same about Steve’s post lacking content, but thought complaining about it might be equally uninteresting. I’m with you, though. I’d filter that comment out, not because it’s a bit mean but because it’s not worth anything.

  39. Mike, it’s your blog so you get to choose. But I certainly see no point in posts that don’t even try to make an interesting point and would suggest that Steve’s post could be deleted without damaging the blog at all.

  40. Greg Black

    Sorry, the previous comment got a nick instead of my name from a borked WordPress login.

  41. I’m kind of disappointed with this post. It talks a lot about recursion when the real topic is immutability. There are impure functional languages, e.g. ocaml, which have mutable arrays.

  42. An integer array can be used to represent a data sort and yields results that are most often faster than quicksort. See http://armr.sourceforge.net/RomLists.html.

  43. Scott O'Dell

    Most functional algorithms can be implemented with tail-recursion, but the hoops you need to jump through often eliminates any benefit for readability and debugging that functional programming normally provides.

    Here’s the quicksort written with tail-recursion (except within the less-greater function). It’s a long mess and at this point, a procedural program may be much more readable.

    (defun less-greater (pivot lis &OPTIONAL (less nil) (greater nil))
    	(cond 	((null lis) (list less (list pivot) greater))
    			((< (car lis) pivot) 
    				(less-greater pivot (cdr lis) (cons (car lis) less) greater))
    			(t (less-greater pivot (cdr lis) less (cons (car lis) greater)))))
    
    (defun qsort (lis)
    	(qsort-help (list lis) nil))
    	
    (defun qsort-help (queue result)
    	(let* ( (x (car queue))
    			(pivot (car x)))
    		(if 	(< (length x) 2) 
    				(setf result (append result x) queue (cdr queue))
    		 		(setf lg (less-greater pivot (cdr x)) queue (append lg (cdr queue))))
    		(if (null queue) result
    			(qsort-help queue result))))
    
  44. @Chris: Sorry, missed that post. That’s interesting, and it definitely makes intuitive sense that something like it would exist.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s