It's only a small story: I was one of the group sent to Iceland for the Need for Speed Sprint on Python back in ~2006. Among the other participants was, of course, tim. I mostly worked on benchmarking to determine how much of a speedup we had achieved.
tim gave me some really eye-opening tips, one that's really stuck with me is that you want to take the best of many short runs of the benchmark rather than running it a long time, because longer runs are going to guarantee interference from other parts of the OS (scheduling, etc).
tim is an insanely smart guy, kind, and funny.
One day for lunch we were ordering takeout from this little joint, I forget the name, it might have been Hamborgarabulla Tomasar, and I picked up the check for everyone (maybe 15 people). tim described it as "shockingly good", which was absolutely accurate. For how expensive Iceland was for food, it wasn't too terribly bad price-wise either.
Interestingly, Timsort used to be broken [1]. They found that by trying to prove correctness. I recall Stijn de Gouw remarking that they wanted to have their proof artefact evaluated, but you'd need a seriously beefy machine to do so.
Timsort was not broken, the implementations were, which is a slightly (but crucially) different issue: the implementations don’t / didn’t uphold the algorithm’s invariants at the upper edges.
Not only that, but the cpython’s implementation’s misbehaviour effectively couldn’t be triggered because you couldn’t create an array large enough to reach it.
Closely related is pattern defeating quicksort ( https://github.com/orlp/pdqsort ), which adapts quicksort to take advantage of sorted runs. I've adapted a few quicksorts to pdqsort and seen good speedups (as people were often sorting partially sorted data)
Basically: Timsort is to mergesort as pdqsort is to quicksort
Hadn't heard of that one, and looking it up I stumbled upon this[1] delightful collection of esoteric sorting algorithms. Perfect for a Friday evening...
I thought it was widely known, IIRC a number of languages have adopted it as the stdlib default sorting algorithm for either unstable/stable sort (can't remember which is it?)
I'm not big into algorithms but I believe that pdqsort and timsort were supposed to be the best two general sorts.
* <p>The implementation was adapted from Tim Peters's list sort for Python
* (<a href="http://svn.python.org/projects/python/trunk/Objects/listsort.txt">
* TimSort</a>). It uses techniques from Peter McIlroy's "Optimistic
* Sorting and Information Theoretic Complexity", in Proceedings of the
* Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp 467-474,
* January 1993.
Here [1] is a great (short, readable) paper from 2015 that explains Timsort and similar stack-merge sorting algorithms. It also gives a runtime analysis, and a simplified version of the algorithm that they call 'alpha-sort'.
Whenever there's a discussion about sorting I like to put in a little reminder about the humble Radix Sort which is often overlooked by academics because it's so simple. It deserves more attention as it is super fast: O(kn). Academics disregard it because it won't sort floating point numbers, but the vast majority of real-world use cases deal with alphanumerics, so Radix Sort should get more press than it does.
CPython source code contains a separate documentation [1] for a detailed description and discussion for the algorithm. TimSort optimizes for the number of comparisons (which are expensive in Python since each comparison can call a Python function) so all numbers are given as such.
Number of comparisons is a particular sore point for me. Nobody is sorting integers. This is the goddamned 21st century, not 1980's era video games. Just stop.
Show me a sort algorithm that is still efficient with five pointer indirections in compare() and I'm satisfied. Show me primitive data types and I start to wonder what you're hiding (read: lying about).
Sorting integers, floats and strings are surely some of the most widely used ways of sorting? You have some table-like thing, you want to sort it by some numeric column or alphabetically by some string column. Sorting some generic structure is surely the edge case, and even when you do that, surely the most common case is to compare one or two structure fields which are either ints, floats or strings.
In languages which try to be efficient, surely an expensive compare function is the outlier.
Creating a single benchmark seems tricky since Timsort’s key strength is it performs particularly well on real world data where subsequences of the input are often already sorted (runs). With random data it should have O(n log n) time like any other comparison sort.
So with benchmarking the challenge is coming up with a set of real world-like test data that feels representative but isn’t just excessively playing to Timsort’s strengths. Not everyone’s “real world” data is the same.
I'm trying to recall what the expected level of clustering is even for properly random inputs.
A few years before Timsort there was someone randomly shuffling the input in order to avoid the reverse-order corner case that kills so many sort algorithms.
For truly random inputs, about half of the entries should be pairs that are partially ordered, and 25% runs of 3, no? And if memory serves Timsort also has a fast path for strictly decrementing runs as well, where it amortizes some of the comparisons done in the scan phase to do a blind reverse() call.
You don’t have to agree on any one degree. You can set up a giant matrix with lengths on one axis, and the sortedness on the other.
Each cell can have a color indicating if TimSort beats, say, some other hybrid of MergeSort; green shades suggest TimSort is winning, red shades suggest the contender is winning.
For each cell, do multiple runs with those sortedness/length parameters and pick the average.
> Why would lacking in evidence and rigor be considered a plus?
The point it’s making is about heuristics rather than theoretical complexity. Tim Peters provided plenty of evidence and rigorous work when he proposed it.
That you interpret it as “lack of evidence and rigor” is really about you.
I had the same thought at first. But I think the point is that this sort of heuristically cobbled together sort wouldn't typically be the stuff of academic research, brilliant and practical though it is.
I actually found this valuable, without being a "brag". The best type of sort for real-world problems is build with an understanding of real-world datasets. The types of questions and research that interest academia, while often valid, are distinct from those that concern practical engineers.
I get annoyed with titles like that. "Never heard of", "Didn't know.", etc. I know people want clicks but it sometimes feel normal person to person courtesy aren't valued when the medium is the Internet.
It's especially "dangerous" in this case because the people who would even care about sorting algorithms are going to contain a subset of people who knows a lot about sorting. Those people are likely to know about Timsort. It's like going to a heart surgery conference and having a presentation about "This one heart surgery method you never knew about." Good chance at least a few people in the audience would know and may have invented it.
I had heard of that! I just don’t count a bug as a bad thing. Maybe the opposite. Software that has no bugs usually achieves this feat by not having any users. Timsort has tons of users, hence tons of attention, hence someone found a bug. That’s great!
I've done this too with blog posts, and it works okay, although many readers don't like it. I wanted a snappier way to say "you didn't learn about in school," and I think this author did too.
Please don't link plagiarized content. This guy also linked to his own "Big O Notation" in his page, where he says O(n) is polynomial
>>>In our shopping list example, in the worst-case of our algorithm it prints out every item in the list sequentially. Since there are n items in the list, it takes O(n) polynomial time to complete the algorithm.
This may be the same author ... Anyway f(n)=n is definitely a polynomial and O(n) is a class of algorithms that run in polynomial time (i.e. are in P).
https://news.ycombinator.com/item?id=21196555
https://news.ycombinator.com/item?id=17436591
https://news.ycombinator.com/item?id=17436591