Hacker News new | past | comments | ask | show | jobs | submit login
Timsort (2019) (skerritt.blog)
189 points by akbarnama on July 29, 2022 | hide | past | favorite | 61 comments




Thanks! Macroexpanded:

Beating TimSort at Merging - https://news.ycombinator.com/item?id=27823180 - July 2021 (69 comments - including https://news.ycombinator.com/threads?id=tim-peters - maybe luck will strike again...)

Timsort, the Python sorting algorithm - https://news.ycombinator.com/item?id=21196555 - Oct 2019 (131 comments)

On the Worst-Case Complexity of TimSort - https://news.ycombinator.com/item?id=17883461 - Aug 2018 (74 comments)

Timsort is a sorting algorithm that is efficient for real-world data - https://news.ycombinator.com/item?id=17436591 - July 2018 (77 comments)

Functional verification with mechanical proofs of TimSort [pdf] - https://news.ycombinator.com/item?id=9778243 - June 2015 (1 comment)

Timsort - https://news.ycombinator.com/item?id=3214527 - Nov 2011 (27 comments)

Visualising Sorting Algorithms: Python's timsort - https://news.ycombinator.com/item?id=2092594 - Jan 2011 (3 comments)

Java has switched from Mergesort to TimSort - https://news.ycombinator.com/item?id=752677 - Aug 2009 (13 comments)

Visualizing Sorting Algorithms - https://news.ycombinator.com/item?id=750858 - Aug 2009 (3 comments)


Apparently a lot of people heard of it. :-) I just love these headlines.


Heard of it? I've bought tim an Icelandic hamburger!


Story time, please :)


It's only a small story: I was one of the group sent to Iceland for the Need for Speed Sprint on Python back in ~2006. Among the other participants was, of course, tim. I mostly worked on benchmarking to determine how much of a speedup we had achieved.

tim gave me some really eye-opening tips, one that's really stuck with me is that you want to take the best of many short runs of the benchmark rather than running it a long time, because longer runs are going to guarantee interference from other parts of the OS (scheduling, etc).

tim is an insanely smart guy, kind, and funny.

One day for lunch we were ordering takeout from this little joint, I forget the name, it might have been Hamborgarabulla Tomasar, and I picked up the check for everyone (maybe 15 people). tim described it as "shockingly good", which was absolutely accurate. For how expensive Iceland was for food, it wasn't too terribly bad price-wise either.

I wrote up more here, though I guess my blogs are offline now. https://lwn.net/Articles/185399/

So, not too crazy of a story. Not like the story of how I first met tim.


> So, not too crazy of a story. Not like the story of how I first met tim.

Are you going to make us ask every time?

How did you meet Tim?


That's a story I only tell in person. ;-)-ly-yrs


Interestingly, Timsort used to be broken [1]. They found that by trying to prove correctness. I recall Stijn de Gouw remarking that they wanted to have their proof artefact evaluated, but you'd need a seriously beefy machine to do so.

[1] http://www.envisage-project.eu/proving-android-java-and-pyth...


Timsort was not broken, the implementations were, which is a slightly (but crucially) different issue: the implementations don’t / didn’t uphold the algorithm’s invariants at the upper edges.

Not only that, but the cpython’s implementation’s misbehaviour effectively couldn’t be triggered because you couldn’t create an array large enough to reach it.


There is a follow up, the fix used in Java was not good enough [2].

[2] https://bugs.openjdk.org/browse/JDK-8203864


Closely related is pattern defeating quicksort ( https://github.com/orlp/pdqsort ), which adapts quicksort to take advantage of sorted runs. I've adapted a few quicksorts to pdqsort and seen good speedups (as people were often sorting partially sorted data)

Basically: Timsort is to mergesort as pdqsort is to quicksort


My favorite oddball sorting algorithm is StalinSort. It runs in (O)1 time, making it probably the fastest sorting algorithm in the world.

It iterates through the list, and deletes every item in the list that isn't ordered. The remaining list is therefore automatically ordered.


Hadn't heard of that one, and looking it up I stumbled upon this[1] delightful collection of esoteric sorting algorithms. Perfect for a Friday evening...

[1]: https://towardsdatascience.com/esoteric-sort-algorithms-and-...


If it iterates over every item in the list then it’s O(n) time


Just delete the whole list


Why would they ask to sort it if it is already sorted? Deleting the list resolves any out of order items.


BLARG. I knew that but I second guessed myself into getting it wrong. Thanks for the clarification.


I guess you could make a Blitzkrieg multithreaded variation, then it would run in O(Log N) time.


> Timsort is to mergesort as pdqsort is to quicksort

It’s the other way around, since timsort is a decade older than pdqsort.


Age has nothing to do with the accuracy of the analogy, which says nothing about the temporal relationship between timsort and pdqsort.


I thought it was widely known, IIRC a number of languages have adopted it as the stdlib default sorting algorithm for either unstable/stable sort (can't remember which is it?)

I'm not big into algorithms but I believe that pdqsort and timsort were supposed to be the best two general sorts.


Java uses it in Arrays.sort:

     * <p>The implementation was adapted from Tim Peters's list sort for Python
     * (<a href="http://svn.python.org/projects/python/trunk/Objects/listsort.txt">
     * TimSort</a>).  It uses techniques from Peter McIlroy's "Optimistic
     * Sorting and Information Theoretic Complexity", in Proceedings of the
     * Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp 467-474,
     * January 1993.


> either unstable/stable sort (can't remember which is it?)

Stable. It’s a hybrid mergesort.

Some of the langages which adopted it after Python are Java, Rust, or Swift.


Java and Python IIRC


Here [1] is a great (short, readable) paper from 2015 that explains Timsort and similar stack-merge sorting algorithms. It also gives a runtime analysis, and a simplified version of the algorithm that they call 'alpha-sort'.

[1] https://hal-upec-upem.archives-ouvertes.fr/hal-01212839/file...


Whenever there's a discussion about sorting I like to put in a little reminder about the humble Radix Sort which is often overlooked by academics because it's so simple. It deserves more attention as it is super fast: O(kn). Academics disregard it because it won't sort floating point numbers, but the vast majority of real-world use cases deal with alphanumerics, so Radix Sort should get more press than it does.


In many practical cases you can sort in O(n) since comparisons out to 512 bytes are a single op on a modern CPU.


Also interesting is pdqsort, which will become Go's standard sort.

https://news.ycombinator.com/item?id=31106157

https://news.ycombinator.com/item?id=31098822


Needs a 2005 in the title.


9 lines of Timsort was the centerpiece of the Oracle-Google Android Java lawsuit.

https://www.infoq.com/news/2012/05/nine-lines/


I was hoping to see some benchmark numbers


CPython source code contains a separate documentation [1] for a detailed description and discussion for the algorithm. TimSort optimizes for the number of comparisons (which are expensive in Python since each comparison can call a Python function) so all numbers are given as such.

[1] https://github.com/python/cpython/blob/main/Objects/listsort...


Number of comparisons is a particular sore point for me. Nobody is sorting integers. This is the goddamned 21st century, not 1980's era video games. Just stop.

Show me a sort algorithm that is still efficient with five pointer indirections in compare() and I'm satisfied. Show me primitive data types and I start to wonder what you're hiding (read: lying about).


What are you talking about? I sort integers every day


Sorting integers, floats and strings are surely some of the most widely used ways of sorting? You have some table-like thing, you want to sort it by some numeric column or alphabetically by some string column. Sorting some generic structure is surely the edge case, and even when you do that, surely the most common case is to compare one or two structure fields which are either ints, floats or strings.

In languages which try to be efficient, surely an expensive compare function is the outlier.


Creating a single benchmark seems tricky since Timsort’s key strength is it performs particularly well on real world data where subsequences of the input are often already sorted (runs). With random data it should have O(n log n) time like any other comparison sort.

So with benchmarking the challenge is coming up with a set of real world-like test data that feels representative but isn’t just excessively playing to Timsort’s strengths. Not everyone’s “real world” data is the same.


I'm trying to recall what the expected level of clustering is even for properly random inputs.

A few years before Timsort there was someone randomly shuffling the input in order to avoid the reverse-order corner case that kills so many sort algorithms.

For truly random inputs, about half of the entries should be pairs that are partially ordered, and 25% runs of 3, no? And if memory serves Timsort also has a fast path for strictly decrementing runs as well, where it amortizes some of the comparisons done in the scan phase to do a blind reverse() call.


This kind of almost sorted data is easily synthesised with any degree of already-sorted-ness.


Certainly! But then how do you decide what degree of already-sorted-ness fairly represents real world data?


You don’t have to agree on any one degree. You can set up a giant matrix with lengths on one axis, and the sortedness on the other.

Each cell can have a color indicating if TimSort beats, say, some other hybrid of MergeSort; green shades suggest TimSort is winning, red shades suggest the contender is winning.

For each cell, do multiple runs with those sortedness/length parameters and pick the average.


By collecting a lot of real world data?


> built for the real world — not constructed in academia.

Why would lacking in evidence and rigor be considered a plus?

Does Timsort pride itself of being the homeopathy of sorting algorithms?

What a bizzare brag to make.


> Why would lacking in evidence and rigor be considered a plus?

The point it’s making is about heuristics rather than theoretical complexity. Tim Peters provided plenty of evidence and rigorous work when he proposed it.

That you interpret it as “lack of evidence and rigor” is really about you.


On the contrary, it means it was actually supported by academic effort (the provision of evidence and rigorous work, as you put it).

Hence why I'm saying what a weird brag. It's not even true.


I had the same thought at first. But I think the point is that this sort of heuristically cobbled together sort wouldn't typically be the stuff of academic research, brilliant and practical though it is.


I actually found this valuable, without being a "brag". The best type of sort for real-world problems is build with an understanding of real-world datasets. The types of questions and research that interest academia, while often valid, are distinct from those that concern practical engineers.


“you’ve never heard of” is a bit presumptuous. I’ve heard of it a lot, nothing but good things.


I get annoyed with titles like that. "Never heard of", "Didn't know.", etc. I know people want clicks but it sometimes feel normal person to person courtesy aren't valued when the medium is the Internet.

It's especially "dangerous" in this case because the people who would even care about sorting algorithms are going to contain a subset of people who knows a lot about sorting. Those people are likely to know about Timsort. It's like going to a heart surgery conference and having a presentation about "This one heart surgery method you never knew about." Good chance at least a few people in the audience would know and may have invented it.


Then I have an interesting tidbit for you: its sorting wasn't 100% correct. Should be fixed now though.

http://www.envisage-project.eu/proving-android-java-and-pyth...


I had heard of that! I just don’t count a bug as a bad thing. Maybe the opposite. Software that has no bugs usually achieves this feat by not having any users. Timsort has tons of users, hence tons of attention, hence someone found a bug. That’s great!


It's well known.

Nevertheless, the article was good.

If the author is reading this, consider dropping the "you've never heard of" from the title; you don't want to assume things about your reader.


"lesser-known" is probably more appropriate choice, given that it is not a kind of algorithm you'd heard from CS curricula.


I've done this too with blog posts, and it works okay, although many readers don't like it. I wanted a snappier way to say "you didn't learn about in school," and I think this author did too.


Sometimes a little colorful language makes dry topics a bit more fun. I don't think the title was meant to be statistically analyzed. :-)


This blog steals content from other sites, original is at https://hackernoon.com/timsort-the-fastest-sorting-algorithm....

Unfortunately I cannot downvote.

Please don't link plagiarized content. This guy also linked to his own "Big O Notation" in his page, where he says O(n) is polynomial

>>>In our shopping list example, in the worst-case of our algorithm it prints out every item in the list sequentially. Since there are n items in the list, it takes O(n) polynomial time to complete the algorithm.


This may be the same author ... Anyway f(n)=n is definitely a polynomial and O(n) is a class of algorithms that run in polynomial time (i.e. are in P).


It's the same author; compare the URL to your link's author's last name.


Well, linear functions are still technically polynomials… haha!


O(n) is polynomial and it's the same author.


There are no downvotes on submissions, FWIW.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: