Flashsort

twic · on Feb 10, 2020

This requires that you know the distribution before sorting; i'm not sure i ever do.

I suppose you could pick the bounds by sampling: pick m random elements, sort them, and use those as the upper bounds. Better yet, pick km random elements, sort them, and use every k'th one as an upper bound. But the author's intention is that the number of classes is a (large!) fraction of the number of elements, so this would give you O(n log n) overall complexity, missing the point entirely.

I note that the classification algorithm, if run on already-sorted input, puts the classes in reverse order. If the classes are small, this doesn't matter, but if they were bigger, it would be worth using the TimSort trick of checking for runs of increasing elements and just flipping them.

I also note that this algorithm involves making three passes over the data: one to count the number of elements in each class, one to classify elements (this pass involves random access), and another to sort the classes. No worse than quicksort, but not ideal for an external sort, and maybe not too cache-friendly.

thomasmg · on Feb 10, 2020

Related is samplesort https://de.wikipedia.org/wiki/Samplesort - it is a comparison sort, and yes the complexity is also O(n log n), same as quicksort and merge sort. But it may have less branches, and possibly better cache locality. It is stable btw. Disadvantage: it uses a O(n) memory.

I made some experiments with samplesort, and found with Java it can be about 40% faster than Arrays.sort, and for C++ maybe 5-10% faster: https://github.com/thomasmueller/fastSort_java and https://github.com/thomasmueller/fastSort_cpp

twic · on Feb 10, 2020

Oh yes, i've basically reinvented samplesort! Fifty years too late, sadly.

thomasmg · on Feb 10, 2020

Well, the same thing happened to me. I implemented it, wrote a benchmark, and later was told this is samplesort...

Historic footnote: samplesort was the Python sort algorithm before it was replaced with Timsort: https://bugs.python.org/issue587076 (specially see the attached timsort.txt)

wnoise · on Feb 10, 2020

In English: https://en.wikipedia.org/wiki/Samplesort

Sharlin · on Feb 10, 2020

Hm? Finding the exact bounds is a linear-time preprocessing step and as such doesn’t affect the complexity of the algorithm as a whole.

twic · on Feb 10, 2020

I don't think you can find k equally spaced class bounds in linear time.

nabla9 · on Feb 10, 2020

You can learn distribution by sampling or by going trough the dataset once.

enriquto · on Feb 10, 2020

And once you do through your dataset once, how do you store your distribution? By binning it into a histogram? This is precisely what bucket sort does. I guess there is a continuum of sorting algorithms between flashsort and bucket sort, according to the granularity of the binning.

zodiac · on Feb 10, 2020

If you know the type of distribution but not the parameters,you can sometimes go through the data once and estimate the parameter (maybe exactly) without "storing the distribution" via histograms. Eg if you know the data is drawn from a uniform distribution U(a,b) but you don't know a or b.

repsilat · on Feb 10, 2020

Maybe this goes without saying, but for a fixed accuracy the number of samples required to estimate the distribution parameters does not depend on the number of samples available. Maybe if you need to sort more samples you need more accuracy, though.

sophiebits · on Feb 10, 2020

Closer to how humans often sort things.

logicallee · on Feb 10, 2020

For sure. If you had to sort a list of words, most would put ones near the top of the alphabet (like c, d, f) at the top, the middle like o,n, p near the middle, and ones from the end like t, s, v near the end, before putting them in the correct order.

My latter two examples weren't even in order, I was just recalling my impression.

firethief · on Feb 10, 2020

That sounds more like a bucket sort, which kind of becomes flash sort as the number of buckets approaches the number of items in the list

phamilton · on Feb 10, 2020

I think about this when I play hearts.

Bucket by suit, sort each suit.

Paperweight · on Feb 10, 2020

Good for sorting hashes.

gorset · on Feb 10, 2020

An array of sorted hashes is also nice combined with interpolation search [0] (or hinted binary search) which can give a nice speedup compared to "naive" binary search.

[0] https://en.wikipedia.org/wiki/Interpolation_search

martin_a · on Feb 10, 2020

Why would you want to sort hashes? Not sure I see the application for hashes here, but that's totally my fault.

kyrieeschaton · on Feb 10, 2020

To have a duplicable traversal order.

anilakar · on Feb 10, 2020

Any good hashing function would already be uniformly distributed, right? In that case, using this algorithm over traditional ones makes little sense.

adwn · on Feb 10, 2020

> Any good hashing function would already be uniformly distributed, right? In that case, using this algorithm over traditional ones makes little sense.

I think you're misunderstanding Paperweight's post and/or the idea behind flashsort. Hashes are uniformly distributed, hence you can use flashsort (instead of mergesort, quicksort, etc.) and get a time complexity of O(n) instead of O(N*log(N)).

anilakar · on Feb 10, 2020

Many sorting algorithms that work on uniform distributions do exist and they often use plain old insertion sort under the hood. As I see it, Flashsort only exploits that by knowing the CDF beforehand, you can turn the data piecewise linear and then use more classic methods on that.

nift · on Feb 10, 2020

Interesting sorting algorithm I haven't encountered before(even though it seems to be from 1998) m, but one that actually makes logical sense.

However, my first thoughts it does seem (from its concept) not so easy to implement(?). Additionally I would have concerns with regards to how much an overhead this calculation adds compared to just a “simple” comparison.

Maybe the calculation is worth it if the comparison is costly enough? My guess would at least be that we would need fewer comparisons in Flashsort as we should have a higher chance of “knowing” where things should go.

The Wikipedia article shares no plots/data (guess I should dig deeper for that), but would be interesting to see how well it fares against more modern and/or optimized versions or Quicksort as it is unclear if the claim that it becomes faster than Quicksort is correct :)

maweki · on Feb 10, 2020

I think assignment into the buckets can be done in parallel.

alanbernstein · on Feb 10, 2020

Neat, almost like a probabilistic extension of radix sort.

DougBTX · on Feb 10, 2020

Similar to pigeonhole sort too:

https://en.wikipedia.org/wiki/Pigeonhole_sort

naich · on Feb 10, 2020

This is basically just guessing with style.

shakna · on Feb 10, 2020

The reference implementation [0], might not be the easiest to translate to a new language, because it makes use of Fortran arrays really well, but shouldn't be too hard.

[0] https://www.drdobbs.com/database/the-flashsort1-algorithm/18...

Jaxan · on Feb 10, 2020

FORTRAN got a new version in 2018! Is that it new enough for you :-)?

shakna · on Feb 11, 2020

I have no issues with Fortran. But Dimension doesn't translate easily to C arrays, or other languages that don't have matrix support.

crazypython · on Feb 10, 2020

How does this perform against timsort?