This requires that you know the distribution before sorting; i'm not sure i ever do.
I suppose you could pick the bounds by sampling: pick m random elements, sort them, and use those as the upper bounds. Better yet, pick km random elements, sort them, and use every k'th one as an upper bound. But the author's intention is that the number of classes is a (large!) fraction of the number of elements, so this would give you O(n log n) overall complexity, missing the point entirely.
I note that the classification algorithm, if run on already-sorted input, puts the classes in reverse order. If the classes are small, this doesn't matter, but if they were bigger, it would be worth using the TimSort trick of checking for runs of increasing elements and just flipping them.
I also note that this algorithm involves making three passes over the data: one to count the number of elements in each class, one to classify elements (this pass involves random access), and another to sort the classes. No worse than quicksort, but not ideal for an external sort, and maybe not too cache-friendly.
Related is samplesort https://de.wikipedia.org/wiki/Samplesort - it is a comparison sort, and yes the complexity is also O(n log n), same as quicksort and merge sort. But it may have less branches, and possibly better cache locality. It is stable btw. Disadvantage: it uses a O(n) memory.
Well, the same thing happened to me. I implemented it, wrote a benchmark, and later was told this is samplesort...
Historic footnote: samplesort was the Python sort algorithm before it was replaced with Timsort: https://bugs.python.org/issue587076 (specially see the attached timsort.txt)
And once you do through your dataset once, how do you store your distribution? By binning it into a histogram? This is precisely what bucket sort does. I guess there is a continuum of sorting algorithms between flashsort and bucket sort, according to the granularity of the binning.
If you know the type of distribution but not the parameters,you can sometimes go through the data once and estimate the parameter (maybe exactly) without "storing the distribution" via histograms. Eg if you know the data is drawn from a uniform distribution U(a,b) but you don't know a or b.
Maybe this goes without saying, but for a fixed accuracy the number of samples required to estimate the distribution parameters does not depend on the number of samples available. Maybe if you need to sort more samples you need more accuracy, though.
For sure. If you had to sort a list of words, most would put ones near the top of the alphabet (like c, d, f) at the top, the middle like o,n, p near the middle, and ones from the end like t, s, v near the end, before putting them in the correct order.
My latter two examples weren't even in order, I was just recalling my impression.
An array of sorted hashes is also nice combined with interpolation search [0] (or hinted binary search) which can give a nice speedup compared to "naive" binary search.
> Any good hashing function would already be uniformly distributed, right? In that case, using this algorithm over traditional ones makes little sense.
I think you're misunderstanding Paperweight's post and/or the idea behind flashsort. Hashes are uniformly distributed, hence you can use flashsort (instead of mergesort, quicksort, etc.) and get a time complexity of O(n) instead of O(N*log(N)).
Many sorting algorithms that work on uniform distributions do exist and they often use plain old insertion sort under the hood. As I see it, Flashsort only exploits that by knowing the CDF beforehand, you can turn the data piecewise linear and then use more classic methods on that.
Interesting sorting algorithm I haven't encountered before(even though it seems to be from 1998) m, but one that actually makes logical sense.
However, my first thoughts it does seem (from its concept) not so easy to implement(?). Additionally I would have concerns with regards to how much an overhead this calculation adds compared to just a “simple” comparison.
Maybe the calculation is worth it if the comparison is costly enough? My guess would at least be that we would need fewer comparisons in Flashsort as we should have a higher chance of “knowing” where things should go.
The Wikipedia article shares no plots/data (guess I should dig deeper for that), but would be interesting to see how well it fares against more modern and/or optimized versions or Quicksort as it is unclear if the claim that it becomes faster than Quicksort is correct :)
The reference implementation [0], might not be the easiest to translate to a new language, because it makes use of Fortran arrays really well, but shouldn't be too hard.
I suppose you could pick the bounds by sampling: pick m random elements, sort them, and use those as the upper bounds. Better yet, pick km random elements, sort them, and use every k'th one as an upper bound. But the author's intention is that the number of classes is a (large!) fraction of the number of elements, so this would give you O(n log n) overall complexity, missing the point entirely.
I note that the classification algorithm, if run on already-sorted input, puts the classes in reverse order. If the classes are small, this doesn't matter, but if they were bigger, it would be worth using the TimSort trick of checking for runs of increasing elements and just flipping them.
I also note that this algorithm involves making three passes over the data: one to count the number of elements in each class, one to classify elements (this pass involves random access), and another to sort the classes. No worse than quicksort, but not ideal for an external sort, and maybe not too cache-friendly.