This is a great article and the diagrams illustrating the difference between a binary heap and a B-heap are awesome. The comments are amusingly negative, too.
It's worth noting that it was published in 2010, though. The Wikipedia article on B-heaps actually references this article: http://en.wikipedia.org/wiki/B-heap
That's good to know! I assumed from the comments -- most of which were from 2010 and admonished him for not being familiar with the data structure already -- that he was describing something which already existed at the time.
Anyhow, man, comments on HN are tedious sometimes. Everyone's glomming onto citation issues. Bo-ring.
I loved, loved, loved the visual representation of the two different data structures. It makes it really clear what's going on and why you'd want one structure vs. the other depending on memory access patterns. It makes it much easier to put frequently co-accessed and co-manipulated data in the same page of memory.
I'm a shitty systems programmer, but my understanding is that having to hop around among many different pages will cause the OS to barf out page faults. What you wind up saving in user time by optimizing for rebalancing operations you pay for 10-fold in system time.
The B-heap was, as far as I'm aware, a new data structure. There were other more sophisticated data structures for the same problem, and most of the criticism surrounded phk inventing his own rather than reading the literature and using one of the existing data structures.
Yes, you want to minimize the number of pages you touch; this is true both in the presence of VM pressure (to avoid page faults) and to a lesser degree even without it (because you want to avoid the cost of TLB misses).
Yes, that's what 90% of the comments on the article said, too. Welcome to the club. :P
He's talking about people's everyday understanding of data structures, including his own. That the data structure he described already existed when he banged it out is beside the point; most folks don't know about it, most CS programs don't teach it, and most software doesn't use it.
That's sort of like saying, "Well, all special relativity does is show Einstein's ignorance of Lorentz's work."
It's true in some sense, but ultimately an unproductive sentiment.
I think that knowing your audience goes a long way, particularly when using "You're doing it wrong" in the title.
The audience of this article was a lot of people that knew of the problem, and knew of research into cache-oblivouis algorithms. This is a long ways from not citing somebody, it's preaching to somebody as if they are ignorant when they're a fair ways ahead of you. Quite annoying when encountered.
I see his point, I'm not attacking that. I'm not happy about him didn't do enough research on the topic to find more recent results? Or talk with a real algorithms professor about this before writing up this entire article about this? One line could change those 90% of the comments, just one line about how "This kind of problem is tackled by the field of cache-oblivious data structures."
I really wish the theory and applied side split off into separate departments or divisions just like how math and applied math go their separate ways.
Cache-conscious data structures are more efficient than cache-oblivious data structures. While the latter are interesting from a research perspective, in practice the cache sizes are known and can be optimized for.
Would you use a cache-oblivious datastructure that's 10x slower than its regular counterpart? Cache sizes don't change every day, I don't see the point.
In general there are lots of datastructures that are wonderful on paper but whose constant time factors make them infeasible in practice.
Most of the STL data structures guarantee stable element addresses, completely precluding any kind of cache-friendly rearrangement.
They also require fast stable iterators that are cheap to pass by value, constraining the implementation further. For this reason I am fairly confident that every std::map/set ever written will be a balanced tree with parent pointers. There's just no latitude to do anything else.
Cache-oblivious algorithms can be pretty wacky when you're used to not thinking about locality at all.
A mergesort can be understood by even a beginner after an hour or less in a language like Haskell. A funnelsort...? Well, I'll put it this way: even after reading the papers, I'm not sure how I would go about implementing it.
The article only cites papers from 1961 and 1964, and complains about how CS departments use out-of-date machine models when teaching algorithms. This is just not true at any of the top schools.
It's worth noting that it was published in 2010, though. The Wikipedia article on B-heaps actually references this article: http://en.wikipedia.org/wiki/B-heap