B-Heap vs. Binary Heap (2010)

jfarmer · on April 22, 2013

This is a great article and the diagrams illustrating the difference between a binary heap and a B-heap are awesome. The comments are amusingly negative, too.

It's worth noting that it was published in 2010, though. The Wikipedia article on B-heaps actually references this article: http://en.wikipedia.org/wiki/B-heap

cperciva · on April 22, 2013

It would be odd if the wikipedia article on B-heaps didn't reference the first publication to describe them and use that name.

jfarmer · on April 22, 2013

That's good to know! I assumed from the comments -- most of which were from 2010 and admonished him for not being familiar with the data structure already -- that he was describing something which already existed at the time.

Anyhow, man, comments on HN are tedious sometimes. Everyone's glomming onto citation issues. Bo-ring.

I loved, loved, loved the visual representation of the two different data structures. It makes it really clear what's going on and why you'd want one structure vs. the other depending on memory access patterns. It makes it much easier to put frequently co-accessed and co-manipulated data in the same page of memory.

I'm a shitty systems programmer, but my understanding is that having to hop around among many different pages will cause the OS to barf out page faults. What you wind up saving in user time by optimizing for rebalancing operations you pay for 10-fold in system time.

Is that a correct understanding?

cperciva · on April 22, 2013

The B-heap was, as far as I'm aware, a new data structure. There were other more sophisticated data structures for the same problem, and most of the criticism surrounded phk inventing his own rather than reading the literature and using one of the existing data structures.

Yes, you want to minimize the number of pages you touch; this is true both in the presence of VM pressure (to avoid page faults) and to a lesser degree even without it (because you want to avoid the cost of TLB misses).

jfarmer · on April 22, 2013

That makes the comments on the original all the more inane, then.

Mgccl · on April 22, 2013

Good article, but it only demonstrate his ignorance on existing cache-oblivious data structures...

jfarmer · on April 22, 2013

Yes, that's what 90% of the comments on the article said, too. Welcome to the club. :P

He's talking about people's everyday understanding of data structures, including his own. That the data structure he described already existed when he banged it out is beside the point; most folks don't know about it, most CS programs don't teach it, and most software doesn't use it.

That's sort of like saying, "Well, all special relativity does is show Einstein's ignorance of Lorentz's work."

It's true in some sense, but ultimately an unproductive sentiment.

epistasis · on April 22, 2013

I think that knowing your audience goes a long way, particularly when using "You're doing it wrong" in the title.

The audience of this article was a lot of people that knew of the problem, and knew of research into cache-oblivouis algorithms. This is a long ways from not citing somebody, it's preaching to somebody as if they are ignorant when they're a fair ways ahead of you. Quite annoying when encountered.

neeee · on April 22, 2013

As he says in the comments, he didn't choose the title.

Mgccl · on April 22, 2013

I see his point, I'm not attacking that. I'm not happy about him didn't do enough research on the topic to find more recent results? Or talk with a real algorithms professor about this before writing up this entire article about this? One line could change those 90% of the comments, just one line about how "This kind of problem is tackled by the field of cache-oblivious data structures."

I really wish the theory and applied side split off into separate departments or divisions just like how math and applied math go their separate ways.

andreasvc · on April 22, 2013

Cache-conscious data structures are more efficient than cache-oblivious data structures. While the latter are interesting from a research perspective, in practice the cache sizes are known and can be optimized for.

Mgccl · on April 23, 2013

true.

irrc there is only a constant factor of time difference between the two.

andreasvc · on April 24, 2013

Would you use a cache-oblivious datastructure that's 10x slower than its regular counterpart? Cache sizes don't change every day, I don't see the point.

In general there are lots of datastructures that are wonderful on paper but whose constant time factors make them infeasible in practice.

imperio59 · on April 22, 2013

Thanks for the article and the comments, I had never heard of Cache Oblivious algorithms and this article seems to indicate that others have not either: http://blogs.msdn.com/b/devdev/archive/2007/06/12/cache-obli...

I wonder if current popular libraries like the STL or the Java Collections library have started to take advantage of this work yet or not...

gsg · on April 22, 2013

Most of the STL data structures guarantee stable element addresses, completely precluding any kind of cache-friendly rearrangement.

They also require fast stable iterators that are cheap to pass by value, constraining the implementation further. For this reason I am fairly confident that every std::map/set ever written will be a balanced tree with parent pointers. There's just no latitude to do anything else.

gwern · on April 22, 2013

Cache-oblivious algorithms can be pretty wacky when you're used to not thinking about locality at all.

A mergesort can be understood by even a beginner after an hour or less in a language like Haskell. A funnelsort...? Well, I'll put it this way: even after reading the papers, I'm not sure how I would go about implementing it.

mtdewcmu · on April 22, 2013

I glanced at the code files from the inventor. It doesn't look like the kind of algorithm you implement yourself.

emin_gun_sirer · on April 22, 2013

Those of you interested in cache conscious algorithm design should check out Anthony LaMarca's thesis from 1996: http://homes.cs.washington.edu/~lamarca/pubs/lamarca-thesis....

The article only cites papers from 1961 and 1964, and complains about how CS departments use out-of-date machine models when teaching algorithms. This is just not true at any of the top schools.

timv · on April 22, 2013

There's some useful comments from last time this popped on to the HN front page: http://news.ycombinator.com/item?id=1426211

kristopher · on April 23, 2013

Here[0] is the implementation if anyone is curious.

[0] https://github.com/varnish/Varnish-Cache/blob/master/lib/lib...