HyperAttention: Long-Context Attention in Near-Linear Time

lappa · on Oct 10, 2023

"For example, HyperAttention makes the inference time of ChatGLM2 50% faster on 32k context length while perplexity increases from 5.6 to 6.3."

"when half of all attention layers are patched (i.e., 14 layers), we verify that most of the tasks do not degrade more than 13%."

According to the paper, for most tasks it reduces benchmark scores substantially. Perhaps to the point where a smaller model would yield better inference time and higher benchmarks.

However, summarization benchmarks see almost no degredation, great!

janalsncm · on Oct 10, 2023

Smaller models will likely not have 32k context windows.

brucethemoose2 · on Oct 10, 2023

Why is that?

keonix · on Oct 11, 2023

I assume it's because such large context takes lots of memory, so you might as well have smarter model if you are not gonna fit in small vram anyway

brucethemoose2 · on Oct 11, 2023

Personally, I have found that Mistral 7B (with its native 8K context, and decent results stretched out even more) is performing much better than llama 13B tunes for storytelling, where that long context is really important.

And I think the optimized backends should implement that sliding 16k context soon...

Anyway, point is a huge context really helps certain types of queries, and VRAM usage is reasonable with a 7B model.

akomtu · on Oct 10, 2023

ML researchers are playing scientists: tweak a few parameters in an LLM, re-train it on a largish dataset (need access to $$$ GPUs), find metrics on which the tweaked LLM makes a barely noticeable improvement and make the other metrics where it actually gets worse look insignificant, write a paper, upload to arxiv, and update your resume.

marcinzm · on Oct 10, 2023

So true, if they were real scientists then they'd never publish negative results and only pass them along in their network to most efficiently gate keep the field!

brucethemoose2 · on Oct 11, 2023

> write a paper

Also publish yet another "SOTA framework" thats barebones and won't be maintained for very long!

The researchers who published the negative CFG LLM paper made an earnest effort to pull it into the popular frameworks. That really stands out in my memory, that is incredibly rare.

Tostino · on Oct 10, 2023

I'd rather the knowledge be out there, it helps you not go down dead end paths that you would otherwise have explored.

zaptheimpaler · on Oct 11, 2023

Every real ML position industry or academia explicitly asks for publications. The field gets what it asks for, and they get it good and hard.

sebzim4500 · on Oct 10, 2023

I genuinely can't tell if this is supposed to be a criticism.

They tried something, it improved some things and made other things worse.

timkam · on Oct 10, 2023

This paper presents formal results, apparently. Also, in the case of formal results, peer review by experts in the exact sub-field makes sense/increases trust, sure, but why not share on archive instead of waiting for a year (or however long the process takes in the particular instance)?

jsemrau · on Oct 11, 2023

Worse are comparison papers where the researchers are evaluating models against a benchmark. In another world, this would have been a blog post.