Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sometimes I think the entire engineering profession collectively underwent a lobotomy. Techniques like caching partial computation results to avoid repeating expensive work were so basic a few decades ago that no one would have bothered to dignify them with a paper, let alone brand them with a fancy acronym and announce them like the second coming of Turing. Now we get breathless blog posts and community calls over the mind-blowing discovery that storing KV caches of repeated text speeds things up. Next we'll get a paper on using hash tables to look things up faster. Meanwhile, actual difficult problems in large-scale distributed inference and model interpretability get hand-waved so we can posture about reinventing memoisation. Tech never fails to take the obvious, put a bow on it, and sell it back to us as groundbreaking.


I've noticed this too. I wonder if it is the difference in experience levels. It feels odd seeing excitement at rediscovering a (what you and I think of as well-known) solution. To be fair, I was that kid at one time too. Still, it feels a bit like these more simple things ought to be taught at university so new grads can focus more on solving domain problems.

I suppose, combine this with pressure from public or private investment, and the way to get ahead is to package anything into a prospect of revenue generation. I'm sure that's part of it too. Everything has to monetize because some business school graduate hasn't "made it" until they have a yacht like their ivy league friends.

Eh, probably comes across as curmudgeonly or "who moved my cheese". But if there is an area that can improve this longstanding problem in tech, my guess is teaching the right skills and concepts at the collegiate level. And that's not a simple thing either.

Edit > reading a bit more, this focuses on chat applications and seems to be a decent caching implementation tailored to that domain, of which, I'm guessing will allow AT&T and Verizon to save money on their gobsmackingly horrible AI chat bot in their mobile app. As an individual, it's unclear how this benefits me though. I don't think it does. ME: asks chat bot question about insurance coverage, CHATBOT: immediately serves canned response in no time about how that's covered in my individual insurance plan which I read more about on their website (pro-tip: no, I can't, those details are actually never on the website)


Partial caching as a concept doesn’t matter. The hard part is figuring out how to make it work for cross attention which sets up a data dependency for every entry on every preceding entry. So prefix caching of KV cache is brain dead easy. Computing a KV cache for random bits of text and then combining unrelated text in a way that makes the LLM still work coherently and correctly? That to me seems much harder.

It seems to me like you’re easily hand waving away a hard problem in a different part of the stack you’re less familiar with.


Let’s be honest: it’s fundamentally about analysing memory access patterns, spotting reuse opportunities, and orchestrating data flows. That’s classic systems engineering. Useful, yes. Rocket science, no. The real joke is how the profession has sunk so low that anything beyond a trivial for-loop becomes a grounds for whitepapers, corporate branding, and breathless conference talks. In the past, we’d have quietly shipped this and moved on. Frankly, I’m surprised they haven’t patented it yet.


Caching and reuse broadly yes. Getting cross attention to work mathematically correctly by stitching the pre computed KV cache for snippets of text is not that unless you’ve redefined what classical systems engineering is.

Again, the novelty is in getting cross attention to work correctly despite the fact that you’re stitching together arbitrary caches together. It’s akin to taking snippets of compressed portions of random compressed files and reconstructing a new correct plain text. That’s obviously not possible but clearly this has been accomplished with the KV cache for arbitrary models (ie not trained for it) despite the KV cache working like decompression where all the preceding bytes have to be computed correctly for the subsequent token to be correct.


I get the argument, but let's be blunt: every serious cache system deals with consistency, partial reuse, and correctness. That’s standard engineering - regardless of how much intimidating jargon you layer over it. Useful, sure. But watching the industry throw a circus around basic cache management, complete with papers and corporate branding, is exactly why so much of modern tech feels like a hype-driven clown show rather than a disciplined craft.


I really don’t understand what you’re saying. This isn’t about consistency of the data. If you don’t figure out a mathematically valid way to combine the precomputed values of snippets of text, then the LLM just doesn’t work properly. Prefix cache management which is just normal systems engineering is not all this is doing. Stitching together cache fragments such that the LLM is actually still reasoning correctly about the text is hard. Have you read the paper?


I’m with you. It’s a bit shocking.


You're bragging about the easy parts and rolling your eyes at the hard parts.

Meanwhile the AI engineers are doing the exact opposite. Bragging about the hard parts and rolling their eyes at the easy parts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: