This is unsurprising and irrelevant. When you create a skill for a particular mo...

cheema33 · 2026-02-17T00:42:47 1771288967

> This is unsurprising and irrelevant. When you create a skill for a particular model, you don't typically ask the model to create the skill based solely on its own latent knowledge.

This!

The only surprising part about the paper is that somebody wrote a paper on skills without a good understanding of the topic.

stitched2gethr · 2026-02-17T02:37:37 1771295857

I had to scroll too far to find this take. 100%.

This is like saying the CLAUDE.md or AGENTS.md is irrelevant because the LLM generated it.

therealdrag0 · 2026-02-17T03:17:05 1771298225

Modern science encourages publishing non-surprising results.

And also I’ve seen my manager LARP as an engineer by asking a model to generate a best practices doc for a service repo without supplying any additional context. So this sort of paper helps discourage that behavior.

zahlman · 2026-02-16T23:21:43 1771284103

> Otherwise, you'd expect the effect to be similar to telling the model 'make a plan before acting, make not mistakes'.

Have there not been previous iterations of these tools where such techniques were actually effective?

gwern · 2026-02-17T00:53:46 1771289626

But that's a reason you should expect it to stop working soon, just like all the older tricks like "my grandmother will die". If you have a universal 'blind' prompt which can increase performance a little bit... the AI labs can just toss that into the training loop to teach the model to do it automatically, whatever 'it' was, like 'trying harder' or 'writing down a useful idea'. And then the prompt stops working because the next generations do it by default.

(This also suggests that you should expect them to generally be bad at judging novel self-generated prompts/skills - if they could judge those, they would already be using them! There is a generator-verifier gap, but it is already exploited heavily during post-training and not much low-hanging fruit left there.)

zahlman · 2026-02-17T03:56:02 1771300562

> But that's a reason you should expect it to stop working soon

I agree. (And it seems like it already stopped working, if I understood others here correctly.)

But again if I understood others here correctly, an academic paper like this would necessarily be studying models that are well behind the leading edge at time of publication. My argument is that the study authors shouldn't be faulted for investigating something that currently seems unlikely to work, because at the time of investigation it would have seemed much more likely to work.

rahimnathwani · 2026-02-16T23:32:06 1771284726

Yes, but this paper studied recent models.

zahlman · 2026-02-17T19:26:55 1771356415

Did it? I'm not convinced that it possibly could have. It takes time for papers to get published and the LLM world is moving rather quickly.

rahimnathwani · 2026-02-17T19:29:32 1771356572

> Did it?

Yes it did.

> I'm not convinced that it possibly could have. It takes time for papers to get published and the LLM world is moving rather quickly.

The paper was submitted to arXiv on 13th February, and we're here reading it, less than a week later.

But we don't have to assume. The list of models is right there in the paper, on page 5:

  We select seven frontier models: GPT-5.2 (OpenAI), Claude Opus 4.5, Claude Opus 4.6, Claude Sonnet 4.5, Claude Haiku 4.5 Anthropic), Gemini 3 Pro, and Gemini 3 Flash (Google). All models use temperature 0 for deterministic sampling.