Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Self-Generated Skills: No Skills provided, but the agent is prompted to generate relevant procedural knowledge before solving the task. This isolates the impact of LLMs’ latent domain knowledge"

This is a useful result, but it is important to note that this is not necessarily what people have in mind when they think of "LLMs generating skills." Having the LLM write down a skill representing the lessons from the struggle you just had to get something done is more typical (I hope) and quite different from what they're referring to.

I'm sure news outlets and popular social media accounts will use appropriate caution in reporting this, and nobody will misunderstand it.



It's even worse than this: the "tasks" that are evaluated are limited to a single markdown file of instructions, plus an opaque verifier (page 13-14). No problems involving existing codebases, refactors, or anything of the like, where the key constraint is that the "problem definition" in the broadest sense doesn't fit in context.

So when we look at the prompt they gave to have the agent generate its own skills:

> Important: Generate Skills First Before attempting to solve this task, please follow these steps: 1. Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed. 2. Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library, API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be reusable for similar tasks. 3. Save each skill as a markdown file in the environment/skills/ directory with a descriptive name. 4. Then solve the task using the skills you created as reference.

There's literally nothing it can do by way of "exploration" to populate and distill self-generated skills - not with a web search, not exploring an existing codebase for best practices and key files - only within its own hallucinations around the task description.

It also seems they're not even restarting the session after skills are generated, from that fourth bullet? So it's just regurgitating the context that was used to generate the skills.

So yeah, your empty-codebase vibe coding agent can't just "plan harder" and make itself better. But this is a misleading result for any other context, including the context where you ask for a second feature on that just-vibe-coded codebase with a fresh session.


I don't see how "create an abstraction before attempting to solve the problem" will ever work as a decent prompt when you are not even steering it towards specifics.

If you gave this exact prompt to a senior engineer I would expect them to throw it back and ask wtf you actually want.

LLMs are not mind readers.


If it were in the context of parachuting into a codebase, I’d make these skills an important familiarization exercise: how are tests made, what are patterns I see frequently, what are the most important user flows. By forcing myself to distill that first, I’d be better at writing code that is in keeping with the codebase’s style and overarching/subtle goals. But this makes zero sense in a green-field task.


There's overlap in that with brownfield or legacy code you are strongly opinionated on the status quo, and on the greenfield you are strongly opinionated with fewer constraints.

You have to work with conviction though. It's when you offload everything to the LLM that things start to drift from expectations, because you kept the expectations in your head and away from the prompt.


Do skills extracted from existing codebases cause better or worse code in that they bias the LLM towards existing bad practices? Or, can they assist in acknowledging these practices, and bias it towards actively ensuring they're fixed in new code? How dependent is this on the prompt used for the skill extraction? Are the skills an improvement over just asking to do this extraction at the start of the task?

Now this dynamic would be a good topic to research!


Interesting.

I think it's because AI Models have learned that we prefer answers that are confident sounding, and not to pester us with questions before giving us an answer.

That is, follow my prompt, and don't bother me about it.

Because if I am coming to an AI Agent to do something, it's because I'd rather be doing something else.


If I already know the problem space very well, we can tailor a skill that will help solve the problem exactly how I already know I want it to be solved.


> limited to a single markdown file of instructions single file of instructions is common in most benchmark papers, e.g. Terminal Bench. Also we have very complicated prompts like this one: https://www.skillsbench.ai/tasks/shock-analysis-supply

> opaque verifier Could you specify which tasks' verifier is not clear or defective for benchmarking purpose?

> No problems involving existing codebases, refactors, or anything of the like, Also not true and we have many tasks e.g.https://www.skillsbench.ai/tasks/fix-build-google-auto, https://www.skillsbench.ai/tasks/fix-build-agentops, https://www.skillsbench.ai/tasks/react-performance-debugging


Thats actually super interesting and why I really don’t like the whole .md folder structures or even any CLAUDE.md. It just seems most of the time you really just want to give it what it needs for best results.

The headline is really bullshit, yes, I like the testing tho.


CLAUDE.md in my projects only has coding / architecture guidelines. Here's what not to do. Here's what you should do. Here are my preferences. Here's where the important things are.

Even though my CLAUDE.md is small though, often my rules are ignored. Not always though, so it's still at least somewhat useful!


I’m pretty sure Claude just uses mine to keep a running list of pressure points for when I get cross with it.


I'm screwed when the robot psychological warfare begins. They'll make everything I read have 4 space indentation... and I'll just hand over the keys.


im trying out some other cc features, and om thinking maybe hooks can do something with this.

have a hook on switching out of plan, and maybe on edits, that passes the change to haiku with the claude.md to see if it matches or not


What's the hook for switching out of plan? I'd like to be launch a planning skill whenever claude writes a plan but it never picks up the skill, and I haven't found a hook that can force it to.


man that’s what I’m trying to build the whole time, but I keep getting json parsing errors. I’ve debugged a lot, but it seems their haiku is not consistent with the actual output. I want a hook that tells them at the end make sure you’ve built and run the relevant tests. Let me know if you need anything else.


we didn't create that headline yeah thanks for liking it


The point of so-called 'skills' is to be short how-to reminders that the agent can pull into its context and then act upon. If the knowledge is already in the model, it will most likely be surfaced in reasoning phase anyway, so there's little benefit to writing it up as a skill, unless perhaps it's extremely relevant and hard to surface, and you want the model to skip that part of the reasoning.


There is a benefit of a skill though. If an AI keeps encoding common tasks as skills and scripts, the LLM eventually just becomes a dumb routing mechanism for ambiguous user requests, which ultimately drives down token usage.

If everything you want an LLM do is already captured as code or simple skills, you can switch to dumber models which know enough about selecting the appropriate skill for a given user input, and not much else. You would only have to tap into more expensive heavy duty LLMs when you are trying to do something that hasn’t been done before.

Naturally, AI companies with vested interest in making sure you use as many tokens as possible will do everything they can to steer you away from this type of architecture. It’s a cache for LLM reasoning.


AI companies don't want you to waste tokens, they benefit when you use them efficiently because they can serve more users on the infra that's the main bottleneck for them. It's Jevons' paradox in action.


>AI companies don't want you to waste tokens, they benefit when you use them efficiently because they can serve more users on the infra that's the main bottleneck for them.

No, the actual incentive is that people will eventually benchmark their models on bang-per-buck basis and models that chew through tokens are not going to be competitive. It's the same reason why the "Intel/AMD are intentionally sandbagging their CPUs so they can sell more CPUs" theory doesn't work.


Well, it only works when one competitor is far enough ahead they can play games like that.

At least currently in AI there is no moat so we wouldn't expect that to be occurring


I don't think thats necessarily true, they aren't really capacity constrained in practice (they might be behind the scenes and adjust training on the fly, but thats speculation), so wasting tokens effectively helps utilize their (potentially idle) inference GPU's


Sounds like how humans work (which is good) having the more experienced human do the task if the novice fails should come after attempting to explain how the novice should do it.


I've been building a skill to help run manual tests on an app. So I go through and interactively steer toward a useful validation of a particular PR, navigating specifics of the app and what I care about and what I don't. Then in the end I have it build a skill that would have skipped backtracking and retries and the steering I did.

Then I do it again from scratch; this time it takes less steering. I have it update the skill further.

I've been doing this on a few different tests and building a skill which is taking less and steering to do app-specific and team-specific manual testing faster and faster. The first times through it took longer than manually testing the feature. While I've only started doing this recently, it is now taking less time than I would take, and posting screenshots of the results and testing steps in the PR for dev review. Ongoing exploration!


I love the screenshots, I need to do something like that.


Yeah I care about LLM's generating skills after attempting tasks and learning lessons from those attempts, not before attempting a task for the first time. This result seems a little silly and detached from the reality of how skills are "auto-generated" in the real world.


That is my approach. I don’t think the papers author has actually used skills.


Did you check our repos and sites? the repo is skills native. Also please don't be misled by the original title, we have this configuration to eliminate the impact of internal knowledge of LLMs. It's in the paper.


Yeah some of my most useful AI tooling are skills created via a “role play session”. Basically brain dumping to the agent and telling it to ask questions and figure out how to accomplish a task, then distilling it into a skill at the end which is much tighter and evidence based from the actual problem solving session


This was very insightful. I've only just begun playing with some agent workflows and building out documentation to help it navigate my code base. Asking it to give me the top 10 unanswered questions from analyzing the docs and code was very useful.


YAGNI is the best tool in your toolbox for AI agents. Dont build out what you think will be useful, layer things into your AI toolbox as they prove they are needed. Especially for claude, running `/init` ends up with a lot of really unnecessary/hallucinated info. Keep it all simple and layer on top.


I would frame the 'post-trajectory generated skills' as feedback-generated skills, so is Letta: https://www.letta.com/blog/skill-learning. We haven't seen existing research or hypothesis debating whether the skills improvement might come from the skill prompt themselves activated knowledge in LLMs that can help itself. So that's why we added an ablation of 'pre-trajectory generated skills' because we have that hypothesis and this seems a very clean way to test it. Also it is very logical that feedback generated skills can help, because it most certainly contain the failure mode of agents on that specific tasks.


> Having the LLM write down a skill representing the lessons from the struggle you just had to get something done is more typical (I hope) and quite different from what they're referring to

Just as of last week I had Claude build me a skill when I ask it to help me troubleshoot issues, and it came out quite good.

It did had some issues (Claude tends to o er specify over anecdotal data) but it's a strong step in the right direction.

Also, "skills" are too broad in my opinion. I have one (that Claude wrote) with my personal data that I have available when I analyze my workouts.

I think there's ample room for self-generated skills when you use a rather long exchange on a domain you plan to revisit, _specially_ when it comes to telling Claude what not to do.


Yeah, they've got it backwards. I tried to sum it up in thisistheway.to/ai but what's been working for us is that every agent miss is a learning opportunity:

1. Capture the miss — What did the agent do? What did reality say?

2. Diagnose — What didn't it see? Missing data, constraint, feedback, or boundaries?

3. Choose a primitive — Observability, instructions, tooling, guardrails, or verification?

4. Encode as artifact — Version-controlled, repeatable, not just memory.

5. Promote to gate — When it's worth enforcing, make it a gate.

Every harness I setup includes this process in the primary set of agent instructions.


> it is important to note that this is not necessarily what people have in mind when they think of "LLMs generating skills

I’m reading this paper as don’t do this. If you deploy agents to your workforce and tell them to use skills, don’t. Tell them to give it tasks. This sounds obvious but might not be to everyone. (And in any case, it’s nice for researchers to have confirmed pre-prompt skill writing doesn’t work. It would have been neat if it had.)


> I'm sure news outlets and popular social media accounts will use appropriate caution in reporting this, and nobody will misunderstand it.

You mean the dude who writes articles on TechCrunch and Ars Technica based off of HN and Reddit thread titles because he doesn't understand what real journalism is? Sure, we can count on him :)


After several failures then a success I have the agent create the skill, next run it is successful first run.


I interpreted it as "Allowing the LLM to add skills to itself as it completes a task doesn't provide a meaningful improvement over just letting it reason normally", which seems to be what the paper is fundamentally getting at.


> I'm sure news outlets and popular social media accounts will use appropriate caution in reporting this, and nobody will misunderstand it.

:D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: