> limited to a single markdown file of instructions
single file of instructions is common in most benchmark papers, e.g. Terminal Bench. Also we have very complicated prompts like this one: https://www.skillsbench.ai/tasks/shock-analysis-supply
> opaque verifier
Could you specify which tasks' verifier is not clear or defective for benchmarking purpose?
> opaque verifier Could you specify which tasks' verifier is not clear or defective for benchmarking purpose?
> No problems involving existing codebases, refactors, or anything of the like, Also not true and we have many tasks e.g.https://www.skillsbench.ai/tasks/fix-build-google-auto, https://www.skillsbench.ai/tasks/fix-build-agentops, https://www.skillsbench.ai/tasks/react-performance-debugging