The best benchmark is one that you build for your use-case. I finally did that f...

airstrike · 2025-12-02T18:15:11 1764699311

If you and others have any insights to share on structuring that benchmark, I'm all ears.

There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.

The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.

pants2 · 2025-12-02T20:00:01 1764705601

Generally, the easiest:

1. Sample a set of prompts / answers from historical usage.

2. Run that through various frontier models again and if they don't agree on some answers, hand-pick what you're looking for.

3. Test different models using OpenRouter and score each along cost / speed / accuracy dimensions against your test set.

4. Analyze the results and pick the best, then prompt-optimize to make it even better. Repeat as needed.

dotancohen · 2025-12-03T12:42:57 1764765777

How do you find and decide which obscure models to test? Do you manually review the model card for each new model on Hugging Face? Is there a better resource?

pants2 · 2025-12-05T04:49:38 1764910178

Just grab the top ~30 models on OpenRouter[1] and test them all. If that's too expensive make a sample 'screening' benchmark that's just a few of the hardest problems to see if it's even worth the full benchmark.

1. https://openrouter.ai/models?order=top-weekly&fmt=table

dotancohen · 2025-12-05T07:59:06 1764921546

Thank you! I'll see about building a test suite.

Do you compare models' output subjectively, manually? Or do you have some objective measures? My use case would be to test diagnostic information summaries - the output is free text, not structured. The only way I can think to automate that would be with another LLM.

Advice welcome!

pants2 · 2025-12-05T16:53:51 1764953631

Yeah - things are easy when you can objectively score an output, otherwise as you said you'll probably need another LLM to score it. For summaries you can try to make that somewhat more objective, like length and "8/10 key points are covered in this summary."

This is a real training method (like Group Relative Policy Optimization), so it's a legitimate approach.

dotancohen · 2025-12-05T17:30:50 1764955850

Thank you. I will google Group Relative Policy Optimization to learn about that and the other training methods. If you have any resources handy that I should be reading, that would be appreciated as well. Have a great weekend.

pants2 · 2025-12-05T19:44:54 1764963894

Nothing off the top of my head! If you find anything good let me know. GRPO is a training technique likely not exactly what you'd do for benchmarking, but it's interesting to read about anyway. Glad I cuold help