- test harness for voice agents.
- multi platform formats (Retell, VAPI, Bland, LiveKit) compile down to a unified AgentGraph IR
- import from one platform, test locally, export to another
- use litellm, DSPY to config models, if on a subscription use claudecode as a runner to avoid API call charges
- metric judges produce continuous 0-1 scores instead of binary pass/fail since a 0.65 and a 0.35 both fail a 0.7 threshold but represent very different agent behaviors.
- persist to DuckDB for querying across test history
- adding auto-healing graph mutations where failed tests propose structural + prompt changes to the agent graph and validate against a regression suite
https://github.com/voicetestdev/voicetest
- test harness for voice agents. - multi platform formats (Retell, VAPI, Bland, LiveKit) compile down to a unified AgentGraph IR - import from one platform, test locally, export to another - use litellm, DSPY to config models, if on a subscription use claudecode as a runner to avoid API call charges - metric judges produce continuous 0-1 scores instead of binary pass/fail since a 0.65 and a 0.35 both fail a 0.7 threshold but represent very different agent behaviors. - persist to DuckDB for querying across test history - adding auto-healing graph mutations where failed tests propose structural + prompt changes to the agent graph and validate against a regression suite
Wrote up the architecture here https://peet.ldee.org/general/2026/02/03/testing-voice-ai-ag...