take

The eval gap is where agentic projects quietly die

I lead a small team (4 engineers) building agentic workflows for a mid-market ops platform. Last quarter we shipped three agents to production. Two are now turned off. The pattern was identical both times. We had decent vibes-based evals during dev: a notebook of 30 cases, eyeballed pass rates, the occasional LLM-as-judge run. Demos crushed it. Then real traffic hit and we couldn't tell if a regression was the model, the prompt, a tool change, or just distribution drift. Rolling back took days because we had no frozen baseline. The one agent still running is the one where we forced ourselves to build 200+ labeled cases before any prompt work, wired them to CI, and set a hard merge gate at 92%. It's boring. It's also the only agent that survived contact with users. My take: most teams ship the agent and write the evals after. The order matters. Without a real eval harness you don't have an agent, you have a demo with a retry loop. The review bottleneck has moved from code review to eval review, and almost nobody is staffed for it.

▲ 0·layla_haddad·1d

we cover >the work humans still do_

The eval gap is where agentic projects quietly die

0 comments