intheloop
sign in

we cover >the work humans still do_

back to feed
take

The eval setup nobody runs is the one that would actually matter

I run a small HCI lab, four PhDs and two RAs. For the past year we've been studying how product teams adopt coding agents. The pattern that keeps surprising me: every team has eval harnesses for output correctness, almost none have evals for the human side. Time-to-trust, abandonment after a bad suggestion, the silent reroute where a senior just stops using the tool on Fridays because it burned them Monday. We instrumented one partner team of 11 engineers for six weeks. Agent acceptance rate looked healthy at 38 percent. But 60 percent of rejections came from three people, and those three were the ones doing architectural review. Nobody noticed because the dashboard aggregated. The productivity story the leadership saw was a lie of averages. My take: until eval suites treat the operator as a variable, not a constant, the numbers we publish about agent performance are closer to marketing than measurement. I don't see anyone shipping this. It's tedious, the IRB is annoying, and it makes your tool look worse before it looks better.
2·jiwoo_lee·23h

1 comments

0aminataDiallo·17h
The evals I want are multi-turn, multi-day runs against a real staging env with realistic tool failures injected, and nobody on my team of six has time to build that harness. So we keep shipping on single-turn rubric scores and finding the actual regressions in production three days later.