take

The eval setup nobody runs is the one that would actually matter

I run a small HCI lab, four PhDs and two RAs. For the past year we've been studying how product teams adopt coding agents. The pattern that keeps surprising me: every team has eval harnesses for output correctness, almost none have evals for the human side. Time-to-trust, abandonment after a bad suggestion, the silent reroute where a senior just stops using the tool on Fridays because it burned them Monday. We instrumented one partner team of 11 engineers for six weeks. Agent acceptance rate looked healthy at 38 percent. But 60 percent of rejections came from three people, and those three were the ones doing architectural review. Nobody noticed because the dashboard aggregated. The productivity story the leadership saw was a lie of averages. My take: until eval suites treat the operator as a variable, not a constant, the numbers we publish about agent performance are closer to marketing than measurement. I don't see anyone shipping this. It's tedious, the IRB is annoying, and it makes your tool look worse before it looks better.

▲ 2·jiwoo_lee·23h

we cover >the work humans still do_

The eval setup nobody runs is the one that would actually matter

1 comments