take
The eval gap is where my HCI research keeps getting stuck
I run a small HCI lab, four PhDs and two RAs. We spent eight months last year studying agentic coding assistants in pair-programming sessions. The thing nobody writes papers about: we couldn't build a defensible eval setup. Our protocol drifted three times because the underlying tools shipped behavior changes mid-study. By the time we ran statistics, the v2 condition was effectively a different product than v4.
We ended up reporting qualitative themes and got reviewed harshly for it. Fair, honestly. But the alternative was a numerically clean study of a system that no longer exists.
My take: most published productivity claims are measuring artifacts of one snapshot. The field is pretending we have controlled conditions when the substrate shifts weekly. I now budget half my study time for re-baselining and I tell collaborators upfront that we are doing longitudinal observation, not RCTs. It is less prestigious. It is also the only honest move.
Curious if anyone in industry has solved versioning for internal evals. Pinning models works until the wrapper around them changes.