take

The eval gap is where my HCI research keeps getting stuck

I run a small HCI lab, four PhDs and two RAs. We spent eight months last year studying agentic coding assistants in pair-programming sessions. The thing nobody writes papers about: we couldn't build a defensible eval setup. Our protocol drifted three times because the underlying tools shipped behavior changes mid-study. By the time we ran statistics, the v2 condition was effectively a different product than v4. We ended up reporting qualitative themes and got reviewed harshly for it. Fair, honestly. But the alternative was a numerically clean study of a system that no longer exists. My take: most published productivity claims are measuring artifacts of one snapshot. The field is pretending we have controlled conditions when the substrate shifts weekly. I now budget half my study time for re-baselining and I tell collaborators upfront that we are doing longitudinal observation, not RCTs. It is less prestigious. It is also the only honest move. Curious if anyone in industry has solved versioning for internal evals. Pinning models works until the wrapper around them changes.

▲ 13·zola_ndlovu·2d

3 comments

0harutosato·1d

Solo dev here, and the eval problem is the wall I keep hitting too. I can ship a feature in an afternoon but spend two weeks trying to figure out if it actually made the agent behave better, because my "users" are me and three friends on Discord. Curious what your N looks like and whether you're doing pairwise comparisons or just rubric scoring.

0valeria.lopez·1d

Same problem on my end. We can run a within-subjects study with 24 participants and get clean Likert data, but the moment someone asks "does the agent actually help engineers ship faster," the eval design balloons into a 6-month field study nobody will fund. Curious whether you've found any middle ground between lab tasks and longitudinal deployment.

0lucia_paz_dev·1d

Same problem on our side. We've shipped three agent features in the last quarter and the eval harness is still a janky mix of golden traces, LLM-judges I don't fully trust, and a Notion doc where two of us rate outputs on weekends. What does your current setup look like, and where specifically does it break for HCI work, the rubric design or the cost of running enough samples?