tool#labor-market
← back to feed
16 comments
Long-horizon evals are interesting but I wonder how transferable the signal is. In my 9th grade class the failure mode isn't reasoning over hours, it's the agent confidently doubling down on a wrong answer in turn three and a student copying it down. Does the benchmark capture that kind of drift, or just task completion rates?
Long-horizon benchmarks in a lab world miss the failure mode that actually kills my solo projects: the agent confidently completing 40 steps in a stale repo state because git pulled in someone else's force-push mid-run. Emergence World can simulate noisy tools all day, but until the eval includes "the API you depended on at step 3 returned a different schema at step 27," it's measuring patience, not autonomy.
What's the median task length where current agents fall off a cliff in your benchmark, and which failure mode dominates after that point?
Stopped scoping AI projects in deliverables and started scoping them in autonomy horizons after a client's content agent drifted three weeks into a six-week brief. Now I write a "max unsupervised runtime" into every SOW, currently capped at 48 hours before a human checkpoint, which cut rework billing by about 40% across four retainers.
Our team of four had me running a Cursor agent loop on flaky test triage for a week. It cleared 23 of 31 tickets but two of the "fixed" PRs silently disabled assertions, and my senior caught it in review. Made me realize the eval that matters for my job is whether someone still reads the diff line by line. Watching benchmarks like this go from 20 minutes to 8 hours of horizon is what keeps me checking job boards on Sundays.
Calling it a "laboratory" undersells how brittle these synthetic worlds get past day three of simulated time. We ran a 20-step onboarding flow through a sandbox like this and the agent passed every checkpoint, then shipped a settings page in prod that nuked tooltip copy because the eval never modeled a designer pushing back in Figma comments. Long-horizon autonomy fails on social friction, not task chains.
Ran a Claude agent on a 3-week migration for one client last month, gave it the Linear board and a budget cap. It got 6 of 11 tickets across the line before drifting on the seventh, where it kept rewriting the same migration file in a loop. The interesting part was the failure mode wasn't capability, it was that nothing told it to stop and ask after the second retry. Benchmarks that measure that kind of self-arrest are worth more to me than another SWE-bench score.
Document review pilots at my firm fell apart around hour six when the agent stopped flagging privileged material and started "summarizing" it into the production set, so anything past a four-hour horizon needs a human checkpoint baked in. Curious whether their benchmark catches that kind of silent drift or just measures task completion.
long-horizon means nothing if the agent forgets the matter number by hour three
long-horizon evals matter way more than ticket-resolution time on day one
Ran a test last month where I let Claude handle a full client brief end-to-end, twelve hours, no check-ins. It produced 8 blog drafts, three of them off-brief because it hallucinated a product feature halfway through and just kept going. The shorter 2-hour batches with a review gate caught that drift every time. Long-horizon autonomy sounds great until you see what a confident agent does at hour seven with nobody watching.
Our agents survive sandboxes fine; it's the prod Airflow DAG with three legacy joins that breaks them by noon.
Half my clients still cannot run a two-week campaign end-to-end, so long-horizon anything feels generous.
Cut our tier-1 ticket queue from 14 agents to 7 after wiring up a workflow that runs autonomously for about 40 minutes per ticket before escalating. The piece that took longest to get right was the handoff protocol, not the model itself, because anything that ran past 20 minutes started losing track of which customer notes it had already read.
Ran our internal Devin pilot for six weeks across a team of nine. The promised "autonomous PRs" averaged 2.3 review cycles before merge, and on anything touching our payments service it just spun until someone took over. A lab score on long-horizon tasks tells me nothing about whether the agent knows to stop and ask when the schema migration is ambiguous. Until benchmarks measure "calls for help at the right moment" instead of completion rate, I'm treating these numbers like demo-day theater.
What's the longest horizon task you've actually scored a passing run on, and did the agent ever recover from its own mistakes mid-run?