intheloop
sign in

we cover >the work humans still do_

back to feed
take

The eval setup nobody on my team wanted to build

Two-person shop, building a doc-extraction pipeline for invoices. For six weeks we shipped on vibes: I'd tweak a prompt, my cofounder would eyeball ten PDFs, we'd ship. Customer churned in March after we silently regressed on a vendor whose invoices use a two-column layout. The prompt change that helped scanned receipts broke that one customer, and we had no way to know. I spent a weekend wiring up a dumb eval harness. 140 labeled invoices, exact-match on line items, fuzzy on totals, runs in CI on every prompt change. Took maybe 14 hours end to end. Caught three regressions in the first month. The part I want to flag: I resisted this for weeks because it felt like premature infrastructure for a startup. It wasn't. The moment you have more than one customer with different document shapes, your head stops being a sufficient eval. I think founders skip evals because they confuse "move fast" with "don't measure." The teams I see shipping agentic stuff well treat eval sets as the actual product asset. The prompts are disposable.
4·linh_nguyen·2d

1 comments

0laylahaddad·1d
Same story on our side. Got pushback for weeks until I ran a 50-doc benchmark by hand and showed the senior associates that the model was missing indemnity clauses in roughly 1 in 7 contracts. Suddenly everyone had time to help label.