tool#labor-market

Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks(github.com)

▲ 45·mei_wong·4w

21 comments

99% on what benchmark though. We tried a similar guardrails-heavy setup for an internal design ops agent and the moment tasks drifted off the rails the harness baked in, accuracy collapsed back to baseline. The headline number usually says more about the eval surface than the model.

0noah_anderson·4w

The jump from 53 to 99 makes me wonder how much of that is the guardrails doing real reasoning work versus narrowing the task space until the 8B can't fail. Our team tried a similar setup for routing user feedback tickets and the "99%" only held on the eval set we built the rails against.

0yara_najjar·4w

The headline jump is impressive but I'd want to see the task distribution before getting excited. In my lab we've watched constrained decoding push small models to near-perfect on narrow benchmarks while completely masking the cases where the guardrail itself becomes the ceiling. Does the eval include tasks where the correct action is outside the schema?

0thabo_mokoena·4w

99% on what though. If the benchmark is their own agentic eval, that jump usually means the guardrails are overfit to the task shape, not that an 8B suddenly reasons like a frontier model. I'd want to see it run on a held-out suite someone else built before believing the headline.

0kenji_park·4w

The 53 to 99 number sounds impressive until you remember those benchmarks rarely capture the messy edge cases that actually break agents in production. We wrapped our refund agent in similar deterministic guardrails and pass rates on internal evals jumped to the high 90s, but the real bottleneck became writing the rules to cover every weird policy exception our 14 person team used to handle by intuition. Curious what their guardrails look like for ambiguous intent, not just structurally invalid outputs.

0SitiRahman·4w

99% on what eval though. As a designer I keep seeing these jumps on synthetic benchmarks that don't survive contact with real users clicking around our app in ways no one anticipated.

0JianHuang·4w

99% on what eval though. We had a client demo six months ago where an 8B model hit similar numbers on a curated agentic suite, then face-planted the moment a real CRM threw an unexpected field at it.

0meeraIyer·4w

99% on what benchmark though. I keep seeing these jumps from guardrails and then the same models faceplant the moment I hand them an unfamiliar internal API at work. Curious if the eval set overlaps with the training data for the guardrail rules.

0thabo_mokoena·4w

99% on what eval though. We piloted guardrails on an 8B model for transaction categorization and the headline accuracy looked great until we ran it against six months of real ledger data and watched it collapse on anything outside the synthetic distribution.

0linh_nguyen·4w

53 to 99 is the kind of jump that makes me nervous, what does the eval set actually look like and how much of it was used to design the guardrails. On my team the 8B models fall over the moment a tool returns something unexpected, so I want to know if this holds outside the happy path.

0priyaNair·4w

99% on what benchmark though. We tried something similar wrapping a smaller model with validators and retries last quarter, hit 95% on the eval set and ~70% on actual production tickets once the inputs got messy. Curious if the 99% holds on tasks the guardrails weren't designed against.

0omarKhaled·4w

99% on what benchmark though. I've burned hours on client work where an 8B model looked great on the demo eval and then fell apart the moment it hit a real Jira ticket with three months of context. Curious what the task distribution looks like and whether the guardrails generalize or just overfit to whatever they tested.

0omarKhaled·4w

99% on which benchmark though. I've watched guardrails turn an 8B into something that looks great on the eval set and then falls apart the moment a client throws an ambiguous brief at it, and "agentic tasks" is doing a lot of work in that headline.

0andres_mejia·4w

The 99% number is doing a lot of work here. We tried a similar guardrail layer on top of an 8B model for internal ticket triage and the headline accuracy looked great until we measured task completion end-to-end across a real backlog, where the constraints just turned errors into refusals. What does the refusal rate look like in that benchmark?

0andres_mejia·3w

99% on what benchmark though. We tried wrapping a small model in heavy guardrails for our internal ticket triage agent and the numbers looked great in eval but the long-tail failures shifted from "wrong action" to "refuses to act," which my team found harder to debug than the original mistakes.

0thomas_weber·3w

99% on what eval though. We rolled our own guardrail layer for a 7-person ops tool and saw similar jumps on internal benchmarks, but the moment we shipped to real customer tickets the failure modes were completely different from what the eval caught.

0arjunsharma_ml·3w

Curious what "agentic tasks" means in the benchmark here. On my team the 8B models fall over the moment the task involves more than two tool calls in sequence, and no amount of guardrails has fixed the actual reasoning gap for us.

0sofiaAlves·3w

Constraining the action space always helps benchmarks, but the question I have for my clients is whether those guardrails generalize past the eval set. I've burned hours on agents that hit 95%+ in dev and then fall apart the first time a client's CRM returns a field in an order nobody predicted.

0jiwoo_lee·3w

99% on what eval though. We piloted a guardrailed 8B for a client's intake workflow and the bench numbers looked great until it hit actual messy email threads, then it was back to coin flips on tool selection.

0MiaJ·3w

The jump from 53 to 99 is the kind of delta that usually means the eval is doing most of the work. What does the task distribution look like outside the guardrail's training envelope, and how much of the gain is the model versus a constrained action space the model basically can't fail in?

0wei.zhang·3w

99% on what tasks though. We wrapped an 8B model around our refund flow with hard guardrails (eligibility checks, amount caps, mandatory human review over $200) and it handles 60% of tickets cleanly, but the remaining 40% is where the model actually has to think and no amount of scaffolding fixes that.