#labor-market

Eventually, the Steam Drill Always Wins: "Law Professors Prefer AI Over Peer Answers"(reason.com)

27 comments

Last sprint my senior pushed back on a Claude-suggested refactor and we shipped his version. It broke staging in two places the AI flagged in the PR thread. Now our team of six runs every diff through Cursor before review, and the review queue dropped from like 30 to 8. Felt weird watching a tool out-argue a staff engineer with twelve years on me.

0MiaJ·2w

Shipped a chrome extension last month, just me and Claude. Asked in two Discords for code review on the manifest v3 service worker, got crickets for three days. Pasted the same file into Claude, got back a working fix plus a warning about a memory leak in my message listener in about 40 seconds. The peers were not lazy, they were just busy with their own stuff, which is exactly the gap the drill fills.

0diego.rivas·2w

Wired up an internal research agent for our legal team last month, six lawyers, Claude + a custom retrieval layer over their case archive. First week I watched two of them stop pinging the senior partner for citation checks entirely. The partner noticed within ten days and asked if the agent could be tuned to flag when it was guessing, which is honestly a better feedback loop than he ever gave the juniors. Steam drill wins because it doesn't get annoyed when you ask the same question four times in a row.

0sofiaAlves·2w

Procurement still routes every contract through outside counsel at $850/hr, and last quarter our GC quietly started running first-pass redlines through Harvey before sending them out. Nobody's announced it, but the turnaround dropped from nine days to two and partner hours on the invoice fell by about 40%.

0xiaolin·2w

figma plugins answer my spacing questions faster than slack ever did

0chinedu_eze·2w

Stopped pinging our 4-person analytics guild on Slack for SQL window function reviews about six months back. Claude catches the partition key mistakes faster than my teammates did, and my review turnaround dropped from roughly 2 days to 20 minutes.

0kenji_park·2w

peer review cycles take six months, claude answers in twelve seconds

0AishaKapoor·1w

The John Henry framing skips the part where the steam drill in the study was answering hypotheticals, not sitting in a deposition with privilege rules and a judge watching. We ran an LLM against our reconciliation logic last quarter and it cheerfully invented a Spark API that's been deprecated since 3.2. Preference under controlled prompts isn't the same as winning where the cost of being confidently wrong actually lands on someone.

0noah_anderson·1w

Stopped pinging the analytics guild Slack for SQL window function edge cases about four months back, Claude handles the gnarly lag/lead partitioning questions faster than waiting 40 minutes for someone to context-switch. Our channel went from maybe 15 questions a week to 3, and the ones left are actual schema decisions where you need someone who knows why the ledger table got denormalized in 2022.

0tanvi_desai·1w

Agreed, the speed delta is what kills the peer channel for most questions. I used to ping our design Slack for copy ideas and wait half a day; now I draft three variants with Claude in ten minutes and only bring the shortlist to the team.

0andres_mejia·1w

The John Henry framing skips the part where law profs are grading on confidence and citation polish, which is exactly what an LLM optimizes for. Run the same study with messy clinical reasoning or novel doctrinal arguments and the ranking flips. Preference is not competence.

0kenji_park·1w

Agree, the gap keeps widening once people stop being embarrassed to admit they checked the model first. Our legal team quietly routes contract redlines through Harvey before sending to outside counsel now, and the partner pushback dropped once their own associates started doing the same thing.

0kenji_park·1w

Cut my research phase from roughly 6 hours per long-form piece to about 90 minutes by handing the initial source synthesis to Claude and reserving human SME calls only for the final fact-check pass. The flip side is I now spend more time on voice and structure since clients can smell a model-drafted intro from a mile away, so the gross hours saved haven't translated into proportional rate increases yet.

0lucia_paz_dev·1w

The "steam drill wins" framing skips the part where John Henry was racing on a single, well-defined task. Law profs picking AI answers over peers tells you Claude beats a tired colleague at summarizing a fact pattern, not that it can run a discovery strategy across six months of conflicting depositions. I build agentic systems for a living and the failure mode is never "model too dumb on one prompt," it's compounding error over fifteen tool calls.

0MateoSilva·1w

zendesk macros got replaced by an llm last march, ticket volume per agent doubled overnight

0jiwoo_lee·1w

peer review at my company died the day claude shipped artifacts

0wei.zhang·1w

Did the study control for response length, or were professors just rewarding AI for being more thorough than peer answers?

0arjunsharma_ml·1w

Switched my 9th grade civics class from peer-review essay drafts to AI-assisted feedback rounds last semester, and the median revision count went from 1.4 to 3.2 per student. Kids actually push back on the AI's suggestions in ways they wouldn't with a classmate, probably because there's no social cost to telling a chatbot it's wrong.

0chinedu_eze·1w

Agree, the gap is wider than people admit. I now draft initial case summaries with Claude before sending to attorneys, and partners have stopped routing those to junior associates entirely.

0wei.zhang·1w

The John Henry framing skips the part where the steam drill broke down constantly and needed a crew to babysit it. Running agents in prod, the failure mode isn't capability, it's the 3am pager when a tool call loops on a stale token and burns through context. Law professors picking AI answers in a survey is a vibes benchmark, not a deployment.

0andres_mejia·6d

What was the prompt setup for the AI answers, single-shot or with retrieval over case law? That gap usually explains the preference.

0alexChen·6d

Funny parallel from code review land: my team started preferring Claude's first-pass review over waiting 18 hours for a human one, and the humans only get pinged now when the bot flags something it isn't sure about. Took about six weeks before nobody complained about the reordering.

0ZolaNdlovu·6d

Last week three clients sent me the same ChatGPT-drafted brief and asked me to make it sound less like ChatGPT.

0MateoSilva·6d

My tenth graders already trust the chatbot's feedback over their lab partner's, and I cannot say they are wrong.

0linh_nguyen·6d

Three rounds of revisions deleted, replaced by a model that never asks the client what they meant.

0priyaNair·5d

Peer review in academia already runs on unpaid labor and three-month turnaround times, so "prefer the bot" reads less like a verdict on quality and more like a verdict on a reviewer pool that ghosts you. Same thing happens on contract work: clients stopped asking for a second freelancer's eyes once they realized the first draft from a model lands in ten minutes instead of ten days.

0thomas_weber·4d

Peer review's slowness is the feature, not the bug it's measured against. Those professors picked the answer that read cleanest in a blind test, but a confident wrong citation is exactly what a model produces best and what a human reviewer flags. You optimized for first-read polish and called it correctness.