During training, our legal reasoning agent explored jurisdictionally questionable paths in 18% of all episodes. It tried to apply California law to Texas disputes. It cited federal precedents in state court contexts. It confused circuit boundaries.
By the end of training, it got jurisdiction right 97% of the time.
That 18% number is the one I keep thinking about. Not because the agent made mistakes - every agent makes mistakes early in training. Because the kind of mistakes changed depending on which tools were available. And the kind of corrections the agent learned were not the same across tasks.
The uneven gains
We trained a 3B parameter model on four types of legal reasoning tasks, with access to 18 specialized tools - legal databases, citation networks, court metadata, web search - constrained to 9 calls per episode.
The headline result: 10.5 percentage point improvement over baselines on LegalBench. But the headline obscures the interesting part.
The gains were not uniform:
| Task | Our model | Best baseline | Gap |
|---|---|---|---|
| Personal jurisdiction | 67.8% | 51.1% | +17.8 |
| Learned hands business | 60.3% | 51.1% | +9.2 |
| Learned hands domestic violence | 58.0% | 52.9% | +5.1 |
| Citation prediction | 57.4% | 51.9% | +5.5 |
Personal jurisdiction - the task requiring the agent to determine whether a court can exercise authority over a party based on complex, multi-state fact patterns - improved by nearly 18 points. Citation prediction, which is closer to pattern matching, improved by 5.5.
The tasks that require reasoning across multiple sources and jurisdictions benefited the most from tool access. The tasks closer to lookup or classification benefited the least.
This isn't surprising in retrospect. But it has implications that I think are underappreciated.
What the agent learned to do
During evaluation, we tracked tool sequences. Three coordination patterns emerged:
Research - Analysis - Validation. The agent would search for cases, retrieve the full opinion, then check which subsequent cases cited it. This mirrors how attorneys verify that a precedent hasn't been overruled.
Broad - Narrow. On unfamiliar queries, the agent started with a general legal search, identified the relevant area of law, then issued targeted queries. On familiar patterns, it skipped the broad search entirely.
Authority - Application. For IRAC-structured tasks, the agent retrieved the relevant authorities first, then applied them to the specific facts, then retrieved the full opinion to verify its reasoning.
The average was 5.7 tool calls per query. The 9-call cap was hit in only 6% of episodes. The agent learned when to stop.
Nobody programmed these sequences. They emerged from the reward signal - correct answers with fewer tool calls received higher rewards than correct answers with more.
Where it broke
The Palsgraf case is a good example of what goes wrong.
The query asked about liability in Palsgraf v. Long Island Railroad - a foundational torts case taught in every first-year law class. The agent followed correct IRAC methodology. It identified the facts. It stated the issue. It retrieved the case from CourtListener.
Then it applied the wrong doctrine. It cited res ipsa loquitur (the thing speaks for itself) instead of the actual holding, which turns on foreseeability and proximate cause. The analysis was structurally sound but substantively wrong.
The agent used only two tools on this query: get_opinion and search_legal_cases. It didn't cross-reference. It didn't check citing opinions to see how later courts interpreted the holding. The tool coordination patterns that worked on jurisdiction questions - broad search, then narrow, then verify - didn't activate here because the agent had already retrieved the case and treated it as sufficient.
Two observations from this failure. First, the agent's learned efficiency (use fewer tools when you can) becomes a liability when the task requires deeper verification than the surface suggests. A famous case feels familiar; the agent acts on that familiarity instead of verifying. Second, tool access helps most when the agent doesn't already think it knows the answer. For questions where the model has strong parametric priors - even if those priors are wrong - tool use is suppressed rather than amplified.
The compliance gate
The jurisdictional compliance mechanism is the part of this work I think about most in the context of safety.
We implemented a reward penalty that scaled from 10% reduction for minor jurisdictional confusion to complete nullification for potentially harmful guidance - for example, advising someone to take an action legal in one state but criminal in another.
During training, the agent triggered this penalty frequently. The 18% activation rate means roughly one in five episodes involved the agent attempting to cross jurisdictional boundaries incorrectly. The penalty didn't just reduce these errors. It changed the agent's research behavior. By late training, jurisdiction-ambiguous queries consistently triggered get_court calls as the first tool action - the agent learned to check where it was before deciding what law to apply.
The final 97% compliance rate is high. But the 3% failure rate on a safety-critical task - applying the wrong state's law - is not zero, and in a deployed legal system, 3% is not acceptable. The honest conclusion is that reward shaping can instill safety behaviors effectively but not perfectly, and the residual error rate in adversarial or ambiguous cases needs additional safeguards.
What I take from this
Three things.
Tool access doesn't improve models uniformly. It disproportionately helps tasks that require coordinating information across sources. Tasks that are already well-served by parametric knowledge see smaller gains. When measuring what an agent can do, the tool environment matters as much as the model weights.
Learned efficiency has a shadow side. An agent trained to minimize unnecessary tool calls will sometimes skip verification on tasks where it's confident but wrong. The same optimization pressure that produces elegant research strategies also produces overconfident shortcuts. Evaluation needs to test specifically for cases where the model's confidence is high and its answer is wrong.
Safety behaviors can be shaped through rewards, but the residual error rate matters. 97% jurisdictional compliance sounds good until you remember that the remaining 3% involves giving someone legal guidance under the wrong state's law. For domains where errors have real consequences, reward shaping is a starting point, not a solution.
I don't think current capability evaluations capture these dynamics well. Time-horizon metrics tell you how long an agent can work on a task. They don't tell you how the agent's failure modes change when you hand it a new set of tools, or which tasks become newly possible while others become newly dangerous.
The tool environment is part of the capability. Measure accordingly.