There are two ways to make an AI agent coordinate across tools effectively.
Path 1: Orchestrate it. Design a pipeline. Define stages. Decide which tool gets used when. Let a powerful general-purpose model execute the plan.
Path 2: Let it learn. Give the agent tools. Give it a reward signal. Let reinforcement learning figure out the strategy.
Both paths work. Both are producing real results right now. The interesting questions are about where they converge, where they diverge, and what each can do that the other can't.
Two Paradigms in the Wild
Orchestrated: Sakana's AI Scientist
Sakana's AI Scientist v2 (Yamada et al., 2025) automates the full research lifecycle through a designed pipeline: idea generation, experiment code, tree-search over variations, results analysis, paper writing, peer review.
The pipeline is sophisticated. The tree search over experiments is genuinely innovative - branching, debugging, refining, selecting the best path and drilling deeper. A dedicated experiment manager agent coordinates the stages.
But the research strategy - the decision of what to try, in what order, with which tools - is architecturally determined. The system uses Semantic Scholar for literature, code execution for experiments, LaTeX for writing. The pipeline defines when each tool gets called. The model's creativity lives within the stages, not in choosing the stages.
This is powerful. It produced the first fully AI-generated paper to pass peer review at an ICLR 2025 workshop. The pipeline works.
Learned: Search-Augmented RL
A different line of work - Search-R1 (Jin et al., 2025), R1-Searcher (Song et al., 2025), and extensions into multi-tool environments - takes the opposite approach. Instead of designing a pipeline, you give the agent access to tools during reinforcement learning training and let it discover its own strategies.
Search-R1 demonstrated that integrating a search engine directly into the RL training loop produces significant improvements - 41% on Qwen2.5-7B and 20% on Qwen2.5-3B over RAG baselines across seven QA datasets. The key insight from Search-R1 is that the search engine becomes part of the environment, not a feature of the prompt. The model learns a policy for when to search, what to search for, and how to incorporate results into its reasoning.
In my thesis at Cornell, I extended this paradigm to legal reasoning with 18 specialized tools - legal databases, web search, citation analysis, document summarization. The agent learned through GRPO which tools to use for which types of questions, when to cross-reference, and when to stop searching. Nobody programmed these patterns. They emerged from training. (The details of what the agent learned - including where it succeeded and where it failed - are in a separate post: What Tool Access Actually Changes.)
Karpathy's Autoresearch: A Third Point
Karpathy's autoresearch is interesting because it doesn't fit either category cleanly. The agent modifies code, trains, measures, and iterates - a loop, not a pipeline. But the loop is simple enough that the "strategy" is mostly generated by the underlying model's reasoning at each step, guided by the human-written program.md.
There's no RL training loop. There's no designed pipeline. There's a capable model, a clear metric, and a constraint (5-minute budget). The model's general intelligence does the rest.
This raises the most important question in this entire discussion: as foundation models get more capable, does the distinction between orchestrated and learned matter?
Taking the Karpathy Challenge Seriously
It's tempting to dismiss this question by pointing to domain-specific gains from RL. But that's not a serious engagement with what autoresearch demonstrates.
Autoresearch works because Claude or GPT-4 class models have already been trained with enormous compute, including extensive RLHF and tool-use optimization. These models aren't naive about tool coordination - they've been trained on millions of examples of multi-step reasoning, code modification, and iterative problem-solving. When Karpathy gives such a model a clear metric and a tight feedback loop, it performs well without any additional domain-specific training.
So the honest version of the question is: given that frontier models keep improving their general tool-use capability through pre-training and RLHF, at what point does the marginal value of domain-specific RL training shrink to zero?
I don't think we're there yet, and here's the empirical reason. In our legal reasoning experiments, training-time tool integration outperformed inference-time multi-tool access on the same model and the same benchmark. The trained agent used fewer tool calls per query while achieving higher accuracy. It developed task-specific coordination patterns - different sequences for jurisdiction questions versus precedent analysis versus general consultation - that the same model couldn't produce at inference time with the same tools available.
But I want to be precise about what this shows and what it doesn't. It shows that for a 3B parameter model (Qwen 2.5 3B) operating in a specialized domain with 18 tools and a constrained budget, RL training produced measurably better tool coordination than inference-time reasoning. It does not show that this advantage would persist at larger model scales. Search-R1's own empirical study (Jin et al., 2025) found that larger models show bigger performance gaps between trained and untrained search - suggesting the advantage may grow with scale, not shrink. But neither result has been tested at the frontier model scale where Karpathy's autoresearch operates. That experiment hasn't been run.
The honest answer: the distinction between orchestrated and learned probably matters most at smaller model scales and in domains with many specialized tools where the optimal coordination pattern is non-obvious. As models get larger and more generally capable, the gap may narrow for simple tool environments. Whether it narrows for complex, multi-tool environments with domain-specific constraints (like jurisdictional compliance in law) is an open empirical question.
Where They Converge
Both paradigms share a core insight: tool access expands the capability envelope.
An agent with access to a legal database can answer questions it literally couldn't answer without one. An agent with Semantic Scholar access can find related work it didn't see in training. An agent with a code executor can test hypotheses instead of guessing.
Whether the tool strategy is orchestrated or learned, the expansion is real. FunSearch pairs an LLM with an evaluator - that's orchestration - and produces genuinely novel mathematical discoveries. Multi-tool RL agents learn their own coordination - that's learned - and produce efficient, task-adapted research strategies.
Both work. The capability expansion from tool access is the constant.
Where They Diverge
Adaptability
Orchestrated pipelines excel when you know the task structure in advance. Scientific research has a known structure: hypothesis, experiment, analysis, communication. Sakana's pipeline encodes this beautifully.
Learned strategies may excel when the task structure varies significantly. Legal reasoning across four different task types - judicial reasoning, precedent analysis, opinion generation, general consultation - required different coordination patterns for each. The RL training loop discovered this variation; a pipeline designer would have had to anticipate it.
The Evaluation Loop Dependency
Both paradigms depend on evaluation quality, but in different ways. Orchestrated systems need well-defined stage transitions - "when is the literature review complete enough to start experimenting?" Learned systems need reward signals that capture what matters - a poorly designed reward function produces agents that game the metric rather than solve the problem. Goodhart's Law applies to both, but at different levels of abstraction.
Open Questions
1. Can learned tool strategies transfer across domains?
If an agent learns efficient research strategies in legal reasoning - when to search broadly vs. narrowly, when to cross-reference, when to stop - do those meta-strategies transfer to medical diagnosis, or financial research, or materials science?
The analogy: a skilled researcher who switches fields doesn't start from zero. They bring meta-skills about how to do research that transfer. Whether RL-trained agents develop similar transferable research skills is an empirical question nobody has tested rigorously. The closest evidence is R1-Searcher++'s (Song et al., 2025) finding that their framework generalizes from in-domain to out-of-domain datasets, but that's within QA tasks, not across domains.
2. What's the right evaluation framework for research agents?
METR measures time horizons - how long a task can an agent complete? Sakana measures peer review acceptance. Karpathy measures validation loss improvement. These are all useful, but they measure different things.
For a system that's supposed to do science, what should we measure? Novel insights per compute dollar? Reproducibility of findings? Quality of the research strategy itself, independent of the outcome?
3. When do orchestrated and learned approaches merge?
I suspect the future isn't "either/or." The most capable research agents will likely combine orchestrated high-level structure (the scientific method has stages for a reason) with learned low-level strategy (which tool to use, how to formulate queries, when to change direction).
Sakana's v2 tree search over experiments is already a step in this direction - an orchestrated pipeline with a learned search component. The natural extension is making more of the pipeline learnable while keeping the high-level structure intact. Whether this hybrid approach outperforms either pure paradigm is, again, an empirical question that hasn't been answered.
4. How do we evaluate research agents for safety?
As we push into Level 4 - AI that innovates - how do we evaluate whether an innovation-capable agent is safe to deploy?
METR and Transluce are building evaluation infrastructure for agent capabilities. But evaluating a research agent is harder than evaluating a coding agent. A coding agent's output is testable - does the code run? A research agent's output is a claim about the world - and verifying claims requires domain expertise, replication, and time.
The systems that solved this best - AlphaFold, GNoME, FunSearch - all had tight evaluation loops. The discovery was immediately verifiable. Extending this to domains with slower, more ambiguous evaluation is one of the hardest open problems in the field.