From AlphaFold to Autoresearch: AI Is Learning to Do Science

March 2026

There's a chart that both DeepMind and OpenAI independently converged on. It maps AI progress across levels, and despite the two organizations developing their frameworks separately, they arrived at nearly identical structures. Worth understanding before we go further.


The Levels

DeepMind published their framework in November 2023 ("Levels of AGI," Morris et al.). OpenAI followed with a similar version internally in July 2024. Both describe five stages of AI capability, each defined by what the system can do relative to a skilled human.

Level 1 - Conversational AI. The system can interact in natural language but doesn't outperform an unskilled human at most tasks. Early ChatGPT, basic chatbots. DeepMind calls this "Emerging." This is where consumer AI lived in 2022-2023.

Level 2 - Reasoners. The system performs at or above the 50th percentile of skilled adults on most tasks. It can reason through multi-step problems, write competent code, analyze documents. DeepMind calls this "Competent." Most current frontier models - Claude, GPT-4, Gemini - operate at this level broadly.

Level 3 - Experts / Agents. The system reaches the 90th percentile of skilled adults. It can take sustained autonomous action - not just answer a question, but execute a multi-step plan over hours or days. DeepMind calls this "Expert," OpenAI calls it "Agents." Claude Opus 4.6 has a METR 50%-time horizon of 14.5 hours. Claude Code chains 21+ tool calls without human intervention. We are solidly here for many tasks.

Level 4 - Virtuosos / Innovators. The system reaches the 99th percentile - or produces genuinely novel contributions to a field. DeepMind's term is "Virtuoso," OpenAI's is "Innovator." This is the level where AI doesn't just apply existing knowledge well but creates new knowledge. The difference between a skilled practitioner and someone who advances the field.

Level 5 - Organizations. The system can do the work of an entire organization. DeepMind calls this "Superhuman" - performance that exceeds all humans in the domain.

Two things to note. First, a system can be at different levels for different tasks - Level 3 at coding, Level 2 at creative writing, Level 1 at physical reasoning. The levels describe capability per domain, not a single number. Second, the jump from Level 3 to Level 4 is qualitatively different from the earlier jumps. Going from Level 1 to 3 is about doing existing tasks better and faster. Going from Level 3 to 4 is about producing something new - a discovery, an invention, an insight that didn't exist before.

Level 4 is where things get interesting. And we're watching it arrive right now. Not all at once, but domain by domain, system by system. The clearest evidence is in automated scientific discovery.

This is the story of how it happened - from a protein-folding breakthrough to an AI agent running overnight experiments on your GPU.


Act 1: DeepMind Solves Specific Problems at Superhuman Level

AlphaFold (2020): Level 5 in a Single Domain

In November 2020, AlphaFold 2 entered the 14th Critical Assessment of protein Structure Prediction (CASP14) - a biennial blind competition where computational biologists try to predict protein structures.

It didn't just win. It broke the competition.

AlphaFold achieved a median GDT score of 92.4 out of 100. The next best group scored 90.8 on summed z-scores - compared to AlphaFold's 244.0. For context, a GDT above 90 is considered competitive with experimental accuracy. Previous methods had plateaued around 60 for the hardest targets. CASP organizers proclaimed that protein structure prediction had been "largely solved."

Demis Hassabis and John Jumper won the 2024 Nobel Prize in Chemistry for this work.

Here's what matters for our story: AlphaFold is Level 5 - superhuman - but only for protein folding. It can't discover new drugs. It can't design experiments. It can't do anything other than predict how a protein folds from its amino acid sequence. It does that one thing better than any human who has ever lived.

GNoME (2023): The Same Pattern, New Domain

DeepMind's Graph Networks for Materials Exploration (GNoME) applied the same principle to materials science. Published in Nature in November 2023, GNoME discovered 2.2 million new crystal structures that are stable by current scientific standards - equivalent to roughly 800 years' worth of knowledge from traditional experimental approaches.

Of those, 380,000 are the most stable, and 736 had already been independently created in labs by external researchers. DeepMind contributed the predictions to the Materials Project, an open-access database used by over 400,000 researchers.

Same pattern as AlphaFold: superhuman performance, single domain, closed evaluation loop (you can computationally verify whether a crystal structure is stable).

FunSearch (2023): LLMs Enter the Loop

FunSearch changed the game. Published in Nature alongside GNoME, it paired a pre-trained LLM with an automated evaluator to search for novel mathematical constructions.

The result: FunSearch discovered new constructions for the cap set problem in extremal combinatorics that surpassed the best-known solutions - including the largest improvement in the asymptotic lower bound in 20 years. It also found new heuristics for online bin packing that outperformed widely-used baselines.

What made FunSearch different from AlphaFold and GNoME:

FunSearch doesn't output "here's the answer." It outputs code that generates the answer. The programs are interpretable - researchers could inspect the discovered code and identify a new mathematical symmetry they hadn't previously known about.

This is the bridge to what comes next. FunSearch showed that LLMs, when paired with the right evaluation loop, can make genuine scientific discoveries. Not by being brilliant reasoners, but by being creative generators whose outputs are rigorously filtered.


Act 2: Sakana Automates the Full Research Process

AI Scientist v1 (August 2024): The Ambition

In August 2024, Sakana AI released The AI Scientist - the first system designed to automate the entire scientific research lifecycle. Not just prediction. Not just search. Everything: idea generation, literature search, experiment design, code implementation, results analysis, figure generation, paper writing, and automated peer review.

The system could generate a complete machine learning research paper for approximately $15 per paper, discovering novel contributions in areas like diffusion models, transformers, and grokking.

But v1 had real limitations:

V1 automated the form of research better than the substance. Papers looked like papers, but the insights were often shallow.

AI Scientist v2 (April 2025): The Breakthrough

Eight months later, Sakana released v2 with three fundamental improvements.

First, they eliminated human templates. V1 needed researchers to provide a starting codebase. V2 generates its own experimental code from a general idea description, making it deployable across diverse ML domains without human scaffolding.

Second, they introduced agentic tree search. Instead of running experiments linearly, v2 uses a progressive tree-search algorithm managed by a dedicated experiment manager agent. The system branches experiments, debugs failures, refines approaches, and selects the best-performing path - then drills deeper. This is fundamentally different from "run experiment, write paper." It's iterative scientific exploration.

Third, they added VLM feedback. A Vision-Language Model reviews figures and manuscript aesthetics iteratively, closing the loop that v1 couldn't.

The result: Sakana submitted three fully autonomous manuscripts to a peer-reviewed workshop at ICLR 2025. One manuscript - "Compositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization" - achieved an average reviewer score of 6.33, exceeding the average human acceptance threshold.

This was the first fully AI-generated paper to pass real peer review.

The paper reported a negative result - it found that a promising approach to regularization didn't work as expected. The fact that an AI system discovered, reported, and correctly framed a negative result through peer review is arguably more impressive than if it had found a positive one. It demonstrated genuine scientific reasoning, not just metric optimization.


Act 3: Karpathy Strips It Down to the Core

In March 2026, Andrej Karpathy released autoresearch - and took the opposite approach from Sakana. Instead of automating the full research lifecycle, he isolated a single question: what happens when you give an AI agent a real training setup and let it experiment autonomously?

The setup is deliberately minimal. Three files:

The agent modifies train.py, trains for exactly 5 minutes, checks if validation loss improved, keeps or discards, and repeats. No paper writing. No literature review. No figure generation. Just: modify, train, measure, iterate.

The constraint is elegant: fixed 5-minute wall-clock budget per experiment, regardless of what the agent changes. This means roughly 12 experiments per hour, roughly 100 experiments overnight. Architecture, hyperparameters, optimizer, batch size - everything is fair game. The single metric is val_bpb (validation bits per byte), lower is better.

And here's the insight that matters: you're not programming the Python. You're programming the program.md. The human writes meta-instructions - the research strategy, the priorities, the constraints. The agent does the science. Karpathy describes it as finding "the research org code that achieves the fastest research progress."

39,000 GitHub stars in weeks. The idea clearly resonated.


Where Are We on the AGI Levels?

Looking at the evidence across these systems:

Level 3 is established. As described above, we're solidly here for many professional tasks.

Level 4 is emerging, unevenly.

AlphaFold and GNoME achieved Level 5 - superhuman - within their specific domains. But they can't generalize. FunSearch showed Level 4 capability in mathematical discovery - genuine innovation, but only when paired with the right evaluation loop. Sakana's AI Scientist v2 produced Level 4 output - a peer-reviewed contribution to scientific knowledge - but at workshop level, not at the level of proposing transformers or discovering new physics.

The pattern: Level 4 is arriving domain by domain, not all at once. Some domains (protein structure, crystal stability) are already at Level 5. Others (automated ML research) are at early Level 4. The frontier is uneven.

Here's where I think the field is underappreciating something: the evaluation loop is the bottleneck, not the model. AlphaFold, GNoME, and FunSearch all succeeded because verification is cheap and automated in their domains. You can computationally check whether a protein structure is stable, whether a crystal is thermodynamically favorable, whether a mathematical construction satisfies the constraints. The model generates candidates; the evaluator filters them. The tighter and cheaper the evaluation loop, the faster the domain reaches Level 4.

This has a corollary that most AI progress narratives ignore: domains where evaluation is expensive, slow, or ambiguous - social science, drug efficacy in humans, policy analysis, most of the humanities - may not reach Level 4 through this paradigm at all. Not because the models aren't capable enough, but because there's no cheap way to verify whether the model's output is a genuine contribution. Peer review at scale is slow. Clinical trials take years. Replication requires resources.

The implication is that Level 4 may arrive as a permanent patchwork - superhuman in some domains, barely competent in others - rather than as a uniform frontier that advances together. The AGI levels framework, as presented by both DeepMind and OpenAI, implies a single ladder. The evidence suggests it's more like separate ladders, one per domain, with the rungs spaced differently depending on how easy it is to check the answer.


What's Next

The generality question. Sakana and Karpathy are pushing toward general-purpose research agents. But there's a meaningful gap between "automate ML experiments" and "do science." Real scientific discovery often requires pulling in ideas from adjacent fields, recognizing unexpected connections, and changing research direction based on what you've read. Whether current architectures can develop this kind of judgment - or whether it requires fundamentally different approaches - remains an open empirical question. Researchers like Yann LeCun have argued that autoregressive LLMs lack the world models necessary for genuine understanding (LeCun, 2022); others, including work on world models from Li Fei-Fei's group and physical intelligence research, suggest that grounding language models in perception and action may close this gap. The honest answer is that nobody knows yet which path leads further.

The compounding effect. Karpathy's epigraph to autoresearch imagines autonomous AI agent swarms running research across "compute cluster megastructures in the skies," with a self-modifying codebase that has "grown beyond human comprehension." That's science fiction today. But the trajectory from AlphaFold to FunSearch to AI Scientist to autoresearch is unmistakable: each step removed another layer of human scaffolding while expanding what the AI could discover.

The evaluation bottleneck. If the argument above is right - that the evaluation loop determines when Level 4 arrives in a given domain - then the most important near-term research may not be in model capability at all. It may be in building better, cheaper, faster evaluation for domains that currently lack it. The group that figures out how to make drug discovery or materials testing as cheap to verify as mathematical proofs will unlock more Level 4 progress than any model improvement.

We are watching AI learn to do science. It started with proteins. It's now writing papers. The pace depends on whether we can build the evaluation infrastructure fast enough to keep up.