Claude Code's AI Code Review and Karpathy's autoresearch: When AI Writes, Reviews, and Experiments on Its Own
On the same day, two announcements tackled two very different problems. Anthropic shipped a system where AI reviews code that other AI wrote. Andrej Karpathy open-sourced a tool that lets an AI agent iterate through PyTorch training runs all night long. Together, they trace the outline of the same trend: the three core stages of the development cycle — writing code, reviewing it, and experimenting with it — are all moving into AI territory.
Background: Too Much AI-Generated Code
AI coding tools have flooded the zone with pull requests. According to Anthropic, its own engineers’ code output grew 200% over the past year.1 Claude Code alone crossed $2.5 billion in annualized revenue. When reviews can’t keep pace with the volume of code being generated, the resulting technical debt doesn’t sit in a backlog — it ships to production.
Karpathy’s problem sits at the other end. Validating a research idea means cycling through edit → train → evaluate → repeat, hundreds of times. When a human researcher drives that loop manually, GPUs sit idle every time they sleep, eat, or step away.
Code Review: AI Reviewing AI-Written Code
On March 9, 2026, Anthropic launched Code Review as part of Claude Code, available as a research preview for Teams and Enterprise customers.1
How It Works: A Team of Parallel Agents
Earlier AI code review tools worked by running a single model top-to-bottom through a pull request. Code Review takes a different approach. When a PR is opened, multiple agents fan out in parallel, each analyzing the code independently from a different angle. An aggregator agent then consolidates the findings, deduplicates overlapping issues, and sorts everything by severity.
flowchart TD
PR[Pull Request Opened] --> D{Assess PR Complexity}
D -->|Simple| FEW[Deploy Fewer Agents]
D -->|Complex| MANY[Deploy More Agents]
FEW --> A1[Agent A\nLogic Error Detection]
MANY --> A1
MANY --> A2[Agent B\nSecurity Vulnerability Scan]
MANY --> A3[Agent C\nCross-check Against Legacy Bugs]
A1 --> AGG[Aggregator Agent\nDeduplicate + Sort by Severity]
A2 --> AGG
A3 --> AGG
AGG --> OUT[PR Comments\nWith Inline Annotations]
OUT --> HUMAN[Human Reviewer Final Approval]
The number of agents deployed scales dynamically with the complexity of the PR. Average review time is around 20 minutes — slower than instant-response tools like GitHub Copilot, but Anthropic argues that depth is the point.2
What It Focuses On: Logic, Not Style
In an interview with TechCrunch, Anthropic’s head of product Cat Wu explained the thinking:
“Developers have already experienced a lot of automated AI feedback, and they’ve grown fatigued by comments they can’t act on immediately. So we made a deliberate choice to focus only on logic errors. That way, the most critical issues surface first.” — Cat Wu, Head of Product, Anthropic1
Issue severity is color-coded:
| Color | Meaning |
|---|---|
| Red | Requires immediate attention |
| Yellow | Potential issue worth reviewing |
| Purple | Related to existing code or historical bugs |
Code Review does not approve PRs. The final call always stays with a human reviewer.
The Numbers
Based on Anthropic’s internal data, only 16% of pull requests received substantive review comments before Code Review was introduced. After deployment, that figure jumped to 54%.2 Fewer changes were being merged without any meaningful review.
Pricing runs $15–25 per review — significantly higher than lightweight alternatives. That price point is a statement: Anthropic is betting on quality over throughput.
“This product is for large enterprise customers — think Uber, Salesforce, Accenture — who are already using Claude Code and need help managing the sheer volume of PRs it generates.” — Cat Wu1
autoresearch: Running Experiments Through the Night
Around the same time, Andrej Karpathy published autoresearch on GitHub in early March 2026.3 The project is roughly 630 lines of Python, and its core question is disarmingly simple: can an AI agent run experiments while the researcher is asleep?
flowchart LR
H[Human\nWrites program.md] -->|Provides Instructions| AGENT[AI Agent]
AGENT -->|Modifies train.py| TRAIN[Fixed 5-Minute Training Run]
TRAIN -->|Measure val_bpb| EVAL{Improvement?}
EVAL -->|Yes — Commit| AGENT
EVAL -->|No — Rollback| AGENT
AGENT -->|Repeat Loop| TRAIN
The Design: Three Files, Deliberately Simple
The project is intentionally minimal:
prepare.py: Downloads data and trains the tokenizer. The agent never touches this.train.py: Contains the full GPT model, optimizer, and training loop. This is the only file the agent modifies — architecture, hyperparameters, batch size, optimizer choices are all fair game.program.md: The research brief, written by a human and refined over time.
The agent modifies train.py, runs training for exactly five minutes, and checks val_bpb (validation bits per byte) — lower is better. If the result beats the previous run, it commits and moves on. If not, it rolls back and tries something else.
At that cadence, the agent can run roughly 12 experiments per hour — over 100 in a single overnight session.3
Early Results and Real-World Use
In Karpathy’s initial published experiments, the agent autonomously reduced val_bpb from 1.0 to 0.97.4 His README opens with a striking framing:
“Once upon a time, frontier AI research was carried out by human computers in the gaps between eating, sleeping, and other pleasures of life. That era ended long ago. Research is now entirely the domain of autonomous AI agent swarms running on vast compute clusters high up in the sky.” — @karpathy, March 20263
Shopify CEO Tobi Lutke applied autoresearch to his own project and reported a 19% improvement in validation scores. An agent-optimized smaller model outperformed a larger model tuned by hand.4 Karpathy has since integrated some of the agent-discovered improvements into his broader nanochat framework.
The Design Philosophy Both Projects Share
On the surface, these two projects look very different. Code Review is a paid enterprise product wired into large-scale engineering workflows. autoresearch is an open-source experiment that runs on a single GPU. But structurally, they’re built around the same idea.
| Code Review | autoresearch | |
|---|---|---|
| Human’s Role | Final approval | Writing program.md |
| AI’s Role | Parallel review + severity ranking | Code edits + iterative experiments |
| Key Metric | Logic error count and severity | val_bpb |
| Loop Structure | PR opened → analysis → comments | Edit → train → evaluate → repeat |
| Human Intervention Point | End of loop | Loop design phase |
Neither system removes humans from the loop entirely. Humans design the loop and make the final call. AI handles the repetitive work inside it. This agentic architecture is already taking shape across multiple domains.
[!KEY] Code Review is an attempt to solve the AI-generated code overflow problem with more AI. The goal is to restore balance between the speed of production and the speed of review.
Open Questions
What these tools actually mean in practice is still being tested.
For Code Review, the key question is whether $15–25 per review delivers enough value to justify the cost. A 20-minute turnaround only makes sense in workflows that prioritize depth over speed, which may limit how broadly it gets adopted.
autoresearch raises a more fundamental question. Architectural improvements an agent discovers in small-scale experiments may not transfer cleanly to large-scale production models. Karpathy himself called this a “starting point.” Whether things get more interesting as program.md evolves into a kind of “research organization code” and more agents are added — that’s something only time will tell.
[!KEY] The core innovation in autoresearch is the fixed five-minute training window. It makes every experiment directly comparable regardless of platform, and establishes the minimum viable unit for meaningful autonomous iteration by an AI agent.
Footnotes
-
Cat Wu (Anthropic), TechCrunch, “Anthropic launches code review tool to check flood of AI-generated code,” 2026-03-09. https://techcrunch.com/2026/03/09/anthropic-launches-code-review-tool-to-check-flood-of-ai-generated-code/ ↩ ↩2 ↩3 ↩4
-
VentureBeat, “Anthropic rolls out Code Review for Claude Code,” 2026-03-09. https://venturebeat.com/technology/anthropic-rolls-out-code-review-for-claude-code-as-it-sues-over-pentagon ↩ ↩2
-
Andrej Karpathy, GitHub, “karpathy/autoresearch,” 2026-03. https://github.com/karpathy/autoresearch ↩ ↩2 ↩3
-
MarkTechPost, “Andrej Karpathy Open-Sources ‘Autoresearch’: A 630-Line Python Tool Letting AI Agents Run Autonomous ML Experiments on Single GPUs,” 2026-03-08. https://www.marktechpost.com/2026/03/08/andrej-karpathy-open-sources-autoresearch-a-630-line-python-tool-letting-ai-agents-run-autonomous-ml-experiments-on-single-gpus/ ↩ ↩2