Claude Code's AI Code Review and Karpathy's autoresearch: When AI Writes, Reviews, and Experiments on Its Own

· # AI News
Claude Code autoresearch code review AI agent Karpathy

On the same day, two announcements tackled two very different problems. Anthropic shipped a system where AI reviews code that other AI wrote. Andrej Karpathy open-sourced a tool that lets an AI agent iterate through PyTorch training runs all night long. Together, they trace the outline of the same trend: the three core stages of the development cycle — writing code, reviewing it, and experimenting with it — are all moving into AI territory.


Background: Too Much AI-Generated Code

AI coding tools have flooded the zone with pull requests. According to Anthropic, its own engineers’ code output grew 200% over the past year.1 Claude Code alone crossed $2.5 billion in annualized revenue. When reviews can’t keep pace with the volume of code being generated, the resulting technical debt doesn’t sit in a backlog — it ships to production.

Karpathy’s problem sits at the other end. Validating a research idea means cycling through edit → train → evaluate → repeat, hundreds of times. When a human researcher drives that loop manually, GPUs sit idle every time they sleep, eat, or step away.


Code Review: AI Reviewing AI-Written Code

On March 9, 2026, Anthropic launched Code Review as part of Claude Code, available as a research preview for Teams and Enterprise customers.1

How It Works: A Team of Parallel Agents

Earlier AI code review tools worked by running a single model top-to-bottom through a pull request. Code Review takes a different approach. When a PR is opened, multiple agents fan out in parallel, each analyzing the code independently from a different angle. An aggregator agent then consolidates the findings, deduplicates overlapping issues, and sorts everything by severity.

flowchart TD
    PR[Pull Request Opened] --> D{Assess PR Complexity}
    D -->|Simple| FEW[Deploy Fewer Agents]
    D -->|Complex| MANY[Deploy More Agents]
    FEW --> A1[Agent A\nLogic Error Detection]
    MANY --> A1
    MANY --> A2[Agent B\nSecurity Vulnerability Scan]
    MANY --> A3[Agent C\nCross-check Against Legacy Bugs]
    A1 --> AGG[Aggregator Agent\nDeduplicate + Sort by Severity]
    A2 --> AGG
    A3 --> AGG
    AGG --> OUT[PR Comments\nWith Inline Annotations]
    OUT --> HUMAN[Human Reviewer Final Approval]

The number of agents deployed scales dynamically with the complexity of the PR. Average review time is around 20 minutes — slower than instant-response tools like GitHub Copilot, but Anthropic argues that depth is the point.2

What It Focuses On: Logic, Not Style

In an interview with TechCrunch, Anthropic’s head of product Cat Wu explained the thinking:

“Developers have already experienced a lot of automated AI feedback, and they’ve grown fatigued by comments they can’t act on immediately. So we made a deliberate choice to focus only on logic errors. That way, the most critical issues surface first.” — Cat Wu, Head of Product, Anthropic1

Issue severity is color-coded:

ColorMeaning
RedRequires immediate attention
YellowPotential issue worth reviewing
PurpleRelated to existing code or historical bugs

Code Review does not approve PRs. The final call always stays with a human reviewer.

The Numbers

Based on Anthropic’s internal data, only 16% of pull requests received substantive review comments before Code Review was introduced. After deployment, that figure jumped to 54%.2 Fewer changes were being merged without any meaningful review.

Pricing runs $15–25 per review — significantly higher than lightweight alternatives. That price point is a statement: Anthropic is betting on quality over throughput.

“This product is for large enterprise customers — think Uber, Salesforce, Accenture — who are already using Claude Code and need help managing the sheer volume of PRs it generates.” — Cat Wu1


autoresearch: Running Experiments Through the Night

Around the same time, Andrej Karpathy published autoresearch on GitHub in early March 2026.3 The project is roughly 630 lines of Python, and its core question is disarmingly simple: can an AI agent run experiments while the researcher is asleep?

flowchart LR
    H[Human\nWrites program.md] -->|Provides Instructions| AGENT[AI Agent]
    AGENT -->|Modifies train.py| TRAIN[Fixed 5-Minute Training Run]
    TRAIN -->|Measure val_bpb| EVAL{Improvement?}
    EVAL -->|Yes — Commit| AGENT
    EVAL -->|No — Rollback| AGENT
    AGENT -->|Repeat Loop| TRAIN

The Design: Three Files, Deliberately Simple

The project is intentionally minimal:

  • prepare.py: Downloads data and trains the tokenizer. The agent never touches this.
  • train.py: Contains the full GPT model, optimizer, and training loop. This is the only file the agent modifies — architecture, hyperparameters, batch size, optimizer choices are all fair game.
  • program.md: The research brief, written by a human and refined over time.

The agent modifies train.py, runs training for exactly five minutes, and checks val_bpb (validation bits per byte) — lower is better. If the result beats the previous run, it commits and moves on. If not, it rolls back and tries something else.

At that cadence, the agent can run roughly 12 experiments per hour — over 100 in a single overnight session.3

Early Results and Real-World Use

In Karpathy’s initial published experiments, the agent autonomously reduced val_bpb from 1.0 to 0.97.4 His README opens with a striking framing:

“Once upon a time, frontier AI research was carried out by human computers in the gaps between eating, sleeping, and other pleasures of life. That era ended long ago. Research is now entirely the domain of autonomous AI agent swarms running on vast compute clusters high up in the sky.” — @karpathy, March 20263

Shopify CEO Tobi Lutke applied autoresearch to his own project and reported a 19% improvement in validation scores. An agent-optimized smaller model outperformed a larger model tuned by hand.4 Karpathy has since integrated some of the agent-discovered improvements into his broader nanochat framework.


The Design Philosophy Both Projects Share

On the surface, these two projects look very different. Code Review is a paid enterprise product wired into large-scale engineering workflows. autoresearch is an open-source experiment that runs on a single GPU. But structurally, they’re built around the same idea.

Code Reviewautoresearch
Human’s RoleFinal approvalWriting program.md
AI’s RoleParallel review + severity rankingCode edits + iterative experiments
Key MetricLogic error count and severityval_bpb
Loop StructurePR opened → analysis → commentsEdit → train → evaluate → repeat
Human Intervention PointEnd of loopLoop design phase

Neither system removes humans from the loop entirely. Humans design the loop and make the final call. AI handles the repetitive work inside it. This agentic architecture is already taking shape across multiple domains.


[!KEY] Code Review is an attempt to solve the AI-generated code overflow problem with more AI. The goal is to restore balance between the speed of production and the speed of review.


Open Questions

What these tools actually mean in practice is still being tested.

For Code Review, the key question is whether $15–25 per review delivers enough value to justify the cost. A 20-minute turnaround only makes sense in workflows that prioritize depth over speed, which may limit how broadly it gets adopted.

autoresearch raises a more fundamental question. Architectural improvements an agent discovers in small-scale experiments may not transfer cleanly to large-scale production models. Karpathy himself called this a “starting point.” Whether things get more interesting as program.md evolves into a kind of “research organization code” and more agents are added — that’s something only time will tell.


[!KEY] The core innovation in autoresearch is the fixed five-minute training window. It makes every experiment directly comparable regardless of platform, and establishes the minimum viable unit for meaningful autonomous iteration by an AI agent.


Footnotes

  1. Cat Wu (Anthropic), TechCrunch, “Anthropic launches code review tool to check flood of AI-generated code,” 2026-03-09. https://techcrunch.com/2026/03/09/anthropic-launches-code-review-tool-to-check-flood-of-ai-generated-code/ 2 3 4

  2. VentureBeat, “Anthropic rolls out Code Review for Claude Code,” 2026-03-09. https://venturebeat.com/technology/anthropic-rolls-out-code-review-for-claude-code-as-it-sues-over-pentagon 2

  3. Andrej Karpathy, GitHub, “karpathy/autoresearch,” 2026-03. https://github.com/karpathy/autoresearch 2 3

  4. MarkTechPost, “Andrej Karpathy Open-Sources ‘Autoresearch’: A 630-Line Python Tool Letting AI Agents Run Autonomous ML Experiments on Single GPUs,” 2026-03-08. https://www.marktechpost.com/2026/03/08/andrej-karpathy-open-sources-autoresearch-a-630-line-python-tool-letting-ai-agents-run-autonomous-ml-experiments-on-single-gpus/ 2

← Oil at $100, the Epstein Bombshell, and a Hawkish Fed: Anatomy of March 2026's Perfect Storm OpenAI Symphony: From Supervising Coding Agents to Managing Work →