Andrej Karpathy's 630-Line Python Script That Does AI Research Itself
Quick summary
Karpathy released AutoResearch: 630 lines of Python where AI agents design, run, and interpret ML experiments with no human in the loop.
Read next
- Mistral Voxtral TTS: Open-Weight Model Beats ElevenLabs at 90ms LatencyMistral released Voxtral-4B-TTS on March 26, 2026. 4B parameters, open weights, 90ms time-to-first-audio, 68.4% win rate vs ElevenLabs. At $0.016 per 1,000 chars it changes the TTS pricing floor.
- Gemma 4 (April 2026): #3 Arena Open LLM, Apache 2, Developer GuideGemma 4 April 2, 2026: Google open weights on Gemini 3, 400M+ Gemma downloads, 31B #3 Arena open LLM, Apache 2. E2B–31B sizes, Ollama, vLLM, Vertex.
Andrej Karpathy does not waste words. The former Tesla AI head, OpenAI founding member, and creator of nanoGPT, makemore, and llm.c posted a single sentence alongside his new release: "The goal is to engineer your agents to make faster research progress indefinitely and without any of your own involvement."
The tool is called AutoResearch. It is 630 lines of Python. It runs on a single GPU. And it is designed to do something that has been a goal of AI labs since the field began: conduct machine learning research autonomously.
What Is AutoResearch?
AutoResearch is an agentic research framework where an AI agent — powered by an LLM — takes a research question, designs experiments to investigate it, writes the code to run those experiments, executes them on a GPU, analyses the results, forms conclusions, and then designs follow-up experiments based on what it found. The loop continues until a stopping condition is met or the agent determines it has answered the question.
The key components in Karpathy's implementation:
| Component | What it does |
|---|---|
| Research planner | LLM generates a hypothesis and experimental design |
| Code generator | Writes Python/PyTorch code for the experiment |
| Execution engine | Runs the experiment on GPU, captures stdout/stderr/metrics |
| Result analyser | LLM reads outputs, interprets results, updates hypothesis |
| Loop controller | Decides whether to run follow-up experiments or terminate |
| Report generator | Summarises findings in readable form |
The entire system is 630 lines. No custom frameworks, no complex orchestration libraries — just Python, an LLM API call, and a GPU. Karpathy's philosophy of minimal dependencies runs through everything he builds.
Why Karpathy's Work Gets Noticed
Karpathy is not a random developer releasing a research tool. His background matters for understanding why AutoResearch is significant:
He was a founding member of OpenAI and led AI research before leaving to run Tesla's Autopilot AI programme for five years. At Tesla, he built one of the most demanding real-world AI deployment pipelines on earth. After leaving Tesla in 2022, he returned to OpenAI briefly, then went independent — releasing a series of educational and research tools that have become widely used in the ML community.
His nanoGPT (a minimal GPT implementation in ~300 lines) became the most widely read tutorial code for understanding transformers. His llm.c project showed how to train GPT-2 in pure C with no Python dependency. The pattern is consistent: take a complex AI concept, strip it to its minimum viable implementation, and publish it openly. AutoResearch follows the same philosophy applied to autonomous research agents.
When Karpathy says the goal is research progress "without any of your own involvement," he means it technically. The system is designed to run overnight or over a weekend and return a research report.
How Autonomous ML Research Actually Works
The hardest part of building an autonomous research agent is not the LLM prompting — it is the execution environment. ML experiments fail in dozens of ways that are not model failures:
- CUDA out of memory errors
- Shape mismatches in tensor operations
- NaN losses from bad learning rates
- Infinite training loops
- Dependency version conflicts
AutoResearch handles these through an execution harness that catches errors, feeds them back to the LLM as context ("the experiment failed with this error: ..."), and asks the LLM to revise the code. This retry loop is the engineering insight that makes the system usable rather than theoretical.
The research loop looks like this:
1. Receive research question ("Does batch norm help in this architecture?")
2. Generate experiment design (2 variants: with and without BN)
3. Write training code for both variants
4. Execute on GPU, capture metrics (loss curves, final accuracy)
5. LLM analyses results: "BatchNorm improved convergence speed by 23% but final accuracy was equivalent"
6. Generate follow-up: "Test with different learning rates to isolate the effect"
7. Repeat until stopping condition
8. Write final report
The stopping condition can be a fixed number of iterations, a time budget, a GPU cost budget, or a convergence criterion that the LLM evaluates qualitatively.
What AutoResearch Can and Cannot Do Right Now
What it does well:
- Ablation studies on small to medium models (fits on a single GPU)
- Hyperparameter sensitivity analysis
- Architecture comparisons on standard benchmarks
- Reproducing and extending existing paper results
- Generating preliminary results to inform larger-scale experiments
What it cannot do:
- Large-scale experiments requiring multi-GPU or multi-node setups
- Novel mathematical derivations (it can test hypotheses but cannot derive new theory)
- Experiments that require custom data collection or real-world interaction
- Research requiring domain expertise that the base LLM lacks (highly specialised fields)
- Replace the intuition of an experienced researcher about which questions are worth asking
The single-GPU constraint is a practical limitation for 2026 frontier model research. Training runs for state-of-the-art language or vision models require clusters. AutoResearch is well-suited for research on smaller models, efficiency techniques, architecture choices in constrained settings, and reproducing findings at reduced scale.
The Bigger Picture: AI Doing AI Research
AutoResearch is a concrete implementation of a concept that AI safety researchers have discussed for years: AI systems that accelerate their own development. If an AI agent can run ML experiments faster and more cheaply than a human researcher, and can do so 24 hours a day, the rate of AI research progress could increase substantially.
This is not hypothetical. The economics are straightforward: a researcher who can delegate exploratory experiments to an agent can focus their time on higher-level hypothesis generation and experimental design. A single researcher with AutoResearch running overnight could cover the ground that previously required a small team running manual experiments.
At scale — multiple agents, multiple GPUs, running in parallel across a research organisation — the acceleration becomes significant. Anthropic, DeepMind, OpenAI, and Google all have internal automated experiment infrastructure that does similar things at much larger scale. Karpathy's contribution is making a functional version of this available to individual researchers and small teams with a single GPU.
How to Use AutoResearch
AutoResearch is open source on Karpathy's GitHub. Requirements:
- Python 3.10+
- PyTorch with CUDA
- Any LLM API (OpenAI, Anthropic Claude, or a local model via Ollama)
- A single GPU (NVIDIA recommended; 8GB+ VRAM for most experiments)
Basic usage (Python):
from autoresearch import AutoResearch
agent = AutoResearch(
llm="claude-opus-4-6",
gpu_budget_hours=4,
max_iterations=10
)
report = agent.run(
question="Does residual connection placement affect training stability in small transformers?",
dataset="wikitext-103-small"
)
print(report.summary)
The LLM choice matters significantly. Karpathy's testing used Claude Opus 4.6 and GPT-5 for the planning and analysis steps. Smaller or local models work but produce lower-quality experimental designs and analyses.
What This Means for Indian ML Researchers
India has a large and growing ML research community — IIT, IISc, TIFR, CMI, and a wave of AI research labs within Indian tech companies. The historical constraint for Indian academic researchers has been compute: access to multi-GPU clusters is expensive and often limited through institutional allocations.
AutoResearch changes the calculus for exploratory research. A researcher with a single RTX 4090 (available for under ₹1,50,000 in India) can now run overnight research campaigns that would have required booking cluster time or running manual experiments across weeks. The LLM API cost per experiment is a few rupees at Claude or GPT-5 pricing.
For MSc and PhD students doing ML research, AutoResearch is potentially the most significant productivity tool released this year.
Key Takeaways
- Karpathy released AutoResearch: 630 lines of Python, single GPU, fully autonomous ML experiment loop
- The system designs experiments, writes code, runs them, analyses results, and iterates without human involvement
- Key engineering: error-handling retry loop that feeds GPU errors back to the LLM for code revision
- Best suited for ablation studies, architecture comparisons, hyperparameter sensitivity on small-to-medium models
- Cannot replace researcher intuition for hypothesis generation or large-scale frontier experiments
- Open source, works with Claude, GPT-5, or local LLMs via Ollama
- For Indian ML researchers: a single GPU + AutoResearch now covers exploratory research that previously required cluster access
- Jensen Huang later gifted Karpathy the first NVIDIA DGX Station GB300 — 748GB unified memory — specifically to run AutoResearch-style agent loops at frontier scale
FAQ
Frequently Asked Questions
What is Karpathy AutoResearch?
AutoResearch is an open-source Python tool released by Andrej Karpathy that enables AI agents to conduct machine learning experiments autonomously. The agent designs experiments, writes code, runs them on a GPU, analyses results, and iterates — all without human involvement. It is 630 lines of code and runs on a single GPU.
Who is Andrej Karpathy?
Andrej Karpathy is one of the most influential figures in applied deep learning. He was a founding member of OpenAI, led Tesla's Autopilot AI programme for five years, and returned to OpenAI briefly before going independent. He is known for educational projects like nanoGPT, makemore, and llm.c — minimal implementations of complex AI concepts that have become widely used learning resources.
How does AutoResearch work?
AutoResearch runs a loop: (1) an LLM generates an experiment design based on a research question, (2) writes Python/PyTorch code for the experiment, (3) executes it on GPU, (4) the LLM analyses the results, (5) designs follow-up experiments, and repeats. A key feature is an error-handling loop that feeds GPU errors back to the LLM for code revision, making the system robust to the many ways ML experiments fail in practice.
Is AutoResearch open source?
Yes. AutoResearch is available on Karpathy's GitHub. It requires Python 3.10+, PyTorch with CUDA, and an LLM API (OpenAI, Anthropic Claude, or a local model via Ollama). A single GPU with 8GB+ VRAM is sufficient for most experiment types the tool is designed for.
What are the limitations of AutoResearch?
AutoResearch is limited to experiments that fit on a single GPU, cannot derive new mathematical theory (it tests hypotheses, not generates them from first principles), cannot replace experienced researcher intuition about which questions are worth asking, and performs best on ablation studies and architecture comparisons rather than frontier-scale training runs.
Free Weekly Briefing
The AI & Dev Briefing
One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.
No spam. Unsubscribe anytime.
More on AI
All posts →Mistral Voxtral TTS: Open-Weight Model Beats ElevenLabs at 90ms Latency
Mistral released Voxtral-4B-TTS on March 26, 2026. 4B parameters, open weights, 90ms time-to-first-audio, 68.4% win rate vs ElevenLabs. At $0.016 per 1,000 chars it changes the TTS pricing floor.
Gemma 4 (April 2026): #3 Arena Open LLM, Apache 2, Developer Guide
Gemma 4 April 2, 2026: Google open weights on Gemini 3, 400M+ Gemma downloads, 31B #3 Arena open LLM, Apache 2. E2B–31B sizes, Ollama, vLLM, Vertex.
AI Developer Tools 2026: Agents, IDEs, APIs, and Self-Hosted Stack Hub
Hub for Cursor, Claude Code, Copilot, Codex, OpenClaw, MCP, Stitch, and API workflows: pick tools by job shape, not hype.
MiroFish 1M Agents: Why It Fails at Trading But Wins Everything Else
A developer ran 338 Polymarket trades with MiroFish swarm AI and made $4,266 profit — then hit the limits. Here is what 1M agents can and cannot predict.
Free Tool
Will AI replace your job?
4 questions. Get a personalised developer risk score based on your stack, role, and what you actually build day to day.
Check Your AI Risk Score →Written by
Software Engineer based in Delhi, India. Writes about AI models, semiconductor supply chains, and tech geopolitics — covering the intersection of infrastructure and global events. 873+ posts cited by ChatGPT, Perplexity, and Gemini. Read in 167 countries.
