Karpathy's Autoresearch: AI Agents Train LLMs Overnight

Andrej Karpathy's autoresearch repo revolutionizes AI development by letting autonomous AI agents experiment with LLM training overnight. No manual coding required – agents modify train.py, run 5-minute experiments, and optimize models based on validation loss. Wake up to better models and detailed logs. Single-GPU setup with nanochat architecture makes frontier research accessible to anyone with an NVIDIA GPU. Perfect for AI researchers wanting to automate hyperparameter tuning, architecture search, and model optimization.

Karpathy's Autoresearch: Let AI Agents Revolutionize Your Model Training

The era of manual AI research is over. Andrej Karpathy's autoresearch repository (20.6k stars) introduces a groundbreaking approach: AI agents autonomously improve LLMs overnight without human intervention.

The Revolutionary Concept

Instead of researchers manually tweaking hyperparameters, architecture, and optimizers, autoresearch hands control to AI agents. The workflow:

  1. Agent edits train.py (GPT model, Muon+AdamW optimizer, training loop)
  2. Runs 5-minute training (fixed wall-clock budget)
  3. Evaluates on val_bpb (bits per byte, lower = better)
  4. Keeps improvements, discards failures
  5. Repeats ~100x overnight

Wake up to optimized models and detailed experiment logs.

Minimal 4-File Setup

uv sync
uv run prepare.py  # Download data + train tokenizer
uv run train.py    # Manual test (~5 min)

Core files:

  • prepare.py – Data prep + utilities (fixed)
  • train.py – Agent's playground (model + training)
  • program.md – Agent instructions (human-editable)

Production-Ready Design Choices

Single editable file keeps diffs reviewable ✅ Fixed 5-min budget = fair architecture comparisons ✅ Self-contained – PyTorch + minimal deps ✅ Vocab-independent metric (val_bpb)

Quick Start for H100 Users

# 1. Install (Python 3.10+)
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

# 2. Prep data (~2 min)
uv run prepare.py

# 3. Test run (~5 min)
uv run train.py

Spin up Claude/Codex:

"Hi, read program.md and kick off a new experiment!"

Smaller Hardware? Try These Forks

Pro tips for low-compute: TinyStories dataset, vocab_size=1024, DEPTH=4, MAX_SEQ_LEN=256.

Why This Changes Everything

  • Democratizes research: Single GPU → frontier progress
  • Platform-optimized: Finds best model for your hardware
  • Agent-programmable: Edit program.md to add multi-agent swarms
  • MIT licensed: Fork, extend, contribute

GitHub Repo (20.6k ⭐) – The future of AI research has arrived.