·6 min read·v0.0.0
AutoResearch Architecture
Autonomous LLM pretraining research agent by Andrej Karpathy — an AI agent that modifies code, trains models, evaluates results, and iterates indefinitely while you sleep.
autonomous-agentllm-pretrainingml-researchgptoptimizerkarpathy
View repository →Core Engine (GPT + Training)
Optimizer (Muon + AdamW)
Data Pipeline
AI Agent (LLM Researcher)
Program / Config
CLI / Runtime
External (HuggingFace, GPU)
System Layers
Agent Layer (the autonomous researcher)
🤖Claude / Codex AgentLLM-powered researcher
📝program.mdAgent skill / instructions
⌨Git WorkflowBranch, commit, revert
📊results.tsvExperiment log
Training Engine (the single file the agent edits)
🧠GPT ModelTransformer + RoPE + Value Embeds
🔁Training Loop5-min time budget, grad accum
⚡MuonAdamWMuon (matrices) + AdamW (rest)
📈LR SchedulesWarmup + warmdown + decay
Data & Evaluation Layer (fixed, read-only)
📦Data DownloaderClimbMix-400B parquet shards
🔢BPE Tokenizerrustbpe + tiktoken, 8192 vocab
📄DataLoaderBOS-aligned, best-fit packing
✅evaluate_bpbFixed BPB metric (ground truth)
Infrastructure
💻Single NVIDIA GPUH100 (tested), CUDA
📦PyTorch 2.9torch.compile, bf16 autocast
🌍HuggingFace HubDataset hosting
📤Flash Attention 3kernels (varunneal / community)
📦uvPackage manager + runner
Core Flow — Autonomous Research Loop
1
Setup — Agent reads program.md, creates branch autoresearch/<tag>, reads all source files, verifies data exists in ~/.cache/autoresearch/
↓
2
Baseline Run — Execute uv run train.py > run.log 2>&1 with unmodified code. Record initial val_bpb in results.tsv
↓
3
Hypothesize — Agent proposes a change: architecture tweak, hyperparameter adjustment, optimizer modification, or simplification in train.py
↓
4
Edit & Commit — Modify train.py (the only mutable file), then git commit the change to the experiment branch
↓
5
Train (5 min) — Run training for fixed TIME_BUDGET = 300s wall clock. Model trains on ClimbMix-400B with gradient accumulation, bf16, torch.compile
↓
6
Evaluate — Compute val_bpb via evaluate_bpb() on pinned validation shard. Extract metrics with grep from log
↓
7
Keep or Discard — If val_bpb improved: keep the commit and advance. If worse: git reset back. Log result to results.tsv
↓
8
Loop Forever — Agent runs autonomously and indefinitely (~12 experiments/hour, ~100 overnight). Human wakes up to a log of discoveries
Integration Model
AI Agent (the Researcher)
Any LLM agent: Claude, Codex, etc.
Reads program.md as its skill / instruction set
Edits only train.py — everything else is read-only
Uses git for version control (branch per run)
Logs results to results.tsv (untracked)
Runs indefinitely with no human intervention
Compute & Data Stack
Single NVIDIA GPU (H100 tested), CUDA backend
Flash Attention 3 via kernels package (Hopper-aware)
torch.compile with bf16 autocast for speed
ClimbMix-400B dataset from HuggingFace (parquet shards)
rustbpe + tiktoken for fast BPE tokenization
uv for dependency management and script execution
Key Subsystem — GPT Model + MuonAdamW Optimizer
train.py (the single mutable file) ├── GPTConfig ← Dataclass: depth, heads, embed dim, window pattern ├── CausalSelfAttention ← GQA + RoPE + Value Embeddings + FA3 ├── MLP ← Linear → ReLU² → Linear (squared ReLU activation) ├── Block ← Pre-norm (RMSNorm) + Attn + MLP with residual ├── GPT ← Embedding + N Blocks + Logit Softcap + LM Head ├── MuonAdamW ← Hybrid optimizer: Muon for 2D matrices, AdamW for rest ├── polar_express_coeffs ← Precomputed coefficients for Newton-Schulz orthogonalization ├── Hyperparameters ← DEPTH, ASPECT_RATIO, LRs, BATCH_SIZE, WEIGHT_DECAY └── Training Loop ← Time-budgeted loop with LR warmup/warmdown
Value Embeddings
ResFormer-style: alternating layers get input-dependent gated value residuals added to V projections
Sliding Window Attention
SSSL pattern: 3 short-window + 1 long-window layers cycling, last layer always full context
Muon Optimizer
Nesterov momentum + Polar Express orthogonalization + NorMuon variance reduction for matrix params
Residual Scaling
Learnable per-layer resid_lambdas and x0_lambdas mixing current hidden state with initial embeddings
Logit Softcap
Tanh-based logit capping at 15 to prevent extreme values and stabilize training
BPB Evaluation
Bits-per-byte metric: vocab-size-independent, sums per-token cross-entropy weighted by UTF-8 byte lengths
Data & Output Model
ClimbMix-400B Dataset
6,542 parquet shards on HuggingFace, Column: text (string), Pinned val shard: shard_06542, Stored: ~/.cache/autoresearch/data/
BPE Tokenizer
rustbpe-trained, tiktoken-wrapped, vocab_size: 8192, 4 special tokens (reserved_0..3), GPT-4 style split pattern, Stored: ~/.cache/autoresearch/tokenizer/
results.tsv
commit (7-char hash), val_bpb (float, lower=better), memory_gb (peak VRAM), status: keep | discard | crash, description (text)
run.log Output
val_bpb, training_seconds, total_seconds, peak_vram_mb, mfu_percent, total_tokens_M, num_steps, num_params_M, depth
DataLoader Batch
inputs: [B, T] int64 (token IDs), targets: [B, T] int64 (shifted), BOS-aligned best-fit packed, Zero padding 100% utilization, Pin-memory + async GPU copy
Model Config (default)
depth: 8, n_embd: 512, n_head: 4, head_dim: 128, seq_len: 2048, vocab: 8192, aspect_ratio: 64, window: SSSL pattern
Package / Directory Map
autoresearch/ ├── train.py THE mutable file: GPT model, MuonAdamW optimizer, training loop, hyperparams ├── prepare.py Read-only: constants, data download, tokenizer training, dataloader, eval ├── program.md Agent instructions: setup, experiment loop, logging rules, constraints ├── pyproject.toml Dependencies: torch, kernels, rustbpe, tiktoken, numpy, pandas, matplotlib ├── analysis.ipynb Jupyter notebook for plotting experiment progress from results.tsv ├── README.md Project overview, quickstart, design choices, platform notes ├── .python-version Python version pinning (3.10+) ├── .gitignore Ignores cache, logs, results ├── uv.lock Locked dependency graph ├── progress.png Sample experiment progress chart └── ~/.cache/autoresearch/ ├── data/ Downloaded parquet shards (shard_00000..06542) └── tokenizer/ tokenizer.pkl + token_bytes.pt
The Key Insight
AutoResearch inverts the traditional ML research workflow: instead of a human writing code and running experiments, the human writes a program.md that programs an AI agent to be the researcher. The entire system is deliberately minimal — just three files — with the constraint that the agent can only modify train.py while a fixed 5-minute time budget makes every experiment directly comparable. This creates an autonomous research loop that runs ~100 experiments overnight, evolving the training code through a keep/discard selection process analogous to evolution.