March 24, 2026·6 min read·v0.0.0

AutoResearch Architecture

Autonomous LLM pretraining research agent by Andrej Karpathy — an AI agent that modifies code, trains models, evaluates results, and iterates indefinitely while you sleep.

autonomous-agentllm-pretrainingml-researchgptoptimizerkarpathy

View repository →

Core Engine (GPT + Training)

Optimizer (Muon + AdamW)

Data Pipeline

AI Agent (LLM Researcher)

Program / Config

CLI / Runtime

External (HuggingFace, GPU)

System Layers

Agent Layer (the autonomous researcher)

🤖Claude / Codex AgentLLM-powered researcher

📝program.mdAgent skill / instructions

⌨Git WorkflowBranch, commit, revert

📊results.tsvExperiment log

Training Engine (the single file the agent edits)

🧠GPT ModelTransformer + RoPE + Value Embeds

🔁Training Loop5-min time budget, grad accum

⚡MuonAdamWMuon (matrices) + AdamW (rest)

📈LR SchedulesWarmup + warmdown + decay

Data & Evaluation Layer (fixed, read-only)

📦Data DownloaderClimbMix-400B parquet shards

🔢BPE Tokenizerrustbpe + tiktoken, 8192 vocab

📄DataLoaderBOS-aligned, best-fit packing

✅evaluate_bpbFixed BPB metric (ground truth)

Infrastructure

💻Single NVIDIA GPUH100 (tested), CUDA

📦PyTorch 2.9torch.compile, bf16 autocast

🌍HuggingFace HubDataset hosting

📤Flash Attention 3kernels (varunneal / community)

📦uvPackage manager + runner

Core Flow — Autonomous Research Loop

Setup — Agent reads program.md, creates branch autoresearch/<tag>, reads all source files, verifies data exists in ~/.cache/autoresearch/

↓

Baseline Run — Execute uv run train.py > run.log 2>&1 with unmodified code. Record initial val_bpb in results.tsv

↓

Hypothesize — Agent proposes a change: architecture tweak, hyperparameter adjustment, optimizer modification, or simplification in train.py

↓

Edit & Commit — Modify train.py (the only mutable file), then git commit the change to the experiment branch

↓

Train (5 min) — Run training for fixed TIME_BUDGET = 300s wall clock. Model trains on ClimbMix-400B with gradient accumulation, bf16, torch.compile

↓

Evaluate — Compute val_bpb via evaluate_bpb() on pinned validation shard. Extract metrics with grep from log

↓

Keep or Discard — If val_bpb improved: keep the commit and advance. If worse: git reset back. Log result to results.tsv

↓

Loop Forever — Agent runs autonomously and indefinitely (~12 experiments/hour, ~100 overnight). Human wakes up to a log of discoveries

Integration Model

AI Agent (the Researcher)

Any LLM agent: Claude, Codex, etc.

Reads program.md as its skill / instruction set

Edits only train.py — everything else is read-only

Uses git for version control (branch per run)

Logs results to results.tsv (untracked)

Runs indefinitely with no human intervention

Compute & Data Stack

Single NVIDIA GPU (H100 tested), CUDA backend

Flash Attention 3 via kernels package (Hopper-aware)

torch.compile with bf16 autocast for speed

ClimbMix-400B dataset from HuggingFace (parquet shards)

rustbpe + tiktoken for fast BPE tokenization

uv for dependency management and script execution

Key Subsystem — GPT Model + MuonAdamW Optimizer

train.py (the single mutable file)
├── GPTConfig             ← Dataclass: depth, heads, embed dim, window pattern
├── CausalSelfAttention   ← GQA + RoPE + Value Embeddings + FA3
├── MLP                   ← Linear → ReLU² → Linear (squared ReLU activation)
├── Block                 ← Pre-norm (RMSNorm) + Attn + MLP with residual
├── GPT                   ← Embedding + N Blocks + Logit Softcap + LM Head
├── MuonAdamW             ← Hybrid optimizer: Muon for 2D matrices, AdamW for rest
├── polar_express_coeffs  ← Precomputed coefficients for Newton-Schulz orthogonalization
├── Hyperparameters       ← DEPTH, ASPECT_RATIO, LRs, BATCH_SIZE, WEIGHT_DECAY
└── Training Loop         ← Time-budgeted loop with LR warmup/warmdown

Value Embeddings

ResFormer-style: alternating layers get input-dependent gated value residuals added to V projections

Sliding Window Attention

SSSL pattern: 3 short-window + 1 long-window layers cycling, last layer always full context

Muon Optimizer

Nesterov momentum + Polar Express orthogonalization + NorMuon variance reduction for matrix params

Residual Scaling

Learnable per-layer resid_lambdas and x0_lambdas mixing current hidden state with initial embeddings

Logit Softcap

Tanh-based logit capping at 15 to prevent extreme values and stabilize training

BPB Evaluation

Bits-per-byte metric: vocab-size-independent, sums per-token cross-entropy weighted by UTF-8 byte lengths

Data & Output Model

ClimbMix-400B Dataset

6,542 parquet shards on HuggingFace, Column: text (string), Pinned val shard: shard_06542, Stored: ~/.cache/autoresearch/data/

BPE Tokenizer

rustbpe-trained, tiktoken-wrapped, vocab_size: 8192, 4 special tokens (reserved_0..3), GPT-4 style split pattern, Stored: ~/.cache/autoresearch/tokenizer/

results.tsv

commit (7-char hash), val_bpb (float, lower=better), memory_gb (peak VRAM), status: keep | discard | crash, description (text)

run.log Output

val_bpb, training_seconds, total_seconds, peak_vram_mb, mfu_percent, total_tokens_M, num_steps, num_params_M, depth

DataLoader Batch

inputs: [B, T] int64 (token IDs), targets: [B, T] int64 (shifted), BOS-aligned best-fit packed, Zero padding 100% utilization, Pin-memory + async GPU copy

Model Config (default)

depth: 8, n_embd: 512, n_head: 4, head_dim: 128, seq_len: 2048, vocab: 8192, aspect_ratio: 64, window: SSSL pattern

Package / Directory Map

autoresearch/
├── train.py               THE mutable file: GPT model, MuonAdamW optimizer, training loop, hyperparams
├── prepare.py             Read-only: constants, data download, tokenizer training, dataloader, eval
├── program.md             Agent instructions: setup, experiment loop, logging rules, constraints
├── pyproject.toml         Dependencies: torch, kernels, rustbpe, tiktoken, numpy, pandas, matplotlib
├── analysis.ipynb         Jupyter notebook for plotting experiment progress from results.tsv
├── README.md              Project overview, quickstart, design choices, platform notes
├── .python-version        Python version pinning (3.10+)
├── .gitignore             Ignores cache, logs, results
├── uv.lock                Locked dependency graph
├── progress.png           Sample experiment progress chart
└── ~/.cache/autoresearch/
  ├── data/              Downloaded parquet shards (shard_00000..06542)
  └── tokenizer/         tokenizer.pkl + token_bytes.pt

The Key Insight

AutoResearch inverts the traditional ML research workflow: instead of a human writing code and running experiments, the human writes a program.md that programs an AI agent to be the researcher. The entire system is deliberately minimal — just three files — with the constraint that the agent can only modify train.py while a fixed 5-minute time budget makes every experiment directly comparable. This creates an autonomous research loop that runs ~100 experiments overnight, evolving the training code through a keep/discard selection process analogous to evolution.

← cd ../architectures