Skip to content

fiale-plus/autoresearch-ane

 
 

Repository files navigation

autoresearch

teaser

One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of "group meeting". That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began. -@karpathy, March 2026.

The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model. The training code here is a simplified single-GPU implementation of nanochat. The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org. The default program.md in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the "research org code" that achieves the fastest research progress, how you'd add more agents to the mix, etc. A bit more context on this project is here in this tweet.

How it works

The repo is deliberately kept small and only really has three files that matter:

  • prepare.py — fixed constants, one-time data prep (downloads training data, trains a BPE tokenizer), and runtime utilities (dataloader, evaluation). Not modified.
  • train.py — the single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc. This file is edited and iterated on by the agent.
  • program.md — baseline instructions for one agent. Point your agent here and let it go. This file is edited and iterated on by the human.

By design, training runs for a fixed 5-minute time budget (wall clock, excluding startup/compilation), regardless of the details of your compute. The metric is val_bpb (validation bits per byte) — lower is better, and vocab-size-independent so architectural changes are fairly compared.

If you are new to neural networks, this "Dummy's Guide" looks pretty good for a lot more context.

Quick start

Requirements: A single NVIDIA GPU (tested on H100), Python 3.10+, uv.

# 1. Install uv project manager (if you don't already have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install dependencies
uv sync

# 3. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py

# 4. Manually run a single training experiment (~5 min)
uv run train.py

If the above commands all work ok, your setup is working and you can go into autonomous research mode.

Running the agent

Simply spin up your Claude/Codex or whatever you want in this repo (and disable all permissions), then you can prompt something like:

Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.

The program.md file is essentially a super lightweight "skill".

Project structure

prepare.py      — constants, data prep + runtime utilities (do not modify)
train.py        — model, optimizer, training loop (agent modifies this)
program.md      — agent instructions
pyproject.toml  — dependencies

Design choices

  • Single file to modify. The agent only touches train.py. This keeps the scope manageable and diffs reviewable.
  • Fixed time budget. Training always runs for exactly 5 minutes, regardless of your specific platform. This means you can expect approx 12 experiments/hour and approx 100 experiments while you sleep. There are two upsides of this design decision. First, this makes experiments directly comparable regardless of what the agent changes (model size, batch size, architecture, etc). Second, this means that autoresearch will find the most optimal model for your platform in that time budget. The downside is that your runs (and results) become not comparable to other people running on other compute platforms.
  • Self-contained. No external dependencies beyond PyTorch and a few small packages. No distributed training, no complex configs. One GPU, one file, one metric.

Platform support

This code currently requires that you have a single NVIDIA GPU. In principle it is quite possible to support CPU, MPS and other platforms but this would also bloat the code. I'm not 100% sure that I want to take this on personally right now. People can reference (or have their agents reference) the full/parent nanochat repository that has wider platform support and shows the various solutions (e.g. a Flash Attention 3 kernels fallback implementation, generic device support, autodetection, etc.), feel free to create forks or discussions for other platforms and I'm happy to link to them here in the README in some new notable forks section or etc.

Seeing as there seems to be a lot of interest in tinkering with autoresearch on much smaller compute platforms than an H100, a few extra words. If you're going to try running autoresearch on smaller computers (Macbooks etc.), I'd recommend one of the forks below. On top of this, here are some recommendations for how to tune the defaults for much smaller models for aspiring forks:

  1. To get half-decent results I'd use a dataset with a lot less entropy, e.g. this TinyStories dataset. These are GPT-4 generated short stories. Because the data is a lot narrower in scope, you will see reasonable results with a lot smaller models (if you try to sample from them after training).
  2. You might experiment with decreasing vocab_size, e.g. from 8192 down to 4096, 2048, 1024, or even - simply byte-level tokenizer with 256 possibly bytes after utf-8 encoding.
  3. In prepare.py, you'll want to lower MAX_SEQ_LEN a lot, depending on the computer even down to 256 etc. As you lower MAX_SEQ_LEN, you may want to experiment with increasing DEVICE_BATCH_SIZE in train.py slightly to compensate. The number of tokens per fwd/bwd pass is the product of these two.
  4. Also in prepare.py, you'll want to decrease EVAL_TOKENS so that your validation loss is evaluated on a lot less data.
  5. In train.py, the primary single knob that controls model complexity is the DEPTH (default 8, here). A lot of variables are just functions of this, so e.g. lower it down to e.g. 4.
  6. You'll want to most likely use WINDOW_PATTERN of just "L", because "SSSL" uses alternating banded attention pattern that may be very inefficient for you. Try it.
  7. You'll want to lower TOTAL_BATCH_SIZE a lot, but keep it powers of 2, e.g. down to 2**14 (~16K) or so even, hard to tell.

I think these would be the reasonable hyperparameters to play with. Ask your favorite coding agent for help and copy paste them this guide, as well as the full source code.

ANE Backend (Apple Neural Engine)

This fork adds an ANE training backend that runs transformer training directly on the Apple Neural Engine via reverse-engineered private APIs. No GPU required — trains on the 15.8 TFLOPS ANE available in every Apple Silicon Mac.

How it works

  • Uses TinyStories dataset with Llama2 32K BPE tokenizer (ANE's native data format)
  • Dynamic weight pipeline: 13 ANE kernels compiled once at startup (~1s). Weights passed via IOSurface spatial dimensions using slice_by_size — no recompilation during training
  • Mega-kernel fusion: Forward pass uses fused sdpaWoFwd (SDPA + Wo projection in one kernel) and fused qkvBwd (Q+KV backward in one kernel), eliminating 12 IOSurface round-trips per step
  • Pipeline overlap: CPU gradient computations (dW cblas) run asynchronously during ANE forward pass. Embedding backward is also async.
  • After each Lion/Adam update, weights are transposed and re-staged to per-layer IOSurfaces
  • Metric is val_loss (cross-entropy), not val_bpb — experiments are compared within this framework
  • Agent edits only ane/experiment_config.h (architecture + optimizer hyperparameters + feature toggles)

Current best results

val_loss = 2.432 (~95 autonomous experiment cycles across 7 phases, ~67M param model, 5-min budget per cycle)

Starting from 6.109 baseline, key improvements discovered through autonomous experimentation:

Phase Change val_loss ms/step Steps/5min
Static kernels Baseline (NL=12, SEQ=256) 6.109 ~400
NL=6, SEQ=512 + ACCUM=1 5.978 ~120
Optimizer tuning (betas, WD, anneal cycles) 5.414 ~60
+ ncdrone optimizer Loss scaling, softcap, diff LR, cosine sched 5.023 ~120
Dynamic pipeline One-time compile, no recompilation 3.89 250 ~1340
+ vDSP Adam Vectorized optimizer, parallel layer updates 3.102 176 ~1284
+ hyperparameter sweep LR=5e-4, WD=0.1, SOFTCAP=30, MATRIX_LR=0.1 3.099 176 ~1631
+ Lion + pipeline Lion optimizer, async dW overlap, vocab compaction 3.079 175 ~1421
+ kernel fusion Fused sdpaWoFwd + qkvBwd mega-kernels 2.489 96 ~2822
+ LOSS_SCALE=1024 Better FP16 gradient stability (from Slavko ecosystem) 2.477 97 ~2800
+ EMBED_LR=1.0 Equal LR for embeddings (was 2x, overfitting) 2.432 99 ~2700

Key discoveries

  • Dynamic weight pipeline (11x speedup): Compile 10 ANE kernels once at startup, pass weights via IOSurface spatial dimensions. Eliminated per-batch recompilation that consumed ~60% of wall time.
  • Mega-kernel fusion (45% faster steps): Fusing sdpaFwd+woFwd into one kernel and qBwd+kvBwd into another eliminated 12 IOSurface round-trips per step. The bottleneck was IOSurface lock/unlock/memcpy overhead, not compute.
  • Lion optimizer: Sign-based weight updates with no second moment buffer. ~2x faster per update than Adam. Counter-intuitively, works best with Adam-style hyperparams (LR=5e-4, WD=0.1), not the lower LR/higher WD recommended in the paper.
  • Vocab compaction (3.5x classifier speedup): Only ~9K of 32K tokens appear in TinyStories. Reducing the classifier SGEMM from 32K to 9K vocab is free accuracy-wise.
  • ACCUM ramping: Start with low ACCUM (noisy but many updates) for early training, ramp up each cycle for smoother gradients. Sweet spot: ACCUM=12-14 for Lion, ACCUM=20-48 for Adam.
  • LR schedule tuning is critical for multi-cycle runs: TOTAL_STEPS must match the actual training window. Too high → model overfits (train_loss=0.87, val_loss=3.9). Too low → LR exhausts early, later cycles waste time.
  • LOSS_SCALE=1024 (April 2026, from ecosystem): Slavko/ANE-Training benchmarks show FP16 gradient underflow is worse than expected. LOSS_SCALE=1024 (up from 512) stabilizes the backward pass and prevents silent gradient vanishing. Improved val_loss from 2.489 to 2.477.
  • EMBED_LR_SCALE=1.0 (April 2026): Embeddings were overfitting with 2× base LR. Equal LR (1.0) for both embeddings and norms gave better generalization, pushing val_loss from 2.477 to 2.432. The embedding matrix is already the largest parameter block (8M of 67M params) and doesn't need extra LR.
  • Adam is worse than Lion here (April 2026): Tested Adam (LR=3e-4) against Lion (LR=5e-4). Adam achieved val_loss=3.23 after 3 cycles at ~99ms/step, significantly worse than Lion's 2.51 at the same point. Lion's sign-based updates are more robust for ANE's FP16 compute path.

What didn't work

  • GQA with non-equal KV heads: Crashes the MIL compiler. Must keep N_KV_HEADS=HEADS.
  • Fused SDPA backward kernel: ANE compiler rejects with "Graph has a cycle path" — too complex.
  • Bigger architectures (DIM=1024, NLAYERS=8): Can't converge in 5-min budget. More steps always beats bigger models.
  • Lion paper hyperparams (LR/3, WD×3): Diverged badly. Empirical testing always beats paper defaults.
  • Adam optimizer (Phase 6): val_loss=3.23 after 3 cycles at ACCUM=8 — much worse than Lion (2.51 at same point). Lion's sign-based updates are more robust for ANE's FP16 path.
  • LOSS_SCALE=256: Too low, gradient underflow causes 5.5 plateau.
  • EMBED_LR_SCALE=2.0: Embeddings overfit. Equal LR (1.0) generalizes better.
  • WEIGHT_DECAY=0.05: Under-regularized, val_loss=2.45 vs 2.43 with WD=0.1.
  • WEIGHT_DECAY=0.2: Over-regularized, val_loss=2.48 vs 2.43 with WD=0.1.
  • SOFTCAP=20: Too aggressive, val_loss=2.53 vs 2.43 with SOFTCAP=30.
  • SOFTCAP=50: Too permissive, val_loss=2.54 vs 2.43 with SOFTCAP=30.
  • MATRIX_LR_SCALE=0.05: Too slow for weight matrices, val_loss=2.59.
  • LR=8e-4: Too aggressive, diverges.

Hyperparameters

The agent edits ane/experiment_config.h. All hyperparameters and their current best values:

Architecture (changing these resets checkpoint):

Parameter Value Notes
DIM 768 Model dimension
HIDDEN 2048 FFN hidden dimension
HEADS 12 Attention heads (DIM must be divisible by HEADS)
SEQ 512 Sequence length. 512 is optimal; 1024 hits ANE SRAM wall
NLAYERS 6 Transformer layers. 6 is the sweet spot — fewer layers = faster steps = more training in the 5-min budget

Optimizer (safe to change between runs):

Parameter Value Notes
LEARNING_RATE 5e-4f Base learning rate (scaled by differential LR multipliers below)
ADAM_BETA1 0.9f First moment decay (used by both Adam and Lion)
ADAM_BETA2 0.95f Second moment decay / Lion momentum update
ADAM_EPS 1e-8f Adam epsilon (unused by Lion)
ACCUM_STEPS 12 Gradient accumulation steps per weight update + restage. Ramp up during training (2→12)
GRAD_CLIP_MAX 1.0f Global L2 gradient norm clip threshold
WEIGHT_DECAY 0.1f Decoupled weight decay. Applied only to weight matrices, not embeddings or RMSNorm
TOTAL_STEPS 3000 Cosine LR schedule denominator (adam_t units). Must match optimal training window
LR_WARMUP_STEPS 100 Linear warmup steps before cosine decay
LR_MIN_FRAC 0.1f Cosine schedule decays LR to this fraction of max
LOSS_SCALE 1024.0f Loss scaling factor — prevents FP16 gradient underflow. 1024 (up from 512) gives better FP16 stability
SOFTCAP 30.0f Logit softcapping: cap * tanh(logits/cap), clamps logits to [-cap, cap]
EMBED_LR_SCALE 1.0f Embedding LR = base LR × this. Equal LR works better than 2× (embedding overfits with 2×)
MATRIX_LR_SCALE 0.1f Weight matrix LR = base LR × this
USE_LION 1 Lion optimizer (sign-based updates, no second moment, ~2x faster per update)
USE_VOCAB_COMPACT 1 Vocab compaction: 32K→9K active tokens, 3.5x classifier SGEMM speedup

Optimizer features:

  • Lion optimizer (default) — sign-based weight updates: w -= lr * sign(β1*m + (1-β1)*g), no second moment buffer. ~2x faster per update than Adam, half the optimizer memory. Toggle USE_LION to switch back to AdamW.
  • Gradient clipping — global L2 norm across all parameters using vDSP
  • Cosine LR schedule — linear warmup for LR_WARMUP_STEPS, then cosine decay to LR_MIN_FRAC of max LR. TOTAL_STEPS controls the schedule length — critical for multi-cycle training
  • Loss scaling (1024×) — scales gradients up before FP16 backward pass, undone during gradient averaging. 1024 (up from 512) prevents underflow and improves FP16 stability (ecosystem: Slavko/ANE-Training)
  • Logit softcappingcap * tanh(logits/cap) before softmax with chain-rule correction in backward
  • Differential learning rates — embeddings at 1× base LR (equal), weight matrices at 0.1× base LR, norm params at 1× base LR
  • Vocab compaction — only ~9K of 32K tokens appear in TinyStories. Classifier SGEMM reduced 3.5x with zero accuracy impact
  • Residual scaling — residual connections scaled by 1/sqrt(2*NLAYERS) to stabilize deep residual streams

Differences from the CUDA backend

The ANE backend is a separate training stack, not a port of train.py. Key differences:

CUDA (train.py) ANE (train_ane.m)
Optimizer Muon + AdamW (per-parameter-group LRs, weight decay, momentum scheduling) AdamW (differential LR, cosine schedule, loss scaling, logit softcap, residual scaling)
Data climbmix-400b, custom 8K BPE TinyStories, Llama2 32K BPE
Metric val_bpb (bits per byte) val_loss (cross-entropy)
Language Python / PyTorch Objective-C / raw MIL kernels
Attention Flash Attention, sliding window patterns Standard attention via ANE conv ops

Results are not comparable across backends — each is its own self-contained research loop.

Setup

# 1. Download TinyStories data (~1 GB)
cd ane && bash download_data.sh

# 2. Compile the training binary
make -C ane train_ane

# 3. Run a single experiment (5 minutes)
python harness_ane.py

# 4. Or run with custom wall time
ANE_WALL_TIME=60 python harness_ane.py

Running the autonomous agent

Point Claude at program_ane.md:

Hi, have a look at program_ane.md and let's kick off a new ANE experiment!

The agent will modify ane/experiment_config.h, run experiments via harness_ane.py, and iterate autonomously.

ANE-specific files

ane/                         # ANE training backend
├── experiment_config.h      # Agent's ONLY editing target
├── stories_config.h         # Model config, structs (includes experiment_config.h)
├── stories_io.h             # IOSurface I/O, dynamic weight staging, request helpers
├── stories_mil_dynamic.h    # Dynamic MIL kernel generators (13 kernels incl. fused mega-kernels)
├── stories_cpu_ops.h        # CPU ops (RMSNorm, SiLU bwd, cross-entropy, Adam, Lion, vocab compaction)
├── train_ane.m              # Training binary (dynamic pipeline, one-time compile, wall-time budget)
├── download_data.sh         # TinyStories data download
└── Makefile                 # Build train_ane binary
harness_ane.py               # ANE orchestrator
program_ane.md               # ANE agent instructions

The original CUDA files (prepare.py, train.py, program.md) remain untouched and work as before if you have a GPU.

Knowledge sources

This project builds on and references the following repositories:

  • maderix/ANE — First project to train transformers directly on the Apple Neural Engine using Objective-C and raw MIL kernel compilation. The ANE training backend in this repo is based on this work.
  • miolini/autoresearch-macos — MacOS fork of autoresearch adapted for Apple Silicon. Early reference for running autonomous research on Mac hardware.
  • slavko-at-klincov-it/ANE-Training — Full libane C API (76 private classes reverse-engineered), Metal fused Adam optimizer (33.7ms for 110M params), comprehensive M4 Mini benchmarks, and 10K-step training runs. Key insights: LOSS_SCALE=1024, activation clipping, QoS 9 for ANE dispatch, hardware utilization metrics.
  • mechramc/orion — Production ANE runtime for Stories110M/GPT-2. Delta compilation, Graph IR compiler, LoRA hot-swapping. Documents 14 ANE hardware constraints. Published as arXiv:2603.06728.
  • vipuldivyanshu92/ANEgpt — ANE transformer training with async CPU-ANE pipelining, kernel lifecycle separation, and per-operation profiling.
  • imperatormk/ane-train — Runtime IOSurface weight injection without recompilation. Passes weights as input tensors, compiles once, updates via memcpy each step. Runs a 28-block ConvNeXt UNet at ~3 it/s on M1. Discovered key constraints: IOSurface slot sizes must be strictly ascending for inputs / descending for outputs (silent zeros otherwise), matmul inner dim must be multiple of 32, grouped/depthwise conv fails with runtime weights.
  • christopherkarani/Espresso — Pure Swift ANE transformer inference achieving 4.76x throughput vs CoreML (1.08ms/token vs 5.09ms). Fused 3-layer kernels, zero-copy I/O. Actively maintained.
  • ncdrone/rustane — Rust-native hybrid ANE + Metal GPU training and inference engine. Community benchmark leaderboard.

Recent findings from the ecosystem (April 2026)

LOSS_SCALE=1024 improves FP16 stability (slavko-at-klincov-it/ANE-Training): Comprehensive M4 Mini benchmarks show LOSS_SCALE=1024 (up from 512) prevents FP16 gradient underflow. Our experiments confirmed a 0.05 val_loss improvement (2.489→2.477). Also: activation clipping (maxact=100) prevents x explosion over long runs, QoS 9 (Background) is fastest for ANE dispatch.

Embedding LR equalization (our own experiment, April 2026): EMBED_LR_SCALE=1.0 (equal LR for all parameters) improves val_loss from 2.477→2.432. The embedding matrix (8M/67M params) doesn't need higher LR — it overfits. This is the single biggest improvement in Phase 6.

EMBED_LR_SCALE=1.0 over 2.0 (our experiments, April 2026): Our Phase 6-7 experiments showed EMBED_LR_SCALE=2.0 caused embedding overfitting. Reducing to 1.0 (equal LR) was the highest-impact change, improving val_loss from 2.477→2.432.

Metal fused Adam (slavko-at-klincov-it/ANE-Training): GPU-accelerated optimizer using Metal compute shaders. 33.7ms for 110M params vs CPU Adam 77ms (2.3× faster). Float4-vectorized kernel. Not yet adopted — Lion is already fast enough and Adam is worse on this task.

E5 Runtime research (maderix/ANE PR #40): Custom MIL text can be compiled directly to ANE via MLE5ProgramLibraryOnDeviceAOTCompilationImpl. Legacy _ANEChainingRequest API is dead on macOS 15+; E5 runtime (MLE5Engine) is the modern path. 7 test programs (~7K lines of reverse-engineering experiments).

Mixed weight type limit (mechramc/orion issue #3): ANE programs with both RMSNorm and linear weights fail at 16 total weights. Pure linear can have 16. Each norm reduces the limit by 1. Practical limit: ~3 FFN layers per ANE mega-kernel.

M3 Ultra 512ch constraint (maderix/ANE issue #42): 512 channels is the ONLY valid channel count on M3 Ultra. Everything else fails with -4 or -3. Peak sustained: 8.77 TFLOPS at 128x conv depth.

Security fix needed (maderix/ANE PR #45): Untrusted model config fields (n_layers, dim, hidden_dim) can cause OOB writes and NULL deref. Bounds checking required after reading Config from disk.

ANE hardware constraints (compiled from ecosystem)

Constraint Source Impact
512 channels only on M3 Ultra maderix #42 Hard limit, all others fail
Mixed weight limit: 16 - n_norms mechramc #3 Limits mega-kernel layer count
IOSurface slots must be ascending (in) / descending (out) imperatormk Silent zeros on violation
Matmul inner dim must be multiple of 32 imperatormk Silent zeros on violation
~100 kernel compilations before process restart needed our experiments ANECompilerService daemon leak
Thermal throttle after ~60 min continuous use (+60% step time) our experiments Plan for cooling breaks
reduce_sum on channel dim requires reshape to spatial first our experiments RMSNorm fusion crashes
GQA with non-equal KV heads crashes MIL compiler our experiments Use equal head counts
constexpr_affine_dequantize incompatible with dynamic pipeline our experiments Bakes weights at compile time

License

MIT

About

AI agents running research on Apple Neural Engine (ANE)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C 47.5%
  • Objective-C 30.6%
  • Python 18.1%
  • Jupyter Notebook 2.5%
  • Shell 1.2%
  • Makefile 0.1%