A monorepo of five self-contained PyTorch projects that walk the full educational arc of building, training, fine-tuning, and serving GPT-class language models — from a ~10M-parameter character-level toy you can train on a laptop CPU, up to an architecture-only blueprint for a frontier-scale (500B+) training platform.
Each subproject is independent: its own README, its own dependencies, its own tests. Pick the one that matches the scale you care about.
docs/ publishes a monthly state-of-the-art digest
covering LLM/AGI training, fine-tuning, and inference — ranked by ROI and
filtered for what's actually harvestable into these projects on consumer
hardware. Latest edition:
2026-06 — stop harvesting, start
measuring: two controlled A/Bs on real hardware (llamafied beats GPT-2 by
16.8 % ppl; FSDP2-over-PCIe flipped 0.69× → 1.28×), the MAI-Thinking-1
"hill-climbing" harvest, and four planned harvests landed and measured
(LoRA Without Regret, DeepConf, GSPO + RLPR). Prior:
2026-05 (Muon, Multi-Token Prediction,
Liger Kernel, DoRA/rsLoRA/NEFTune, FineWeb-Edu/DCLM data scaling).
Three projects in this repo come with published training plots and headline numbers from real single-GPU runs (all on 5060 Ti), plus one project with a discrete-event simulator that scales the same physics to a frontier cluster.
Three runs of a real PyTorch training loop on a single RTX 5060 Ti (16 GB, bf16) against char-level Tiny Shakespeare (1 MB). The whole sweep, including a 25 M parameter model overfitting hard for ~1.5 h, was run on a desktop GPU at home — no cluster, no API.
| Run | Params | Iters | ms/it | Wall | Final train | Best val | Final val | Overfit Δ |
|---|---|---|---|---|---|---|---|---|
smoke |
0.86 M | 275 / 300 | 4.9 | ~2 s | 1.79 | 1.97 | 1.97 | ~0 |
tiny |
10.65 M | 4,990 / 5,000 | 85.0 | ~8 min | 0.07 | 1.53 | 4.27 | 2.74 |
tiny_clean |
10.65 M | 1,500 / 1,500 | 92.5 | ~2.3 min | 0.53 | 1.48 | 1.85 | 0.36 |
small |
25.73 M | 15,000 / 15,000 | 347.9 | ~1.5 h | 0.03 | 1.86 | 5.47 | 3.60 |
All four rows were trained on the RTX 5060 Ti (16 GB, bf16). Loss columns are hardware-independent — comparable across all rows regardless of card — while ms/it and wall reflect the 5060 Ti's throughput.
The classic "best val arrives in the first ~1000 iters and then val
climbs monotonically while train collapses to zero" overfit story
(tiny, small) — and the textbook U-shaped counter-example
(tiny_clean: same architecture, dropout=0.1, max_iters=1500),
which lands on a better best-val while overfitting 8× less:
Side-by-side: tiny (no dropout) vs tiny_clean (dropout 0.1) on the
same 10.65 M model and same 1 MB dataset — the only intervention is
regularization + early stopping:
Per-run 3-panel plots (loss + LR + step time) and the parser that
generated them live at nanogpt-edu/out/ and
nanogpt-edu/tools/plot_nanogpt.py.
Full discussion in
nanogpt-edu/README.md.
A real from-scratch pretraining run of a 354 M-parameter GPT-2 on a
1 B-token slice of HuggingFaceFW/fineweb-edu, on a single RTX 5060 Ti
16 GB (Blackwell, sm_120). Same code as the MPS smoke run; just scaled
up to a real GPU and a real corpus.
| Metric | Value |
|---|---|
| Model | GPT-2 354 M (24 L × 1024 d × 16 H, tied embeddings) |
| Dataset | FineWeb-Edu sample-10BT, 1 B-token slice (streamed) |
| Tokens trained | 131 M (~0.37× Chinchilla) |
| Wall-clock | 2 h 27 min (4 000 iters × 32 768 tok / step) |
| Throughput | 14.9 k tok/s sustained, 99 % GPU util |
| Peak VRAM | 11.9 GB allocated / 12.8 GB reserved (of 16 GB) |
| Train loss | 11.00 → 3.97 |
| Best val ppl | 58.2 (loss 4.064) at iter 3 800 |
Three panels: train loss (raw + EMA) and validation, cosine LR schedule, step time. The classic undertrained-Chinchilla shape — a fast ~400-iter drop as the model picks up the vocab and frequent bigrams, then a long slow descent as it actually starts modelling text. Val tracks train to within 0.05, no overfitting, plenty of capacity left.
Sample (T=0.7, k=50):
Photosynthesis is an important component of plant cell metabolism. It is important for the action of plants. The cell's cell activity is responsible for the formation of the micro-organisms…
Fluent English, vaguely on-topic, locally coherent within ~20 tokens — exactly what a 354 M model trained on 131 M tokens is supposed to sound like. The plumbing is correct; the model is just undertrained. Real GPT-2-345M (the OpenAI release) reaches val ppl ~26 on ~380× more compute.
Recipe:
midgpt/configs/gpt2_350m_fweb_5060ti.yaml.
Walkthrough + sample completions + calibration table:
midgpt/examples/5060ti_350m_fineweb.md.
A controlled architecture ablation on the same 2× 5060 Ti DDP harness: GPT-2 (learned-pos · LayerNorm · GELU) vs llamafied (RoPE · RMSNorm · SwiGLU · QK-norm), held iso-param (354.6 M vs 353.5 M) and iso-token (131 M), only the architecture varies.
| Arm | Best val ppl | @ iter | tok/s | Peak VRAM/GPU |
|---|---|---|---|---|
| A — GPT-2 (learned-pos·LN·GELU) | 57.8 | 3 800 | 19.3 k | ~11.9 GB |
| B — llamafied (RoPE·RMSNorm·SwiGLU·QK-norm) | 48.1 | 3 800 | 14.8 k | ~13.0 GB |
The Llama recipe wins by 16.8 % perplexity at equal compute — and leads at every one of 19 evals (by up to 40 % early in training), reaching GPT-2's final quality ~37 % sooner. The cost is ~23 % throughput + ~1 GB VRAM: SwiGLU's third matmul and QK-norm carry more activation memory (iso-param ≠ iso-activation), which OOM'd the naive config and forced a smaller micro-batch on the 16 GB card.
Full table + per-iteration trajectory + samples + the OOM-fix systems note:
midgpt/examples/5060ti_350m_llamafied_AB.md.
Single-GPU shake-out of distgpt's full multi-node training stack (the collectives no-op at world_size=1, but every other code path — DeviceMesh, FSDP2 wrapping policy, streaming dataloader with mid-epoch resume, DCP sharded checkpointer, SpikeMonitor + RewindController, AdamW + cosine + per-group WD — is on the critical path).
| Metric | Value |
|---|---|
| Model | 416 M Llama-arch (24 L × 1024 d × 16 H, GQA 4:1, tied embeddings, RoPE + RMSNorm + SwiGLU) |
| Dataset | FineWeb-Edu sample-10BT, 1 B-token slice (shared with midgpt/) |
| Tokens trained | 98 M (~0.24× Chinchilla for 416 M) |
| Wall-clock | 2 h 22 min (3 000 steps × 32 768 tok / step) |
| Throughput | 11.5 k tok/s sustained, ~98 % GPU util |
| Peak VRAM | 12.0 GB allocated / 12.1 GB reserved (of 16 GB) |
| Train loss | 11.02 → 4.58 |
| Best val ppl | 60.7 (loss 4.105) at step 2 800 |
The point of the run isn't to beat midgpt/ (it doesn't — fewer tokens,
no GQA-speedup at this scale, FSDP wrapping overhead) but to prove the
distributed-training plumbing actually trains a model: a sharded DCP
checkpoint resumes cleanly, the streaming loader's LoaderState survives
restart, the spike monitor stays out of the way on a noisy small-batch
run. The writeup documents two real bugs the run surfaced — a
SpikeMonitor rewind-loop that wasted 6 hours retraining the same 100
steps in a regression loop, and why two ranks on one consumer 5060 Ti
doesn't work under NCCL 2.28 — and the fixes that landed alongside.
Recipe:
distgpt/configs/gpt_416m_fweb_5060ti.yaml.
Walkthrough + bug post-mortems + 2-rank-on-one-GPU notes:
distgpt/examples/5060ti_416m_fineweb.md.
A second 5060 Ti lights up the real FSDP2 collectives (all-gather +
reduce-scatter + 2-rank-sharded DCP) — over a PHB PCIe link with no
NVLink, so NCCL_P2P_DISABLE=1 routes every collective through host
memory. The naive dp=2 config was 0.69× slower than one GPU; two
fixes (reshard_after_forward=false + gating gradient-sync to the last
micro-step) flip it to 1.28× positive scaling.
| Metric | Value |
|---|---|
| GPUs / parallelism | 2× RTX 5060 Ti, FSDP2 dp=2 (PCIe PHB, no NVLink) |
| Tokens trained | 295 M (~0.71× Chinchilla for 416 M) |
| Wall-clock | 5 h 33 min (4 500 steps × 65 536 tok / step) |
| Throughput | 14.7 k tok/s aggregate (7.4 k / GPU) |
| Per-GPU MFU | 10.3 % (vs 16.2 % single-GPU — the PCIe tax) |
| Peak VRAM | 12.8 GB allocated / 14.3 GB reserved per GPU |
| Train loss | 11.03 → 3.93 (low 3.66 at step 3 870) |
| Best val ppl | 41.6 (loss 3.728) at step 4 250 |
The cosine tail is the lesson: val plateaus noisily at ~4.0 from step 2 750–3 250, then the LR decay grinds it down to 41.6 ppl in the final quarter — beating the step-2 500 checkpoint and landing exactly on the single-GPU run's "295 M tokens → ppl ~42" forecast. Full calibration story (naive→optimized tables, the competing-VRAM OOM, why 2 ranks on one card is a dead end) in the same walkthrough.
Discrete-event simulator (pure Python, no torch) that runs the full
program end-to-end: Chinchilla-style scaling laws, MFU → throughput →
wall time, Poisson GPU failures, rolling $ accounting, eval-score
prediction, safety thresholds, serving cost models. Optional
--real-gpu flag probes local CUDA devices and recalibrates
seconds_per_step from a few real training steps so the simulated wall
clock and $ figures match the silicon you actually own.
| Run | Cluster | Wall | Final loss | MMLU | Arena ELO | Total $ | Throughput model |
|---|---|---|---|---|---|---|---|
1b |
64× H100 | 3.7 d | 2.21 | 50.6% | 1515 | $0.93 M | 50% MFU × spec |
7b |
512× H100 | 4.8 d | 2.02 | 62.7% | 1711 | $1.02 M | 50% MFU × spec |
70b |
4,096× H100 | 13.2 d | 1.88 | 76.8% | 1985 | $3.31 M | 50% MFU × spec |
400b |
16,384× H100 | 54.0 d | 1.81 | 84.2% | 2142 | $42.42 M | 50% MFU × spec |
7b_realgpu |
512× H100 | 430.7 d | 2.03 | 62.7% | 1711 | $11.48 M | RTX 3050 bf16 (measured) |
The 7B-vs-7B-realgpu comparison is the punchline: same simulated cluster, but calibrating against an actually-measured 4.2 TFLOP/s per RTX 3050 (vs H100's 989 TFLOP/s spec) blows wall-clock from 5 days to 14 months and cost from $1 M to $11.5 M — eval scores are identical because scaling laws don't care how fast the GPUs are.
All five runs ship with per-run 3-panel plots (loss + cumulative $ +
cumulative failures), JSON summaries, and a reproducible CLI. See
frontier-platform/README.md
for the full story.
Three reproducible recipes that walk the consumer-GPU ladder, sharing
the same transformers + peft + trl plumbing and only differing in
model size / dataset / memory recipe. All three are 1-epoch LoRA r=16
runs against Qwen/Qwen2.5-Coder-*.
| Recipe | GPU | Base | Dataset | Packing | Grad-ckpt | Wall | Peak VRAM | Loss end |
|---|---|---|---|---|---|---|---|---|
lora_3050.yaml |
RTX 3050 8 GB | 0.5B | builtin 320 (memorize) | ✗ | ✗ | 1m 24s | 2.3 GB | 0.45 |
lora_3050_1p5b.yaml |
RTX 3050 8 GB | 1.5B | Magicoder-Py 2k | ✗ | ✓ | 24m 05s | 7.5 GB | 0.58 |
lora_5060ti.yaml |
RTX 5060 Ti 16 GB | 3B | Magicoder-Py 2.5k | ✓ | ✗ | 11m 59s | 15.1 GB | 0.55 |
The 5060 Ti recipe is the one to read: 2× the model in half the wall-clock of the 1.5B-on-3050 push recipe, because the 16 GB budget lets you (a) disable gradient checkpointing and (b) enable packing.
Left: training progress (% of 1 epoch). The 0.5B/builtin run drops to 0.45 because it's memorizing 320 short pairs — a smoke test. The two Magicoder runs land at 0.55–0.58 (real generalization on held-out prompts; see the 5060 Ti example for Levenshtein DP, BFS, LRU cache, and a retry decorator all generated correctly at T=0.2). Right: same losses on a log wall-clock axis — the 5060 Ti curve sits to the left of the 1.5B-on-3050 curve at every loss level despite training a 2× larger model.
coder-finetune/examples/5060ti_lora.md
walks through the headline run: Qwen/Qwen2.5-Coder-3B LoRA r=16 on 2,500
Python rows of ise-uiuc/Magicoder-OSS-Instruct-75K at seq_len=1024,
packed, gradient checkpointing off:
| Metric | Value |
|---|---|
| Wall-clock | 11 min 59 s (1 epoch, 161 packed steps) |
| Peak VRAM allocated | 13.87 GB |
| Peak VRAM reserved | 15.10 GB (of 16 GB) |
| Trainable params | 29.9 M (0.96 % of 3.09 B) |
| Train loss | 0.80 → 0.55 |
| Mean-token-acc | 0.82 → 0.85 |
| Tokens trained | 1.28 M |
coder-finetune/configs/lora_3050_1p5b.RESULTS.md
documents the 8-GB limit-pusher: same Magicoder dataset, 2,000 rows,
seq_len=1024, grad-ckpt on (the 3050 has ~500 MB headroom left at that
point — without it the recipe OOMs at step 1).
coder-finetune/examples/3050_lora.md
is the smoke run: Qwen/Qwen2.5-Coder-0.5B against a 16-pair built-in
instruction set (× repeat = 320 examples). No HF dataset download required.
Loss collapses from 2.85 to 0.45 in 80 steps because it's memorizing — the
point is to validate the whole plumbing end-to-end before reaching for a
real dataset.
Plotter: scripts/plot_training.py
(single-run) and
scripts/plot_compare_recipes.py
(cross-recipe) — both reusable for any TRL trainer_state.json.
The June SOTA edition turned four planned harvests into measured, charted runs on the 2× 5060 Ti. The throughline is honesty: none produced a flashy "we beat it" headline — each surfaced the real precondition or sizing fact, which is the more useful result.
-
LoRA Without Regret — r=16 vs r=256 (
coder-finetune, writeup). Three iso-rank A/Bs: r=256 ties r=16 at convergence but loses at every fixed epoch budget (the 16× adapter is slow to warm up), and a bigger 30k×9-language mixture didn't flip it. The binding constraint is training budget, not dataset size.
-
DeepConf — test-time confidence filtering (
nanogpt-edu, writeup). On a verifiable char-level addition model: confidence robustly tracks correctness, and online early-abort trades tokens for accuracy on a clean curve (~10 % fewer at near-iso accuracy). The offline vote-lift is a large-k/long-trace sizing fact.
-
GSPO vs GRPO + RLPR (
frontier-platform, writeup). GSPO's sequence-level importance ratio is ~4× lower-variance than GRPO's token ratio and wins on the MoE policy; RLPR's verifier-free reward sharpens the policy (answer-prob 0.44 → 0.70) — needing an SFT warm-start + KL anchor.
| Project | Scale | What it teaches | Hardware |
|---|---|---|---|
nanogpt-edu/ |
10M–100M | A correct transformer + training loop in ~500 lines: RoPE, RMSNorm, SwiGLU, AMP, cosine LR. | 1 GPU or CPU |
midgpt/ |
124M–1.5B | GPT-2 scale with the real production toolbox: tiktoken BPE, gradient checkpointing, gradient accumulation, DDP, resumable runs, HellaSwag eval. |
1–8 GPUs, single node |
distgpt/ |
1B–70B | Real multi-node training: FSDP2 + Tensor Parallel + Pipeline Parallel on a 3D device mesh, sharded DCP checkpoints, loss-spike rewind, streaming dataloader. | Multi-node cluster |
coder-finetune/ |
0.5B–7B | Post-training on a single consumer GPU: full FT, LoRA, and QLoRA via HuggingFace transformers + peft + trl, plus GRPO/RLVR with verifiable unit-test rewards. HumanEval+ in a Docker sandbox. |
1 consumer GPU (≥6 GB) |
frontier-platform/ |
1B–500B+ | Architecture-only blueprint: data acquisition → filtering → dedup → tokenizer → pretrain → SFT → RLHF/DPO → eval → red-team → serving → observability. Interfaces + design docs; bodies are NotImplementedError. |
Design doc; no GPUs required |
The projects are designed to be read in order. Each one reuses the vocabulary of the previous and adds one production concern:
nanogpt-edu → midgpt → distgpt → coder-finetune → frontier-platform
minimal real tokenizer 3D parallelism post-training the whole system
correct AMP / grad-ckpt DCP checkpoints LoRA / QLoRA around training
transformer single-node DDP spike rewind SFT + GRPO/RLVR HumanEval+
coder-finetune is the orthogonal track: instead of pretraining from
scratch, it takes pretrained weights and aligns them for code.
frontier-platform zooms back out to show the dozen production systems
that surround the training loop in a real frontier lab.
Each subproject installs independently. There is no top-level build.
# Smallest — train a tiny GPT on TinyShakespeare in ~5 minutes
cd nanogpt-edu
python -m venv .venv && .venv/bin/pip install -r requirements.txt
.venv/bin/python prepare_shakespeare.py
.venv/bin/python train.py --config configs/tiny.py
.venv/bin/python sample.py --ckpt out/ckpt.pt --prompt "ROMEO:"# GPT-2 scale on one node
cd midgpt
pip install -r requirements.txt
python prepare.py --dataset wikitext103
torchrun --standalone --nproc_per_node 8 train.py --config configs/gpt2_350m.yaml# Fine-tune a code model on a consumer GPU
cd coder-finetune
pip install -r requirements.txt
python train.py --config configs/lora.yaml
python eval/run_humaneval.py --model out/lora --n-samples 1# Multi-node FSDP2 + TP + PP
cd distgpt
pip install -e .
# launch via Slurm or torchrun-elastic — see distgpt/scripts/# Read the blueprint
cd frontier-platform
pip install -e .
$EDITOR docs/00-overview.mdEvery subproject ships pytest smoke tests:
cd <subproject> && pytestTests run without installing the package — they use a sys.path shim so
you can iterate without a reinstall.
To run everything (pytest in each subproject + ruff at the root) in one shot:
python3 tools/orchestrate.py # tests + lint
python3 tools/orchestrate.py --tests # tests only
python3 tools/orchestrate.py --lint # lint only
python3 tools/orchestrate.py -p midgpt # one projectCI mirrors this matrix in .github/workflows/tests.yml.
LLM-playground/
├── nanogpt-edu/ # 10M–100M, single-file, educational
├── midgpt/ # 124M–1.5B, single-node DDP, tiktoken
├── distgpt/ # 1B–70B, multi-node FSDP2 + TP + PP
├── coder-finetune/ # 0.5B–7B, SFT / LoRA / QLoRA on HF
├── frontier-platform/ # 1B–500B+, architecture blueprint + design docs
├── docs/ # SOTA Watch — monthly LLM & AGI research digest
├── tools/orchestrate.py # one-shot test+lint runner across all subprojects
├── pyproject.toml # shared ruff config (no shared build)
├── .github/workflows/ # CI matrix: pytest each subproject + repo-wide ruff
├── JAAICODE.md # AI-assistant project instructions
└── README.md # this file
- Python ≥ 3.10,
from __future__ import annotations, PEP-604 unions (str | None), built-in generics. @dataclassfor configs and small value types.- YAML configs in
configs/; checkpoints and artefacts inout/. - No cross-subproject imports — each project is deliberately standalone.
These are study projects. nanogpt-edu, midgpt, distgpt, and
coder-finetune are runnable code. frontier-platform is a design doc with
typed skeletons — every public function has a signature and a docstring,
but most bodies raise NotImplementedError. Running a real frontier model
takes thousands of GPUs and tens of millions of dollars; this repo is the
map, not the territory.
See individual subprojects.










