Skip to content

tomz/LLM-playground

Repository files navigation

LLM-playground

A monorepo of five self-contained PyTorch projects that walk the full educational arc of building, training, fine-tuning, and serving GPT-class language models — from a ~10M-parameter character-level toy you can train on a laptop CPU, up to an architecture-only blueprint for a frontier-scale (500B+) training platform.

Each subproject is independent: its own README, its own dependencies, its own tests. Pick the one that matches the scale you care about.

SOTA Watch — monthly LLM & AGI digest

docs/ publishes a monthly state-of-the-art digest covering LLM/AGI training, fine-tuning, and inference — ranked by ROI and filtered for what's actually harvestable into these projects on consumer hardware. Latest edition: 2026-06stop harvesting, start measuring: two controlled A/Bs on real hardware (llamafied beats GPT-2 by 16.8 % ppl; FSDP2-over-PCIe flipped 0.69× → 1.28×), the MAI-Thinking-1 "hill-climbing" harvest, and four planned harvests landed and measured (LoRA Without Regret, DeepConf, GSPO + RLPR). Prior: 2026-05 (Muon, Multi-Token Prediction, Liger Kernel, DoRA/rsLoRA/NEFTune, FineWeb-Edu/DCLM data scaling).

Results gallery

Three projects in this repo come with published training plots and headline numbers from real single-GPU runs (all on 5060 Ti), plus one project with a discrete-event simulator that scales the same physics to a frontier cluster.

nanogpt-edu/ — real training on an RTX 5060 Ti

Three runs of a real PyTorch training loop on a single RTX 5060 Ti (16 GB, bf16) against char-level Tiny Shakespeare (1 MB). The whole sweep, including a 25 M parameter model overfitting hard for ~1.5 h, was run on a desktop GPU at home — no cluster, no API.

Run Params Iters ms/it Wall Final train Best val Final val Overfit Δ
smoke 0.86 M 275 / 300 4.9 ~2 s 1.79 1.97 1.97 ~0
tiny 10.65 M 4,990 / 5,000 85.0 ~8 min 0.07 1.53 4.27 2.74
tiny_clean 10.65 M 1,500 / 1,500 92.5 ~2.3 min 0.53 1.48 1.85 0.36
small 25.73 M 15,000 / 15,000 347.9 ~1.5 h 0.03 1.86 5.47 3.60

All four rows were trained on the RTX 5060 Ti (16 GB, bf16). Loss columns are hardware-independent — comparable across all rows regardless of card — while ms/it and wall reflect the 5060 Ti's throughput.

The classic "best val arrives in the first ~1000 iters and then val climbs monotonically while train collapses to zero" overfit story (tiny, small) — and the textbook U-shaped counter-example (tiny_clean: same architecture, dropout=0.1, max_iters=1500), which lands on a better best-val while overfitting 8× less:

nanogpt-edu cross-run comparison

Side-by-side: tiny (no dropout) vs tiny_clean (dropout 0.1) on the same 10.65 M model and same 1 MB dataset — the only intervention is regularization + early stopping:

tiny vs tiny_clean

Per-run 3-panel plots (loss + LR + step time) and the parser that generated them live at nanogpt-edu/out/ and nanogpt-edu/tools/plot_nanogpt.py. Full discussion in nanogpt-edu/README.md.

midgpt/ — 350M GPT-2 pretraining on RTX 5060 Ti

A real from-scratch pretraining run of a 354 M-parameter GPT-2 on a 1 B-token slice of HuggingFaceFW/fineweb-edu, on a single RTX 5060 Ti 16 GB (Blackwell, sm_120). Same code as the MPS smoke run; just scaled up to a real GPU and a real corpus.

Metric Value
Model GPT-2 354 M (24 L × 1024 d × 16 H, tied embeddings)
Dataset FineWeb-Edu sample-10BT, 1 B-token slice (streamed)
Tokens trained 131 M (~0.37× Chinchilla)
Wall-clock 2 h 27 min (4 000 iters × 32 768 tok / step)
Throughput 14.9 k tok/s sustained, 99 % GPU util
Peak VRAM 11.9 GB allocated / 12.8 GB reserved (of 16 GB)
Train loss 11.00 → 3.97
Best val ppl 58.2 (loss 4.064) at iter 3 800

midgpt 350M training curves

Three panels: train loss (raw + EMA) and validation, cosine LR schedule, step time. The classic undertrained-Chinchilla shape — a fast ~400-iter drop as the model picks up the vocab and frequent bigrams, then a long slow descent as it actually starts modelling text. Val tracks train to within 0.05, no overfitting, plenty of capacity left.

Sample (T=0.7, k=50):

Photosynthesis is an important component of plant cell metabolism. It is important for the action of plants. The cell's cell activity is responsible for the formation of the micro-organisms…

Fluent English, vaguely on-topic, locally coherent within ~20 tokens — exactly what a 354 M model trained on 131 M tokens is supposed to sound like. The plumbing is correct; the model is just undertrained. Real GPT-2-345M (the OpenAI release) reaches val ppl ~26 on ~380× more compute.

Recipe: midgpt/configs/gpt2_350m_fweb_5060ti.yaml. Walkthrough + sample completions + calibration table: midgpt/examples/5060ti_350m_fineweb.md.

midgpt/ — llamafied vs GPT-2 A/B (350M, iso-param, iso-token)

A controlled architecture ablation on the same 2× 5060 Ti DDP harness: GPT-2 (learned-pos · LayerNorm · GELU) vs llamafied (RoPE · RMSNorm · SwiGLU · QK-norm), held iso-param (354.6 M vs 353.5 M) and iso-token (131 M), only the architecture varies.

Arm Best val ppl @ iter tok/s Peak VRAM/GPU
A — GPT-2 (learned-pos·LN·GELU) 57.8 3 800 19.3 k ~11.9 GB
B — llamafied (RoPE·RMSNorm·SwiGLU·QK-norm) 48.1 3 800 14.8 k ~13.0 GB

The Llama recipe wins by 16.8 % perplexity at equal compute — and leads at every one of 19 evals (by up to 40 % early in training), reaching GPT-2's final quality ~37 % sooner. The cost is ~23 % throughput + ~1 GB VRAM: SwiGLU's third matmul and QK-norm carry more activation memory (iso-param ≠ iso-activation), which OOM'd the naive config and forced a smaller micro-batch on the 16 GB card.

llamafied vs GPT-2 350M A/B

Full table + per-iteration trajectory + samples + the OOM-fix systems note: midgpt/examples/5060ti_350m_llamafied_AB.md.

distgpt/ — 416M Llama-arch (RoPE+SwiGLU+GQA) on RTX 5060 Ti

Single-GPU shake-out of distgpt's full multi-node training stack (the collectives no-op at world_size=1, but every other code path — DeviceMesh, FSDP2 wrapping policy, streaming dataloader with mid-epoch resume, DCP sharded checkpointer, SpikeMonitor + RewindController, AdamW + cosine + per-group WD — is on the critical path).

Metric Value
Model 416 M Llama-arch (24 L × 1024 d × 16 H, GQA 4:1, tied embeddings, RoPE + RMSNorm + SwiGLU)
Dataset FineWeb-Edu sample-10BT, 1 B-token slice (shared with midgpt/)
Tokens trained 98 M (~0.24× Chinchilla for 416 M)
Wall-clock 2 h 22 min (3 000 steps × 32 768 tok / step)
Throughput 11.5 k tok/s sustained, ~98 % GPU util
Peak VRAM 12.0 GB allocated / 12.1 GB reserved (of 16 GB)
Train loss 11.02 → 4.58
Best val ppl 60.7 (loss 4.105) at step 2 800

distgpt 416M training curves

The point of the run isn't to beat midgpt/ (it doesn't — fewer tokens, no GQA-speedup at this scale, FSDP wrapping overhead) but to prove the distributed-training plumbing actually trains a model: a sharded DCP checkpoint resumes cleanly, the streaming loader's LoaderState survives restart, the spike monitor stays out of the way on a noisy small-batch run. The writeup documents two real bugs the run surfaced — a SpikeMonitor rewind-loop that wasted 6 hours retraining the same 100 steps in a regression loop, and why two ranks on one consumer 5060 Ti doesn't work under NCCL 2.28 — and the fixes that landed alongside.

Recipe: distgpt/configs/gpt_416m_fweb_5060ti.yaml. Walkthrough + bug post-mortems + 2-rank-on-one-GPU notes: distgpt/examples/5060ti_416m_fineweb.md.

Going multi-GPU — genuine 2× 5060 Ti FSDP2 over PCIe

A second 5060 Ti lights up the real FSDP2 collectives (all-gather + reduce-scatter + 2-rank-sharded DCP) — over a PHB PCIe link with no NVLink, so NCCL_P2P_DISABLE=1 routes every collective through host memory. The naive dp=2 config was 0.69× slower than one GPU; two fixes (reshard_after_forward=false + gating gradient-sync to the last micro-step) flip it to 1.28× positive scaling.

Metric Value
GPUs / parallelism 2× RTX 5060 Ti, FSDP2 dp=2 (PCIe PHB, no NVLink)
Tokens trained 295 M (~0.71× Chinchilla for 416 M)
Wall-clock 5 h 33 min (4 500 steps × 65 536 tok / step)
Throughput 14.7 k tok/s aggregate (7.4 k / GPU)
Per-GPU MFU 10.3 % (vs 16.2 % single-GPU — the PCIe tax)
Peak VRAM 12.8 GB allocated / 14.3 GB reserved per GPU
Train loss 11.03 → 3.93 (low 3.66 at step 3 870)
Best val ppl 41.6 (loss 3.728) at step 4 250

distgpt 416M 2-GPU training curves

The cosine tail is the lesson: val plateaus noisily at ~4.0 from step 2 750–3 250, then the LR decay grinds it down to 41.6 ppl in the final quarter — beating the step-2 500 checkpoint and landing exactly on the single-GPU run's "295 M tokens → ppl ~42" forecast. Full calibration story (naive→optimized tables, the competing-VRAM OOM, why 2 ranks on one card is a dead end) in the same walkthrough.

frontier-platform/ — simulated 1B → 400B program

Discrete-event simulator (pure Python, no torch) that runs the full program end-to-end: Chinchilla-style scaling laws, MFU → throughput → wall time, Poisson GPU failures, rolling $ accounting, eval-score prediction, safety thresholds, serving cost models. Optional --real-gpu flag probes local CUDA devices and recalibrates seconds_per_step from a few real training steps so the simulated wall clock and $ figures match the silicon you actually own.

Run Cluster Wall Final loss MMLU Arena ELO Total $ Throughput model
1b 64× H100 3.7 d 2.21 50.6% 1515 $0.93 M 50% MFU × spec
7b 512× H100 4.8 d 2.02 62.7% 1711 $1.02 M 50% MFU × spec
70b 4,096× H100 13.2 d 1.88 76.8% 1985 $3.31 M 50% MFU × spec
400b 16,384× H100 54.0 d 1.81 84.2% 2142 $42.42 M 50% MFU × spec
7b_realgpu 512× H100 430.7 d 2.03 62.7% 1711 $11.48 M RTX 3050 bf16 (measured)

The 7B-vs-7B-realgpu comparison is the punchline: same simulated cluster, but calibrating against an actually-measured 4.2 TFLOP/s per RTX 3050 (vs H100's 989 TFLOP/s spec) blows wall-clock from 5 days to 14 months and cost from $1 M to $11.5 M — eval scores are identical because scaling laws don't care how fast the GPUs are.

frontier-platform size sweep

spec-sheet vs real-GPU calibration

All five runs ship with per-run 3-panel plots (loss + cumulative $ + cumulative failures), JSON summaries, and a reproducible CLI. See frontier-platform/README.md for the full story.

coder-finetune/ — LoRA on a single consumer GPU, three recipes

Three reproducible recipes that walk the consumer-GPU ladder, sharing the same transformers + peft + trl plumbing and only differing in model size / dataset / memory recipe. All three are 1-epoch LoRA r=16 runs against Qwen/Qwen2.5-Coder-*.

Recipe GPU Base Dataset Packing Grad-ckpt Wall Peak VRAM Loss end
lora_3050.yaml RTX 3050 8 GB 0.5B builtin 320 (memorize) 1m 24s 2.3 GB 0.45
lora_3050_1p5b.yaml RTX 3050 8 GB 1.5B Magicoder-Py 2k 24m 05s 7.5 GB 0.58
lora_5060ti.yaml RTX 5060 Ti 16 GB 3B Magicoder-Py 2.5k 11m 59s 15.1 GB 0.55

The 5060 Ti recipe is the one to read: 2× the model in half the wall-clock of the 1.5B-on-3050 push recipe, because the 16 GB budget lets you (a) disable gradient checkpointing and (b) enable packing.

cross-recipe loss comparison

Left: training progress (% of 1 epoch). The 0.5B/builtin run drops to 0.45 because it's memorizing 320 short pairs — a smoke test. The two Magicoder runs land at 0.55–0.58 (real generalization on held-out prompts; see the 5060 Ti example for Levenshtein DP, BFS, LRU cache, and a retry decorator all generated correctly at T=0.2). Right: same losses on a log wall-clock axis — the 5060 Ti curve sits to the left of the 1.5B-on-3050 curve at every loss level despite training a 2× larger model.

5060 Ti / 3B recipe — 12 minutes, 13.9 GB peak

coder-finetune/examples/5060ti_lora.md walks through the headline run: Qwen/Qwen2.5-Coder-3B LoRA r=16 on 2,500 Python rows of ise-uiuc/Magicoder-OSS-Instruct-75K at seq_len=1024, packed, gradient checkpointing off:

Metric Value
Wall-clock 11 min 59 s (1 epoch, 161 packed steps)
Peak VRAM allocated 13.87 GB
Peak VRAM reserved 15.10 GB (of 16 GB)
Trainable params 29.9 M (0.96 % of 3.09 B)
Train loss 0.80 → 0.55
Mean-token-acc 0.82 → 0.85
Tokens trained 1.28 M

3B LoRA on 5060 Ti — training curves

1.5B / RTX 3050 recipe — 24 minutes, 7.5 GB peak

coder-finetune/configs/lora_3050_1p5b.RESULTS.md documents the 8-GB limit-pusher: same Magicoder dataset, 2,000 rows, seq_len=1024, grad-ckpt on (the 3050 has ~500 MB headroom left at that point — without it the recipe OOMs at step 1).

1.5B LoRA on 3050 — training curves

0.5B / RTX 3050 recipe — 84 seconds, 2.3 GB peak

coder-finetune/examples/3050_lora.md is the smoke run: Qwen/Qwen2.5-Coder-0.5B against a 16-pair built-in instruction set (× repeat = 320 examples). No HF dataset download required. Loss collapses from 2.85 to 0.45 in 80 steps because it's memorizing — the point is to validate the whole plumbing end-to-end before reaching for a real dataset.

Plotter: scripts/plot_training.py (single-run) and scripts/plot_compare_recipes.py (cross-recipe) — both reusable for any TRL trainer_state.json.

SOTA-harvest measurements — the 2026-06 "measure the technique" runs

The June SOTA edition turned four planned harvests into measured, charted runs on the 2× 5060 Ti. The throughline is honesty: none produced a flashy "we beat it" headline — each surfaced the real precondition or sizing fact, which is the more useful result.

  • LoRA Without Regret — r=16 vs r=256 (coder-finetune, writeup). Three iso-rank A/Bs: r=256 ties r=16 at convergence but loses at every fixed epoch budget (the 16× adapter is slow to warm up), and a bigger 30k×9-language mixture didn't flip it. The binding constraint is training budget, not dataset size. LoRA r=16 vs r=256

  • DeepConf — test-time confidence filtering (nanogpt-edu, writeup). On a verifiable char-level addition model: confidence robustly tracks correctness, and online early-abort trades tokens for accuracy on a clean curve (~10 % fewer at near-iso accuracy). The offline vote-lift is a large-k/long-trace sizing fact. DeepConf tradeoff + confidence separation

  • GSPO vs GRPO + RLPR (frontier-platform, writeup). GSPO's sequence-level importance ratio is ~4× lower-variance than GRPO's token ratio and wins on the MoE policy; RLPR's verifier-free reward sharpens the policy (answer-prob 0.44 → 0.70) — needing an SFT warm-start + KL anchor. GRPO vs GSPO + RLPR

The five projects

Project Scale What it teaches Hardware
nanogpt-edu/ 10M–100M A correct transformer + training loop in ~500 lines: RoPE, RMSNorm, SwiGLU, AMP, cosine LR. 1 GPU or CPU
midgpt/ 124M–1.5B GPT-2 scale with the real production toolbox: tiktoken BPE, gradient checkpointing, gradient accumulation, DDP, resumable runs, HellaSwag eval. 1–8 GPUs, single node
distgpt/ 1B–70B Real multi-node training: FSDP2 + Tensor Parallel + Pipeline Parallel on a 3D device mesh, sharded DCP checkpoints, loss-spike rewind, streaming dataloader. Multi-node cluster
coder-finetune/ 0.5B–7B Post-training on a single consumer GPU: full FT, LoRA, and QLoRA via HuggingFace transformers + peft + trl, plus GRPO/RLVR with verifiable unit-test rewards. HumanEval+ in a Docker sandbox. 1 consumer GPU (≥6 GB)
frontier-platform/ 1B–500B+ Architecture-only blueprint: data acquisition → filtering → dedup → tokenizer → pretrain → SFT → RLHF/DPO → eval → red-team → serving → observability. Interfaces + design docs; bodies are NotImplementedError. Design doc; no GPUs required

The complexity ladder

The projects are designed to be read in order. Each one reuses the vocabulary of the previous and adds one production concern:

nanogpt-edu  →  midgpt        →  distgpt          →  coder-finetune    →  frontier-platform
  minimal       real tokenizer    3D parallelism      post-training         the whole system
  correct       AMP / grad-ckpt   DCP checkpoints     LoRA / QLoRA          around training
  transformer   single-node DDP   spike rewind        SFT + GRPO/RLVR       HumanEval+

coder-finetune is the orthogonal track: instead of pretraining from scratch, it takes pretrained weights and aligns them for code. frontier-platform zooms back out to show the dozen production systems that surround the training loop in a real frontier lab.

Quickstart

Each subproject installs independently. There is no top-level build.

# Smallest — train a tiny GPT on TinyShakespeare in ~5 minutes
cd nanogpt-edu
python -m venv .venv && .venv/bin/pip install -r requirements.txt
.venv/bin/python prepare_shakespeare.py
.venv/bin/python train.py --config configs/tiny.py
.venv/bin/python sample.py --ckpt out/ckpt.pt --prompt "ROMEO:"
# GPT-2 scale on one node
cd midgpt
pip install -r requirements.txt
python prepare.py --dataset wikitext103
torchrun --standalone --nproc_per_node 8 train.py --config configs/gpt2_350m.yaml
# Fine-tune a code model on a consumer GPU
cd coder-finetune
pip install -r requirements.txt
python train.py --config configs/lora.yaml
python eval/run_humaneval.py --model out/lora --n-samples 1
# Multi-node FSDP2 + TP + PP
cd distgpt
pip install -e .
# launch via Slurm or torchrun-elastic — see distgpt/scripts/
# Read the blueprint
cd frontier-platform
pip install -e .
$EDITOR docs/00-overview.md

Testing

Every subproject ships pytest smoke tests:

cd <subproject> && pytest

Tests run without installing the package — they use a sys.path shim so you can iterate without a reinstall.

To run everything (pytest in each subproject + ruff at the root) in one shot:

python3 tools/orchestrate.py            # tests + lint
python3 tools/orchestrate.py --tests    # tests only
python3 tools/orchestrate.py --lint     # lint only
python3 tools/orchestrate.py -p midgpt  # one project

CI mirrors this matrix in .github/workflows/tests.yml.

Repository layout

LLM-playground/
├── nanogpt-edu/         # 10M–100M, single-file, educational
├── midgpt/              # 124M–1.5B, single-node DDP, tiktoken
├── distgpt/             # 1B–70B, multi-node FSDP2 + TP + PP
├── coder-finetune/      # 0.5B–7B, SFT / LoRA / QLoRA on HF
├── frontier-platform/   # 1B–500B+, architecture blueprint + design docs
├── docs/                # SOTA Watch — monthly LLM & AGI research digest
├── tools/orchestrate.py # one-shot test+lint runner across all subprojects
├── pyproject.toml       # shared ruff config (no shared build)
├── .github/workflows/   # CI matrix: pytest each subproject + repo-wide ruff
├── JAAICODE.md          # AI-assistant project instructions
└── README.md            # this file

Conventions

  • Python ≥ 3.10, from __future__ import annotations, PEP-604 unions (str | None), built-in generics.
  • @dataclass for configs and small value types.
  • YAML configs in configs/; checkpoints and artefacts in out/.
  • No cross-subproject imports — each project is deliberately standalone.

Status

These are study projects. nanogpt-edu, midgpt, distgpt, and coder-finetune are runnable code. frontier-platform is a design doc with typed skeletons — every public function has a signature and a docstring, but most bodies raise NotImplementedError. Running a real frontier model takes thousands of GPUs and tens of millions of dollars; this repo is the map, not the territory.

License

See individual subprojects.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors