Training Trace Runtime is a C11 runtime that records online SGD updates for a single-parameter linear model into a fixed-size binary trace, validates trace integrity (CRC32 + semantic checks), and replays or compares runs deterministically.
Training bugs often hide behind nondeterminism. Deterministic event logs let you:
- replay a run exactly,
- bisect the first divergent step,
- detect silent corruption (record CRC),
- detect numerical drift (ULP comparison),
- distinguish model bugs from data/IO bugs.
All fields are little-endian.
TraceHeader (64 bytes)
Offset Size Field
0 4 magic[4] = "DGRR"
4 2 version (1)
6 2 _pad0
8 8 seed
16 8 step_count
24 8 learning_rate
32 4 file_crc32
36 4 header_crc32 (CRC32 bytes [0..35])
40 8 created_at (unix sec)
48 16 _reserved
TraceRecord (72 bytes)
Offset Size Field
0 8 step_id
8 8 w_before
16 8 b_before
24 8 grad_w
32 8 grad_b
40 8 w_after
48 8 b_after
56 8 loss
64 4 record_crc32 (CRC32 bytes [0..63])
68 4 _pad0
_Static_assert(sizeof(TraceHeader)==64) and _Static_assert(sizeof(TraceRecord)==72) enforce ABI invariants.
Model:
[ \hat y = w x + b,\quad L = (\hat y - y)^2 = (wx + b - y)^2 ]
Gradients:
[ \frac{\partial L}{\partial w} = 2(wx+b-y)x,\quad \frac{\partial L}{\partial b} = 2(wx+b-y) ]
Update:
[ w \leftarrow w - \alpha \frac{\partial L}{\partial w},\quad b \leftarrow b - \alpha \frac{\partial L}{\partial b} ]
Concrete step (x=2, y=3, w=0.5, b=0.1, α=0.01):
- error (e=wx+b-y=0.5\cdot2+0.1-3=-1.9)
- (\partial L/\partial w = 2(-1.9)2=-7.6)
- (\partial L/\partial b = 2(-1.9)=-3.8)
- (w'=0.5-0.01(-7.6)=0.576)
- (b'=0.1-0.01(-3.8)=0.138)
Loss aggregation uses Kahan compensated summation.
./dgrr train [--steps N] [--lr FLOAT] [--seed UINT64] [--out PATH]
./dgrr replay --step S [--trace PATH]
./dgrr inspect --step S [--trace PATH]
./dgrr rollback --step S [--trace PATH]
./dgrr compare --a PATH_A --b PATH_B [--ulp-tol N]Exit codes:
0success1usage error2IO error3trace validation failure4divergence detected5replay integrity failure
Replay recomputes:
[ w_{re} = w_{before} - lr\cdot grad_w,\quad b_{re} = b_{before} - lr\cdot grad_b ]
and compares against stored w_after / b_after using ULP tolerance.
compare validates both traces, then scans in lockstep and compares w_after, b_after, loss at each step with ULP tolerance. It reports the first mismatching step and values.
make
make debugmake testMIT (see LICENSE).