Training Trace Runtime

What it is

Training Trace Runtime is a C11 runtime that records online SGD updates for a single-parameter linear model into a fixed-size binary trace, validates trace integrity (CRC32 + semantic checks), and replays or compares runs deterministically.

Why determinism matters in ML systems

Training bugs often hide behind nondeterminism. Deterministic event logs let you:

replay a run exactly,
bisect the first divergent step,
detect silent corruption (record CRC),
detect numerical drift (ULP comparison),
distinguish model bugs from data/IO bugs.

Binary trace format (show struct layouts)

All fields are little-endian.

TraceHeader (64 bytes)
Offset  Size  Field
0       4     magic[4] = "DGRR"
4       2     version (1)
6       2     _pad0
8       8     seed
16      8     step_count
24      8     learning_rate
32      4     file_crc32
36      4     header_crc32 (CRC32 bytes [0..35])
40      8     created_at (unix sec)
48      16    _reserved

TraceRecord (72 bytes)
Offset  Size  Field
0       8     step_id
8       8     w_before
16      8     b_before
24      8     grad_w
32      8     grad_b
40      8     w_after
48      8     b_after
56      8     loss
64      4     record_crc32 (CRC32 bytes [0..63])
68      4     _pad0

_Static_assert(sizeof(TraceHeader)==64) and _Static_assert(sizeof(TraceRecord)==72) enforce ABI invariants.

Mathematical foundation (show gradient formulas)

Model:

[ \hat y = w x + b,\quad L = (\hat y - y)^2 = (wx + b - y)^2 ]

Gradients:

[ \frac{\partial L}{\partial w} = 2(wx+b-y)x,\quad \frac{\partial L}{\partial b} = 2(wx+b-y) ]

Update:

[ w \leftarrow w - \alpha \frac{\partial L}{\partial w},\quad b \leftarrow b - \alpha \frac{\partial L}{\partial b} ]

Concrete step (x=2, y=3, w=0.5, b=0.1, α=0.01):

error (e=wx+b-y=0.5\cdot2+0.1-3=-1.9)
(\partial L/\partial w = 2(-1.9)2=-7.6)
(\partial L/\partial b = 2(-1.9)=-3.8)
(w'=0.5-0.01(-7.6)=0.576)
(b'=0.1-0.01(-3.8)=0.138)

Loss aggregation uses Kahan compensated summation.

CLI reference

./dgrr train [--steps N] [--lr FLOAT] [--seed UINT64] [--out PATH]
./dgrr replay --step S [--trace PATH]
./dgrr inspect --step S [--trace PATH]
./dgrr rollback --step S [--trace PATH]
./dgrr compare --a PATH_A --b PATH_B [--ulp-tol N]

Exit codes:

0 success
1 usage error
2 IO error
3 trace validation failure
4 divergence detected
5 replay integrity failure

Replay integrity model

Replay recomputes:

[ w_{re} = w_{before} - lr\cdot grad_w,\quad b_{re} = b_{before} - lr\cdot grad_b ]

and compares against stored w_after / b_after using ULP tolerance.

Divergence detection

compare validates both traces, then scans in lockstep and compares w_after, b_after, loss at each step with ULP tolerance. It reports the first mismatching step and values.

Building

make
make debug

Running tests

make test

License

MIT (see LICENSE).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
include		include
src		src
tests		tests
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training Trace Runtime

What it is

Why determinism matters in ML systems

Binary trace format (show struct layouts)

Mathematical foundation (show gradient formulas)

CLI reference

Replay integrity model

Divergence detection

Building

Running tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Training Trace Runtime

What it is

Why determinism matters in ML systems

Binary trace format (show struct layouts)

Mathematical foundation (show gradient formulas)

CLI reference

Replay integrity model

Divergence detection

Building

Running tests

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages