Skip to content

W4ilops/training-trace-runtime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training Trace Runtime

What it is

Training Trace Runtime is a C11 runtime that records online SGD updates for a single-parameter linear model into a fixed-size binary trace, validates trace integrity (CRC32 + semantic checks), and replays or compares runs deterministically.

Why determinism matters in ML systems

Training bugs often hide behind nondeterminism. Deterministic event logs let you:

  • replay a run exactly,
  • bisect the first divergent step,
  • detect silent corruption (record CRC),
  • detect numerical drift (ULP comparison),
  • distinguish model bugs from data/IO bugs.

Binary trace format (show struct layouts)

All fields are little-endian.

TraceHeader (64 bytes)
Offset  Size  Field
0       4     magic[4] = "DGRR"
4       2     version (1)
6       2     _pad0
8       8     seed
16      8     step_count
24      8     learning_rate
32      4     file_crc32
36      4     header_crc32 (CRC32 bytes [0..35])
40      8     created_at (unix sec)
48      16    _reserved
TraceRecord (72 bytes)
Offset  Size  Field
0       8     step_id
8       8     w_before
16      8     b_before
24      8     grad_w
32      8     grad_b
40      8     w_after
48      8     b_after
56      8     loss
64      4     record_crc32 (CRC32 bytes [0..63])
68      4     _pad0

_Static_assert(sizeof(TraceHeader)==64) and _Static_assert(sizeof(TraceRecord)==72) enforce ABI invariants.

Mathematical foundation (show gradient formulas)

Model:

[ \hat y = w x + b,\quad L = (\hat y - y)^2 = (wx + b - y)^2 ]

Gradients:

[ \frac{\partial L}{\partial w} = 2(wx+b-y)x,\quad \frac{\partial L}{\partial b} = 2(wx+b-y) ]

Update:

[ w \leftarrow w - \alpha \frac{\partial L}{\partial w},\quad b \leftarrow b - \alpha \frac{\partial L}{\partial b} ]

Concrete step (x=2, y=3, w=0.5, b=0.1, α=0.01):

  • error (e=wx+b-y=0.5\cdot2+0.1-3=-1.9)
  • (\partial L/\partial w = 2(-1.9)2=-7.6)
  • (\partial L/\partial b = 2(-1.9)=-3.8)
  • (w'=0.5-0.01(-7.6)=0.576)
  • (b'=0.1-0.01(-3.8)=0.138)

Loss aggregation uses Kahan compensated summation.

CLI reference

./dgrr train [--steps N] [--lr FLOAT] [--seed UINT64] [--out PATH]
./dgrr replay --step S [--trace PATH]
./dgrr inspect --step S [--trace PATH]
./dgrr rollback --step S [--trace PATH]
./dgrr compare --a PATH_A --b PATH_B [--ulp-tol N]

Exit codes:

  • 0 success
  • 1 usage error
  • 2 IO error
  • 3 trace validation failure
  • 4 divergence detected
  • 5 replay integrity failure

Replay integrity model

Replay recomputes:

[ w_{re} = w_{before} - lr\cdot grad_w,\quad b_{re} = b_{before} - lr\cdot grad_b ]

and compares against stored w_after / b_after using ULP tolerance.

Divergence detection

compare validates both traces, then scans in lockstep and compares w_after, b_after, loss at each step with ULP tolerance. It reports the first mismatching step and values.

Building

make
make debug

Running tests

make test

License

MIT (see LICENSE).

About

Deterministic ML training replay runtime in pure C for tracing, inspecting, and replaying parameter updates.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors