Skip to content

feat: FP stability CI (Verrou) + reduced-energy scheme + HLLC xi-factor fix#1403

Open
sbryngelson wants to merge 31 commits intoMFlowCode:masterfrom
sbryngelson:fp-stability
Open

feat: FP stability CI (Verrou) + reduced-energy scheme + HLLC xi-factor fix#1403
sbryngelson wants to merge 31 commits intoMFlowCode:masterfrom
sbryngelson:fp-stability

Conversation

@sbryngelson
Copy link
Copy Markdown
Member

@sbryngelson sbryngelson commented May 6, 2026

Summary

Three changes, all motivated by floating-point cancellation in stiffened-EOS flows:

1. FP stability CI suite (./mfc.sh fp-stability)

Adds a Verrou-based CI workflow that runs six 1-D cases under randomized IEEE-754 rounding and compares the L∞ deviation from a nearest-rounding reference.

Closes #650.

Case Description Threshold Ill-conditioned operation
sod_standard Standard Sod, ideal gas 1e-13 — (well-conditioned canary)
sod_strong p_L/p_R=100,000, ideal gas 1e-10 HLLC xi-factor (s_L−vel)/(s_L−s_S) near sonic contact
water_stiffened Stiffened EOS, π_inf=4046 1e-8 Pressure recovery p=(E−π_inf)/γ, loses ~4 digits when π_inf/p≈40,000
air_water_interface Two-fluid air/water isobaric contact 1e-10 Mixed-cell pressure recovery with α_water≪1
bubble_rp Bubbly water, pressure step 2:1, Keller-Miksis RP ODE 1e-8 (p_bub − p_ext) cancels near bubble equilibrium
low_mach Water shock with low_Mach=1 HLLC correction 1e-7 Low-Mach velocity perturbation ~u/c cancels severely at M≈0

On each run, verrou_dd_sym and verrou_dd_line bisect deterministically (float-mode, --nruns=1) to identify the minimal responsible functions and source lines. Logs are uploaded as CI artifacts; on GitHub Actions, failing source lines appear as inline ::warning:: annotations in the PR diff. A step-summary table with float-proxy and VPREC sweep results is written to the Actions run UI.

2. HLLC xi-factor denominator protection

xi_L/R = (s_L/R − vel)/(s_L/R − s_S) divided by near-zero when s_L ≈ s_S at a sonic contact. Fixed by clamping:

xi_L = (s_L - vel_L(dir_idx(1))) / min(s_L - s_S, -sgm_eps)
xi_R = (s_R - vel_R(dir_idx(1))) / max(s_R - s_S,  sgm_eps)

3. Reduced-energy (Ẽ) scheme for model_eqns=2

For the Allaire 5-equation model with stiffened EOS, pressure recovery
p = (E − π_inf − KE − qv)/γ loses ~log₁₀(π_inf/p) digits when π_inf ≫ p (water: π_inf=4046, p≈1 → ~4 digits lost).

Fix: store Ẽ = E − π_mix in the conserved energy slot. Pressure recovery becomes p = (Ẽ − KE − qv)/γ — no cancellation.

Changes required for consistency:

  • m_variables_conversion.fpp: s_compute_pressure and s_convert_primitive_to_conservative scoped to model_eqns==2; explicit branch for model_eqns=1/3 (physical E) prevents fallthrough to Tait EOS for the Saurel 6-eq model
  • m_riemann_solvers.fpp: HLL and LF non-MHD blocks use Ẽ states with π_inf restored for enthalpy H; HLLC block for model_eqns=3 unchanged (physical E throughout)
  • m_rhs.fpp: non-conservative source S_Ẽ = −Σᵢ πᵢ·rhs(αᵢ) added for x/y/z, gated on model_eqns==2 .and. .not.(bubbles_euler .or. mhd), with separate loops for HLL vs HLLC/LF
  • m_sim_helpers.fpp: IGR enthalpy computation updated for Ẽ convention
  • m_cbc.fpp: CBC energy flux updated for Ẽ convention

Test plan

  • ./mfc.sh precheck -j 8 — all 6 lint checks pass
  • ./mfc.sh build -j 8 — all 3 targets compile
  • model_eqns=2 HLLC 1D/2D tests (8 cases) — pass
  • model_eqns=3 HLLC 1D/2D/3D tests (6 cases) — pass
  • model_eqns=1 and HLL baseline tests — pass
  • Full CPU debug test suite (585 tests) — all pass

sbryngelson added 11 commits May 5, 2026 22:39
Adds ./mfc.sh fp-stability — a persistent floating-point stability test
suite using Verrou's random IEEE-754 rounding mode.

For each registered test case the runner:
  1. Generates initial conditions via pre_process
  2. Runs simulation once with --rounding-mode=nearest (reference)
  3. Runs simulation N times with --rounding-mode=random
  4. Reports max L-inf deviation vs threshold (PASS/FAIL)

Two cases probe known ill-conditioning in MFC:
  - sod_strong: 1-D Sod p_L/p_R=100,000 — HLLC xi-factor cancellation
    (s_L - vel_L)/(s_L - s_S) near sonic contact
  - water_stiffened: 1-D water shock pi_inf=4046 — pressure recovery
    p=(E-pi_inf)/gamma loses ~4 decimal digits on low-pressure side

Requires a Verrou-enabled Valgrind at $VERROU_HOME/bin/valgrind
(default: $HOME/.local/verrou). Silently skips if not found.
Binaries are auto-discovered from build/install/ or passed explicitly.
- Add sod_standard (p_L/p_R=10, well-conditioned, threshold=1e-13) as a
  canary that should always pass under any FP rounding
- Enlarge sod_strong and water_stiffened from 25→50 cells and 5→10 steps
  to give Verrou more arithmetic to perturb and raise sensitivity
- Run verrou_dd_sym automatically on any failing case: generates dd_run.sh
  and dd_cmp.py per case, sets PYTHONPATH for verrou Python libs, saves
  rddmin_summary and full logs to fp-stability-logs/ (uploaded as CI artifact)
- Add python3-numpy to CI apt-get and upload-artifact step (if: always())
- Add four new test cases total: sod_standard, sod_strong, water_stiffened,
  air_water_interface (two-fluid isobaric contact, stiffened EOS)
- Feature B: float-proxy run (--rounding-mode=float) measures single-precision
  sensitivity without recompiling; skippable with --no-float-proxy
- Feature C: VPREC mantissa sweep at [52,23,16,10] bits shows precision floor
  curve for each case; skippable with --no-vprec
- Feature D: verrou_dd_sym symbol-level delta-debug on failure; --no-dd-sym
- Feature E: verrou_dd_line source-line delta-debug on failure; --no-dd-line
- Add --no-float-proxy/--no-vprec/--no-dd-sym/--no-dd-line CLI flags
- Remove dead _max_diff() function (superseded by _max_diff_np())
- Update FP_STABILITY_COMMAND description to document all 4 cases and features
Store Ẽ = E - pi_inf_mix (reduced energy) in the conserved energy slot
for the Allaire 5-equation model (model_eqns=2). This eliminates
catastrophic cancellation when recovering pressure via
p = (Ẽ - KE - qv)/gamma vs the old p = (E - pi_inf - KE - qv)/gamma
where pi_inf >> p (e.g. water: pi_inf=4046 >> p=1).

Key changes:
- m_variables_conversion.fpp: scope Ẽ storage to model_eqns=2 only;
  add explicit model_eqns=1/3 branch (physical E) to avoid fallthrough
  to Tait EOS for model_eqns=3 (Saurel 6-equation)
- m_riemann_solvers.fpp: HLL and LF non-MHD blocks use Ẽ as states
  with pi_inf added back for enthalpy H; HLLC Block 1 (model_eqns=3)
  and Block 4 (model_eqns=2) retain their existing correct handling;
  xi-factor denominators protected with sgm_eps (Fix 1)
- m_rhs.fpp: add non-conservative Ẽ source S_Ẽ = -sum_i(pi_inf_i *
  rhs(alpha_i)) for x/y/z directions, gated on model_eqns==2 and
  .not.(bubbles_euler .or. mhd)
- m_sim_helpers.fpp: igr branch drops pi_inf from pressure recovery
  (Ẽ stored) and restores it for physical H
- m_cbc.fpp: energy flux drops dpi_inf_dt term (Ẽ not pi_inf in flux)
@sbryngelson sbryngelson changed the title feat: add fp-stability command for Verrou-based FP instability testing feat: FP stability CI (Verrou) + reduced-energy scheme + HLLC xi-factor fix May 6, 2026
sbryngelson added 15 commits May 6, 2026 16:08
The stiffened-EOS pressure recovery p=(E-pi_inf)/gamma suffers
catastrophic cancellation (~2.5e-9 max_dev under random rounding vs
the previous 1e-10 gate). The algorithmic fix (reduced-energy / Etilde
storage scheme) lives on feat/reduced-energy. Until that PR merges,
raise the gate to 1e-8 so CI is green while still catching regressions.
…fp-stability

On every run, fp_stability.py now writes to GITHUB_STEP_SUMMARY a
markdown table with pass/fail, max_dev, float proxy, and the full
VPREC precision sweep showing which bit levels (52/23/16/10) fail.

On failure, dd_line source locations are emitted as ::warning::
annotations so the responsible Fortran lines appear highlighted
directly in the PR diff view.

Both are no-ops outside GitHub Actions (env var guards).
The VPREC sweep is a sensitivity curve, not a pass/fail test.
Comparing reduced-precision runs against the double-precision threshold
marks every case as failing at 23b/16b/10b, which is noise.
Show raw dev numbers only; the main table has the actual pass/fail.
…_binary

- cons.unindent()/cons.print() were after the try/finally, so any
  MFCException raised inside _run_case would leave the console
  permanently over-indented for all subsequent case output.
- _find_binary used os.getcwd() which breaks if fp-stability is
  invoked from a subdirectory; MFC_ROOT_DIR is always correct.
Previously dd_sym/dd_line only ran on test failure, so passing runs
gave no source-level instability info.

Now dd_sym and dd_line always run, using a sensitivity threshold of
max_dev/10 rather than the pass/fail threshold. This isolates the
source lines responsible for the dominant FP variation even when the
case passes. Results are capped at top 10 locations per case.

The GitHub step summary now always shows a 'Top FP hotspots (dd_line)'
section, and ::warning:: annotations are emitted for all cases
(labelled 'hotspot' on pass, 'FAIL' on failure) so the sensitive
Fortran lines are visible in the PR diff regardless of CI outcome.
… run

verrou_dd_sym sets VERROU_ROUNDING_MODE=nearest when producing its reference
run, then leaves it unset for test runs.  Hardcoding --rounding-mode=float as
a CLI arg overrides that env var (CLI takes precedence in Valgrind), so both
reference AND full-perturbation test end up in float mode, give identical
output, deviation=0, and dd exits 42.

Fix: use ${VERROU_ROUNDING_MODE:-float} in dd_run.sh so verrou_dd_sym's
nearest-rounding reference is preserved, while test steps still default to
float mode (deterministic, --nruns=1 sufficient).
…to fp-stability

Three new Verrou analysis passes, each enabled by default with --no-X to skip:

F. --check-cancellation=yes (--no-cancellation): uses --cc-gen-file for structured
   per-line output of catastrophic cancellation sites in MFC source.

G. --backend=mcaquad MCA runs (--no-mca): N samples with Monte Carlo Arithmetic;
   reports max deviation and a significant-bits lower bound: s = -log2(dev/scale).

H. --check-max-float=yes (--no-float-max): detects double→float conversions that
   would overflow to ±Inf; reports source locations from Valgrind error log.

Results added to GitHub step summary (cancellation sites table, float-max sites
table, MCA sig-bits column in the main results table).
Previously only dd_line locations were annotated (minimal delta-debug
set, often just 1 line).  Now also emit the top 3 cancellation sites
per case as ::notice:: so the PR diff shows a richer set of FP hotspots.
…eedback

- Add _get_source_context() to embed 5-line annotated snippets in the
  GitHub step summary for each dd_line hotspot and cancellation site
- Fix workflow on: push: to only fire on master and fp-stability branches,
  plus add weekly cron schedule (Monday 03:00 UTC)
- Fix log_dir to use MFC_ROOT_DIR instead of os.getcwd()
- Remove unreachable 'or 5' fallback on ARG('samples') (default=5 in CLI)
@MFlowCode MFlowCode deleted a comment from github-actions Bot May 7, 2026
In the 5-equation HLLC block, xi_L ≈ 1 near the sonic contact point, so
computing (xi_L - 1) by post-hoc subtraction from 1 loses significant bits.

Add xi_L_m1 = (s_S - vel_L)/(s_L - s_S), computed directly instead of as
xi_L - 1, and use it throughout the mass, momentum, energy, volume-fraction,
color-function, and species flux loops.

The momentum flux also uses the algebraic identity
  xi*(dir_flg*s_S + (1-dir_flg)*u_i) - u_i = (dir_flg*s_L + (1-dir_flg)*u_i)*xi_m1
which avoids both the xi - 1 and the xi*s_S - u (normal direction) forms.

For the energy flux the star-state term xi_L*(E + expr) - E is rewritten as
E*xi_L_m1 + xi_L*expr, removing the E*(xi_L - 1) cancellation.

All 162 1D regression tests pass; all 6 FP-stability suite cases pass.
@sbryngelson sbryngelson marked this pull request as ready for review May 8, 2026 19:55
@qodo-code-review
Copy link
Copy Markdown
Contributor

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Claude Code Review

Head SHA: a41594f

Files changed:

  • 6
  • .github/workflows/fp-stability.yml
  • .gitignore
  • src/simulation/m_riemann_solvers.fpp
  • toolchain/main.py
  • toolchain/mfc/cli/commands.py
  • toolchain/mfc/fp_stability.py

Findings

1. Cancellation fix (xi_L_m1/xi_R_m1) only applied to one of four HLLC blocks

xi_L_m1 and xi_R_m1 are declared once at the subroutine level and computed only inside the fourth HLLC block (diff hunk at ~line 3129 of m_riemann_solvers.fpp):

xi_L_m1 = (s_S - vel_L(dir_idx(1)))/min(s_L - s_S, -sgm_eps)
xi_R_m1 = (s_S - vel_R(dir_idx(1)))/max(s_R - s_S, sgm_eps)

The flux-formula refactoring that substitutes xi_L_m1/xi_R_m1 for (xi_L - 1._wp) (continuity, momentum, energy, advection, species, color function) is also exclusive to block 4, confirmed by the GPU data clause update (diff hunk at ~line 2847) adding xi_L_m1, xi_R_m1 only to that block's private list.

The three earlier HLLC blocks (diff hunks at ~lines 2041, 2332, 2689) receive only the denominator-clamping change:

xi_L = (s_L - vel_L(dir_idx(1)))/min(s_L - s_S, -sgm_eps)
xi_R = (s_R - vel_R(dir_idx(1)))/max(s_R - s_S, sgm_eps)

Their downstream flux formulas are unchanged, meaning they still evaluate (xi_L - 1.0_wp) and (xi_R - 1.0_wp). This subtraction catastrophically cancels when xi_L ≈ 1 (i.e., near a sonic contact — exactly the scenario this PR targets). The denominator clamping prevents divide-by-zero but does not eliminate that cancellation.

If any of blocks 1–3 is exercised by configurations involving near-sonic contacts, the FP instability remains. The PR should either apply the xi_m1 reformulation to all four blocks, or document why blocks 1–3 are safe from the sonic-contact case.


2. Test case input files are absent from the PR

toolchain/mfc/fp_stability.py defines:

CASES_DIR = os.path.join(MFC_ROOT_DIR, "tests", "fp_stability", "cases")

Each of the six registered cases (sod_standard, sod_strong, water_stiffened, air_water_interface, bubble_rp, low_mach) requires pre_process.inp and simulation.inp under CASES_DIR/<name>/. _run_case calls shutil.copy2 on these files immediately; if either is absent the run aborts with a Python exception.

The .gitignore hunk (!/tests/fp_stability/**) whitelists those paths so they are tracked in git, which implies they should be committed as part of this PR. They are absent from the changed-files list. The CI workflow step "Run FP Stability Suite" will therefore fail on the first case with a FileNotFoundError even when Verrou and the MFC binaries are correctly installed and cached.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 81.25000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.90%. Comparing base (141ff01) to head (ef20c51).

Files with missing lines Patch % Lines
src/simulation/m_riemann_solvers.fpp 81.25% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1403   +/-   ##
=======================================
  Coverage   64.90%   64.90%           
=======================================
  Files          72       72           
  Lines       18874    18876    +2     
  Branches     1571     1571           
=======================================
+ Hits        12250    12252    +2     
  Misses       5649     5649           
  Partials      975      975           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Valgrind verrou to check for sketchy float operations

1 participant