Skip to content

MRB-650 maps simplified#92

Open
jonasbhend wants to merge 110 commits into
mainfrom
MRB-650-Maps-simplified
Open

MRB-650 maps simplified#92
jonasbhend wants to merge 110 commits into
mainfrom
MRB-650-Maps-simplified

Conversation

@jonasbhend
Copy link
Copy Markdown
Contributor

@jonasbhend jonasbhend commented Jan 7, 2026

Opt-in metric maps for runs and baselines

Adds a pipeline that produces temporally aggregated BIAS/RMSE/MAE maps for both model runs and baselines. The work is computationally heavy, so it's gated behind a new --maps flag on evalml experiment.

What's new

CLI / config

  • evalml experiment <config> --maps triggers map plotting alongside the standard pipeline.
  • New optional metric_maps: block in config YAML controls params, leadtimes, metrics, regions, and seasons. Sensible defaults; existing configs work unchanged.
  • JSON schema regenerated.
  • The domains which were originally called "centraleurope" and "switzerland" are mostly the same. I suggest making domain "switzerland" much smaller, so that more spatial detail can be seen, especially in the complex topography of the alps.

Workflow

  • New target metric_maps_all, only built when --maps is set.
  • New rules verification_metrics_maps (runs, GRIB input) and verification_metrics_maps_baseline (baselines, zarr input).
  • New plot_summary_stat_maps / plot_summary_stat_maps_baseline produce the seasonal map plots.
  • Map plots live under results/{experiment}/metric_maps/{runs,baselines}/.
  • Removed obsolete spatial-data filtering in report_experiment_dashboard.py — no longer needed now that metric maps live in their own files instead of verif_aggregated.nc.

Script

  • workflow/scripts/verification_metric_maps.py is a single unified script handling both run (GRIB) and baseline (zarr) inputs via mutually exclusive --run_root / --baseline_root.
  • Streaming aggregation: BIAS/RMSE/MAE accumulators stream over init times; no per-init-time error fields are written to disk.
  • --reftimes flag restricts processing to the configured hindcast period (essential for baselines, whose zarr is a continuous archive).
  • Per-season stratification (DJF/MAM/JJA/SON/all).

Testing

End-to-end validated with the example configs.
A 40-init-time real-scale run completes in ~26 min wall-clock.

Deferred to follow-up PRs

  • GRIB loader per-call overhead (TODO marker added in data_input/__init__.py)
  • Wind direction / vector visualisation
  • Reorganisation of pre-existing scalar plots under plots/
  • Consolidation of load_fct_data_from_grib and load_state_from_grib
  • Same-grid short-circuit in map_forecast_to_truth
  • Nice country polygons for map plots
  • plot-every-pixel approach for the switzerland domain

Authors

Co-authored-by: Louis Frey louis.frey@meteoswiss.ch
Co-authored-by: Francesco Zanetta francesco.zanetta@meteoswiss.ch
Co-authored-by: Jonas Bhend jonas.bhend@meteoswiss.ch

@Louis-Frey Louis-Frey force-pushed the MRB-650-Maps-simplified branch from 2185fd6 to 9eb4643 Compare January 22, 2026 12:43
jonasbhend and others added 29 commits January 27, 2026 16:28
summary statistics. (No changes to code yet.)
For Bias, RMSE and MAE map plots.
Francesco. Got a long way towards the png plots.

Co-authored-by: Francesco Zanetta <francesco.zanetta@meteoswiss.ch>
properly working). Output written to .png now
working.
detailed inspection of results at smaller spatial
scale.
Louis-Frey added 11 commits May 5, 2026 11:34
Remove dev-commentary comments in the import cell, sever the unused
`var` cross-cell dependency, and drop a trailing empty marimo cell.
No behaviour change.
These were workarounds for transient node issues at development time
and shouldn't be baked into the workflow.
# Conflicts:
#	src/evalml/cli.py
#	src/verification/__init__.py
@Louis-Frey Louis-Frey marked this pull request as ready for review May 6, 2026 09:17
@Louis-Frey
Copy link
Copy Markdown
Contributor

Fixed all issues remaining in my view and tested with the ICON example configs. After pinning the anemoi-inference version to 0.10.0, they run without error. The PR is ready. Please check again @dnerini @jonasbhend @frazane

Louis-Frey added 3 commits May 8, 2026 10:57
Both loaders assumed >=2 lead times when disaggregating TOT_PREC, so
the maps rules (which pass a single step) crashed: load_baseline_from_zarr
with "fmin which has no identity" on the empty .diff(), and
load_fct_data_from_grib with "not all values found in index 'lead_time'"
when anemoi-inference omits step 0.

Push step-0 augmentation into the loaders: fetch step 0 alongside the
requested steps for cumulative-from-start params, synthesize it as 0 if
the GRIB lacks it, and drop it from the output. The maps script's
_preceding_step shim is no longer needed and is removed.

Production callers (regular verification, plot rules) all start at step 0,
so their behavior is unchanged.
Aligns the plot rules and script with the rest of the metric-maps
naming convention. No logic change; only rule names, the script
filename, and the log path are renamed. Output paths are unchanged.
# Conflicts:
#	src/verification/__init__.py
#	workflow/scripts/verification_metrics.py
Comment thread .gitignore
Comment thread src/plotting/__init__.py Outdated
Comment thread workflow/rules/plot.smk Outdated
Comment thread workflow/scripts/verification_metrics.py Outdated
Louis-Frey added 8 commits May 8, 2026 15:03
Keep commented line for potential future interactive map plotting
(e.g., design stuff like nicer country-border polygons, or plotting every pixel).
The on-the-fly wind-speed computation was a half-baked feature outside
the scope of MRB-650 (which is about metric maps). Comprehensive
wind-speed support — covering point verification, dashboard, and
proper config-driven derivation — belongs in a separate focused PR.

Map plots of SP_10M continue to work because the metric-maps pipeline
derives wind speed independently in verification_metric_maps.py.
The "hard-code for the moment, can still make smarter later on" line
was the same anti-pattern as the recently-removed proposal comments.
Replace it (plus the surrounding rationale block) with a tight
impersonal description of the colour-scheme choice.
Copy link
Copy Markdown
Contributor Author

@jonasbhend jonasbhend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Louis-Frey. Wow this has come a long way, since I last had a look. Congrats!

I have a number of points, that I would like to discuss. More top-level ones are summarized below:

  1. Naming
    In the evaluation part, we use the terms scores and metrics, where metrics are independent of the ground truth (i.e. a 'statistic' of the forecast alone). This is not set in stone in the community, but I think it would help to keep the naming consistent at least throughout evalml.

  2. Config
    Instead of computing the scores for a specific lead time, I think we should be able to compute the scores for all the lead times as specified in the corresponding entries of the runs in the config. This would allow us to store results (.nc files) that are somewhat independent of the time slices selected for visualization. If the performance / time it takes to do compute all steps is really prohibitive, we should discuss it.

  3. Precip-related code
    Computing error components for all steps should also allow us to avoid issues with disaggregation (and the need to load adjacent time steps for accumulated quantities). I have the impression that there is quite a bunch of precip related code, that may become redundant.

  4. Output paths
    Currently the spatial scores are stored in files such as: output/data/baselines/ICON-CH1-EPS/metric_maps/T_2M_24.nc. I think at least for baselines, we need a different approach, as we run into problems when using the same baseline in different experiments (e.g. verification against analysis, verification against stations or low and high-resolution analysis data).

  5. Adjacent work
    There is a bunch of changes that are related to other ongoing work. Most notably probably the harmonization of data input across experiments and showcases. Therefore, I would strongly suggest to coordinate. What is not clear to me is, what functionality from the current data_input module is missing for this PR. If we manage to integrate all the necessary features in the data_input module, the harmonization is likely much easier.

@cosunae is working on a side-by-side visualization of maps for the showcases. Maybe this functionality could be leveraged to show side-by-side animations (plots) of maps of scores?

Addendum:
My heart bleeds when I see the manual iteration through reftime and summing up of components. I was really hoping we could avoid this (I know we can't currently). So kudos to you, @Louis-Frey, for implementing this nonetheless.

# ---------------------------------------------------------------------------


def _open_zarr_component(root: Path, param: str) -> xr.DataArray:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this functionality that is unique to the maps application or could this be introduced in load_truth_data (in fact in load_analysis_from_zarr)?

Comment on lines +146 to +149
# ---------------------------------------------------------------------------
# GRIB step helpers
# ---------------------------------------------------------------------------

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing or just not necessary any more?

Comment on lines +489 to +501
parser.add_argument(
"--baseline_root",
type=Path,
default=None,
help="Root directory of a baseline (e.g. /path/to/ICON-CH1-EPS), containing FCST<YY>.zarr files.",
)
parser.add_argument(
"--baseline_zarrs",
type=Path,
nargs="+",
default=None,
help="Explicit list of baseline zarr paths (used by Snakemake for dependency tracking).",
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need both? baseline zarrs are not an output from a downstream rule, so baseline_root should be sufficient.

Comment on lines +542 to +544
args.output = (
source / f"verification_metric_maps_{args.param}_step{args.step:03d}h.nc"
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this would try to write into the baseline_zarr root (e.g. /store_new/mch/msopr/ml/ICON-CH1-EPS) if no output is provided. That would almost certainly fail or be not what we want to do. Instead, I would expect the maps NetCDF files to live under output/data/runs/<run_id>/<netcdf>.nc or output/data/baselines/<baseline_id>/<netcdf>.nc. I suggest to change this accordingly.

f"{args.param}.MAE": _seasonal_da(
lambda n, s: np.where(n > 0, accum_sum_ae[s] / n, np.nan)
),
f"{args.param}.N": _seasonal_da(lambda n, s: np.where(n > 0, n, np.nan)),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The components allow also to compute the standard deviation of error:

f"{args.param}.STDE": _seasonal_da(
    lambda n, s: np.where(n > 0, np.sqrt(accum_sum_se[s]/n - (accum_sum_e[s] / n)**2), np.nan)
),

@Louis-Frey
Copy link
Copy Markdown
Contributor

Hi @jonasbhend, thanks a lot for the feedback and the kudos :-) I will address your points next week, now I have to leave soon. Regarding the

manual iteration through reftime and summing up

I have to give credit to @frazane too, it was mainly his idea.

verification_metric_maps.py now keys error accumulators by (season,
init_hour) instead of season only, producing netcdfs with a new
init_hour dimension (integer hour, -999 = "all"; matches the convention
in verification_aggregation.py).

plot_metric_maps.mo.py + plot.smk + Snakefile wire init_hour through
as a wildcard so per-init-hour plots can be requested via the new
metric_maps.init_hours config field (default ["all"], unchanged
behaviour). MetricMapsConfig + config.schema.json + the eight example
configs gain the init_hours scaffolding.

This lets future per-init-hour analyses re-aggregate from the existing
netcdfs without re-reading GRIBs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants