Summary
apply_inverse_transfer_function_single_position uses multiprocessing.Pool to parallelize reconstruction across timepoints. In pipeline contexts (e.g. Nextflow + Slurm), the orchestrator already handles parallelism at the position level — each Slurm job processes a single position. Within that job, mp.Pool forks child processes that each get a full copy of the input data (via fork + copy-on-write), leading to very high memory usage and poor performance on large volumes.
Since the core compute (numpy/scipy 3D FFTs, Tikhonov solves) releases the GIL, thread-based parallelism would avoid the data-copying overhead while achieving equivalent CPU utilization.
Observed behavior
- Input: 5 timepoints × 1 channel × 86 Z × 1600 Y × 1370 X, float32 (~750 MB per timepoint)
- Settings: 3D Tikhonov with
TV_iterations: 1, z_padding: 5, num_processes=4
- Context: Nextflow dispatches one Slurm job per position (4 positions running concurrently), each calling
apply_inverse_transfer_function_single_position with num_processes=4
- Result: 0 out of 5 timepoints completed after 1h16m. Each of the 4 worker processes consumed 35–45 GB RSS (vs ~750 MB per volume), indicating massive memory duplication from forking.
Relevant code
https://github.com/mehta-lab/waveorder/blob/main/waveorder/cli/apply_inverse_transfer_function.py
with mp.Pool(num_processes) as p:
p.starmap(...)
Suggestion
Rather than replacing the existing mp.Pool path (which may work well for standalone CLI usage), it would be useful to add an alternative thread-based parallelization option — e.g. a num_threads parameter that uses concurrent.futures.ThreadPoolExecutor instead.
This is analogous to what iohub did in czbiohub-sf/iohub#396 (landed in iohub v0.3.1), where process_single_position was switched from multiprocessing.Pool to ThreadPoolExecutor since numpy/scipy/torch all release the GIL.
Thread-based parallelism avoids the memory duplication that makes mp.Pool expensive when an external orchestrator (Nextflow, Snakemake, etc.) is already handling coarse-grained parallelism across positions.
Environment
- waveorder 3.0.2
- Python 3.12
- HPC node: 16 CPUs, 244 GB RAM
- Lustre filesystem
Summary
apply_inverse_transfer_function_single_positionusesmultiprocessing.Poolto parallelize reconstruction across timepoints. In pipeline contexts (e.g. Nextflow + Slurm), the orchestrator already handles parallelism at the position level — each Slurm job processes a single position. Within that job,mp.Poolforks child processes that each get a full copy of the input data (via fork + copy-on-write), leading to very high memory usage and poor performance on large volumes.Since the core compute (numpy/scipy 3D FFTs, Tikhonov solves) releases the GIL, thread-based parallelism would avoid the data-copying overhead while achieving equivalent CPU utilization.
Observed behavior
TV_iterations: 1,z_padding: 5,num_processes=4apply_inverse_transfer_function_single_positionwithnum_processes=4Relevant code
https://github.com/mehta-lab/waveorder/blob/main/waveorder/cli/apply_inverse_transfer_function.py
Suggestion
Rather than replacing the existing
mp.Poolpath (which may work well for standalone CLI usage), it would be useful to add an alternative thread-based parallelization option — e.g. anum_threadsparameter that usesconcurrent.futures.ThreadPoolExecutorinstead.This is analogous to what iohub did in czbiohub-sf/iohub#396 (landed in iohub v0.3.1), where
process_single_positionwas switched frommultiprocessing.PooltoThreadPoolExecutorsince numpy/scipy/torch all release the GIL.Thread-based parallelism avoids the memory duplication that makes
mp.Poolexpensive when an external orchestrator (Nextflow, Snakemake, etc.) is already handling coarse-grained parallelism across positions.Environment