Skip to content

Replace mp.Pool with ThreadPoolExecutor in apply_inverse_transfer_function_single_position #552

@aofei-liu

Description

@aofei-liu

Summary

apply_inverse_transfer_function_single_position uses multiprocessing.Pool to parallelize reconstruction across timepoints. In pipeline contexts (e.g. Nextflow + Slurm), the orchestrator already handles parallelism at the position level — each Slurm job processes a single position. Within that job, mp.Pool forks child processes that each get a full copy of the input data (via fork + copy-on-write), leading to very high memory usage and poor performance on large volumes.

Since the core compute (numpy/scipy 3D FFTs, Tikhonov solves) releases the GIL, thread-based parallelism would avoid the data-copying overhead while achieving equivalent CPU utilization.

Observed behavior

  • Input: 5 timepoints × 1 channel × 86 Z × 1600 Y × 1370 X, float32 (~750 MB per timepoint)
  • Settings: 3D Tikhonov with TV_iterations: 1, z_padding: 5, num_processes=4
  • Context: Nextflow dispatches one Slurm job per position (4 positions running concurrently), each calling apply_inverse_transfer_function_single_position with num_processes=4
  • Result: 0 out of 5 timepoints completed after 1h16m. Each of the 4 worker processes consumed 35–45 GB RSS (vs ~750 MB per volume), indicating massive memory duplication from forking.

Relevant code

https://github.com/mehta-lab/waveorder/blob/main/waveorder/cli/apply_inverse_transfer_function.py

with mp.Pool(num_processes) as p:
    p.starmap(...)

Suggestion

Rather than replacing the existing mp.Pool path (which may work well for standalone CLI usage), it would be useful to add an alternative thread-based parallelization option — e.g. a num_threads parameter that uses concurrent.futures.ThreadPoolExecutor instead.

This is analogous to what iohub did in czbiohub-sf/iohub#396 (landed in iohub v0.3.1), where process_single_position was switched from multiprocessing.Pool to ThreadPoolExecutor since numpy/scipy/torch all release the GIL.

Thread-based parallelism avoids the memory duplication that makes mp.Pool expensive when an external orchestrator (Nextflow, Snakemake, etc.) is already handling coarse-grained parallelism across positions.

Environment

  • waveorder 3.0.2
  • Python 3.12
  • HPC node: 16 CPUs, 244 GB RAM
  • Lustre filesystem

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions