Replace mp.Pool with ThreadPoolExecutor in apply_inverse_transfer_function_single_position

## Summary

`apply_inverse_transfer_function_single_position` uses `multiprocessing.Pool` to parallelize reconstruction across timepoints. In pipeline contexts (e.g. Nextflow + Slurm), the orchestrator already handles parallelism at the position level — each Slurm job processes a single position. Within that job, `mp.Pool` forks child processes that each get a full copy of the input data (via fork + copy-on-write), leading to very high memory usage and poor performance on large volumes.

Since the core compute (numpy/scipy 3D FFTs, Tikhonov solves) releases the GIL, thread-based parallelism would avoid the data-copying overhead while achieving equivalent CPU utilization.

## Observed behavior

- **Input**: 5 timepoints × 1 channel × 86 Z × 1600 Y × 1370 X, float32 (~750 MB per timepoint)
- **Settings**: 3D Tikhonov with `TV_iterations: 1`, `z_padding: 5`, `num_processes=4`
- **Context**: Nextflow dispatches one Slurm job per position (4 positions running concurrently), each calling `apply_inverse_transfer_function_single_position` with `num_processes=4`
- **Result**: 0 out of 5 timepoints completed after 1h16m. Each of the 4 worker processes consumed 35–45 GB RSS (vs ~750 MB per volume), indicating massive memory duplication from forking.

## Relevant code

https://github.com/mehta-lab/waveorder/blob/main/waveorder/cli/apply_inverse_transfer_function.py

```python
with mp.Pool(num_processes) as p:
    p.starmap(...)
```

## Suggestion

Rather than replacing the existing `mp.Pool` path (which may work well for standalone CLI usage), it would be useful to add an alternative thread-based parallelization option — e.g. a `num_threads` parameter that uses `concurrent.futures.ThreadPoolExecutor` instead.

This is analogous to what iohub did in [czbiohub-sf/iohub#396](https://github.com/czbiohub-sf/iohub/pull/396) (landed in iohub v0.3.1), where `process_single_position` was switched from `multiprocessing.Pool` to `ThreadPoolExecutor` since numpy/scipy/torch all release the GIL.

Thread-based parallelism avoids the memory duplication that makes `mp.Pool` expensive when an external orchestrator (Nextflow, Snakemake, etc.) is already handling coarse-grained parallelism across positions.

## Environment

- waveorder 3.0.2
- Python 3.12
- HPC node: 16 CPUs, 244 GB RAM
- Lustre filesystem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace mp.Pool with ThreadPoolExecutor in apply_inverse_transfer_function_single_position #552

Summary

Observed behavior

Relevant code

Suggestion

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Replace mp.Pool with ThreadPoolExecutor in apply_inverse_transfer_function_single_position #552

Description

Summary

Observed behavior

Relevant code

Suggestion

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions