Skip to content

DAOS-19058 pydaos: torch surface worker errors in parallel_list (#18414)#18488

Open
enakta wants to merge 1 commit into
release/2.8from
0xe0f/DAOS-19058-release-2.8
Open

DAOS-19058 pydaos: torch surface worker errors in parallel_list (#18414)#18488
enakta wants to merge 1 commit into
release/2.8from
0xe0f/DAOS-19058-release-2.8

Conversation

@enakta

@enakta enakta commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Worker processes spawned by _Dfs.parallel_list may raise exceptions that never reached the calling process. This results in indefinite hang during Dataset and IterableDataset construction with no surfaced error to the user.

Replacing manual Process + Queue scheme and its queued/processed counter with a multiprocessing.Pool driven by imap_unordered. Pool re-raises worker exceptions in the parent when their results are consumed, so a worker error now propagates as a raised OSError instead of a deadlock, and the Pool context manager reaps all workers on any exit path.

concurrent.futures.ProcessPoolExecutor would be even better but its initializer/initargs arguments are unavailable before Python 3.7, and the target runtime includes EL8.8 / Python 3.6.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Worker processes spawned by _Dfs.parallel_list may raise exceptions that
never reached the calling process. This results in indefinite hang during Dataset
and IterableDataset construction with no surfaced error to the user.

Replacing manual Process + Queue scheme and its queued/processed
counter with a multiprocessing.Pool driven by imap_unordered. Pool
re-raises worker exceptions in the parent when their results are
consumed, so a worker error now propagates as a raised OSError instead
of a deadlock, and the Pool context manager reaps all workers on any
exit path.

`concurrent.futures.ProcessPoolExecutor` would be even better
but its initializer/initargs arguments are unavailable before
Python 3.7, and the target runtime includes EL8.8 / Python 3.6.

Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
@enakta enakta added the clean-cherry-pick Cherry-pick from another branch that did not require additional edits label Jun 12, 2026
@github-actions

Copy link
Copy Markdown

Ticket title is 'pytorch parallel_list does not surface worker process errors, causing silent hangs'
Status is 'In Review'
Labels: '2.8.0rc1,request_for_2.8'
https://daosio.atlassian.net/browse/DAOS-19058

@enakta enakta marked this pull request as ready for review June 12, 2026 03:13
@enakta enakta requested review from a team as code owners June 12, 2026 03:13
@enakta enakta added the waiting-for-merge-approval Waiting for merge approval label Jun 12, 2026
@daosbuild3

Copy link
Copy Markdown
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clean-cherry-pick Cherry-pick from another branch that did not require additional edits waiting-for-merge-approval Waiting for merge approval

Development

Successfully merging this pull request may close these issues.

4 participants