Skip to content

feat: fix the vLLM DP path#2517

Open
guyueh1 wants to merge 21 commits into
NVIDIA-NeMo:mainfrom
guyueh1:support_vllm_dp
Open

feat: fix the vLLM DP path#2517
guyueh1 wants to merge 21 commits into
NVIDIA-NeMo:mainfrom
guyueh1:support_vllm_dp

Conversation

@guyueh1
Copy link
Copy Markdown
Contributor

@guyueh1 guyueh1 commented May 18, 2026

What does this PR do ?

Previously nemo-rl doesn't work for vllm's native DP (EP>TP), this PR wants to support this case.

The following basic tests have passed, now trying the nightly test

# eval test
uv run examples/run_eval.py \
generation.model_name=Qwen/Qwen3-30B-A3B \
cluster.num_nodes=2 \
cluster.gpus_per_node=4 \
generation.vllm_cfg.tensor_parallel_size=4 \
generation.vllm_cfg.expert_parallel_size=8 \
generation.vllm_cfg.async_engine=true \

# grpo test
uv run examples/run_grpo.py \
--config examples/configs/grpo_math_1B_megatron.yaml \
policy.model_name=Qwen/Qwen3-30B-A3B \
cluster.num_nodes=4 \
cluster.gpus_per_node=4 \
policy.generation.colocated.enabled=false \
policy.generation.colocated.resources.num_nodes=2 \
policy.generation.colocated.resources.gpus_per_node=4 \
policy.generation.vllm_cfg.tensor_parallel_size=4 \
policy.generation.vllm_cfg.expert_parallel_size=8 \
policy.generation.vllm_cfg.async_engine=true \
policy.megatron_cfg.expert_model_parallel_size=8 \
policy.sequence_packing.enabled=false \

New nightly test figures:
H100 with EP=8 async engine
image
https://wandb.ai/nvidia/nemo-rl/runs/4mcplb63

H100 with TP=2 EP=16 sync engine
image
https://wandb.ai/nvidia/nemo-rl/runs/b3mon8zg

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
@guyueh1 guyueh1 requested review from a team as code owners May 18, 2026 04:46
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@guyueh1 guyueh1 requested review from yuki-97 and removed request for a team May 18, 2026 04:46
@guyueh1 guyueh1 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 18, 2026
@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented May 18, 2026

/ok to test efc6fc2

Guyue Huang added 2 commits May 18, 2026 09:36
Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented May 18, 2026

/ok to test 9f381f2

@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented May 19, 2026

fast CI is failing for uuidgen: command not found, trying the full set to see if it helps

@guyueh1 guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) labels May 19, 2026
@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented May 22, 2026

the added nightly test exceeds the 1340 hour quota for nightly, should we increase it? (now it's 1345) @terrykong @chtruong814

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented May 22, 2026

/ok to test ad114f0

@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented May 22, 2026

Ran a test on llama3-8B with vLLM TP=1 & PP=1 to study the impact of changing distributed_executor_backend from None to mp , there is no observable performance difference.
W B Chart 5_22_2026, 4_49_37 PM

guyueh1 added 2 commits May 22, 2026 16:57
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 force-pushed the support_vllm_dp branch from d0e091a to d03ceee Compare May 24, 2026 00:53
@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented May 24, 2026

/ok to test d03ceee

Comment thread nemo_rl/models/generation/vllm/vllm_generation.py Outdated
Comment thread nemo_rl/models/generation/vllm/vllm_worker.py Outdated
os.environ["VLLM_DP_SIZE"] = str(vllm_dp_size)
os.environ["VLLM_DP_RANK"] = str(vllm_dp_rank)
# Always set local rank to 0 because we only expose GPUs belong to this DP rank to the worker; if we set it to the actual local rank, it will cause the worker to hang.
os.environ["VLLM_DP_RANK_LOCAL"] = str(0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it have some issue when we have local dp > 1?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it works when we have only dp (which means on one node the local dp = gpus_per_node) and I confirmed the workers are placed on all GPUs; on the contrary if we set them to rank % 8, then it will cause a failure

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming we have 2nodes * 8GPUs, is that mean os.environ["VLLM_DP_RANK_LOCAL"] = str(0) works for both DP8 and DP4? (I think for DP4 there should be 2 DP groups locally?)

@@ -462,19 +462,41 @@ def _patch_vllm_hermes_tool_parser_thread_safety():
os.environ["VLLM_ALLOW_INSECURE_SERIALIZATION"] = "1"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the file in vLLM is removed, can we link to https://github.com/vllm-project/vllm/tree/v0.20.0/examples/rl instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which file?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry comment on the wrong line. I mean line 463 # See details in https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/data_parallel.py

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/data_parallel.py is removed in vLLM now, and I think we can link to another instead.

Comment thread tests/test_suites/nightly.txt Outdated
@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented May 25, 2026

@terrykong could you help to take a review as well?

@yuki-97 yuki-97 requested a review from terrykong May 25, 2026 07:40
guyueh1 added 2 commits May 25, 2026 11:06
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented May 25, 2026

/ok to test 2272d47

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented May 26, 2026

/ok to test 0ea633c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L2 Run doctests, unit tests, functional tests, and convergence tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support use vLLM DP + EP in async engine

2 participants