feat: fix the vLLM DP path#2517
Conversation
Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
|
/ok to test efc6fc2 |
Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
|
/ok to test 9f381f2 |
|
fast CI is failing for |
|
the added nightly test exceeds the 1340 hour quota for nightly, should we increase it? (now it's 1345) @terrykong @chtruong814 |
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
|
/ok to test ad114f0 |
|
/ok to test d03ceee |
| os.environ["VLLM_DP_SIZE"] = str(vllm_dp_size) | ||
| os.environ["VLLM_DP_RANK"] = str(vllm_dp_rank) | ||
| # Always set local rank to 0 because we only expose GPUs belong to this DP rank to the worker; if we set it to the actual local rank, it will cause the worker to hang. | ||
| os.environ["VLLM_DP_RANK_LOCAL"] = str(0) |
There was a problem hiding this comment.
will it have some issue when we have local dp > 1?
There was a problem hiding this comment.
no, it works when we have only dp (which means on one node the local dp = gpus_per_node) and I confirmed the workers are placed on all GPUs; on the contrary if we set them to rank % 8, then it will cause a failure
There was a problem hiding this comment.
assuming we have 2nodes * 8GPUs, is that mean os.environ["VLLM_DP_RANK_LOCAL"] = str(0) works for both DP8 and DP4? (I think for DP4 there should be 2 DP groups locally?)
| @@ -462,19 +462,41 @@ def _patch_vllm_hermes_tool_parser_thread_safety(): | |||
| os.environ["VLLM_ALLOW_INSECURE_SERIALIZATION"] = "1" | |||
|
|
|||
There was a problem hiding this comment.
I saw the file in vLLM is removed, can we link to https://github.com/vllm-project/vllm/tree/v0.20.0/examples/rl instead?
There was a problem hiding this comment.
ah sorry comment on the wrong line. I mean line 463 # See details in https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/data_parallel.py
https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/data_parallel.py is removed in vLLM now, and I think we can link to another instead.
|
@terrykong could you help to take a review as well? |
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
|
/ok to test 2272d47 |
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
|
/ok to test 0ea633c |

What does this PR do ?
Previously nemo-rl doesn't work for vllm's native DP (EP>TP), this PR wants to support this case.
The following basic tests have passed, now trying the nightly test
New nightly test figures:

H100 with EP=8 async engine
https://wandb.ai/nvidia/nemo-rl/runs/4mcplb63
H100 with TP=2 EP=16 sync engine

https://wandb.ai/nvidia/nemo-rl/runs/b3mon8zg
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information