Skip to content

Conversation

@jianjunzhong
Copy link
Contributor

@jianjunzhong jianjunzhong commented Jan 5, 2026

What does this PR do?

As described in #4903 (comment), to address the issue where consecutive bundles in a single cross-node placement_group may be scheduled to different nodes (ray issue #59547), based on @wuxibin89 's work #4583, we implemented a solution referencing the approach used in vllm. After instantiating the workers, we sort them by IP address, adjust their ranks accordingly, and then proceed with the initialization.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here:
    [ray] feat: resource pool create one placement group across all nodes #4583
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable change to worker initialization by creating a single placement group and sorting workers by IP address. This should resolve the scheduling inconsistencies mentioned. The two-phase initialization (__init__ followed by init_worker) is a good architectural pattern for this purpose and has been correctly propagated to most worker types. My review focuses on a couple of critical/high-severity issues related to this new architecture: one concerning the debuggability of workers due to naming, and another a potential critical bug in the initialization of fused workers.

Comment on lines 599 to 600
# TODO: worker_name may be inaccurate after adjust_rank, need to fix
name = f"{self.name_prefix}{cia_name}_{pg_idx}:{local_rank}" # e.g. Worker_2:5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The worker name is generated using the initial rank and local_rank before IP-based sorting. After sorting, the worker's actual rank will change, but its name in Ray will remain the same. This discrepancy can make debugging very difficult, as the actor's name will not correspond to its logical rank within the worker group. This is especially problematic when trying to identify or re-attach to detached workers by name.

The TODO comment acknowledges this. This should be addressed to improve maintainability and reduce confusion during debugging.

wuxibin89 and others added 2 commits January 10, 2026 17:41
Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants