[WIP] RL multinode #152

rafapi · 2024-12-18T19:42:04Z

No description provided.

rizar

Good stuff! Please see my comments below.

examples/rl_gsm8k/orchestrate_rl.py

rizar · 2024-12-20T14:37:53Z

examples/rl_gsm8k/utils.py

@@ -69,6 +69,11 @@ def __init__(
        self.stderr_file: Optional[TextIO] = None
        self.stats = {}

+        # Add node rank awareness
+        self.node_rank = int(os.environ.get("RANK", 0))
+        self.port_offset = self.node_rank * 1000  # Ensure different port ranges for each node


why different port ranges for each node?

Theoretically we should be able to update the sync port if the one selected by the toolkit environment is being used. But that's not true and I found out that the reason for those clashes is the subprocess subshell instead.

I'm not sure I understand. These vllms are running on different nodes, aren't they?

rizar · 2024-12-20T14:40:52Z

examples/rl_gsm8k/utils.py

    # Check GPU availability
    num_gpus = torch.cuda.device_count()
+    print('###############################')


please use logger.info, and maybe smth like messages, "I'm rank X, training on Y GPU"

That's just sanity checking, it will be deleted soon. It was there to check that that number is the same as Num processes, but things are clear now.

rizar · 2024-12-20T14:42:07Z

examples/rl_gsm8k/utils.py

+                    "--deepspeed_multinode_launcher",
+                    "standard",
+                    "--same_network",
+                ])


Would training multi-node training without accelerate work without DeepSpeed? If not, should add an exception.

rizar · 2024-12-20T14:47:03Z

examples/rl_gsm8k/orchestrate_rl.py

+        run = None
+
+    def safe_wandb_log(metrics, step):
+        if run is not None:


Can we just check the rank there and nothing unless it is 0?

rafapi added 2 commits December 18, 2024 18:40

Multnode changes

3738d09

Merge fixed from deepspeed pr

e3c30ec

rafapi changed the base branch from main to multinode_deepspeed December 18, 2024 19:42

rafapi added 4 commits December 18, 2024 19:55

Add multinode env vars

40c211e

Move config

8382fe1

Add 70B config

cc0a37f

Remove dead code

103ca66

rafapi changed the base branch from multinode_deepspeed to main December 19, 2024 18:34

rafapi added 7 commits December 20, 2024 14:39

Only convert DS weights for inference

7f1f713

Reduce batch size

751f1e3

Run only on main process for now

36c853b

Add conversion flag

0f62cfb

Compose finetune command for multinode

46d0c88

Add default adam params

99415df

DS multinode config

b786d6b

rizar reviewed Dec 20, 2024

View reviewed changes

rafapi added 2 commits December 20, 2024 16:26

Merge branch 'main' into deepspeed_multinode

2a55ec4

Fix node sync

64306e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] RL multinode #152

[WIP] RL multinode #152

rafapi commented Dec 18, 2024

rizar left a comment

rizar Dec 20, 2024

rafapi Dec 20, 2024

rizar Dec 21, 2024 •

edited

Loading

rizar Dec 20, 2024

rafapi Dec 20, 2024

rizar Dec 20, 2024

rizar Dec 20, 2024

[WIP] RL multinode #152

Are you sure you want to change the base?

[WIP] RL multinode #152

Conversation

rafapi commented Dec 18, 2024

rizar left a comment

Choose a reason for hiding this comment

rizar Dec 20, 2024

Choose a reason for hiding this comment

rafapi Dec 20, 2024

Choose a reason for hiding this comment

rizar Dec 21, 2024 • edited Loading

Choose a reason for hiding this comment

rizar Dec 20, 2024

Choose a reason for hiding this comment

rafapi Dec 20, 2024

Choose a reason for hiding this comment

rizar Dec 20, 2024

Choose a reason for hiding this comment

rizar Dec 20, 2024

Choose a reason for hiding this comment

rizar Dec 21, 2024 •

edited

Loading