Extend HuggingFace Trainer test to multinode multiGPU #303

sutaakar · 2025-01-13T16:43:18Z

Description

Extend HuggingFace Trainer tests to cover multinode and multiGPU scenarios.

Note: Prometheus GPU utilization check is currently limited to Nvidia only, will be extended to AMD once AMD Prometheus integration is fully ready in test environment.
Note: MultiGPU AMD tests are failing with some generic NCCL error. Will try to run tests again once we upgrade to ROCm 6.3.

How Has This Been Tested?

Manually by running the tests against OCP cluster with Nvidia and AMD GPUs.

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

openshift-ci · 2025-01-13T16:43:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from sutaakar. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

astefanutti

I've left a couple of comments / questions, otherwise looks already great!

tests/kfto/support.go

astefanutti · 2025-01-15T08:01:31Z

tests/kfto/resources/hf_llm_training.py

    trainer.train()
+    trainer.save_model()


What exactly is the model being saved here? Are we sure the local weights trained one each worker shard are shared across workers so the saved model is trained on the global dataset?

thinking how to check that....
any idea?

Quickly looking at the code, it doesn't look like the model is wrapped to sync the weights globally. I'd check the logs to get more evidence. We may need to adapt the example e.g. with https://huggingface.co/docs/transformers/accelerate or with something equivalent to DDP. Otherwise that dilutes the value of this example in distributed mode.

If I get it right then the Trainer should already use distributed training under the hood - https://huggingface.co/docs/transformers/main/main_classes/trainer#id1

The Trainer class provides an API for feature-complete training in PyTorch, and it supports distributed training on multiple GPUs/TPUs, mixed precision for NVIDIA GPUs, AMD GPUs, and torch.amp for PyTorch.

https://huggingface.co/blog/pytorch-ddp-accelerate-transformers#using-%F0%9F%A4%97-trainer

Awesome, exactly what we wanted to see!

That's OK if that's DDP. The model is a bit small and it's good to have an example for DDP.

We can plan to work on an another example for FSDP with a larger model, WDYT?

We can plan to work on an another example for FSDP with a larger model, WDYT?

Sounds good.
Though thinking whether we should put FSDP (maybe even DDP) rather as an example into https://github.com/opendatahub-io/distributed-workloads/tree/main/examples.
As it doesn't cover the Training operator capabilities much as it rather covers Trainer usage itself.

I agree, let's cover FSDP with an example.

Maybe such FSDP example can be placed into Training operator upstream?

It could yes, though I'm thinking an example equivalent to the LLM fine-tuning example we have for Ray, that is, using the KFTO SDK from a workbench, and more state-of-the-art with LoRA / qLoRA, etc.

astefanutti · 2025-01-15T13:09:00Z

Awesome work!

/lgtm

sutaakar requested review from astefanutti, abhijeet-dhumal and ChughShilpa January 13, 2025 16:43

openshift-ci bot requested review from dimakis and Fiona-Waters January 13, 2025 16:43

sutaakar removed request for dimakis and Fiona-Waters January 13, 2025 16:43

sutaakar force-pushed the refactor-training branch from f4914fb to f2faef8 Compare January 13, 2025 16:49

astefanutti reviewed Jan 15, 2025

View reviewed changes

Extend HuggingFace Trainer test to multinode multiGPU

f4708e1

sutaakar force-pushed the refactor-training branch from f2faef8 to f4708e1 Compare January 15, 2025 12:56

openshift-ci bot assigned astefanutti Jan 15, 2025

openshift-ci bot added the lgtm label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend HuggingFace Trainer test to multinode multiGPU #303

Extend HuggingFace Trainer test to multinode multiGPU #303

sutaakar commented Jan 13, 2025 •

edited

Loading

openshift-ci bot commented Jan 13, 2025

astefanutti left a comment

astefanutti Jan 15, 2025

sutaakar Jan 15, 2025

astefanutti Jan 15, 2025

sutaakar Jan 15, 2025 •

edited

Loading

sutaakar Jan 15, 2025

astefanutti Jan 15, 2025

sutaakar Jan 15, 2025

astefanutti Jan 15, 2025

sutaakar Jan 15, 2025

astefanutti Jan 15, 2025

astefanutti commented Jan 15, 2025

Extend HuggingFace Trainer test to multinode multiGPU #303

Are you sure you want to change the base?

Extend HuggingFace Trainer test to multinode multiGPU #303

Conversation

sutaakar commented Jan 13, 2025 • edited Loading

Description

How Has This Been Tested?

Merge criteria:

openshift-ci bot commented Jan 13, 2025

astefanutti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sutaakar Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astefanutti commented Jan 15, 2025

sutaakar commented Jan 13, 2025 •

edited

Loading

sutaakar Jan 15, 2025 •

edited

Loading