-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend HuggingFace Trainer test to multinode multiGPU #303
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
f4914fb
to
f2faef8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a couple of comments / questions, otherwise looks already great!
trainer.train() | ||
trainer.save_model() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is the model being saved here? Are we sure the local weights trained one each worker shard are shared across workers so the saved model is trained on the global dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking how to check that....
any idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quickly looking at the code, it doesn't look like the model is wrapped to sync the weights globally. I'd check the logs to get more evidence. We may need to adapt the example e.g. with https://huggingface.co/docs/transformers/accelerate or with something equivalent to DDP. Otherwise that dilutes the value of this example in distributed mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I get it right then the Trainer should already use distributed training under the hood - https://huggingface.co/docs/transformers/main/main_classes/trainer#id1
The Trainer class provides an API for feature-complete training in PyTorch, and it supports distributed training on multiple GPUs/TPUs, mixed precision for NVIDIA GPUs, AMD GPUs, and torch.amp for PyTorch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, exactly what we wanted to see!
That's OK if that's DDP. The model is a bit small and it's good to have an example for DDP.
We can plan to work on an another example for FSDP with a larger model, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can plan to work on an another example for FSDP with a larger model, WDYT?
Sounds good.
Though thinking whether we should put FSDP (maybe even DDP) rather as an example into https://github.com/opendatahub-io/distributed-workloads/tree/main/examples.
As it doesn't cover the Training operator capabilities much as it rather covers Trainer usage itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, let's cover FSDP with an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe such FSDP example can be placed into Training operator upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could yes, though I'm thinking an example equivalent to the LLM fine-tuning example we have for Ray, that is, using the KFTO SDK from a workbench, and more state-of-the-art with LoRA / qLoRA, etc.
f2faef8
to
f4708e1
Compare
Awesome work! /lgtm |
Description
Extend HuggingFace Trainer tests to cover multinode and multiGPU scenarios.
Note: Prometheus GPU utilization check is currently limited to Nvidia only, will be extended to AMD once AMD Prometheus integration is fully ready in test environment.
Note: MultiGPU AMD tests are failing with some generic NCCL error. Will try to run tests again once we upgrade to ROCm 6.3.
How Has This Been Tested?
Manually by running the tests against OCP cluster with Nvidia and AMD GPUs.
Merge criteria: