Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix allreduce bug #197

Merged
merged 1 commit into from
Oct 18, 2023
Merged

Fix allreduce bug #197

merged 1 commit into from
Oct 18, 2023

Conversation

Binyang2014
Copy link
Contributor

@Binyang2014 Binyang2014 commented Oct 16, 2023

Fix allreduce correctness issue

@Binyang2014 Binyang2014 changed the title fix allreduce bug Fix allreduce bug Oct 18, 2023
@Binyang2014 Binyang2014 marked this pull request as ready for review October 18, 2023 02:34
Copy link
Contributor

@chhwang chhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why couldn't we see correctness failure earlier from mscclpp-test?

Copy link
Contributor

@saeedmaleki saeedmaleki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't understand why it wasn't an error before?

@Binyang2014
Copy link
Contributor Author

Previously, we did not encounter this error because each rank initialized its elements with the rank number. For instance, all elements for rank 3 were initialized to 3. In the all-reduce algorithm, we divide the tensor into 8 shards. The expected behavior is to place shard 1 on GPU 1, shard 2 on GPU 2, and so on. However, the current implementation places shard 0 on both GPU 1 and GPU 2. As all elements within each shard are identical, this results in consistent output, masking the error. To detect such issues in the future, we need a more robust data initialization approach, such as using the formula: (rank_number + 1) * shard_number.

@chhwang
Copy link
Contributor

chhwang commented Oct 18, 2023

Passed integration tests: https://github.com/microsoft/mscclpp/actions/runs/6562648457

@chhwang chhwang merged commit 6f43282 into main Oct 18, 2023
11 of 13 checks passed
@chhwang chhwang deleted the binyli/bug-fix branch October 18, 2023 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants