Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge OpenAI Triton commit 0702320 #3149

Merged
merged 17 commits into from
Jan 13, 2025
Merged

Merge OpenAI Triton commit 0702320 #3149

merged 17 commits into from
Jan 13, 2025

Conversation

whitneywhtsang
Copy link
Contributor

@whitneywhtsang whitneywhtsang commented Jan 13, 2025

This PR change the Triton base from 3bac3be to 0702320 (Jan 13).
Pass rate: 97.63%

Please do not squash and merge this PR.

Jokeren and others added 13 commits January 11, 2025 16:28
…end to LLVM codegen. Ignore NaN when set. (#5582)
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
# Overview

Atomics in triton have two optional attributes:
1) `sem` -- describing the memory semantics of the operation
2) `scope` -- describing which threads will see the effect of a memory
operation (e.g., GPU, CTA)

Presently, the `scope` is ignored by the AMD backend and defaults to
`agent`-scope in the emitted LLVM (which roughly corresponds to `gpu`
memscope in triton). This is correct (in most cases? maybe not all?), as
this is a "stricter" scope than CTA (and I'm guessing it is rare that
system scope is needed for AMD kernels, so no bugs have shown up). That
being said, emitting atomics at CTA scope can be more efficient since
there can be fewer cache invalidations/barriers.

I think that this is fixable by just passing through the attribute to
the generated `llvm.atomicrmw` op. There are some additional
optimizations potentially possible (e.g., !amdgpu.no.remote.memory,
since Triton doesn't support this today), but it isn't clear to me if
those would have any real impact on end-to-end performance and those
optimizations would be specific to the `sys`-scope that doesn't appear
to be frequently used.

# Testing

I added a lit test to ensure that the generated LLVM instructions have
the correct sem/scope attributes for atomicrmw, but I also ran the
following 386 unit tests locally on an MI300x:

```bash
pytest test/unit/language/test_core.py -k test_atomic_
```

I then locally ran some kernels with the scope set to CTA/SYSTEM to make
sure that they worked.
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
there is currently a weird bug causing capability overrides to persist
when users pass `arch=None`. Rather than making
`CUDABackend.sw_capability` stateful, we now retrieve capability lazily
from compilation options

also fix an amd bug encountered in the wild
…t_cmd.py` (#5588)

Relates to triton-lang/triton#5537

---------

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
This reverts commit 70359fa which was
causing some of our internal tests to fail.

Co-authored-by: Adam P. Goucher <goucher@statslab.cam.ac.uk>
…elates to c++20 (#5585)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
We found regressions for moe kernel with fp8 inputs. This PR effectively
reverts part of #4767 and disables the swap-operand feature for fp8
inputs matmul kernels for now while we investigate the regression.
@whitneywhtsang whitneywhtsang self-assigned this Jan 13, 2025
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit 3bac3be Merge OpenAI Triton commit 7cc6799 Jan 13, 2025
@whitneywhtsang whitneywhtsang marked this pull request as ready for review January 13, 2025 19:30
@whitneywhtsang whitneywhtsang merged commit 865cfae into main Jan 13, 2025
5 checks passed
@whitneywhtsang whitneywhtsang deleted the whitneywhtsang/merge branch January 13, 2025 21:59
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit 7cc6799 Merge OpenAI Triton commit 0702320 Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants