Workaround for matmul kernel crash with i8xf32 operands. #12

3gx · 2024-08-29T18:46:06Z

I am not making a trivial change, such as fixing a typo in a comment.

I have written a PR description following these
rules.

 The BlockedToMMA pass creates a layout with kWidth=4 when one operand is
 i8. However, the TritonGPU to LLVM lowering pass does not support
 lowering f32 with kWidth=4, which is the other operand, causing a
 segmentation fault.

 To work around this, if the operands' minBitWidth is 8 and maxBitWidth
 is 32, we use a minBitWidth of 16 instead of 8, creating a layout with
 kWidth=2 for both i8 and f32 operands.

I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /python/test for end-to-end tests
Select one of the following.
- I have not added any lit tests.

google-cla · 2024-08-29T18:46:10Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

3gx · 2024-08-30T16:21:58Z

I signed CLA.

Moerafaat · 2024-09-04T17:58:24Z

This change, as mentioned in the title, would only work-around the issue but not fix it. Effectively what this is doing is it removes mixed-precision behavior for any matmuls with s8. Also the current change would regress s8 x . The ideal way we would hope to handle the issue is to fix the limitations of Triton during its lowering to LLVM, and still allow proper mixed-precision mma to happen.

3gx · 2024-09-05T04:51:50Z

Could you elaborate on what you mean by "removes mixed-precision behavior for any matmuls with s8"? I ask because the lowered code contains a cast from i8 to f32 before feeding data to the tf32 mma op, which is necessary since the other operand is already f32. Could you also clarify what you mean by "the current change would regress s8 x "? Perhaps you could provide an example to illustrate this point? Thank you.

Moerafaat · 2024-09-09T17:23:26Z

My apologies for replying late.

Regarding s8 x : We can consider the example of s8 x f16:
This change will cause the "kwidth" attribute (can be inspected in MLIR in the AccelerateMatmul pass if you dump the MLIR) to be different before and after the change. The value before will be equal to 4, while after it will be equal to 2. This will affect how the data is loaded.
I have attached the LLVM IR before and after the change for you to inspect given the HLO below.

HloModule m

ENTRY e {
  p0 = s8[16,32] parameter(0)
  p0c = bf16[16,32] convert(p0)
  p1 = bf16[32,8] parameter(1)
  ROOT _ = bf16[16,8] dot(p0c, p1),
    lhs_contracting_dims={1}, rhs_contracting_dims={0}
})

I haven't looked deeply into the performance impact, but it is clear that the change is not local.
llvm-after-change.txt
llvm-before-change.txt

As you can see the change will impact other use-cases. I'm not sure whether what the performance impact is (would be nice if you profile it). The constraints could be tighter to only match on s8 x f32 combinations, but that would still be working around the issue.
I hope this explains it a bit more.

The BlockedToMMA pass creates a layout with kWidth=4 when one operand is i8. However, the TritonGPU to LLVM lowering pass does not support lowering f32 with kWidth=4, which is the other operand, causing a segmentation fault. To work around this, if the operands' minBitWidth is 8 and maxBitWidth is 32, we use a minBitWidth of 16 instead of 8, creating a layout with kWidth=2 for both i8 and f32 operands.

3gx · 2024-09-10T13:31:29Z

Thank you for the details. I think I understand the issue with the proposed workaround. I have updated this MR with changes that should not affect other mixed-precision matrix multiplications. I verified that the i8xf16 kWidth remains 4 with this workaround.

The issue stems from the LLVM lowering pass not supporting f32 with kWidth=4 when lowering for Ampere tensor cores. I am not familiar with Ampere tensor cores and cannot estimate the effort required to fix the issue in the lowering pass.

gflegar requested review from Moerafaat and chsigg September 3, 2024 17:46

3gx force-pushed the xla/egx/bug-2853-v1 branch from 3ff8088 to b313a8b Compare September 10, 2024 13:25

3gx changed the title ~~Workaround for matmul kernel crash with i8 operand~~ Workaround for matmul kernel crash with i8xf32 operands. Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround for matmul kernel crash with i8xf32 operands. #12

Workaround for matmul kernel crash with i8xf32 operands. #12

3gx commented Aug 29, 2024 •

edited

Loading

google-cla bot commented Aug 29, 2024

3gx commented Aug 30, 2024

Moerafaat commented Sep 4, 2024

3gx commented Sep 5, 2024

Moerafaat commented Sep 9, 2024 •

edited

Loading

3gx commented Sep 10, 2024

Workaround for matmul kernel crash with i8xf32 operands. #12

Are you sure you want to change the base?

Workaround for matmul kernel crash with i8xf32 operands. #12

Conversation

3gx commented Aug 29, 2024 • edited Loading

google-cla bot commented Aug 29, 2024

3gx commented Aug 30, 2024

Moerafaat commented Sep 4, 2024

3gx commented Sep 5, 2024

Moerafaat commented Sep 9, 2024 • edited Loading

3gx commented Sep 10, 2024

3gx commented Aug 29, 2024 •

edited

Loading

Moerafaat commented Sep 9, 2024 •

edited

Loading