Port "sub-group transpose reduction" to default path #2266

victor-eds · 2024-09-17T12:28:12Z

#2109 explores layout conversion in the advanced path to improve reduction performance (see #1637 for investigation). Porting this to the default path would involve a transformation similar to (after heuristics to check profitability):

Reshape input tensor so no data movement is needed and we can perform reduction of elements within the work-item tt.reshape
Perform reduction within the work-item tt.reduce
Convert layout so a transposition within the sub-group as explained in the investigation is performed triton_gpu.convert_layout
Finalize reduction (within work-item and possibly within the work-group) tt.reduce
Convert back to initial layout triton_gpu.convert_layout

Note 5 can be dropped in case the new layout is beneficial for performance.

The text was updated successfully, but these errors were encountered:

victor-eds added the performance label Sep 17, 2024

victor-eds self-assigned this Sep 17, 2024

vlad-penkin added this to the 4.0 [Performance] Core milestone Sep 17, 2024

vlad-penkin added codegen: attention enhancement New feature or request labels Sep 17, 2024

victor-eds changed the title ~~Port #2109 to default path~~ Port "sub-group transpose reduction" to default path Sep 18, 2024

victor-eds removed their assignment Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port "sub-group transpose reduction" to default path #2266

Port "sub-group transpose reduction" to default path #2266

victor-eds commented Sep 17, 2024

Port "sub-group transpose reduction" to default path #2266

Port "sub-group transpose reduction" to default path #2266

Comments

victor-eds commented Sep 17, 2024