Some triton kernels gernerated by inductors have low efficiency on PVC 1550 compare to A100 #2229

jianyizh · 2024-09-13T02:28:11Z

Hi, I found some triton kernels generated by torch benchmark vit-base model are slower than A100. It seems the bandwidth are pretty low. I'm using public pytorch master branch + XPU build.
PVC:
python cat_layernorm.py
0.403ms 0.039GB 96.59GB/s
python gelu.py
0.362ms 0.155GB 428.07GB/s
python layernorm.py
0.296ms 0.058GB 196.21GB/s
python safe_softmax.py
0.495ms 0.238GB 481.51GB/s
A100:
python cat_layernorm_nv.py
0.089ms 0.039GB 437.12GB/s
python gelu_nv.py
0.144ms 0.155GB 1073.02GB/s
python layernorm_nv.py
0.062ms 0.058GB 930.15GB/s
python safe_softmax_nv.py
0.261ms 0.238GB 913.15GB/s
reproducer.zip

vlad-penkin · 2024-09-13T12:25:08Z

@jianyizh i've got higher numbers for your reproducer:

Kernel	Original PVC Time (ms)	Original PVC Bandwidth (GB)	Original PVC Speed (GB/s)	PVC 1100 Time (ms)	PVC 1100 Bandwidth (GB)	PVC 1100 Speed (GB/s)	A100 Time (ms)	A100 Bandwidth (GB)	A100 Speed (GB/s)
`cat_layernorm`	0.403	0.039	96.59	0.290	0.039	134.47	0.089	0.039	437.12
`gelu`	0.362	0.155	428.07	0.201	0.155	770.94	0.144	0.155	1073.02
`layernorm`	0.296	0.058	196.21	0.131	0.058	443.38	0.062	0.058	930.15
`safe_softmax`	0.495	0.238	481.51	0.437	0.238	545.68	0.261	0.238	913.15

Could you please provide more details on your env by running these two commands:

./scripts/capture-hw-details.sh
pip list | grep -iE "torch|triton"

jianyizh · 2024-09-13T13:20:17Z

@vlad-penkin Thank you for reply. I think the performance of the two layernorm (~50%) and softmax (~60%) is still not good compare with A100.

./scripts/capture-hw-details.sh
LIBIGC1_VERSION=1.0.16900.24-914
LEVEL_ZERO_VERSION=1.3.29735.27-914
AGAMA_VERSION=914
GPU_DEVICE=Intel(R) Data Center GPU Max 1550

pip list | grep -iE "torch|triton"
bert_pytorch 0.0.1a4 /home/sdp/jianyi/oob/benchmark/torchbenchmark/models/BERT_pytorch
functorch 1.14.0a0+b71aa0b
pytorch-labs-segment-anything-fast 0.2
torch 2.6.0a0+gite6b6835 /home/sdp/jianyi/pytorch
torch_geometric 2.4.0
torchao 0.5.0
torchaudio 2.5.0a0+97ed7b3 /home/sdp/jianyi/audio/src
torchvision 0.20.0a0+838ad6c /home/sdp/jianyi/vision
triton 3.0.0
triton-xpu 3.0.0b2

vlad-penkin added this to the 4.0 [Performance] Core milestone Sep 13, 2024

vlad-penkin added performance bug Something isn't working community tests: e2e labels Sep 13, 2024

vlad-penkin self-assigned this Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some triton kernels gernerated by inductors have low efficiency on PVC 1550 compare to A100 #2229

Some triton kernels gernerated by inductors have low efficiency on PVC 1550 compare to A100 #2229

jianyizh commented Sep 13, 2024

vlad-penkin commented Sep 13, 2024 •

edited

Loading

jianyizh commented Sep 13, 2024

Some triton kernels gernerated by inductors have low efficiency on PVC 1550 compare to A100 #2229

Some triton kernels gernerated by inductors have low efficiency on PVC 1550 compare to A100 #2229

Comments

jianyizh commented Sep 13, 2024

vlad-penkin commented Sep 13, 2024 • edited Loading

jianyizh commented Sep 13, 2024

vlad-penkin commented Sep 13, 2024 •

edited

Loading