Update `flash_attention_fwd_benchmark.py` #2265

anmyachev · 2024-09-17T11:30:23Z

CI:

https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10902146943
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10943729332
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10943905958 (without IPEX)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10944237335 (with ZE_FLAT_DEVICE_HIERARCHY)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10944810716 (for FA bench)

Error:

torch.OutOfMemoryError: XPU out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacity of 64.00 GiB. Of the allocated memory 32.81 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. Please use `empty_cache` to release all unoccupied cached memory.

It's strange that a total capacity is 64.00 GiB. I need to understand why (the expected capacity should be more in my understanding).

UPD: Maybe it's related to https://spec.oneapi.io/level-zero/latest/core/PROG.html#environment-variables ZE_FLAT_DEVICE_HIERARCHY

Signed-off-by: Anatoly Myachev <[email protected]>

This reverts commit de9335c.

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-09-19T16:54:09Z

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

-                q, k, v, attn_mask=None, dropout_p=0.0, is_causal=False, scale=sm_scale).to(torch.float32)
-            atol = 1e-1 if N_CTX == 16384 else 1e-2
-            benchmark_suit.assert_close(triton_fn(), torch_fn(), atol=atol, rtol=1e-3, err_msg='triton to torch')
+        torch_fn = lambda: torch.nn.functional.scaled_dot_product_attention(


Using ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE the available memory is doubled and there is no more out of memory error for upstream pytorch (however, this affects the performance)

vlad-penkin linked an issue Sep 18, 2024 that may be closed by this pull request

[Benchmarks][Upstream PyTorch 2.5] Triton and XeTLA softmax performance degrades in comparison with torch 2.1 / ipex 2.1 test proxies #2106

Open

anmyachev and others added 2 commits September 19, 2024 15:04

Update flash_attention_fwd_benchmark.py

87b59d3

debug

92998b2

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev force-pushed the anmyachev-patch-1 branch from 80a8f26 to 92998b2 Compare September 19, 2024 15:05

anmyachev added 5 commits September 19, 2024 15:34

debug

241719f

Signed-off-by: Anatoly Myachev <[email protected]>

try for FA

5aaa5da

Signed-off-by: Anatoly Myachev <[email protected]>

speedup CI

de9335c

Signed-off-by: Anatoly Myachev <[email protected]>

Revert "speedup CI"

bbd64f3

This reverts commit de9335c.

cleanup

49fc392

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev commented Sep 19, 2024

View reviewed changes

anmyachev marked this pull request as ready for review September 19, 2024 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `flash_attention_fwd_benchmark.py` #2265

Update `flash_attention_fwd_benchmark.py` #2265

anmyachev commented Sep 17, 2024 •

edited

Loading

anmyachev Sep 19, 2024 •

edited

Loading

Update flash_attention_fwd_benchmark.py #2265

Are you sure you want to change the base?

Update flash_attention_fwd_benchmark.py #2265

Conversation

anmyachev commented Sep 17, 2024 • edited Loading

anmyachev Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Update `flash_attention_fwd_benchmark.py` #2265

Update `flash_attention_fwd_benchmark.py` #2265

anmyachev commented Sep 17, 2024 •

edited

Loading

anmyachev Sep 19, 2024 •

edited

Loading