Profiling llama on AMD backend, found long times CopyHostToDevice #4855

CrimsonDump · 2024-05-16T08:12:02Z

CrimsonDump
May 16, 2024

I've been running llama 2 with vllm, and trace with rocprof. This is what I got, ~8 ms CopyHostToDevice CUDA time in a ~ 12 ms deocode phase, though these copies seem to be harmless to the core kernels (paged_attention, gemm, etc.. ). Then I found some clues from tracing which suggest that these CopyHostToDevice is for computing logits in Sampler, which confuses me because I thought all Weight were loaded at very beginning (nv did), instead of each decode phase.

So I'm wondering what are these CopyHostToDevice doing? Is it true sampler need to load weight every decode phase?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling llama on AMD backend, found long times CopyHostToDevice #4855

{{title}}

Replies: 0 comments

Select a reply

Profiling llama on AMD backend, found long times CopyHostToDevice #4855

CrimsonDump May 16, 2024

Replies: 0 comments

CrimsonDump
May 16, 2024