Profiling llama on AMD backend, found long times CopyHostToDevice #4855
Unanswered
CrimsonDump
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've been running llama 2 with vllm, and trace with rocprof. This is what I got, ~8 ms CopyHostToDevice CUDA time in a ~ 12 ms deocode phase, though these copies seem to be harmless to the core kernels (paged_attention, gemm, etc.. ). Then I found some clues from tracing which suggest that these CopyHostToDevice is for computing logits in Sampler, which confuses me because I thought all Weight were loaded at very beginning (nv did), instead of each decode phase.
So I'm wondering what are these CopyHostToDevice doing? Is it true sampler need to load weight every decode phase?
Beta Was this translation helpful? Give feedback.
All reactions