Mistral-7B-Instruct-v0.2 inference slows down when implemented on the offline_inference.py example, possibly due to improper usage #4813
Closed
Alf-Z-SymphoMe
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to utilize the Mistral-7B model, in a way that the input prompt contains a fixed part plus some variables derived from other Python scripts.
The way I run the model is by using
python run_model.py
: this runs the scripts to generate the variable part of the prompt (extracted from some data provided by the user) and the Mistral-7B model itself.However, the model is loaded each time I run the code, which takes most of the inference time. I would like to avoid this, and I am trying to figure out how the vllm library would be helpful in doing so.
I tried combining
offline_inference.py
in the vllm-project repo but, while it shortens the model loading time, it actually slows down the overall inference.Which of the examples in the vllm-project repo would be good for my case? Maybe the
api_client.py
?Beta Was this translation helpful? Give feedback.
All reactions