Mistral-7B-Instruct-v0.2 inference slows down when implemented on the offline_inference.py example, possibly due to improper usage #4813

Alf-Z-SymphoMe · 2024-05-14T16:37:27Z

Alf-Z-SymphoMe
May 14, 2024

I am trying to utilize the Mistral-7B model, in a way that the input prompt contains a fixed part plus some variables derived from other Python scripts.
The way I run the model is by using python run_model.py: this runs the scripts to generate the variable part of the prompt (extracted from some data provided by the user) and the Mistral-7B model itself.
However, the model is loaded each time I run the code, which takes most of the inference time. I would like to avoid this, and I am trying to figure out how the vllm library would be helpful in doing so.

I tried combining offline_inference.py in the vllm-project repo but, while it shortens the model loading time, it actually slows down the overall inference.

Which of the examples in the vllm-project repo would be good for my case? Maybe the api_client.py?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral-7B-Instruct-v0.2 inference slows down when implemented on the offline_inference.py example, possibly due to improper usage #4813

{{title}}

Replies: 0 comments

Select a reply

Mistral-7B-Instruct-v0.2 inference slows down when implemented on the offline_inference.py example, possibly due to improper usage #4813

Alf-Z-SymphoMe May 14, 2024

Replies: 0 comments

Alf-Z-SymphoMe
May 14, 2024