You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deploy three VLLM instances based on Qwen 7B using L20. The outermost layer uses Nginx for load balancing. VLLM is deployed using python3 -m vllm.entrypoints.openai.api_server. The deployment architecture is as follows:
I have two questions regarding online deployment:
For the /v1/chat/completions request, can the source instance of the parameters in the returned value be determined through configuration?
Some requests take a long time. Is there a way to forcibly terminate the request?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Deploy three VLLM instances based on Qwen 7B using L20. The outermost layer uses Nginx for load balancing. VLLM is deployed using
python3 -m vllm.entrypoints.openai.api_server
. The deployment architecture is as follows:I have two questions regarding online deployment:
Beta Was this translation helpful? Give feedback.
All reactions