Online deployment issue consultation #11625

hongyuwen · 2024-12-30T09:57:16Z

hongyuwen
Dec 30, 2024

Deploy three VLLM instances based on Qwen 7B using L20. The outermost layer uses Nginx for load balancing. VLLM is deployed using python3 -m vllm.entrypoints.openai.api_server. The deployment architecture is as follows:

I have two questions regarding online deployment:

For the /v1/chat/completions request, can the source instance of the parameters in the returned value be determined through configuration?
Some requests take a long time. Is there a way to forcibly terminate the request?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Online deployment issue consultation #11625

{{title}}

Replies: 0 comments

Select a reply

Online deployment issue consultation #11625

hongyuwen Dec 30, 2024

Replies: 0 comments

hongyuwen
Dec 30, 2024