Skip to content

Latest commit

 

History

History
215 lines (172 loc) · 7.03 KB

0002-gep.md

File metadata and controls

215 lines (172 loc) · 7.03 KB
MEP Title Discussion Implementation
2
Unifed API
Link
Link

Unified API

Abstract

This GEP discusses the unified API that handles the interaction with LLM providers. This represents the feature set for MVP.

Motivation

The Unified API abstracts much of the overhead of interacting with LLM providers. It stores all the provider specific logic in a single place and allowing developers to focus on the core logic. It also introduces durability to LLM apps. Additionally, unifying the API allows optimizations to be leveraged such as latency and cost improvements.

Requirements

  • R1: Handles all provider specific logic
  • R2: Easily maintained
  • R3: API schemas must unify common LLM request params (e.g. topP, N, etc)
  • R4: API routes must unify common LLM endpoints/API
  • R5: API schemas should allow to customize provider-specific params (e.g. a different messages for OpenAI/Anthropic)
  • R6: Focus on building out a single endpoint for /Chat. Most apps will be using this and then we can focus our efforts on making this part of our API super effective
  • R7: Disallow overrides in the MVP. All params are updated via config. This will change over time but I want users to tell which params they want to override (and how) instead of us making the assumption.

Design

Routes

routes:
    chat: /v1/language/{pool-id}/chat/ 
    transcribers: /v1/speech/transcribers/{pool-id}/  
    speech-synthesizer:  /v1/speech/synthesizers/{pool-id}/ 
    multi-modal: /v1/multi/{pool-id}/multimodal/ 
    embedding: /v1/embeddings/{pool-id}/embed/ 

User Request Schema for Chat Route

Below is an example user request to the Chat route. The provider specific params will look different. This request schema matches the OpenAI Chat request schema. The message field is required and the messageHistory field is optional. The role and content fields must be specified. These fields will be parsed when a specific provider is selected and messages will be passed in the correct schema to that provider.

Unified API schema example:

{
 "message":
      {
        "role": "user",
        "content": "Where was it played?"
      },
    "messageHistory": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Who won the world series in 2020?"},
      {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}
    ]
}

If a user wants to specify different prompts for different models, they can do so. This assumes the following models have been added to the gateway via the config file.

{
  "model": {
    "name": "gpt-3.5-turbo",
    "messages": [
      {
        "role": "user",
        "content": "Where was it played?"
      }
    ],
    "messageHistory": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Who won the world series in 2020?"},
      {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}
    ]
  },
  "model": {
    "name": "command-light",
    "messages": [
      {
        "role": "user",
        "content": "Answer this question in two words: Where was it played?"
      }
    ],
    "messageHistory": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Who won the world series in 2020?"},
      {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}
    ]
  }
}

Provider Request Schemas

OpenAi Chat Endpoint Request Schema

{
    "model": "gpt-3.5-turbo",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
    {"role": "user", "content": "Where was it played?"},
    ]
}

Cohere Chat Endpoint Request Schema

{
    "model": "command-light",
    "chat_history": [
      {"role": "USER", "message": "Who won the world series in 2020?"},
      {"role": "CHATBOT", "message": "The Los Angeles Dodgers won the World Series in 2020."}
    ],
    "message": "Where was it played?"
}

Response Schema For Chat

The response form the LLM will be unified for all providers. This way end users can expect the same reponse regardless of provider. Provider specific metadata will be included in the response.

Unified Response Schema from OpenAI API to user

{
  "provider": "openai",
  "pool": "fallback",
  "model": "gpt-3.5-turbo",
  "cached": false,
  "provider_response": {
    "response_id": {
      "system_fingerprint": "fp_44709d6fcb"
    },
    "message": {
      "role": "assistant",
      "content": "\n\nHello there, how may I assist you today?"
    },
    "token_count": {
      "prompt_tokens": 9,
      "response_tokens": 12,
      "total_tokens": 21
    }
  }
}

Unified Response Schema from Cohere API to user

{
  "provider": "cohere",
  "pool": "latency",
  "model": "command-light",
  "cached": false,
  "provider_response": {
    "response_id": {
      "response_id": "12af01ff-ee43-44f6-bdf9-b5aa14a6f738",
      "generation_id": "8a345fc4-afa6-4ddf-8a1a-f601edb6c82b"
    },
    "message": {
      "role": "assistant",
      "content": "Isaac Newton was born on January 4, 1643. However, there is some debate over his birth date, with some records suggesting he was born on December 25, 1642, or even as early as October 24, 1642. Is there anything else you would like to know about Sir Isaac Newton?"
    },
    "token_count": {
      "prompt_tokens": 92,
      "response_tokens": 144,
      "total_tokens": 236
    }
  }
}

Example

In a RAG use case:

  • The LLM dev indexes their data in a vectorDB

  • The LLM dev crafts prompt, search for relevant context in the vectorDB and adds it to the prompt. Also, they experiment with params and prompt to find a set that works the best

  • Up to this point, this was experiments. Now it's time to make that app more production ready. The LLM dev deploys/uses hosted vectorDB. Along with that, they deploy Glide with one default provider pool. The pool contains one model they experimented

  • The Q&A app uses Glide client to work with LLM. It also maintains the conversational memory and sends it as soon as a conversation develops. The app uses /v1/default/chat for that, for example.

  • To make it more resilient, the LLM dev adds two more LLMs in the default pool.

  • After more experiments and testing, turned out that one of fallback models in the pool works badly with the prompt used for the main model. The LLM dev overrides the prompt for that model, so now each time the application sends request to Glide, it specifies two prompts: the main one (used by two models) and one more for one of the fallback models.

Reference

LangchainGo: https://github.com/tmc/langchaingo/tree/main LLMLite Config: https://github.com/BerriAI/litellm/blob/main/litellm/proxy/proxy_server.py

Future Work

  • Additional routes (multi-model, embedding, speech-synthesizer, transcriber)
  • Additional providers
  • Streaming responses