How to use MLflow for Multi-Model Serving with External LLMs?

LLMs are artificial intelligence trained on massive amounts of text data, allowing them to understand and process language with remarkable sophistication. Their ability to handle complex tasks and generate human-like text makes them increasingly important in various fields.

As LLMs and other machine learning models become more powerful, deploying and managing them effectively becomes crucial. MLflow is an open-source platform that streamlines the entire machine learning lifecycle, from experimentation and training to deployment and monitoring. It provides tools:

To track and compare different training runs of your models, making it easier to identify and configure.
Helps package trained models into a standard format, facilitating deployment across different environments.
Manage different versions of your models and track their performance.

You can build even better programs by using MLflow to connect your applications with different powerful language models (like fancy AI programs that understand text). These new programs can be flexible and handle various situations because they can choose the best language model for the job.

Let's explore how to use MLflow for Multi-Model Serving with External LLMs.

What is Multi-Model Serving?

Multi-model serving (MMS) is a technique in machine learning that allows you to deploy and manage multiple machine learning models behind a single serving endpoint. This means you can have an application or service to access and utilize the predictions or outputs from various models depending on the specific task or situation.

Here are some key characteristics of multi-model serving:

Flexibility: MMS lets you choose the most suitable model for a given task at runtime. This can be based on model specialization, accuracy requirements, or resource constraints.
Scalability: By efficiently sharing resources across multiple models, MMS can improve overall serving efficiency and handle increased workloads.
Fault Tolerance: If one model becomes unavailable or encounters issues, the serving endpoint can still function by routing requests to other available models.
A/B Testing: MMS facilitates A/B testing where you can compare the performance of different models on the same data and choose the best-performing one.

There are two main approaches to multi-model serving:

Model Selection: In this approach, the serving endpoint analyzes the incoming request and selects the most appropriate model based on pre-defined criteria. This could involve factors like the type of data, the desired task, or the specific needs.
Traffic Splitting: The incoming traffic is distributed among different models than a pre-defined ratio. This can be helpful for scenarios where you want to explore the performance of multiple models simultaneously or leverage the strengths of various models for different use cases.

In this, we will use traffic splitting multi-model serving for routing requests between the two external Large Language Models (LLMs).

Multi-Model Serving vs. Single-Model Deployment for LLMs

Feature	Multi-Model Serving	Single-Model Deployment
Flexibility	High	Low
Scalability	Improved	Potentially limited
Fault Tolerance	Higher	Lower
A/B Testing	Facilitated	Difficult
Complexity	More complex to set up and manage	Simpler setup and management
Resource Efficiency	Potentially more efficient	It can be resource-efficient if the single model meets all needs
Development Time	Longer	Faster

Additional:

Cost: Multi-model serving can incur higher costs due to the potential use of multiple external services or increased resource consumption.
Latency: Complex traffic routing logic in multi-model serving might introduce slight latency overhead compared to single-model deployment.

Multi-model serving offers greater flexibility, scalability, and fault tolerance than single-model deployment for LLMs. It's particularly beneficial when you need to:

Utilize different LLM strengths for various tasks.
Build robust systems with redundancy.
Conduct A/B testing for model comparison.

How to use MLflow for Multi-Model Serving with External LLMs?

The steps involved in creating a serving endpoint using MLflow on Azure Databricks to route traffic between multiple external Large Language Models (LLMs) are:

STEP 1. Installation and Configuration of mlflow.deployments

Since Azure Databricks offers a managed version of MLflow, you typically don't need separate installation. However, to leverage the mlflow.deployments functionalities for creating serving endpoints, some configuration might be required. Here's what you might encounter:

Cluster Configuration: Depending on your cluster setup in Databricks, you might need to ensure libraries like mlflow and its dependencies are available on the cluster workers. This could involve installing required libraries using cluster configuration options within the Databricks workspace.
Spark Version Compatibility: The mlflow.deployments library might have specific Spark version compatibility requirements. Verify the compatibility with your current Databricks environment to avoid potential issues.

STEP 2. Specifying the "served_entities" Section

This section within the endpoint configuration defines the details of the external LLMs you want to integrate:

Unique Names: Each LLM must have a unique name within the endpoint. This allows the serving logic to differentiate between them when routing requests. The code snippet uses "served_model_name_1" and "served_model_name_2" for the two LLMs.
Provider Details: Specify the provider of the LLM service. In the example, "openai" indicates OpenAI's GPT-4, and "anthropic" refers to Anthropic's claude-3-opus-20240229.
Task Type: Define the type of task the LLM can handle. The "llm/v1/chat" suggests it's a chat-related task, but this can vary depending on the specific LLM capabilities.
External Model Configuration: This section provides details for accessing the external service. The code snippet demonstrates using Databricks Secrets to store API keys:
openai_config: Likely contains the key "openai_api_key" referencing a secret holding the OpenAI API key.
anthropic_config: Similar to openai_config, this likely holds the Anthropic API key for accessing claude-3-opus-20240229.

Essentially, the "served_entities" section creates a mapping between the unique names and the external LLM service details.

STEP 3. Setting Up Traffic Allocation with the "traffic_config" Section

This section defines how incoming requests are distributed among the different LLMs:

Traffic Splitting: The code utilizes traffic splitting to route requests. This involves specifying the percentage of traffic directed to each LLM. In the example, a 50/50 split is used with:
"traffic_percentage": 50 for "served_model_name_1" (gpt-4).
"traffic_percentage": 50 for "served_model_name_2" (claude-3-opus-20240229).
Alternative Approaches: While the code shows a 50/50 split, you can adjust the percentages to favor specific models based on their strengths or resource requirements. Additionally, advanced configurations might allow for dynamic traffic allocation based on real-time factors like model performance or workload distribution.

The "traffic_config" section dictates the load-balancing strategy for routing requests to the available external LLMs.

Code

This code demonstrates a basic example with a 50/50 split. To create different traffic allocation strategies you can adjust the "traffic_percentage" values in the "routes" list.

import mlflow.deployments

# Assuming mlflow.deployments is already configured within your Databricks environment

# Create a client to interact with the Databricks deployment service
client = mlflow.deployments.get_deploy_client("databricks")

# Define the endpoint name
endpoint_name = "mix-chat-endpoint"

# Define the configuration dictionary for the endpoint
config = {
    # List of served entities (external LLMs)
    "served_entities": [
        {
            # Unique name for the first LLM
            "name": "served_model_name_1",
            # External model configuration for the first LLM
            "external_model": {
                # Name of the LLM provider (e.g., OpenAI)
                "name": "gpt-4",
                # Task type this LLM can handle (e.g., chat)
                "task": "llm/v1/chat",
                # Configuration for accessing the external service (OpenAI in this case)
                "openai_config": {
                    # Accessing OpenAI API key securely using Databricks Secrets
                    "openai_api_key": "{{secrets/my_openai_secret_scope/openai_api_key}}"
                }
            }
        },
        {
            # Unique name for the second LLM
            "name": "served_model_name_2",
            # External model configuration for the second LLM
            "external_model": {
                # Name of the LLM provider (e.g., Anthropic)
                "name": "claude-3-opus-20240229",
                # Task type this LLM can handle (e.g., chat)
                "task": "llm/v1/chat",
                # Configuration for accessing the external service (Anthropic in this case)
                "anthropic_config": {
                    # Accessing Anthropic API key securely using Databricks Secrets
                    "openai_api_key": "{{secrets/my_anthropic_secret_scope/anthropic_api_key}}"
                }
            }
        }
    ],
    # Traffic allocation configuration
    "traffic_config": {
        # Routes for distributing traffic
        "routes": [
            # 50% traffic to served_model_name_1 (gpt-4)
            {"served_model_name": "served_model_name_1", "traffic_percentage": 50},
            # 50% traffic to served_model_name_2 (claude-3-opus-20240229)
            {"served_model_name": "served_model_name_2", "traffic_percentage": 50}
        ]
    }
}

# Create the serving endpoint using the client and configuration
client.create_endpoint(name=endpoint_name, config=config)

print(f"Serving endpoint '{endpoint_name}' created successfully!")

Conclusion

This article explored the powerful combination of MLflow and external Large Language Models (LLMs). This approach opens doors for innovative applications across various fields:

Natural Language Processing (NLP): Enhance the accuracy and efficiency of tasks like machine translation, text summarization, and sentiment analysis by leveraging the strengths of different LLMs.
Creative Content Generation: Develop applications to generate unique and engaging content in various formats, from marketing copy to poems and scripts.
Intelligent Chatbots: Build chatbots that can hold more natural and nuanced conversations by utilizing the capabilities of multiple LLMs.