Mistral Small 24B served with vLLM and AI Deploy - a single command to deploy an LLM (Part 1)

You are not dreaming! You can deploy open-source LLM in a single command line.

Rocket in MistralAI colors in a data center with a French rooster showing rapid LLM deployment

Deploying advanced language models can be a challenge! But this sometimes this arduous task is becoming increasingly accessible, enabling developers to integrate sophisticated AI capabilities into their applications.

In this guide, we will walk through deploying the Mistral-Small-24B-Instruct-2501 model using vLLM on OVHcloud’s AI Deploy platform. This combination offers a powerful solution for efficient and scalable AI model serving.

Deploying a model is great, but doing it quickly is even better!

🤯 What if a single command line was enough? That’s the challenge we’re tackling today!

Context

Before deployment, let’s take a closer look at our key technologies!

Mistral Small

The mistralai/Mistral-Small-24B-Instruct-2501 is a 24-billion-parameter instruction-fine-tuned model, renowned for its compact size and performance comparable to larger models.

This model, from MistralAI, is an instruction-fine-tuned version of the base model: Mistral-Small-24B-Base-2501.

To serve this model efficiently, we will utilize vLLM, an open-source library for LLM inference.

vLLM

vLLM (Virtual LLM) is a highly optimized service engine designed to efficiently run large language models. It takes advantage of several key optimizations, such as:

PagedAttention: an attention mechanism that reduces memory fragmentation and enables more efficient use of GPU memory
Continuous Batching: vLLM dynamically adjusts batch sizes in real time, ensuring that the GPU is always used efficiently, even with multiple simultaneous requests
Tensor parallelism: enables model inference across multiple GPUs to boost performance
Optimized kernel implementations: vLLM uses custom CUDA kernels for faster execution, reducing latency compared to traditional inference frameworks

These features make vLLM one of the best choices for large models such as Mistral Small 24B, enabling low-latency, high-throughput inference on the latest GPUs.

By deploying on OVHcloud’s AI Deploy platform, you can deploy this model in a single command line.

AI Deploy

OVHcloud AI Deploy is a Container as a Service (CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications / APIs based on Machine Learning (ML), Deep Learning (DL) or LLMs.

The key benefits are:

Easy to use: bring your own custom Docker image and deploy it in a command line or a few clicks surely
High-performance computing: a complete range of GPUs available (H100, A100, V100S, L40S and L4)
Scalability and flexibility: supports automatic scaling, allowing your model to effectively handle fluctuating workloads
Cost-efficient: billing per minute, no surcharges

✅ To go further, some prerequisites must be checked!

Prerequisites

Before you begin, ensure that you have:

OVHcloud account: access to the OVHcloud Control Panel
ovhai CLI available: install the ovhai CLI
AI Deploy access: ensure you have a user for AI Deploy
Hugging Face access: create an Hugging Face account and generate an access token
Gated model authorization: be sure you have been granted access to Mistral-Small-24B-Instruct-2501 model

🚀 Having all the ingredients for our recipe, it’s time to deploy!

Deployment of the Mistral Small 24B LLM

Let’s go for the deployment of the model mistralai/Mistral-Small-24B-Instruct-2501

Manage access tokens

Export your Hugging Face token.

export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

Create a token to access your AI Deploy app once it will be deployed.

ovhai token create --role operator ai_deploy_token=my_operator_token

Returning the following output:

Id: 47292486-fb98-4a5b-8451-600895597a2b Created At: 20-02-25 11:53:05 Updated At: 20-02-25 11:53:05 Spec: Name: ai_deploy_token=my_operator_token Role: AiTrainingOperator Label Selector: Status: Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Version: 1

You can now store and export your access token:

export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Launch Mistral Small LLM with AI Deploy

You are ready to start Mistral-Small-24B using vLLM and AI Deploy:

ovhai app run --name vllm-mistral-small \
              --default-http-port 8000 \
              --label ai_deploy_token=my_operator_token \
              --gpu 2 \
              --flavor l40s-1-gpu \
              -e OUTLINES_CACHE_DIR=/tmp/.outlines \
              -e HF_TOKEN=$MY_HF_TOKEN \
              -e HF_HOME=/hub \
              -e HF_DATASETS_TRUST_REMOTE_CODE=1 \
              -e HF_HUB_ENABLE_HF_TRANSFER=0 \
              -v standalone:/hub:rw \
              -v standalone:/workspace:rw \
              vllm/vllm-openai:v0.8.2 \
              -- bash -c "python3 -m vllm.entrypoints.openai.api_server \
                        --model mistralai/Mistral-Small-24B-Instruct-2501 \
                        --tensor-parallel-size 2 \
                        --tokenizer_mode mistral \
                        --load_format mistral \
                        --config_format mistral \
                        --dtype half"

How to understand the different parameters of this command?

1. Start your AI Deploy app

Launch a new app using ovhai CLI and name it.

ovhai app run --name vllm-mistral-small

2. Define access

Define the HTTP API port and restrict access to your token.

--default-http-port 8000
--label ai_deploy_token=my_operator_token

3. Configure GPU resources

Specifies the hardware type (l40s-1-gpu), which refers to an NVIDIA L40S GPU and the number (2).

--gpu 2 --flavor l40s-1-gpu

⚠️WARNING! For this model, two L40S are sufficient, but if you want to deploy another model, you will need to check which GPU you need. Note that you can also access to A100 and H100 GPUs for your larger models.

4. Set up environment variables

Configure caching for the Outlines library (used for efficient text generation):

-e OUTLINES_CACHE_DIR=/tmp/.outlines

Pass the Hugging Face token ($MY_HF_TOKEN) for model authentication and download:

-e HF_TOKEN=$MY_HF_TOKEN

Set the Hugging Face cache directory to /hub (where models will be stored):

-e HF_HOME=/hub

Allow execution of custom remote code from Hugging Face datasets (required for some model behaviors):

-e HF_DATASETS_TRUST_REMOTE_CODE=1

Disable Hugging Face Hub transfer acceleration (to use standard model downloading):

-e HF_HUB_ENABLE_HF_TRANSFER=0

5. Mount persistent volumes

Mounts two persistent storage volumes:

/hub → Stores Hugging Face model files
/workspace → Main working directory

The rw flag means read-write access.

-v standalone:/hub:rw -v standalone:/workspace:rw

6. Choose the target Docker image

Uses the vllm/vllm-openai:v0.8.2 Docker image (a pre-configured vLLM OpenAI API server).

vllm/vllm-openai:v0.8.2

7. Running the model inside the container

Runs a bash shell inside the container and executes a Python command to launch the vLLM API server:

python3 -m vllm.entrypoints.openai.api_server → Starts the OpenAI-compatible vLLM API server
--model mistralai/Mistral-Small-24B-Instruct-2501 → Loads the Mistral Small 24B model from Hugging Face
--tensor-parallel-size 2 → Distributes the model across 2 GPUs
--tokenizer_mode mistral → Uses the Mistral tokenizer
--load_format mistral → Uses Mistral’s model loading format
--config_format mistral → Ensures the model configuration follows Mistral’s standard
--dtype half → Uses FP16 (half-precision floating point) for optimized GPU performance

You can now check if your AI Deploy app is alive:

ovhai app get <your_vllm_app_id>

💡Is your app in RUNNING status? Perfect! You can check in the logs that the server is started…

ovhai app logs <your_vllm_app_id>

⚠️WARNING! This step may take a little time as the template must be loaded…
After a few minutes, you should get the following information in the logs:

2025-02-20T13:48:07Z [app] [tcmzt] INFO: Started server process [13] 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Waiting for application startup. 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Application startup complete. 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

🚦 Are all the indicators green? Then it’s off to inference!

Request and send prompt to the LLM

Launch the following query by asking the question of your choice:

curl https://<your_vllm_app_id>.app.gra.ai.cloud.ovh.net/v1/chat/completions \
  -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Small-24B-Instruct-2501",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Give me the name of OVHcloud’s founder."}
    ],
    "stream": false
  }'

Returning the following result:

{
  "id":"chatcmpl-d6ea734b524bd851668e71d4111ba496",
  "object":"chat.completion",
  "created":1740059807,
  "model":"mistralai/Mistral-Small-24B-Instruct-2501",
  "choices":[
    {
      "index":0,
      "message":{
        "role":"assistant",
        "reasoning_content":null, 
        "content":"The founder of OVHcloud is Octave Klaba.",
        "tool_calls":[]
      },
      "logprobs":null,
      "finish_reason":"stop",
      "stop_reason":null
    }
  ],
  "usage":{
    "prompt_tokens":22,
    "total_tokens":35,
    "completion_tokens":13,
    "prompt_tokens_details":null
  },
  "prompt_logprobs":null
}

Conclusion

By following these steps, you have successfully deployed the mistralai/Mistral-Small-24B-Instruct-2501 model using vLLM on OVHcloud’s AI Deploy platform. This setup provides a scalable and efficient solution for serving advanced language models in production environments.

For further customization and optimization, refer to the vLLM documentation and OVHcloud AI Deploy resources.

💪 Challenges taken! You can now enjoy the power of your LLM deployed in a single command line!

Want even more simplicity? You can also use ready-to-use APIs with AI Endpoints!

But… what’s next?

Eléa Petton

Solution Architect at OVHcloud | + posts