Mistral Archives - OVHcloud Blog

Reference Architecture: deploying the Mistral Large 123B model in a sovereign environment with OVHcloud

Eléa Petton — Wed, 18 Jun 2025 12:45:51 +0000

Are you ready to think bigger with the Mistral Large model 🚀 ?

Mistral Large model deployed on OVHcloud infrastructure

As Artificial Intelligence (AI) becomes a strategic pillar for both enterprises and public institutions, data sovereignty and infrastructure control have become essential. Deploying advanced large language models (LLMs) like Mistral Large, under a commercial license, requires a secure, high-performance environment that complies with European data regulations.

OVHcloud Machine Learning Services offer a trusted solution for deploying AI models in a fully sovereign cloud environment — hosted in Europe, under EU jurisdiction, and fully GDPR-compliant.

This Reference Architecture will show you how to:

Access Mistral AI registry using your own license
Download the Mistral Large 123B model automatically using AI Training
Store the model into a dedicated bucket with OVHcloud Object Storage
Deploy a production-ready inference API for Mistral Large using AI Deploy

Context

Mistral Large model

The Mistral Large model is a state-of-the-art (LLM) developed by Mistral AI, a French AI company. It’s designed to compete with top-tier models like GPT-4, Claude, while emphasizing performance and efficiency.

This is a model with 123 billion parameters. Mistral AI recommends deploying this model in FP8 with 4 H100 GPUs. For more information, refer to Mistral documentation.

This model requires the use of a commercial licence. To do this, you need to create an account on La Plateforme via the Mistral AI console (console.mistral.ai).

AI Training

OVHcloud AI Training is a fully managed platform designed to help you train, tune Machine Learning (ML), Deep Learning (DL), and Large Language Models (LLMs) efficiently. Whether you’re working on computer vision, NLP, or tabular data, this solution lets you launch training jobs on high-performance GPUs in seconds.

What are the key benefits?

Easy to use: launch processing or training jobs in one CLI command or a few clicks using your own Docker image
High-performance computing: access GPUs like H100, A100, V100S, L40S, and L4 as of June 2025 – new references are added regularly
Cost-efficient: pay-per-minute billing with no upfront commitment. You only pay for compute time used, with precise control over resources thanks to automatic job stop and synchronisation

💡 Why do we need AI Training? To download the Mistral Large model automatically and efficiently, using a single command to launch the job.

AI Deploy

OVHcloud AI Deploy is a Container as a Service (CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications / APIs based on Machine Learning (ML), Deep Learning (DL) or LLMs.

The key benefits are:

Easy to use: bring your own custom Docker image and deploy it in a command line or a few clicks surely
High-performance computing: a complete range of GPUs available (H100, A100, V100S, L40S and L4)
Scalability and flexibility: supports automatic scaling, allowing your model to effectively handle fluctuating workloads
Cost-efficient: billing per minute, no surcharges

✅ To go further, some prerequisites must be checked!

Overview of the Mistral Large deployment architecture

Here is how will be deployed Mistral Large 123B:

Install the ovhai CLI
Create a bucket for model storage
Retrieve the license information from Mistral Console
Configure and set up the environment
Download the Mistral Large model weights
Deploy the Mistral Large service
Test it with simple request and advanced usage thanks to LangChain

Let’s go for the set up and deployment of your own Mistral Large service!

Prerequisites

Before you begin, ensure you have:

A Mistral AI license to access to the Mistral Large model
An OVHcloud Public Cloud account
An OpenStack user with the following roles:
- Administrator
- AI Training Operator
- Object Storage Operator

🚀 Having all the ingredients for our recipe, it’s time to deploy the Mistral Large model on 4 H100!

Architecture guide: Mistral Large on OVHcloud infrastructure

Let’s go for the set up and deployment of the Mistral Large model!

✅ Note
In this example, the Mistral Large 25.02 is used. Choose the mistral model under the licence of your choice and repeat the same steps, adapting the model name and versions.

⚙️ Also consider that all of the following steps can be automated using OVHcloud APIs!

Step 1 – Install `ovhai` CLI

If the ovhai CLI is not install, start by setting up your CLI environment.

curl https://cli.gra.ai.cloud.ovh.net/install.sh | bash

Secondly, login using your OpenStack credentials.

ovhai login -u  -p

Now, it’s time to create your bucket inside OVHcloud Object Storage!

Step 2 – Provision Object Storage

Go to Public Cloud > Storage > Object Storage in the OVHcloud Control Panel.
Create a datastore and a new S3 bucket (e.g., s3-mistral-large-model).
Register the datastore with the ovhai CLI:

ovhai datastore add s3  https://s3.gra.perf.cloud.ovh.net/ gra   --store-credentials-locally

💡 Note that, for this use case, we recommend the High Performance Object Storage range using https://s3.gra.perf.cloud.ovh.net/ instead of https://s3.gra.io.cloud.ovh.net/

Step 3 – Access the Mistral AI registry

⚠️ Please note that you must have a licence for the Mistral Large model to be able to carry out the following steps.

Go to the Mistral AI platform: https://console.mistral.ai/home
Retrieve credentials and the license key from the Mistral console: https://console.mistral.ai/on-premise/licenses
Authenticate to the Mistral AI Docker registry:

docker login  --username $DOCKER_USERNAME --password $DOCKER_PASSWORD

Add the private registry to the config using the ovhai CLI:

ovhai registry add

Check that it is present in the list:

ovhai registry list

Step 4 – Define environment variables

The next step is to define an .env file that will list all the environment variables required to download and deploy the Mistral Large model.

Create the .env file, enter the following information:

SERVED_MODEL=mistral-large-2502
RECIPES_VERSION=v0.0.76TP_SIZE=4
LICENSE_KEY=
DOCKER_IMAGE_INFERENCE_ENGINE=<mistral-inference-server-docker-image>
DOCKER_IMAGE_MISTRAL_UTILS=<mistral-utils-docker-image>

Then, create a script to load theses environment variables easily. Name it load_env.sh:

#!/bin/bash

# Vérifie si le fichier .env existe
if [ ! -f .env ]; then
  echo "Error: .env not found"
  exit 1
fi

# Exporter toutes les variables du .env
export $(grep -v '^#' .env | xargs)

echo "Environment variables are loaded from .env"

Now, launch this script :

source load_env.sh

✅ You have everything you need to start the implementation!

Step 5 – Download Mistral Large model weights

The aim here is to download the model and its artefacts into the S3 bucket created earlier.

To achieve this, you can launch a download job that will run automatically with AI Training.

💡 Here’s a tip!
Note that here you are not using AI Training to train models, but as an easy-to-use Container as a Service solution. With a single command line, you can launch a one-shot download of the Mistral Large model with automatic synchronisation to Object Storage.

Launch the AI Training download job by attaching the object container:

ovhai job run --name DOWNLOAD_MISTRAL_LARGE_123B \
              --cpu 12 \
              --volume s3-mistral-large-model@/:/opt/ml/model:RW \
              -e RECIPES_VERSION=$RECIPES_VERSION \
              $DOCKER_IMAGE_MISTRAL_UTILS \
                -- bash -c "cd /app/mistral-rclone && \ 
                  poetry run python mistral-rclone.py \
                  --license-key $LICENSE_KEY \
                  --download-model $SERVED_MODEL"

Full command explained:

ovhai job run

This is the core command to run a job using the OVHcloud AI Training platform.

--name DOWNLOAD_MISTRAL_LARGE_123B

Sets a custom name for the job. For example, DOWNLOAD_MISTRAL_LARGE_123B.

--cpu 12

Allocates 12 CPU for the job.

--volume s3-mistral-large-model@/:/opt/ml/model:RW

This mounts your OVHcloud Object Storage volume into the job’s file system:
– s3-mistral-large-model@/: refers to your S3 bucket volume from the OVHcloud Object Storage
– :/opt/ml/model: mounts the volume into the container under /opt/ml/model
– RW: enables Read/Write permissions

-e RECIPES_VERSION=$RECIPES_VERSION

This is from your environment variables defined previously.

$DOCKER_IMAGE_MISTRAL_UTILS

This is the Mistral Large utils Docker image you are running inside the job.

-- bash -c "cd /app/mistral-rclone && \
poetry run python mistral-rclone.py \
--license-key $LICENSE_KEY \
--download-model $SERVED_MODEL"

Refers to the specific command to launch the model download.

Note that synchronisation with Object Storage will be automatic at the end of the AI Training job.

⚠️ WARNING!
Wait for the job to go to DONE before proceeding to the next step.

Check that the various elements are present in the bucket:

ovhai bucket object list s3-mistral-large-model@

The bucket must be organized and split into 4 different folders:

grammars
recipes
tokenizers
weights

Note that a total of 6 elements must be present.

🚀 It’s all there? So let’s move on to the deployment of the Mistral Large model!

Step 6 – Deploy Mistral Large service

To deploy the Mistral Large 123B model using the previously downloaded weights, you will use OVHcloud’s AI Deploy product.

But first you need to create an API key that will allow you to consume the model and query it, in particular using Open AI compatibility.

Creation of an access token:

ovhai token create --role read mistral_large=api_key_reader

Export this token as an environment variable:

export MY_OVHAI_MISTRAL_LARGE_TOKEN=

Launch the Mistral Large service with AI Deploy by running the following command:

ovhai app run --name DEPLOY_MISTRAL_LARGE_123B \
              --gpu 4 \
              --flavor h100-1-gpu \
              --default-http-port 5000 \
              --label mistral_large=api_key_reader \
              -e SERVED_MODEL=$SERVED_MODEL \
              -e RECIPES_VERSION=$RECIPES_VERSION \
              -e TP_SIZE=$TP_SIZE \
              --volume s3-mistral-large-model@/:/opt/ml/model:RW \
              --volume standalone:/tmp:RW \
              --volume standalone:/workspace:RW \
              $DOCKER_IMAGE_INFERENCE_ENGINE

Full command explained:

ovhai app run

This is the core command to run an app / API using the OVHcloud AI Deploy platform.

--name DEPLOY_MISTRAL_LARGE_123B

Sets a custom name for the app. For example, DEPLOY_MISTRAL_LARGE_123B.

--default-http-port 5000

Exposes port 5000 as the default HTTP endpoint.

--gpu 4

Allocates 4 GPUs for the app.

--flavor h100-1-gpu

Chooses H100 GPUs for the app.

--volume s3-mistral-large-model@/:/opt/ml/model:RW

--label mistral_large=api_key_reader

Means that the access is restricted to your token

-e SERVED_MODEL=$SERVED_MODEL
-e RECIPES_VERSION=$RECIPES_VERSION
-e TP_SIZE=$TP_SIZE

These are environment variables defined previously.

-v standalone:/tmp:rw
-v standalone:/workspace:rw

Mounts two persistent storage volumes:
– /tmp
– /workspace → Main working directory

$DOCKER_IMAGE_INFERENCE_ENGINE

This is the Mistral Large inference Docker image you are running inside the app.

It may take a few minutes for the resources to be allocated and for the Docker image to be pulled.

To check the progress and get additional information about the AI deploy app, run the following command:

ovhai app get

Once in RUNNING status, the model will be loaded. To check that the load was successful, you can check the container logs:

ovhai app logs

⚠️ WARNING!
To consume the service, you must wait for the app to go into RUNNING status, AND for the model to finish loading.

🎉 Is that it? Everything ready? It is therefore possible to start playing with the model!

Step 7 – Test the Mistral Large model by sending your first requests

Access the API doc via your app URL:

https://.app.gra.ai.cloud.ovh.net/docs

To find the information, please refer to https://console.mistral.ai/on-premise/licenses

Test with a basic cURL:

curl -X 'POST' \
'https://.app.gra.ai.cloud.ovh.net/v1/chat/completions' \
  -H 'accept: application/json' \
  -H "Authorization: Bearer $MY_OVHAI_MISTRAL_LARGE_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "mistral-large-",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant!"
    },
    {
      "role": "user",
      "content": "What is the capital of France?"     
    }
  ]
}'

⚠️ Note that you have also to replace in the model name by the one you are using:
"model": "mistral-large-"

To take implementation a step further and take advantage of all the features of this endpoint, you can also integrate it with Langchain thanks to its fuOpenAI compatibility.

LangChain integration:

import time
import os 
from langchain.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

def chat_completion_basic(new_message: str):

  model = ChatOpenAI(model_name="mistral-large-",
                        openai_api_key=$MY_OVHAI_MISTRAL_LARGE_TOKEN,
                        openai_api_base='https://.app.gra.ai.cloud.ovh.net/v1',
                       )

  prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant!"),
    ("human", "{question}"),
  ])

  chain = prompt | model

  print("🤖: ")
  for r in chain.stream({"question", new_message}):
    print(r.content, end="", flush=True)
    time.sleep(0.150)

chat_completion_basic("What is the capital of France?)

🥹 Congratulations! You have successfully completed the deployment!

Conclusion

You can now consume your Mistral Large 123B in a secure environment!

The result of your implementation? The deployment of a sovereign, scalable, production-quality 123B LLM, powered by OVHcloud AI Deploy.

➡️ To go further?

Update your model in a single command line and without interruption following this documentation
Go to the next replica in the event of a heavy load to ensure high availability using this method

Mistral Small 24B served with vLLM and AI Deploy – a single command to deploy an LLM (Part 1)

Eléa Petton — Mon, 24 Feb 2025 10:08:37 +0000

You are not dreaming! You can deploy open-source LLM in a single command line.

Deploying advanced language models can be a challenge! But this sometimes this arduous task is becoming increasingly accessible, enabling developers to integrate sophisticated AI capabilities into their applications.

In this guide, we will walk through deploying the Mistral-Small-24B-Instruct-2501 model using vLLM on OVHcloud’s AI Deploy platform. This combination offers a powerful solution for efficient and scalable AI model serving.

Deploying a model is great, but doing it quickly is even better!

🤯 What if a single command line was enough? That’s the challenge we’re tackling today!

Context

Before deployment, let’s take a closer look at our key technologies!

Mistral Small

The mistralai/Mistral-Small-24B-Instruct-2501 is a 24-billion-parameter instruction-fine-tuned model, renowned for its compact size and performance comparable to larger models.

This model, from MistralAI, is an instruction-fine-tuned version of the base model: Mistral-Small-24B-Base-2501.

To serve this model efficiently, we will utilize vLLM, an open-source library for LLM inference.

vLLM

vLLM (Virtual LLM) is a highly optimized service engine designed to efficiently run large language models. It takes advantage of several key optimizations, such as:

PagedAttention: an attention mechanism that reduces memory fragmentation and enables more efficient use of GPU memory
Continuous Batching: vLLM dynamically adjusts batch sizes in real time, ensuring that the GPU is always used efficiently, even with multiple simultaneous requests
Tensor parallelism: enables model inference across multiple GPUs to boost performance
Optimized kernel implementations: vLLM uses custom CUDA kernels for faster execution, reducing latency compared to traditional inference frameworks

These features make vLLM one of the best choices for large models such as Mistral Small 24B, enabling low-latency, high-throughput inference on the latest GPUs.

By deploying on OVHcloud’s AI Deploy platform, you can deploy this model in a single command line.

AI Deploy

The key benefits are:

Easy to use: bring your own custom Docker image and deploy it in a command line or a few clicks surely
High-performance computing: a complete range of GPUs available (H100, A100, V100S, L40S and L4)
Scalability and flexibility: supports automatic scaling, allowing your model to effectively handle fluctuating workloads
Cost-efficient: billing per minute, no surcharges

✅ To go further, some prerequisites must be checked!

Prerequisites

Before you begin, ensure that you have:

OVHcloud account: access to the OVHcloud Control Panel
ovhai CLI available: install the ovhai CLI
AI Deploy access: ensure you have a user for AI Deploy
Hugging Face access: create an Hugging Face account and generate an access token
Gated model authorization: be sure you have been granted access to Mistral-Small-24B-Instruct-2501 model

🚀 Having all the ingredients for our recipe, it’s time to deploy!

Deployment of the Mistral Small 24B LLM

Let’s go for the deployment of the model mistralai/Mistral-Small-24B-Instruct-2501

Manage access tokens

Export your Hugging Face token.

export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

Create a token to access your AI Deploy app once it will be deployed.

ovhai token create --role operator ai_deploy_token=my_operator_token

Returning the following output:

Id: 47292486-fb98-4a5b-8451-600895597a2b Created At: 20-02-25 11:53:05 Updated At: 20-02-25 11:53:05 Spec: Name: ai_deploy_token=my_operator_token Role: AiTrainingOperator Label Selector: Status: Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Version: 1

You can now store and export your access token:

export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Launch Mistral Small LLM with AI Deploy

You are ready to start Mistral-Small-24B using vLLM and AI Deploy:

ovhai app run --name vllm-mistral-small \
              --default-http-port 8000 \
              --label ai_deploy_token=my_operator_token \
              --gpu 2 \
              --flavor l40s-1-gpu \
              -e OUTLINES_CACHE_DIR=/tmp/.outlines \
              -e HF_TOKEN=$MY_HF_TOKEN \
              -e HF_HOME=/hub \
              -e HF_DATASETS_TRUST_REMOTE_CODE=1 \
              -e HF_HUB_ENABLE_HF_TRANSFER=0 \
              -v standalone:/hub:rw \
              -v standalone:/workspace:rw \
              vllm/vllm-openai:v0.8.2 \
              -- bash -c "python3 -m vllm.entrypoints.openai.api_server \
                        --model mistralai/Mistral-Small-24B-Instruct-2501 \
                        --tensor-parallel-size 2 \
                        --tokenizer_mode mistral \
                        --load_format mistral \
                        --config_format mistral \
                        --dtype half"

How to understand the different parameters of this command?

1. Start your AI Deploy app

Launch a new app using ovhai CLI and name it.

ovhai app run --name vllm-mistral-small

2. Define access

Define the HTTP API port and restrict access to your token.

--default-http-port 8000
--label ai_deploy_token=my_operator_token

3. Configure GPU resources

Specifies the hardware type (l40s-1-gpu), which refers to an NVIDIA L40S GPU and the number (2).

--gpu 2 --flavor l40s-1-gpu

⚠️WARNING! For this model, two L40S are sufficient, but if you want to deploy another model, you will need to check which GPU you need. Note that you can also access to A100 and H100 GPUs for your larger models.

4. Set up environment variables

Configure caching for the Outlines library (used for efficient text generation):

-e OUTLINES_CACHE_DIR=/tmp/.outlines

Pass the Hugging Face token ($MY_HF_TOKEN) for model authentication and download:

-e HF_TOKEN=$MY_HF_TOKEN

Set the Hugging Face cache directory to /hub (where models will be stored):

-e HF_HOME=/hub

Allow execution of custom remote code from Hugging Face datasets (required for some model behaviors):

-e HF_DATASETS_TRUST_REMOTE_CODE=1

Disable Hugging Face Hub transfer acceleration (to use standard model downloading):

-e HF_HUB_ENABLE_HF_TRANSFER=0

5. Mount persistent volumes

Mounts two persistent storage volumes:

/hub → Stores Hugging Face model files
/workspace → Main working directory

The rw flag means read-write access.

-v standalone:/hub:rw -v standalone:/workspace:rw

6. Choose the target Docker image

Uses the vllm/vllm-openai:v0.8.2 Docker image (a pre-configured vLLM OpenAI API server).

vllm/vllm-openai:v0.8.2

7. Running the model inside the container

Runs a bash shell inside the container and executes a Python command to launch the vLLM API server:

python3 -m vllm.entrypoints.openai.api_server → Starts the OpenAI-compatible vLLM API server
--model mistralai/Mistral-Small-24B-Instruct-2501 → Loads the Mistral Small 24B model from Hugging Face
--tensor-parallel-size 2 → Distributes the model across 2 GPUs
--tokenizer_mode mistral → Uses the Mistral tokenizer
--load_format mistral → Uses Mistral’s model loading format
--config_format mistral → Ensures the model configuration follows Mistral’s standard
--dtype half → Uses FP16 (half-precision floating point) for optimized GPU performance

You can now check if your AI Deploy app is alive:

ovhai app get

💡Is your app in RUNNING status? Perfect! You can check in the logs that the server is started…

ovhai app logs

⚠️WARNING! This step may take a little time as the template must be loaded…
After a few minutes, you should get the following information in the logs:

2025-02-20T13:48:07Z [app] [tcmzt] INFO: Started server process [13] 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Waiting for application startup. 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Application startup complete. 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

🚦 Are all the indicators green? Then it’s off to inference!

Request and send prompt to the LLM

Launch the following query by asking the question of your choice:

curl https://.app.gra.ai.cloud.ovh.net/v1/chat/completions \
  -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Small-24B-Instruct-2501",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Give me the name of OVHcloud’s founder."}
    ],
    "stream": false
  }'

Returning the following result:

{
  "id":"chatcmpl-d6ea734b524bd851668e71d4111ba496",
  "object":"chat.completion",
  "created":1740059807,
  "model":"mistralai/Mistral-Small-24B-Instruct-2501",
  "choices":[
    {
      "index":0,
      "message":{
        "role":"assistant",
        "reasoning_content":null, 
        "content":"The founder of OVHcloud is Octave Klaba.",
        "tool_calls":[]
      },
      "logprobs":null,
      "finish_reason":"stop",
      "stop_reason":null
    }
  ],
  "usage":{
    "prompt_tokens":22,
    "total_tokens":35,
    "completion_tokens":13,
    "prompt_tokens_details":null
  },
  "prompt_logprobs":null
}

Conclusion

By following these steps, you have successfully deployed the mistralai/Mistral-Small-24B-Instruct-2501 model using vLLM on OVHcloud’s AI Deploy platform. This setup provides a scalable and efficient solution for serving advanced language models in production environments.

For further customization and optimization, refer to the vLLM documentation and OVHcloud AI Deploy resources.

💪 Challenges taken! You can now enjoy the power of your LLM deployed in a single command line!

Want even more simplicity? You can also use ready-to-use APIs with AI Endpoints!

But… what’s next?

How to serve LLMs with vLLM and OVHcloud AI Deploy

Mathieu Busquet — Wed, 29 May 2024 12:22:26 +0000

In this tutorial, we will walk you through the process of serving large language models (LLMs), providing step-by-step instruction.

Introduction

In recent years, large language models (LLMs) have become increasingly popular, with open-source models like Mistral and LLaMA gaining widespread attention. In particular, the LLaMA 3 model was released on April 18, 2024, is one of today’s most powerful open-source LLMs.

However, serving these LLMs can be challenging, particularly on hardware with limited resources. Indeed, even on expensive hardware, LLMs can be surprisingly slow, with high VRAM utilization and throughput limitations.

This is where vLLM comes in. vLLM is an open-source project that enables fast and easy-to-use LLM inference and serving. Designed for optimal performance and resource utilization, vLLM supports a range of LLM architectures and offers flexible customization options. That’s why we are going to use it to efficiently deploy and scale our LLMs.

Objective

In this guide, you will discover how to deploy a LLM thanks to vLLM and the AI Deploy OVHcloud solution. This will enable you to benefit from vLLM‘s optimisations and OVHcloud‘s GPU computing resources. Your LLM will then be exposed by a secured API.

🎁 And for those who do not want to bother with the deployment process, a surprise awaits you at the end of the article. We are going to introduce you to our new solution for using LLMs, called AI Endpoints. This product makes it easy to integrate AI capabilities into your applications with a simple API call, without the need for deep AI expertise or infrastructure management. And while it’s in alpha, it’s free!

Requirements

To deploy your vLLM server, you need:

An OVHcloud account to access the OVHcloud Control Panel
A Public Cloud project
A user for the AI Products, related to this Public Cloud project
The OVHcloud AI CLI installed on your local computer (to interact with the AI products by running commands).
Docker installed on your local computer, or access to a Debian Docker Instance, which is available on the Public Cloud

Once these conditions have been met, you are ready to serve your LLMs.

Building a Docker image

Since the OVHcloud AI Deploy solution is based on Docker images, we will be using a Docker image to deploy our vLLM inference server.

As a reminder, Docker is a platform that allows you to create, deploy, and run applications in containers. Docker containers are standalone and executable packages that include everything needed to run an application (code, libraries, system tools).

To create this Docker image, we will need to write the following Dockerfile into a new folder:

mkdir my_vllm_image
nano Dockerfile

# 🐳 Base image
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

# 👱 Set the working directory inside the container
WORKDIR /workspace

# 📚 Install missing system packages (git) so we can clone the vLLM project repository
RUN apt-get update && apt-get install -y git
RUN git clone https://github.com/vllm-project/vllm/

# 📚 Install the Python dependencies
RUN pip3 install --upgrade pip
RUN pip3 install vllm 

# 🔑 Give correct access rights to the OVHcloud user
ENV HOME=/workspace
RUN chown -R 42420:42420 /workspace

Let’s take a closer look at this Dockerfile to understand it:

FROM: Specify the base image for our Docker Image. We choose the PyTorch image since it comes with CUDA, CuDNN and torch, which is needed by vLLM.
WORKDIR /workspace: We set the working directory for the Docker container to /workspace, which is the default folder when we use AI Deploy.
RUN: It allows us to upgrade pip to the latest version to make sure we have access to the latest libraries and dependencies. We will install vLLM library, and git, which will enable to clone the vLLM repository into the /workspace directory.
ENV HOME=/workspace: This sets the HOME environment variable to /workspace. This is a requirement to use the OVHcloud AI Products.
RUN chown -R 42420:42420 /workspace: This changes the owner of the /workspace directory to the user and group with IDs of 42420 (OVHcloud user). This is also a requirement to use the OVHcloud AI Products.

This Dockerfile does not contain a CMD instruction and therefore does not launch our VLLM server. Do not worry about that, we will do it directly from AI Deploy to have more flexibility.

Once your Dockerfile is written, launch the following command to build your image:

docker build . -t vllm_image:latest

Push the image into the shared registry

Once you have built the Docker image, you will need to push it to a registry to make it accessible from AI Deploy. A registry is a service that allows you to store and distribute Docker images, making it easy to deploy them in different environments.

Several registries can be used (OVHcloud Managed Private Registry, Docker Hub, GitHub packages, …). In this tutorial, we will use the OVHcloud shared registry. More information are available in the Registries documentation.

To find the address of your shared registry, use the following command (ovhai CLI needs to be installed on your computer):

ovhai registry list

Then, log in on your shared registry with your usual AI Platform user credentials:

docker login -u  -p

Once you are logged in to the registry, tag the compiled image and push it into your shared registry:

docker tag vllm_image:latest /vllm_image:latest
docker push /vllm_image:latest

vLLM inference server deployment

Once your image has been pushed, it can be used with AI Deploy, using either the ovhai CLI or the OVHcloud Control Panel (UI).

Creating an access token

Tokens are used as unique authenticators to securely access the AI Deploy apps. By creating a token, you can ensure that only authorized requests are allowed to interact with the vLLM endpoint. You can create this token by using the OVHcloud Control Panel (UI) or by running the following command:

ovhai token create vllm --role operator --label-selector name=vllm

This will give you a token that you will need to keep.

Creating a Hugging Face token (optionnal)

Note that some models, such as LLaMA 3 require you to accept their license, hence, you need to create a HuggingFace account, accept the model’s license, and generate a token by accessing your account settings, that will allow you to access the model.

For example, when visiting the HugginFace Gemma model page, you’ll see this (if you are logged in):

If you want to use this model, you will have to Acknowledge the license, and then make sure to create a token in the tokens section.

In the next step, we will set this token as an environment variable (named HF_TOKEN). Doing this will enable us to use any LLM whose conditions of use we have accepted.

Run the AI Deploy application

Run the following command to deploy your vLLM server by running your customized Docker image:

ovhai app run /vllm_image:latest \
  --name vllm_app \
  --flavor h100-1-gpu \
  --gpu 1 \
  --env HF_TOKEN="" \
  --label name=vllm \
  --default-http-port 8080 \
  -- python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model  --dtype half

You just need to change the address of your registry to the one you used, and the name of the LLM you want to use. Also pay attention to the name of the image, its tag, and the label selector of your label if you haven’t used the same ones as those given in this tutorial.

Parameters explanation

/vllm_image:latest is the image on which the app is based.
--name vllm_app is an optional argument that allows you to give your app a custom name, making it easier to manage all your apps.
--flavor h100-1-gpu indicates that we want to run our app on H100 GPU(s). You can access the full list of GPUs available by running ovhai capabilities flavor list
--gpu 1 indicates that we request 1 GPU for that app.
--env HF_TOKEN is an optional argument that allows us to set our Hugging Face token as an environment variable. This gives us access to models for which we have accepted the conditions.
--label name=vllm allows to privatize our LLM by adding the token corresponding to the label selector name=vllm.
--default-http-port 8080 indicates that the port to reach on the app URL is the 8080.
--python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model allows to start the vLLM API server. The specified will be downloaded from Hugging Face. Here is a list of those that are supported by vLLM. Many arguments can be used to optimize your inference.

When this ovhai app run command is executed, several pieces of information will appear in your terminal. Get the ID of your application, and open the Info URL in a new tab. Wait a few minutes for your application to launch. When it is RUNNING, you can stream its logs by executing:

ovhai app logs -f

This will allow you to track the server launch, the model download and any errors you may encounter if you have used a model for which you have not accepted the user contract.

If all goes well, you should see the following output, which means that your server is up and running:

Started server process [11]
Waiting for application startup.
Application startup complete.
Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

Interacting with your LLM

Once the server is up and running, we can interact with our LLM by hitting the /generate endpoint.

Using cURL

Make sure you change the ID to that of your application so that you target the right endpoint. In order for the request to be accepted, also specify the token that you generated previously by executing ovhai token create. Feel free to adapt the parameters of the request (prompt, max_tokens, temperature, …)

curl --request POST \                                             
  --url https://.app.gra.ai.cloud.ovh.net/generate \
  --header 'Authorization: Bearer ' \
  --header 'Content-Type: application/json' \
  --data '{
        "prompt": "",
        "max_tokens": 50,
        "n": 1,
        "stream": false
}'

Using Python

Here too, you need to add your personal token and the correct link for your application.

import requests
import json

# change for your host
APP_URL = "https://.app.gra.ai.cloud.ovh.net"
TOKEN = "AI_TOKEN_generated_with_CLI"

url = f"{APP_URL}/generate"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {TOKEN}"
}
data = {
    "prompt": "What a LLM is in AI?",
    "max_tokens": 100,
    "temperature": 0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json()["text"][0])

OVHcloud AI Endpoints

If you are not interested in building your own image and deploying your own LLM inference server, you can use OVHcloud’s new AI Endpoints product which will make your life definitely easier!

AI Endpoints is a serverless solution that provides AI APIs, enabling you to easily use pre-trained and optimized AI models in your applications.

Overview of AI Endpoints

You can use LLM as a Service, choosing the desired model (such as LLaMA, Mistral, or Mixtral) and making an API call to use it in your application. This will allow you to interact with these models without even having to deploy them!

In addition to LLM capabilities, AI Endpoints also offers a range of other AI models, including speech-to-text, translation, summarization, embeddings and computer vision.

Best of all, AI Endpoints is currently in alpha phase and is free to use, making it an accessible and affordable solution for developers seeking to explore the possibilities of AI. Check this article and try it out today to discover the power of AI!

Join our Discord server to interact with the community and send us your feedbacks (#ai-endpoints channel)!

Mistral Archives - OVHcloud Blog

Reference Architecture: deploying the Mistral Large 123B model in a sovereign environment with OVHcloud

Context

Mistral Large model

AI Training

AI Deploy

Overview of the Mistral Large deployment architecture

Prerequisites

Architecture guide: Mistral Large on OVHcloud infrastructure

Step 1 – Install ovhai CLI

Step 2 – Provision Object Storage

Step 3 – Access the Mistral AI registry

Step 4 – Define environment variables

Step 5 – Download Mistral Large model weights

Step 6 – Deploy Mistral Large service

Step 7 – Test the Mistral Large model by sending your first requests

Conclusion

Mistral Small 24B served with vLLM and AI Deploy – a single command to deploy an LLM (Part 1)

Context

Mistral Small

vLLM

AI Deploy

Prerequisites

Deployment of the Mistral Small 24B LLM

Manage access tokens

Launch Mistral Small LLM with AI Deploy

1. Start your AI Deploy app

2. Define access

3. Configure GPU resources

4. Set up environment variables

5. Mount persistent volumes

6. Choose the target Docker image

7. Running the model inside the container

Request and send prompt to the LLM

Conclusion

How to serve LLMs with vLLM and OVHcloud AI Deploy

Introduction

Objective

Requirements

Building a Docker image

Push the image into the shared registry

vLLM inference server deployment

Creating an access token

Creating a Hugging Face token (optionnal)

Run the AI Deploy application

Interacting with your LLM

OVHcloud AI Endpoints

Step 1 – Install `ovhai` CLI