Eléa Petton, Author at OVHcloud Blog

Reference Architecture: Deploying a vision-language model with vLLM on OVHcloud MKS for high performance inference and full observability

Eléa Petton — Fri, 10 Apr 2026 07:48:53 +0000

Ensure complete digital sovereignty of your AI models with end-to-end control through open-source solutions on OVHcloud’s Managed Kubernetes Service.

vLLM on OVHcloud MKS for high availability and full observability

This reference architecture demonstrates how to deploy a Large Language Model (LLM) inference system using vLLM on OVHcloud Managed Kubernetes Service (MKS). The solution leverages NVIDIA L40S GPUs to serve the Qwen3-VL-8B-Instruct multimodal model (vision + text) with OpenAI-compatible API endpoints.

This comprehensive guide shows you how to deploy, to scale automatically, and how to monitor vLLM-based LLM workloads on the OVHcloud infrastructure.

What are the key benefits?

Cost-effectiveness: Leverage managed services to minimise operational overhead
Real-time observability: Track Time-to-First-Token (TTFT), throughput, and resource utilisation
Sovereign infrastructure: Keep all metrics and data within European datacentres
Scalable by design: Automatically scale GPU inference replicas based on real workload demand

Context

Managed Kubernetes Service

OVHcloud MKS is a fully managed Kubernetes platform designed to help you deploy, operate, and scale containerised applications in production. It provides a secure and reliable Kubernetes environment without the operational overhead of managing the control plane.

How does this benefit you?

Cost-efficient: Pay only for worker nodes and consumed resources, with no additional charge for the Kubernetes control plane
Fully managed Kubernetes: Certified upstream Kubernetes with automated control plane management, provided upgrades and high availability
Production-ready by design: Built-in integrations with OVHcloud Load Balancers, networking, and persistent storage
Scalable and flexible: Scale workloads easily, node pools to match application demand
Open and portable: Based on standard Kubernetes APIs, enable seamless integration with open-source ecosystems and avoid vendor lock-in

In the following guide, all services are deployed within the OVHcloud Public Cloud.

Architecture overview

This reference architecture demonstrates a basic deployment of vLLM for vision-language model inference on OVHcloud Managed Kubernetes Service, featuring:

High-availability deployment with 2 GPU nodes (NVIDIA L40S)
Optimised GPU utilisation with proper driver configuration
Scalable infrastructure supporting vision-language models
Comprehensive monitoring using Prometheus, Grafana, and DCGM
Full observability for both application and hardware metrics

Data flow:

Data Flow

Inference request:
- User → LoadBalancer → Gateway → NGINX Ingress → “Qwen3 VL” Service → vLLM Pod → GPU
- Response follows reverse path with streaming support
Metrics collection:
- vLLM Pods expose /metrics endpoint (port 8000)
- DCGM Exporters expose GPU metrics (port 9400)
- Prometheus scrapes both endpoints every 30 seconds
- Grafana queries Prometheus for visualization
Load distribution
- NGINX Ingress uses cookie-based session affinity
- vLLM Service uses ClientIP session affinity
- Anti-affinity ensures 1 pod per GPU node

Prerequisites

Before you begin, ensure you have:

An OVHcloud Public Cloud account
An OpenStack user with the Administrator role
Hugging Face access – create a Hugging Face account and generate an access token
kubectl already installed and helm installed (at least version 3.x)

🚀 Now you have all the ingredients, it’s time to deploy the recipe for Qwen/Qwen3-VL-8B-Instruct using vLLM and MKS!

Architecture guide: Native GPU deployment of vLLM on MKS with full stack observability

This reference architecture describes a Large Language Model deployment using vLLM inference server and Kubernetes, to enjoy the benefits of a service that’s both highly available and monitorable in real time.

Step 1 – Create MKS cluster and Node pools

From OVHcloud Control Panel, create a Kubernetes cluster using the MKS.

Navigate to: Public Cloud → Managed Kubernetes Service → Create a cluster

1. Configure cluster

Consider using the following configuration for the current use case:

Name: vllm-deployment-l40s-qwen3-8b
Location: 1-AZ Region – Gravelines (GRA11)
Plan: Free (or Standard)
Network: attach a Private network (e.g. 0000 - AI Private Network)
Version: Latest stable (e.g. 1.34)

2. Create GPU Node pool

During the cluster creation, configure the vLLM Node pool for GPUs:

Node pool name: vllm
Flavor: L40S-90
Number of nodes: 2
Autoscaling: Disabled (OFF)

Why L40S-90?

Cost-effective for single-model deployment (1 GPU per node)
Sufficient RAM (90GB) for Qwen3-VL-8B model

You should see your cluster (e.g. vllm-deployment-l40s-qwen3-8b) in the list, along with the following information:

You can now set up the node pool dedicated to monitoring.

3. Create CPU Node pool

From your cluster, click on Add a node pool and configure it as follow:

Node pool name: monitoring
Flavor: B2-15
Number of nodes: 1
Autoscaling: Disabled (OFF)

✅ Note

Monitoring stack can run on GPU nodes if cost is a concern. Dedicated CPU node provides better isolation and resource management.

If the status is green with the OK label, you can proceed to the next step.

4. Configure Kubernetes access

Once your nodes have been provisioned, you can download the Kubeconfig file and configure kubectl with your MKS cluster.

# configure kubectl with your MKS cluster
export KUBECONFIG=/path/to/your/kubeconfig-xxxxxx.yml

# verify cluster connectivity
kubectl cluster-info
kubectl get nodes

Returning:

NAME STATUS ROLES AGE VERSION monitoring-node-xxxxxx Ready 1d v1.34.2 vllm-node-yyyyyy Ready 1d v1.34.2 vllm-node-zzzzzz Ready 1d v1.34.2

Before going further, add a label to the CPU node for monitoring workloads.

CPU_NODE=$(kubectl get nodes -o json | \
  jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" == null) | .metadata.name')
kubectl label node $CPU_NODE node-role=monitoring

Finally, check with the following command:

NAME                     GPU      ROLE
monitoring-node-xxxxxx      monitoring
vllm-node-yyyyyy         1        
vllm-node-zzzzzz         1

Once both nodes are in Ready status, you can proceed to the next step.

Step 2 – Install GPU operator

To start, consider setting up the GPU operator.

✅ Note

This step is based on this OVHcloud documentation: Deploying a GPU application on OVHcloud Managed Kubernetes Service

1. Add NVIDIA helm repository and create namespace

Add NVIDIA helm repo:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

And create Namespace as follow.

kubectl create namespace gpu-operator

2. Install GPU operator with correct configuration

The GPU Operator must be configured with specific driver versions to ensure compatibility with vLLM containers.

However, the default installation uses recent drivers (580.x with CUDA 13.x) which are incompatible with vLLM containers (CUDA 12.x).

Solution: Force driver version 535.183.01 (CUDA 12.2).

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --set driver.enabled=true \
  --set driver.version="535.183.01" \
  --set toolkit.enabled=true \
  --set operator.defaultRuntime=containerd \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set dcgmExporter.image="dcgm-exporter" \
  --set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04" \
  --set gfd.enabled=true \
  --set migManager.enabled=false \
  --set nodeStatusExporter.enabled=true \
  --set validator.driver.enable=false \
  --set validator.toolkit.enable=false \
  --set validator.plugin.enable=false \
  --timeout 20m

✅ Note

Specifying the DCGM version may only be necessary if you encounter problems with the default image (e.g. ‘ImagePullBackOff’). If this is the case, add the following parameters:
--set dcgmExporter.repository="nvcr.io/nvidia/k8s" --set dcgmExporter.image="dcgm-exporter" --set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04"

kubectl get pods -n gpu-operator

Note that all pods should reach Running state in 5-10 minutes.

You can also check the GPU availability:

kubectl get nodes -o json | jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" != null) | "\(.metadata.name): \(.status.allocatable."nvidia.com/gpu") GPU(s)"'

Returning:

vllm-node-yyyyyy: 1 GPU(s) vllm-node-zzzzzz: 1 GPU(s)

And you can test to run nvidia-smi:

DRIVER_POD=$(kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset -o name | head -1)
kubectl exec -n gpu-operator $DRIVER_POD -- nvidia-smi

If GPU tests are working properly, you can move on DCGM service configuration.

3. Configure DCGM service

Why is DCGM Exporter required?

DCGM (Data Centre GPU Manager) is NVIDIA’s official tool for monitoring GPUs in production. The goal is to be able to collect and display metrics from both GPU nodes.

GPU monitoring with DCGM

The metrics provided are:

DCGM_FI_DEV_GPU_UTIL – GPU utilisation (%)
DCGM_FI_DEV_GPU_TEMP – GPU temperature (°C)
DCGM_FI_DEV_FB_USED – VRAM used (MB)
DCGM_FI_DEV_FB_FREE – Free VRAM (MB)
DCGM_FI_DEV_POWER_USAGE – Power consumption (W)
And 50+ other GPU metrics

Next, ensure DCGM service has the correct labels and port configuration:

kubectl patch svc nvidia-dcgm-exporter -n gpu-operator --type merge -p '{
  "metadata": {
    "labels": {
      "app": "nvidia-dcgm-exporter"
    }
  },
  "spec": {
    "ports": [
      {
        "name": "metrics",
        "port": 9400,
        "targetPort": 9400,
        "protocol": "TCP"
      }
    ]
  }
}'

Verify the endpoints (should show 2 IPs, one per GPU node).

kubectl get endpoints nvidia-dcgm-exporter -n gpu-operator

NAME ENDPOINTS AGE nvidia-dcgm-exporter x.x.x.x:9400,x.x.x.x:9400 17d

Step 3 – Deploy Qwen3 VL 8B with vLLM inference server

The deployment of the Qwen 3 VL 8B model on two L40S GPU nodes is carried out in several stages.

1. Create namespace and Hugging Face secret

Start by creating Namespace:

kubectl create namespace vllm

Next, you must retrieve your Hugging Face token and replace the HF_TOKEN value by your own:

export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Create your secret as follow:

kubectl create secret generic huggingface-secret \
  --from-literal=token=$HF_TOKEN \
  --namespace=vllm

Verify you obtain the following output by launching:

kubectl get secret huggingface-secret -n vllm

NAME TYPE DATA AGE huggingface-secret Opaque 1 14d

2. Create vLLM deployment configuration

First, you can create vllm-deployment-2nodes.yaml file.

Deploy vLLM:

kubectl apply -f vllm-deployment-2nodes.yaml

You can monitor the deployment (it should take 8-10 minutes for model download and loading).

kubectl get pods -n vllm -o wide -w

Expected output after 10 minutes:

NAME               READY  STATUS   RESTARTS  AGE  IP       NODE  
qwen3-vl-xxxx-yyy  1/1    Running  0         1d   X.X.X.X  vllm-node-yyyyyy
qwen3-vl-xxxx-zzz  1/1    Running  0         1d   X.X.X.X  vllm-node-zzzzzz

You can also check the container logs:

kubectl logs -f -n vllm

You should find in the logs: “Uvicorn running on http://0.0.0.0:8000“

Is everything installed correctly? Then let’s move on to the next step.

3. Add service label

Ensure service has the correct label for ServiceMonitor discovery.

kubectl label svc qwen3-vl-service -n vllm app=qwen3-vl --overwrite

You can now verify by launching the following command.

kubectl get svc qwen3-vl-service -n vllm --show-labels | grep "app=qwen3-vl"

Returning:

qwen3-vl-service ClusterIP X.X.X.X 8000/TCP 1d app=qwen3-vl

Step 4 – Install NGINX ingress controller

⚠️ Moving beyond Ingress

Follow this tutorial if you want to use Gateway instead of Ingress.

1. Add helm repository and configure Ingress

First of all, add helm repository:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

Create configuration file with ingress-nginx-values.yaml.

Then, install NGINX Ingress:

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  -f ingress-nginx-values.yaml \
  --wait

Wait for LoadBalancer IP. The external IP assignment should take 1-2 minutes.

kubectl get svc -n ingress-nginx ingress-nginx-controller -w

Once is no longer , Ctrl+C and export it:

export EXTERNAL_IP=
echo "API URL: http://$EXTERNAL_IP"

2. Create vLLM Ingress resource

Next, create vLLM Ingress using vllm-ingress.yaml.

Apply it as follow:

kubectl apply -f vllm-ingress.yaml

You can now test different API calls to verify that your deployment is functional.

3. Test API

Firstly, check if the model is available:

curl http://$EXTERNAL_IP/v1/models | jq

{
  "object": "list",
  "data": [
    {
      "id": "qwen3-vl-8b",
      "object": "model",
      "created": 1772472143,
      "owned_by": "vllm",
      "root": "Qwen/Qwen3-VL-8B-Instruct",
      "parent": null,
      "max_model_len": 8192,
      "permission": [
        {
          "id": "modelperm-8fb35cdd3208b068",
          "object": "model_permission",
          "created": 1772472143,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

Next, test inference using the following request:

curl http://$EXTERNAL_IP/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-vl-8b",
    "messages": [{"role": "user", "content": "Count from 1 to 10."}],
    "max_tokens": 100
  }' | jq '.choices[0].message.content'

"1, 2, 3, 4, 5, 6, 7, 8, 9, 10"

Great! You’re almost there…

Step 5 – Install Prometheus stack

Now, set up the monitoring stack that provides complete observability for application-level (vLLM) and hardware-level (GPU) metrics:

Monitoring architecture

1. Add helm repository and create namespace

Add Prometheus helm repo:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Then, create the monitoring Namespace.

kubectl create namespace monitoring

2. Create Prometheus deployment configuration and installation

First, create prometheus.yaml file.

Install Prometheus stack:

helm install prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring \
  -f prometheus.yaml \
  --timeout 10m \
  --wait

Now, monitor its installation and wait until the pods are ready:

kubectl get pods -n monitoring -w

If all pods are running successfully, you can proceed to the next step.

3. Check that the installation is operational

First access Grafana in background:

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &

Test Grafana health:

curl -s http://localhost:3000/api/health | jq

{
  "database": "ok",
  "version": "12.3.3",
  "commit": "2a14494b2d6ab60f860d8b27603d0ccb264336f6"
}

You can now access to Grafana locally via http://localhost:3000. You will have to use:

Login: admin
Password: Admin123!vLLM

Well done! You can now proceed to the configuration step.

Step 6 – Configure ServiceMonitors

The ServiceMonitors is used to tell Prometheus which endpoints to scrape for metrics.

1. Create vLLM ServiceMonitor

Retrieve the file from the GitHub repository: vllm-servicemonitor.yaml.

Next, apply and check that the ServiceMonitor vllm-metrics exists:

kubectl apply -f vllm-servicemonitor.yaml
kubectl get servicemonitor -n vllm

2. Create DCGM ServiceMonitor

First, create the dcgm-servicemonitor.yaml file.

Once again, apply and verify:

kubectl apply -f dcgm-servicemonitor.yaml
kubectl get servicemonitor -n gpu-operator

gpu-operator                  1d
nvidia-dcgm-exporter          1d
nvidia-node-status-exporter   1d

3. Configure Prometheus for Cross-Namespace discovery

Apply a patch to allow Prometheus to discover ServiceMonitors in all namespaces.

kubectl patch prometheus prometheus-kube-prometheus-prometheus -n monitoring --type merge -p '{
  "spec": {
    "serviceMonitorNamespaceSelector": {},
    "podMonitorNamespaceSelector": {}
  }
}'

Now you have to restart Prometheus.

Delete Prometheus pod to force configuration reload
Wait for Prometheus to restart

kubectl delete pod prometheus-prometheus-kube-prometheus-prometheus-0 -n monitoring

kubectl wait --for=condition=Ready \
  pod/prometheus-prometheus-kube-prometheus-prometheus-0 \
  -n monitoring \
  --timeout=180s

Wait about 2 minutes for discovery and finally, verify targets:

kubectl port-forward -n monitoring \
  prometheus-prometheus-kube-prometheus-prometheus-0 9090:9090 &

You can open in browser: http://localhost:9090/targets and search for:

vllm
dcgm

Note that the expected targets are:

serviceMonitor/vllm/vllm-metrics/0 (2/2 UP)
serviceMonitor/gpu-operator/nvidia-dcgm-exporter/0 (2/2 UP)

Step 7 – Create Grafana dashboards

In this final step, the goal is to create two Grafana dashboards to track both the software side with vLLM metrics and the hardware metrics that will monitor the GPU consumption and system.

1. vLLM application metrics

The dashboard provides insights into vLLM application performance, request handling, and resource utilization based on the following metrics:

Metric	Type	Description	Unit	Dashboard Usage
`vllm:request_success_total`	Counter	Total successful requests	count	Request Rate, Total Requests
`vllm:num_requests_running`	Gauge	Requests currently being processed	count	Queue Depth, Active Requests
`vllm:num_requests_waiting`	Gauge	Requests waiting in queue	count	Queue Depth, Queued Requests
`vllm:time_to_first_token_seconds`	Histogram	Latency until first token generated	seconds	TTFT P50/P95/P99
`vllm:e2e_request_latency_seconds`	Histogram	Total end-to-end latency	seconds	E2E Latency P50/P95/P99
`vllm:generation_tokens_total`	Counter	Total tokens generated (output)	count	Token Generation Rate, Throughput
`vllm:prompt_tokens_total`	Counter	Total prompt tokens (input)	count	Token Generation Rate, Avg Tokens
`vllm:kv_cache_usage_perc`	Gauge	GPU KV cache utilization	0-1 (0-100%)	KV Cache Usage
`vllm:prefix_cache_hits_total`	Counter	Number of prefix cache hits	count	Cache Hit Rate
`vllm:prefix_cache_queries_total`	Counter	Number of prefix cache queries	count	Cache Hit Rate
`vllm:request_queue_time_seconds`	Histogram	Time spent waiting in queue	seconds	Request Queue Time
`vllm:request_prefill_time_seconds`	Histogram	Prefill phase time	seconds	Prefill Time
`vllm:request_decode_time_seconds`	Histogram	Decode phase time	seconds	Decode Time
`vllm:inter_token_latency_seconds`	Histogram	Latency between each token	seconds	Inter-Token Latency
`vllm:num_preemptions_total`	Counter	Number of preemptions (OOM)	count	Preemptions
`vllm:prompt_tokens_cached_total`	Counter	Prompt tokens cached	count	Cached Tokens
`vllm:request_prompt_tokens`	Histogram	Prompt size distribution	count	(Table)
`vllm:request_generation_tokens`	Histogram	Generated tokens distribution	count	(Table)
`vllm:iteration_tokens_total`	Histogram	Tokens per iteration	count	(Advanced analysis)

This vLLM Grafana dashboard is composed of 23 panels:

The dashboard provides insights into LLM application performance, request handling, and resource utilisation based on the previous metrics.

Type	Nombre	Panels
Timeseries	12	Request Rate, Queue Depth, TTFT, E2E Latency, Token Gen, Cache Usage, Cache Hit, Queue Time, Prefill/Decode, Inter-Token, Preemptions, Avg Tokens
Stat	10	Throughput, TTFT P95, Active Req, Queued Req, Cache Hit Rate, Cache Usage, Total Req, Total Tokens, Cached Tokens, Preemptions
Table	1	Pod Performance

Now create the dashboard using vllm-app-dashboard.json. Then, launch:

echo "Importing vLLM application dashboard..."
curl -X POST \
  'http://localhost:3000/api/dashboards/db' \
  -H 'Content-Type: application/json' \
  -u 'admin:Admin123!vLLM' \
  -d @vllm-app-dashboard.json | jq '.status, .url'

Next, you an access the vLLM dashboard and follow metrics in real time:

This dashboard is also essential to track hardware consumption for comprehensive monitoring.

2. GPU hardware metrics

Take advantage of the most useful DCGM metrics to check both the functioning and consumption of your hardware resources:

Metric	Type	Description	Unit	Normal Thresholds	Dashboard Usage
`DCGM_FI_DEV_GPU_UTIL`	Gauge	GPU utilization (compute)	% (0-100)	70-95% optimal	GPU Utilization
`DCGM_FI_DEV_GPU_TEMP`	Gauge	GPU temperature	°C	< 85°C normal	GPU Temperature
`DCGM_FI_DEV_FB_USED`	Gauge	VRAM used	MB	Variable by model	GPU Memory Used
`DCGM_FI_DEV_FB_FREE`	Gauge	VRAM free	MB	> 2GB recommended	GPU Memory Free
`DCGM_FI_DEV_POWER_USAGE`	Gauge	Power consumption	Watts	< 300W (L40S)	GPU Power Usage
`DCGM_FI_DEV_SM_CLOCK`	Gauge	GPU clock speed (compute)	MHz	Variable	GPU Clock Speed
`DCGM_FI_DEV_MEM_CLOCK`	Gauge	Memory clock speed	MHz	Variable	Memory Clock Speed
`DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL`	Counter	Total NVLink bandwidth	bytes/s	(If multi-GPU)	NVLink Bandwidth
`DCGM_FI_DEV_PCIE_TX_BYTES`	Counter	PCIe data transmitted	bytes	(I/O monitoring)	PCIe TX
`DCGM_FI_DEV_PCIE_RX_BYTES`	Counter	PCIe data received	bytes	(I/O monitoring)	PCIe RX
`DCGM_FI_DEV_ECC_DBE_VOL_TOTAL`	Counter	ECC double-bit errors	count	0 ideal	(Health check)
`DCGM_FI_DEV_ECC_SBE_VOL_TOTAL`	Counter	ECC single-bit errors	count	< 10/day acceptable	(Health check)

This hardware Grafana dashboard is composed of 13 panels with GPU hardware and system metrics. A detailed view is also available GPU util (%), temperature (°C), vRAM (GB) and power (Watt).

Type	Count	Panels
Timeseries	8	GPU Util, GPU Mem, GPU Temp, GPU Power, CPU Usage, RAM Usage, Network I/O, Disk I/O
Stat	4	Avg GPU Util, Avg GPU Temp, Total GPU Mem, Total GPU Power
Table	1	Hardware Status

Please refer to hardware-dashboard.json by loading it as follows:

echo "Importing hardware dashboard..."
curl -X POST \
  'http://localhost:3000/api/dashboards/db' \
  -H 'Content-Type: application/json' \
  -u 'admin:Admin123!vLLM' \
  -d @hardware-dashboard.json | jq '.status, .url'

Finally, track resource consumption using this hardware dashboard:

Congratulations! Everything is working. You can now test your model and track the various metrics in real time.

Step 8 – LLM testing and performance tracking

Start by installing Python dependencies:

pip3 install openai tqdm

Replace the by the vLLM service external IP and launch the performance test thanks to the following Python code:

import time
import threading
import random
from statistics import mean
from openai import OpenAI
from tqdm import tqdm

APP_URL = "http://94.23.185.22/v1"
MODEL = "qwen3-vl-8b"

CONCURRENT_WORKERS = 500          # concurrency
REQUESTS_PER_WORKER = 10
MAX_TOKENS = 200                  # generation pressure

# some random prompts
SHORT_PROMPTS = [
    "Summarize the theory of relativity.",
    "Explain what a transformer model is.",
    "What is Kubernetes autoscaling?"
]

MEDIUM_PROMPTS = [
    "Explain how attention mechanisms work in transformer-based models, including self-attention and multi-head attention.",
    "Describe how vLLM manages KV cache and why it impacts inference performance."
]

LONG_PROMPTS = [
    "Write a very detailed technical explanation of how large language models perform inference, "
    "including tokenization, embedding lookup, transformer layers, attention computation, KV cache usage, "
    "GPU memory management, and how batching affects latency and throughput. Use examples.",
]

PROMPT_POOL = (
    SHORT_PROMPTS * 2 +
    MEDIUM_PROMPTS * 4 +
    LONG_PROMPTS * 6    # bias toward long prompts
)

# openai compliance
client = OpenAI(
    base_url=APP_URL,
    api_key="foo"
)

# basic metrics
latencies = []
errors = 0
lock = threading.Lock()

# worker
def worker(worker_id):
    global errors
    for _ in range(REQUESTS_PER_WORKER):
        prompt = random.choice(PROMPT_POOL)

        start = time.time()
        try:
            client.chat.completions.create(
                model=MODEL,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=MAX_TOKENS,
                temperature=0.7,
            )
            elapsed = time.time() - start

            with lock:
                latencies.append(elapsed)

        except Exception as e:
            with lock:
                errors += 1

# run
threads = []
start_time = time.time()

print("\n-> STARTING PERFORMANCE TEST:")
print(f"Concurrency: {CONCURRENT_WORKERS}")
print(f"Total requests: {CONCURRENT_WORKERS * REQUESTS_PER_WORKER}")

for i in range(CONCURRENT_WORKERS):
    t = threading.Thread(target=worker, args=(i,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

total_time = time.time() - start_time

# results
print("\n-> BENCH RESULTS:")
print(f"Total requests sent: {len(latencies) + errors}")
print(f"Successful requests: {len(latencies)}")
print(f"Errors: {errors}")
print(f"Total wall time: {total_time:.2f}s")

if latencies:
    print(f"Avg latency: {mean(latencies):.2f}s")
    print(f"Min latency: {min(latencies):.2f}s")
    print(f"Max latency: {max(latencies):.2f}s")
    print(f"Throughput: {len(latencies)/total_time:.2f} req/s")

Returning:

-> STARTING PERFORMANCE TEST:
Concurrency: 500
Total requests: 5000

-> BENCH RESULTS:
Total requests sent: 5000
Successful requests: 5000
Errors: 0
Total wall time: 225.54s
Avg latency: 21.45s
Min latency: 6.06s
Max latency: 25.19s
Throughput: 22.17 req/s

Don’t forget to track GPU and vLLM metrics in your Grafana dashboards!

Conslusion

This reference architecture demonstrates a vLLM deployment on OVHcloud Managed Kubernetes Service (MKS) with comprehensive GPU monitoring. Benefits include:

High Performance: GPU-accelerated inference with L40S
Scalability: Kubernetes-native, horizontal scaling-ready
Reliability: Health checks, auto-restart, monitoring
API Compatibility: OpenAI-compatible endpoints
Multimodality: Vision & text capabilities
Full stack monitoring: Complete vLLM application and hardware dashboards

Going Further

Your current architecture is functional. However, if desired, it could be improved into a full production-ready solution.

Wish to take production hardening a step further?

Go further with the following enhancements:

Authentication & authorization
- vLLM API authentication
- Grafana authentication
- Prometheus security
High availability & load balancing
- Grafana high availability with multiple replicas and shared storage
- Prometheus high availability
- vLLM Horizontal Pod Autoscaling (HPA) based on custom metrics
Data persistence & backup
- Prometheus long-term storage with persistent storage
- Grafana Dashboard Backup
Observability enhancements
- Distributed tracing by adding OpenTelemetry for request tracing
- Alerting rules with production-ready alert rules

Reference Architecture: Custom metric autoscaling for LLM inference with vLLM on OVHcloud AI Deploy and observability using MKS

Eléa Petton — Tue, 10 Feb 2026 08:51:11 +0000

Take your LLM (Large Language Model) deployment to production level with comprehensive custom autoscaling configuration and advanced vLLM metrics observability.

vLLM metrics monitoring and observability based on OVHcloud infrastructure

This reference architecture describes a comprehensive solution for deploying, autoscaling and monitoring vLLM-based LLM workloads on OVHcloud infrastructure. It combinesAI Deploy, used for model serving with custom metric autoscaling, and Managed Kubernetes Service (MKS), which hosts the monitoring and observability stack.

By leveraging application-level Prometheus metrics exposed by vLLM, AI Deploy can automatically scale inference replicas based on real workload demand, ensuring high availability, consistent performance under load and efficient GPU utilisation. This autoscaling mechanism allows the platform to react dynamically to traffic spikes while maintaining predictable latency for end users.

On top of this scalable inference layer, the monitoring architecture provides observability through Prometheus, Grafana and Alertmanager. It enables real-time performance monitoring, capacity planning, and operational insights, while ensuring full data sovereignty for organisations running Large Language Models (LLMs) in production environments.

What are the key benefits?

Cost-effective: Leverage managed services to minimise operational overhead
Real-time observability: Track Time-to-First-Token (TTFT), throughput, and resource utilisation
Sovereign infrastructure: All metrics and data remain within European datacentres
Production-ready: Persistent storage, high availability, and automated monitoring

Context

AI Deploy

OVHcloud AI Deploy is a Container as a Service (CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications/APIs based on Machine Learning (ML), Deep Learning (DL) or Large Language Models (LLMs).

Key points to keep in mind:

Easy to use: Bring your own custom Docker image and deploy it in a command line or a few clicks surely
High-performance computing: A complete range of GPUs available (H100, A100, V100S, L40S and L4)
Scalability and flexibility: Supports automatic scaling, allowing your model to effectively handle fluctuating workloads
Cost-efficient: Billing per minute, no surcharges

Managed Kubernetes Service

What should you keep in mind?

Cost-efficient: Only pay for worker nodes and consumed resources, with no additional charge for the Kubernetes control plane
Fully managed Kubernetes: Certified upstream Kubernetes with automated control plane management, upgrades and high availability
Production-ready by design: Built-in integrations with OVHcloud Load Balancers, networking and persistent storage
Scalability and flexibility: Easily scale workloads and node pools to match application demand
Open and portable: Based on standard Kubernetes APIs, enabling seamless integration with open-source ecosystems and avoiding vendor lock-in

In the following guide, all services are deployed within the OVHcloud Public Cloud.

Overview of the architecture

This reference architecture describes a complete, secure and scalable solution to:

Deploy an LLM with vLLM and AI Deploy, benefiting from automatic scaling based on custom metrics to ensure high service availability – vLLM exposes /metrics via its public HTTPS endpoint on AI Deploy
Collect, store and visualise these vLLM metrics using Prometheus and Grafana on MKS

vLLM metrics monitoring and observability architecture overview

Here you will find the main components of the architecture. The solution comprises three main layers:

Model serving layer with AI Deploy
- vLLM containers running on top of GPUs for LLM inference
- vLLM inference server exposing Prometheus metrics
- Automatic scaling based on custom metrics to ensure high availability
- HTTPS endpoints with Bearer token authentication
Monitoring and observability infrastructure using Kubernetes
- Prometheus for metrics collection and storage
- Grafana for visualisation and dashboards
- Persistent volume storage for long-term retention
Network layer
- Secure HTTPS communication between components
- OVHcloud LoadBalancer for external access

To go further, some prerequisites must be checked!

Prerequisites

Before you begin, ensure you have:

An OVHcloud Public Cloud account
An OpenStack user with the Administrator role
ovhai CLI available – install the ovhai CLI
A Hugging Face access – create a Hugging Face account and generate an access token
kubectl installed and helm installed (at least version 3.x)

🚀 Now you have all the ingredients for our recipe, it’s time to deploy the Ministral 14B using AI Deploy and vLLM Docker container!

Architecture guide: From autoscaling to observability for LLMs served by vLLM

Let’s set up and deploy this architecture!

Overview of the deployment workflow

✅ Note

In this example, mistralai/Ministral-3-14B-Instruct-2512 is used. Choose the open-source model of your choice and follow the same steps, adapting the model slug (from Hugging Face), the versions and the GPU(s) flavour.

Remember that all of the following steps can be automated using OVHcloud APIs!

Step 1 – Manage access tokens

Before introducing the monitoring stack, this architecture starts with the deployment of the Ministral 3 14B on OVHcloud AI Deploy, configured to autoscale based on custom Prometheus metrics exposed by vLLM itself.

Export your Hugging Face token.

export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

Create a Bearer token to access your AI Deploy app once it’s been deployed.

ovhai token create --role operator ai_deploy_token=my_operator_token

Returning the following output:

Id: 47292486-fb98-4a5b-8451-600895597a2b Created At: 20-01-26 11:53:05 Updated At: 20-01-26 11:53:05 Spec: Name: ai_deploy_token=my_operator_token Role: AiTrainingOperator Label Selector: Status: Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Version: 1

You can now store and export your access token:

export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Step 2 – LLM deployment using AI Deploy

1. Define the targeted vLLM metric for autoscaling

Before proceeding with the deployment of the Ministral 3 14B endpoint, you have to choose the metric you want to use as the trigger for scaling.

Instead of relying solely on CPU/RAM utilisation, AI Deploy allows autoscaling decisions to be driven by application-level signals.

To do this, you can consult the metrics exposed by vLLM.

In this example, you can use a basic metric such as vllm:num_requests_running to scale the number of replicas based on real inference load.

This enables:

Faster reaction to traffic spikes
Better GPU utilisation
Reduced inference latency under load
Cost-efficient scaling

Finally, the configuration chosen for scaling this application is as follows:

Parameter	Value	Description
Metric source	`/metrics`	vLLM Prometheus endpoint
Metric name	`vllm:num_requests_running`	Number of in-flight requests
Aggregation	`AVERAGE`	Mean across replicas
Target value	`50`	Desired load per replica
Min replicas	`1`	Baseline capacity
Max replicas	`3`	Burst capacity

✅ Note

You can choose the metric that best suits your use case. You can also apply a patch to your AI Deploy deployment at any time to change the target metric for scaling.

When the average number of running requests exceeds 50, AI Deploy automatically provisions additional GPU-backed replicas.

2. Deploy Ministral 3 14B using AI Deploy

Now you can deploy the LLM using the ovhai CLI.

Key elements necessary for proper functioning:

GPU-based inference: 1 x H100
vLLM OpenAI-compatible Docker image: vllm/vllm-openai:v0.13.0
Custom autoscaling rules based on Prometheus metrics: vllm:num_requests_running

Below is the reference command used to deploy the mistralai/Ministral-3-14B-Instruct-2512:

ovhai app run \
  --name vllm-ministral-14B-autoscaling-custom-metric \
  --default-http-port 8000 \
  --label ai_deploy_token=my_operator_token \
  --gpu 1 \
  --flavor h100-1-gpu \
  -e OUTLINES_CACHE_DIR=/tmp/.outlines \
  -e HF_TOKEN=$MY_HF_TOKEN \
  -e HF_HOME=/hub \
  -e HF_DATASETS_TRUST_REMOTE_CODE=1 \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -v standalone:/hub:rw \
  -v standalone:/workspace:rw \
  --liveness-probe-path /health \
  --liveness-probe-port 8000 \
  --liveness-initial-delay-seconds 300 \
  --probe-path /v1/models \
  --probe-port 8000 \
  --initial-delay-seconds 300 \
  --auto-min-replicas 1 \
  --auto-max-replicas 3 \
  --auto-custom-api-url "http://:8000/metrics" \
  --auto-custom-metric-format PROMETHEUS \
  --auto-custom-value-location vllm:num_requests_running \
  --auto-custom-target-value 50 \
  --auto-custom-metric-aggregation-type AVERAGE \
  vllm/vllm-openai:v0.13.0 \
  -- bash -c "python3 -m vllm.entrypoints.openai.api_server \
    --model mistralai/Ministral-3-14B-Instruct-2512 \
    --tokenizer_mode mistral \
    --load_format mistral \
    --config_format mistral \
    --enable-auto-tool-choice \
    --tool-call-parser mistral \
    --enable-prefix-caching"

How to understand the different parameters of this command?

a. Start your AI Deploy app

Launch a new app using ovhai CLI and name it.

ovhai app run --name vllm-ministral-14B-autoscaling-custom-metric

b. Define access

Define the HTTP API port and restrict access to your token.

--default-http-port 8000
--label ai_deploy_token=my_operator_token

c. Configure GPU resources

Specify the hardware type (h100-1-gpu), which refers to an NVIDIA H100 GPU and the number (1).

--gpu 1 --flavor h100-1-gpu

⚠️WARNING! For this model, one H100 is sufficient, but if you want to deploy another model, you will need to check which GPU you need. Note that you can also access L40S and A100 GPUs for your LLM deployment.

d. Set up environment variables

Configure caching for the Outlines library (used for efficient text generation):

-e OUTLINES_CACHE_DIR=/tmp/.outlines

Pass the Hugging Face token ($MY_HF_TOKEN) for model authentication and download:

-e HF_TOKEN=$MY_HF_TOKEN

Set the Hugging Face cache directory to /hub (where models will be stored):

-e HF_HOME=/hub

Allow execution of custom remote code from Hugging Face datasets (required for some model behaviours):

-e HF_DATASETS_TRUST_REMOTE_CODE=1

Disable Hugging Face Hub transfer acceleration (to use standard model downloading):

-e HF_HUB_ENABLE_HF_TRANSFER=0

e. Mount persistent volumes

Mount two persistent storage volumes:

/hub → Stores Hugging Face model files
/workspace → Main working directory

The rw flag means read-write access.

-v standalone:/hub:rw -v standalone:/workspace:rw

f. Health checks and readiness

Configure liveness and readiness probes:

/health verifies the container is alive
/v1/models confirms the model is loaded and ready to serve requests

The long initial delays (300 seconds) can be reduced; they correspond to the startup time of vLLM and the loading of the model on the GPU.

--liveness-probe-path /health --liveness-probe-port 8000 --liveness-initial-delay-seconds 300 --probe-path /v1/models --probe-port 8000 --initial-delay-seconds 300

g. Autoscaling configuration (custom metrics)

First set the minimum and maximum number of replicas.

--auto-min-replicas 1 --auto-max-replicas 3

This guarantees basic availability (one replica always up) while allowing for peak capacity.

Then enable autoscaling based on application-level metrics exposed by vLLM.

--auto-custom-api-url "http://:8000/metrics" --auto-custom-metric-format PROMETHEUS --auto-custom-value-location vllm:num_requests_running --auto-custom-target-value 50 --auto-custom-metric-aggregation-type AVERAGE

AI Deploy:

Scrapes the local /metrics endpoint
Parses Prometheus-formatted metrics
Extracts the vllm:num_requests_running gauge
Computes the average value across replicas

Scaling behaviour:

When the average number of in-flight requests exceeds 50, AI Deploy adds replicas
When load decreases, replicas are scaled down

This approach ensures high availability and predictable latency under fluctuating traffic.

h. Choose the target Docker image and the startup command

Use the official vLLM OpenAI-compatible Docker image.

vllm/vllm-openai:v0.13.0

Finally, run the model inside the container using a Python command to launch the vLLM API server:

python3 -m vllm.entrypoints.openai.api_server → Starts the OpenAI-compatible vLLM API server
--model mistralai/Ministral-3-14B-Instruct-2512 → Loads the Ministral 3 14B model from Hugging Face
--tokenizer_mode mistral → Uses the Mistral tokenizer
--load_format mistral → Uses Mistral’s model loading format
--config_format mistral → Ensures the model configuration follows Mistral’s standard
--enable-auto-tool-choice → Automatic call of tools if necessary (function/tool call)
--tool-call-parser mistral → Tool calling support
--enable-prefix-caching → Prefix caching for improved throughput and reduced latency

You can now launch this command using ovhai CLI.

3. Check AI Deploy app status

You can now check if your AI Deploy app is alive:

ovhai app get

Is your app in RUNNING status? Perfect! You can check in the logs that the server is started:

ovhai app logs

⚠️WARNING! This step may take a little time as the LLM must be loaded.

4. Test that the deployment is functional

First you can request and send a prompt to the LLM. Launch the following query by asking the question of your choice:

curl https://.app.gra.ai.cloud.ovh.net/v1/chat/completions \
  -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Ministral-3-14B-Instruct-2512",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Give me the name of OVHcloud’s founder."}
    ],
    "stream": false
  }'

You can also verify access to vLLM metrics.

curl -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  https://.app.gra.ai.cloud.ovh.net/metrics

If both tests show that the model deployment is functional and you receive 200 HTTP responses, you are ready to move on to the next step!

The next step is to set up the observability and monitoring stack. This autoscaling mechanism is fully independent from Prometheus used for observability:

AI Deploy queries the local /metrics endpoint internally
Prometheus scrapes the same metrics endpoint externally for monitoring, dashboards and potentially alerting

This ensures:

A single source of truth for metrics
No duplication of exporters
Consistent signals for scaling and observability

Step 3 – Create an MKS cluster

From OVHcloud Control Panel, create a Kubernetes cluster using the MKS.

Consider using the following configuration for the current use case:

Location: GRA ( Gravelines) – you can select the same region as for AI Deploy
Network: Public
Node pool :
- Flavour : b2-15 (or something similar)
- Number of nodes: 3
- Autoscaling : OFF
Name your node pool: monitoring

You should see your cluster (e.g. prometheus-vllm-metrics-ai-deploy) in the list, along with the following information:

If the status is green with the OK label, you can proceed to the next step.

Step 4 – Configure Kubernetes access

Download your kubeconfig file from the OVHcloud Control Panel and configure kubectl:

# configure kubectl with your MKS cluster
export KUBECONFIG=/path/to/your/kubeconfig-xxxxxx.yml

# verify cluster connectivity
kubectl cluster-info
kubectl get nodes

Now,- you can create the values-prometheus.yaml file:

# general configuration
nameOverride: "monitoring"
fullnameOverride: "monitoring"

# Prometheus configuration
prometheus:
  prometheusSpec:
    # data retention (15d)
    retention: 15d
    
    # scrape interval (15s)
    scrapeInterval: 15s
    
    # persistent storage (required for production deployment)
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: csi-cinder-high-speed  # OVHcloud storage
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi  # (can be modified according to your needs)
    
    # scrape vLLM metrics from your AI Deploy instance (Ministral 3 14B)
    additionalScrapeConfigs:
      - job_name: 'vllm-ministral'
        scheme: https
        metrics_path: '/metrics'
        scrape_interval: 15s
        scrape_timeout: 10s
        
        # authentication using AI Deploy Bearer token stored Kubernetes Secret
        bearer_token_file: /etc/prometheus/secrets/vllm-auth-token/token
        static_configs:
          - targets:
              - '.app.gra.ai.cloud.ovh.net'  # /!\ REPLACE THE  by yours /!\
            labels:
              service: 'vllm'
              model: 'ministral'
              environment: 'production'
        
        # TLS configuration
        tls_config:
          insecure_skip_verify: false
    
    # kube-prometheus-stack mounts the secret under /etc/prometheus/secrets/ and makes it accessible to Prometheus
    secrets:
      - vllm-auth-token

# Grafana configuration (visualization layer)
grafana:
  enabled: true
  
  # disable automatic datasource provisioning
  sidecar:
    datasources:
      enabled: false
  
  # persistent dashboards
  persistence:
    enabled: true
    storageClassName: csi-cinder-high-speed
    size: 10Gi
  
  # /!\ DEFINE ADMIN PASSWORD - REPLACE "test" BY YOURS /!\
  adminPassword: "test"
  
  # access via OVHcloud LoadBalancer (public IP and managed LB)
  service:
    type: LoadBalancer
    port: 80
    annotations:
      # optional : limiter l'accès à certaines IPs
      # service.beta.kubernetes.io/ovh-loadbalancer-allowed-sources: "1.2.3.4/32"
  
# alertmanager (optional but recommended for production)
alertmanager:
  enabled: true
  
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: csi-cinder-high-speed
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

# cluster observability components
nodeExporter:
  enabled: true
  
kubeStateMetrics:
  enabled: true

✅ Note

On OVHcloud MKS, persistent storage is handled automatically through the Cinder CSI driver. When a PersistentVolumeClaim (PVC) references a supported storageClassName such as csi-cinder-high-speed, OVHcloud dynamically provisions the underlying Block Storage volume and attaches it to the node running the pod. This enables stateful components like Prometheus, Alertmanager and Grafana to persist data reliably without any manual volume management, making the architecture fully cloud-native and operationally simple.

Then create the monitoring namespace:

# create namespace
kubectl create namespace monitoring

# verify creation
kubectl get namespaces | grep monitoring

Finally, configure the Bearer token secret to access vLLM metrics.

# create bearer token secret
kubectl create secret generic vllm-auth-token \
  --from-literal=token='"$MY_OVHAI_ACCESS_TOKEN"' \
  -n monitoring

# verify secret creation
kubectl get secret vllm-auth-token -n monitoring

# test token (optional)
kubectl get secret vllm-auth-token -n monitoring \
  -o jsonpath='{.data.token}' | base64 -d

Right, if everything is working, let’s move on to deployment.

Step 5 – Deploy Prometheus stack

Add the Prometheus Helm repository and install the monitoring stack. The deployment creates:

Prometheus StatefulSet with persistent storage
Grafana deployment with LoadBalancer access
Alertmanager for future alert configuration (optional)
Supporting components (node exporters, kube-state-metrics)

# add Helm repository
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update

# install monitoring stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values values-prometheus.yaml \
  --wait

Then you can retrieve the LoadBalancer IP address to access Grafana:

kubectl get svc -n monitoring monitoring-grafana

Finally, open your browser to http:// and login with:

Username: admin
Password: as configured in your values-prometheus.yaml file

Step 6 – Create Grafana dashboards

In this step, you will be able to access Grafana interface and add your Prometheus as a new data source, then create a complete dashboard with different vLLM metrics.

1. Add a new data source in Grafana

First of all, create a new Prometheus connection inside Grafana:

Navigate to Connections → Data sources → Add data source
Select Prometheus
Configure URL: http://monitoring-prometheus:9090
Click Save & test

Now that your Prometheus has been configured as a new data source, you can create your Grafana dashboard.

2. Create your monitoring dashboard

To begin with, you can use the following pre-configured Grafana dashboard by downloading this JSON file locally:

In the left-hand menu, select Dashboard:

Navigate to Dashboards → Import
Upload the provided dashboard JSON
Select Prometheus as datasource
Click Import and select the vLLM-metrics-grafana-monitoring.json file

The dashboard provides real-time visibility for Ministral 3 14B deployed with vLLM container and OVHcloud AI Deploy.

You can now track:

Performance metrics: TTFT, inter-token latency, end-to-end latency
Throughput indicators: Requests per second, token generation rates
Resource utilisation: KV cache usage, active/waiting requests
Capacity indicators: Queue depth, preemption rates

Here are the key metrics tracked and displayed in the Grafana dashboard:

Metric Category	Prometheus Metric	Description	Use case
Latency	`vllm:time_to_first_token_seconds`	Time until first token generation	User experience monitoring
Latency	`vllm:inter_token_latency_seconds`	Time between tokens	Throughput optimisation
Latency	`vllm:e2e_request_latency_seconds`	End-to-end request time	SLA monitoring
Throughput	`vllm:request_success_total`	Successful requests counter	Capacity planning
Resource	`vllm:kv_cache_usage_perc`	KV cache memory usage	Memory management
Queue	`vllm:num_requests_running`	Active requests	Load monitoring
Queue	`vllm:num_requests_waiting`	Queued requests	Overload detection
Capacity	`vllm:num_preemptions_total`	Request preemptions	Peak load indicator
Tokens	`vllm:prompt_tokens_total`	Input tokens processed	Usage analytics
Tokens	`vllm:generation_tokens_total`	Output tokens generated	Cost tracking

Well done, you now have at your disposal:

An endpoint of the Ministral 3 14B model deployed with vLLM thanks to OVHcloud AI Deploy and its autoscaling strategies based on custom metrics
Prometheus for metrics collection and Grafana for visualisation/dashboards thanks to OVHcloud MKS

But how can you check that everything will work when the load increases?

Step 7 – Test autoscaling and real-time visualisation

The first objective here is to force AI Deploy to:

Increase vllm:num_requests_running
‘Saturate’ a single replica
Trigger the scale up
Observe replica increase + latency drop

1. Autoscaling testing strategy

The goal is to combine:

High concurrency
Long prompts (KVcache heavy)
Long generations
Bursty load

This is what vLLM autoscaling actually reacts to.

To do so, a Python code can simulate the expected behaviour:

import time
import threading
import random
from statistics import mean
from openai import OpenAI
from tqdm import tqdm

APP_URL = "https://.app.gra.ai.cloud.ovh.net/v1" # /!\ REPLACE THE  by yours /!\
MODEL = "mistralai/Ministral-3-14B-Instruct-2512"
API_KEY = $MY_OVHAI_ACCESS_TOKEN

CONCURRENT_WORKERS = 500          # concurrency (main scaling trigger)
REQUESTS_PER_WORKER = 25
MAX_TOKENS = 768                  # generation pressure

# some random prompts
SHORT_PROMPTS = [
    "Summarize the theory of relativity.",
    "Explain what a transformer model is.",
    "What is Kubernetes autoscaling?"
]

MEDIUM_PROMPTS = [
    "Explain how attention mechanisms work in transformer-based models, including self-attention and multi-head attention.",
    "Describe how vLLM manages KV cache and why it impacts inference performance."
]

LONG_PROMPTS = [
    "Write a very detailed technical explanation of how large language models perform inference, "
    "including tokenization, embedding lookup, transformer layers, attention computation, KV cache usage, "
    "GPU memory management, and how batching affects latency and throughput. Use examples.",
]

PROMPT_POOL = (
    SHORT_PROMPTS * 2 +
    MEDIUM_PROMPTS * 4 +
    LONG_PROMPTS * 6    # bias toward long prompts
)

# openai compliance
client = OpenAI(
    base_url=APP_URL,
    api_key=API_KEY,
)

# basic metrics
latencies = []
errors = 0
lock = threading.Lock()

# worker
def worker(worker_id):
    global errors
    for _ in range(REQUESTS_PER_WORKER):
        prompt = random.choice(PROMPT_POOL)

        start = time.time()
        try:
            client.chat.completions.create(
                model=MODEL,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=MAX_TOKENS,
                temperature=0.7,
            )
            elapsed = time.time() - start

            with lock:
                latencies.append(elapsed)

        except Exception as e:
            with lock:
                errors += 1

# run
threads = []
start_time = time.time()

print("Starting autoscaling stress test...")
print(f"Concurrency: {CONCURRENT_WORKERS}")
print(f"Total requests: {CONCURRENT_WORKERS * REQUESTS_PER_WORKER}")

for i in range(CONCURRENT_WORKERS):
    t = threading.Thread(target=worker, args=(i,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

total_time = time.time() - start_time

# results
print("\n=== AUTOSCALING BENCH RESULTS ===")
print(f"Total requests sent: {len(latencies) + errors}")
print(f"Successful requests: {len(latencies)}")
print(f"Errors: {errors}")
print(f"Total wall time: {total_time:.2f}s")

if latencies:
    print(f"Avg latency: {mean(latencies):.2f}s")
    print(f"Min latency: {min(latencies):.2f}s")
    print(f"Max latency: {max(latencies):.2f}s")
    print(f"Throughput: {len(latencies)/total_time:.2f} req/s")

How can you verify that autoscaling is working and that the load is being handled correctly without latency skyrocketing?

2. Hardware and platform-level monitoring

First, AI Deploy Grafana answers ‘What resources are being used and how many replicas exist?‘.

GPU utilisation, GPU memory, CPU, RAM and replica count are monitored through OVHcloud AI Deploy Grafana (monitoring URL), which exposes infrastructure and runtime metrics for the AI Deploy application. This layer provides visibility into resource saturation and scaling events managed by the AI Deploy platform itself.

Access it using the following URL (do not forget to replace by yours): https://monitoring.gra.ai.cloud.ovh.net/d/app/app-monitoring?var-app=&orgId=1

For example, check GPU/RAM metrics:

You can also monitor scale ups and downs in real time, as well as information on HTTP calls and much more!

3. Software and application-level monitoring

Next the combination of MKS + Prometheus + Grafana answers ‘How the inference engine behaves internally’.

In fact, vLLM internal metrics (request concurrency, token throughput, latency indicators, KV cache pressure, etc.) are collected via the vLLM /metrics endpoint and scraped by Prometheus running on OVHcloud MKS, then visualised in a dedicated Grafana instance. This layer focuses on model behaviour and inference performance.

Find all these metrics via (just replace ): http:///d/vllm-ministral-monitoring/ministral-14b-vllm-metrics-monitoring?orgId=1

Find key metrics such as TTF, etc:

You can also find some information about ‘Model load and throughput’:

To go further and add even more metrics, you can refer to the vLLM documentation on ‘Prometheus and Grafana‘.

Conclusion

This reference architecture provides a scalable, and production-ready approach for deploying LLM inference on OVHcloud using AI Deploy and the autoscaling on custom metric feature.

OVHcloud MKS is dedicated to running Prometheus and Grafana, enabling secure scraping and visualisation of vLLM internal metrics exposed via the /metrics endpoint.

By scraping vLLM metrics securely from AI Deploy into Prometheus and exposing them through Grafana, the architecture provides full visibility into model behaviour, performance and load, enabling informed scaling analysis, troubleshooting and capacity planning in production environments.

Reference Architecture: build a sovereign n8n RAG workflow for AI agent using OVHcloud Public Cloud solutions

Eléa Petton — Tue, 27 Jan 2026 13:12:03 +0000

What if an n8n workflow, deployed in a sovereign environment, saved you time while giving you peace of mind? From document ingestion to targeted response generation, n8n acts as the conductor of your RAG pipeline without compromising data protection.

n8n workflow overview

In the current landscape of AI agents and knowledge assistants, connecting your internal documentation with Large Language Models (LLMs) is becoming a strategic differentiator.

How? By building Agentic RAG systems capable of retrieving, reasoning, and acting autonomously based on external knowledge.

To make this possible, engineers need a way to connect retrieval pipelines (RAG) with tool-based orchestration.

This article outlines a reference architecture for building a fully automated RAG pipeline orchestrated by n8n, leveraging OVHcloud AI Endpoints and PostgreSQL with pgvector as core components.

The final result will be a system that automatically ingests Markdown documentation from Object Storage, creates embeddings with OVHcloud’s BGE-M3 model available on AI Endpoints, and stores them in a Managed Database PostgreSQL with pgvector extension.

Lastly, you’ll be able to build an AI Agent that lets you chat with an LLM (GPT-OSS-120B on AI Endpoints). This agent, utilising the RAG implementation carried out upstream, will be an expert on OVHcloud products.

You can further improve the process by using an LLM guard to protect the questions sent to the LLM, and set up a chat memory to use conversation history for higher response quality.

But what about n8n?

n8n, the open-source workflow automation tool, offers many benefits and connects seamlessly with over 300 APIs, apps, and services:

Open-source: n8n is a 100% self-hostable solution, which means you retain full data control;
Flexible: combines low-code nodes and custom JavaScript/Python logic;
AI-ready: includes useful integrations for LangChain, OpenAI, and embedding support capabilities;
Composable: enables simple connections between data, APIs, and models in minutes;
Sovereign by design: compliant with privacy-sensitive or regulated sectors.

This reference architecture serves as a blueprint for building a sovereign, scalable Retrieval Augmented Generation (RAG) platform using n8n and OVHcloud Public Cloud solutions.

This setup shows how to orchestrate data ingestion, generate embedding, and enable conversational AI by combining OVHcloud Object Storage, Managed Databases with PostgreSQL, AI Endpoints and AI Deploy.The result? An AI environment that is fully integrated, protects privacy, and is exclusively hosted on OVHcloud’s European infrastructure.

Overview of the n8n workflow architecture for RAG

The workflow involves the following steps:

Ingestion: documentation in markdown format is fetched from OVHcloud Object Storage (S3);
Preprocessing: n8n cleans and normalises the text, removing YAML front-matter and encoding noise;
Vectorisation: Each document is embedded using the BGE-M3 model, which is available via OVHcloud AI Endpoints;
Persistence: vectors and metadata are stored in OVHcloud PostgreSQL Managed Database using pgvector;
Retrieval: when a user sends a query, n8n triggers a LangChain Agent that retrieves relevant chunks from the database;
Reasoning and actions: The AI Agent node combines LLM reasoning, memory, and tool usage to generate a contextual response or trigger downstream actions (Slack reply, Notion update, API call, etc.).

In this tutorial, all services are deployed within the OVHcloud Public Cloud.

Prerequisites

Before you start, double-check that you have:

an OVHcloud Public Cloud account
an OpenStack user with the following roles:
- Administrator
- AI Operator
- Object Storage Operator
An API key for AI Endpoints
ovhai CLI available – install the ovhai CLI
Hugging Face access – create a Hugging Face account and generate an access token

🚀 Now that you have everything you need, you can start building your n8n workflow!

Architecture guide: n8n agentic RAG workflow

You’re all set to configure and deploy your n8n workflow

⚙️ Keep in mind that the following steps can be completed using OVHcloud APIs!

Step 1 – Build the RAG data ingestion pipeline

This first step involves building the foundation of the entire RAG workflow by preparing the elements you need:

n8n deployment
Object Storage bucket creation
PostgreSQL database creation
and more

Remember to set up the proper credentials in n8n so the different elements can connect and function.

1. Deploy n8n on OVHcloud VPS

OVHcloud provides VPS solutions compatible with n8n. Get a ready-to-use virtual server with pre-installed n8n and start building automation workflows without manual setup. With plans ranging from 6 vCores / 12 GB RAM to 24 vCores / 96 GB RAM, you can choose the capacity that suits your workload.

How to set up n8n on a VPS?

Setting up n8n on an OVHcloud VPS generally involves:

Choosing and provisioning your OVHcloud VPS plan;
Connecting to your server via SSH and carrying out the initial server configuration, which includes updating the OS;
Installing n8n, typically with Docker (recommended for ease of management and updates), or npm by following this guide;
Configuring n8n with a domain name, SSL certificate for HTTPS, and any necessary environment variables for databases or settings.

While OVHcloud provides a robust VPS platform, you can find detailed n8n installation guides in the official n8n documentation.

Once the configuration is complete, you can configure the database and bucket in Object Storage.

2. Create Object Storage bucket

First, you have to set up your data source. Here you can store all your documentation in an S3-compatible Object Storage bucket.

Here, assume that all the documentation files are in Markdown format.

From OVHcloud Control Panel, create a new Object Storage container with S3-compatible API solution; follow this guide.

When the bucket is ready, add your Markdown documentation to it.

Note: For this tutorial, we’re using the various OVHcloud product documentation available in Open-Source on the GitHub repository maintained by OVHcloud members.

Click this link to access the repository.

How do you do that? Extract all the guide.en-gb.md files from the GitHub repository and rename each one to match its parent folder.

Example: the documentation about ovhai cli installation docs/pages/public_cloud/ai_machine_learning/cli_10_howto_install_cli/guide.en-gb.md is stored in ovhcloud-products-documentation-md bucket as cli_10_howto_install_cli.md

You should get an overview that looks like this:

Keep the following elements and create a new credential in n8n named OVHcloud S3 gra credentials:

S3 Endpoint: https://s3.gra.io.cloud.ovh.net/
Region: gra
Access Key ID:
Secret Access Key:

Then, create a new n8n node by selecting S3, then Get Multiple Files.
Configure this node as follows:

Connect the node to the previous one before moving on to the next step.

With the first phase done, you can now configure the vector DB.

3. Configure PostgreSQL Managed DB (pgvector)

In this step, you can set up the vector database that lets you store the embeddings generated from your documents.

How? By using OVHcloud’s managed databases, a pgvector extension of PostgreSQL. Go to your OVHcloud Control Panel and follow the steps.

1. Navigate to Databases & Analytics > Databases

2. Create a new database and select PostgreSQL and a datacenter location

3. Select Production plan and Instance type

4. Reset the user password and save it

5. Whitelist the IP of your n8n instance as follows

6. Take note of te following parameters

Make a note of this information and create a new credential in n8n named OVHcloud PGvector credentials:

Host:
Database: defaultdb
User: avnadmin
Password:
Port: 20184

Consider enabling the Ignore SSL Issues (Insecure) button as needed and setting the Maximum Number of Connections value to 1000.

✅ You’re now connected to the database! But what about the PGvector extension?

Add a PosgreSQL node in your n8n workflow Execute a SQL query, and create the extension through an SQL query, which should look like this:

-- drop table as needed
DROP TABLE IF EXISTS md_embeddings;

-- activate pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- create table
CREATE TABLE md_embeddings (
    id SERIAL PRIMARY KEY,
    text TEXT,
    embedding vector(1024),
    metadata JSONB
);

You should get this n8n node:

Finally, you can create a new table and name it md_embeddings using this node. Create a Stop and Error node if you run into errors setting up the table.

All set! Your vector DB is prepped and ready for data! Keep in mind, you still need an embeddings model for the RAG data ingestion pipeline.

4. Access to OVHcloud AI Endpoints

OVHcloud AI Endpoints is a managed service that provides ready-to-use APIs for AI models, including LLM, CodeLLM, embeddings, Speech-to-Text, and image models hosted within OVHcloud’s European infrastructure.

To vectorise the various documents in Markdown format, you have to select an embedding model: BGE-M3.

Usually, your AI Endpoints API key should already be created. If not, head to the AI Endpoints menu in your OVHcloud Control Panel to generate a new API key.

Once this is done, you can create new OpenAI credentials in your n8n.

Why do I need OpenAI credentials? Because AI Endpoints API is fully compatible with OpenAI’s, integrating it is simple and ensures the sovereignty of your data.

How? Thanks to a single endpoint https://oai.endpoints.kepler.ai.cloud.ovh.net/v1, you can request the different AI Endpoints models.

This means you can create a new n8n node by selecting Postgres PGVector Store and Add documents to Vector Store.
Set up this node as shown below:

Then configure the Data Loader with a custom text splitting and a JSON type.

For the text splitter, here are some options:

To finish, select the BGE-M3 embedding model from the model list and set the Dimensions to 1024.

You now have everything you need to build the ingestion pipeline.

5. Set up the ingestion pipeline loop

To make use of a fully automated document ingestion and vectorisation pipeline, you have to integrate some specific nodes, mainly:

a Loop Over Items that downloads each markdown file one by one so that it can be vectorised;
a Code in JavaScript that counts the number of files processed, which subsequently determines the number of requests sent to the embedding model;
an If condition that allows you to check when the 400 requests have been reached;
a Wait node that pauses after every 400 requests to avoid getting rate-limited;
an S3 block Download a file to download each markdown;
another Code in JavaScript to extract and process text from Markdown files by cleaning and removing special characters before sending it to the embeddings model;
a PostgreSQL node to Execute a SQL query to check that the table contains vectors after the process (loop) is complete.

5.1. Create a loop to process each documentation file

Begin by creating a Loop Over Items to process all the Markdown files one at a time. Set the batch size to 1 in this loop.

Add the Loop statement right after the S3 Get Many Files node as shown below:

Time to put the loop’s content into action!

5.2. Count the number of files using a code snippet

Next, choose the Code in JavaScript node from the list to see how many files have been processed. Set “Run Once for Each Item” Mode and “JavaScript” code Language, then add the following code snippet to the designated block.

// simple counter per item
const counter = $runIndex + 1;

return {
  counter
};

Make sure this code snippet is included in the loop.

You can start adding the if part to the loop now.

5.3. Add a condition that applies a rule every 400 requests

Here, you need to create an If node and add the following condition, which you have set as an expression.

{{ (Number($json["counter"]) % 400) === 0 }}

Add it immediately after counting the files:

If this condition is true, trigger the Wait node.

5.4. Insert a pause after each set of 400 requests

Then insert a Wait node to pause for a few seconds before resuming. You can insert Resume “After Time Interval” and set the Wait Amount to “60:00” seconds.

Link it to the If condition when this is True.

Next, you can go ahead and download the Markdown file, and then process it.

5.5. Launch documentation download

To do this, create a new Download a file S3 node and configure it with this File Key expression:

{{ $('Process each documentation file').item.json.Key }}

Want to connect it? That’s easy, link it to the output of the Wait and If statements when the ‘if’ statement returns False; this will allow the file to be processed only if the rate limit is not exceeded.

You’re almost done! Now you need to extract and process the text from the Markdown files – clean and remove any special characters before sending it to the embedding model.

5.6 Clean Markdown text content

Next, create another Code in JavaScript to process text from Markdown files:

// extract binary content
const binary = $input.item.binary.data;

// decoding into clean UTF-8 text
let text = Buffer.from(binary.data, 'base64').toString('utf8');

// cleaning - remove non-printable characters
text = text
  .replace(/[^\x09\x0A\x0D\x20-\x7EÀ-ÿ€£¥•–—‘’“”«»©®™°±§¶÷×]/g, ' ')
  .replace(/\s{2,}/g, ' ')
  .trim();

// check lenght
if (text.length > 14000) {
  text = text.slice(0, 14000);
}

return [{
  text,
  fileName: binary.fileName,
  mimeType: binary.mimeType
}];

Select the “Run Once for Each Item” Mode and place the previous code in the dedicated JavaScript block.

To finish, check that the output text has been sent to the document vectorisation system, which was set up in Step 3 – Configure PostgreSQL Managed DB (pgvector).

How do I confirm that the table contains all elements after vectorisation?

5.7 Double-check that the documents are in the table

To confirm that your RAG system is working, make sure your vector database has different vectors; use a PostgreSQL node with Execute a SQL query in your n8n workflow.

Then, run the following query:

-- count the number of elements
SELECT COUNT(*) FROM md_embeddings;

Next, link this element to the Done section of your Loop, so the elements are counted when the process is complete.

Congrats! You can now run the workflow to begin ingesting documents.

Click the Execute workflow button and wait until the vectorization process is complete.

Remember, everything should be green when it’s finished ✅.

Step 2 – RAG chatbot

With the data ingestion and vectorisation steps completed, you can now begin implementing your AI agent.

This involves building a RAG-based AI Agent by simply starting a chat with an LLM.

1. Set up the chat box to start a conversation

First, configure your AI Agent based on the RAG system, and add a new node in the same n8n workflow: Chat Trigger.

This node will allow you to interact directly with your AI agent! But before that, you need to check that your message is safe.

This node will allow you to interact directly with your AI agent! But before that, you need to check that your message is secure.

2. Set up your LLM Guard with AI Deploy

To check whether a message is secure or not, use an LLM Guard.

What’s an LLM Guard? This is a safety and control layer that sits between users and an LLM, or between the LLM and an external connection. Its main goal is to filter, monitor, and enforce rules on what goes into or comes out of the model 🔐.

You can use AI Deploy from OVHcloud to deploy your desired LLM guard. With a single command line, this AI solution lets you deploy a Hugging Face model using vLLM Docker containers.

For more details, please refer to this blog.

For the use case covered in this article, you can use the open-source model meta-llama/Llama-Guard-3-8B available on Hugging Face.

2.1 Create a Bearer token to request your custom AI Deploy endpoint

Create a token to access your AI Deploy app once it’s deployed.

ovhai token create --role operator ai_deploy_token=my_operator_token

The following output is returned:

Id: 47292486-fb98-4a5b-8451-600895597a2b Created At: 20-10-25 8:53:05 Updated At: 20-10-25 8:53:05 Spec: Name: ai_deploy_token=my_operator_token Role: AiTrainingOperator Label Selector: Status: Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Version: 1

You can now store and export your access token to add it as a new credential in n8n.

export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

2.1 Start Llama Guard 3 model with AI Deploy

Using ovhai CLI, launch the following command and vLLM start inference server.

ovhai app run \
	--name vllm-llama-guard3 \
        --default-http-port 8000 \
        --gpu 1 \
	--flavor l40s-1-gpu \
        --label ai_deploy_token=my_operator_token \
	--env OUTLINES_CACHE_DIR=/tmp/.outlines \
	--env HF_TOKEN=$MY_HF_TOKEN \
	--env HF_HOME=/hub \
	--env HF_DATASETS_TRUST_REMOTE_CODE=1 \
	--env HF_HUB_ENABLE_HF_TRANSFER=0 \
	--volume standalone:/workspace:RW \
	--volume standalone:/hub:RW \
	vllm/vllm-openai:v0.10.1.1 \
	-- bash -c python3 -m vllm.entrypoints.openai.api_server                       
                           --model meta-llama/Llama-Guard-3-8B \                     
                           --tensor-parallel-size 1 \                     
                           --dtype bfloat16

Full command explained:

ovhai app run

This is the core command to run an app using the OVHcloud AI Deploy platform.

--name vllm-llama-guard3

Sets a custom name for the job. For example, vllm-llama-guard3.

--default-http-port 8000

Exposes port 8000 as the default HTTP endpoint. vLLM server typically runs on port 8000.

--gpu 1
--flavor l40s-1-gpu

Allocates 1 GPU L40S for the app. You can adjust the GPU type and number depending on the model you have to deploy.

--volume standalone:/workspace:RW
--volume standalone:/hub:RW

Mounts two persistent storage volumes: /workspace which is the main working directory and /hub to store Hugging Face model files.

--env OUTLINES_CACHE_DIR=/tmp/.outlines
--env HF_TOKEN=$MY_HF_TOKEN
--env HF_HOME=/hub
--env HF_DATASETS_TRUST_REMOTE_CODE=1
--env HF_HUB_ENABLE_HF_TRANSFER=0

These are Hugging Face environment variables you have to set. Please export your Hugging Face access token as environment variable before starting the app: export MY_HF_TOKEN=***********

vllm/vllm-openai:v0.10.1.1

Use the vllm/vllm-openai Docker image (a pre-configured vLLM OpenAI API server).

-- bash -c python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-Guard-3-8B \ --tensor-parallel-size 1 \ --dtype bfloat16

Finally, run a bash shell inside the container and executes a Python command to launch the vLLM API server.

2.2 Check to confirm your AI Deploy app is RUNNING

Replace the by yours.

ovhai app get

You should get:

History: DATE STATE 20-1O-25 09:58:00 QUEUED 20-10-25 09:58:01 INITIALIZING 04-04-25 09:58:07 PENDING 04-04-25 10:03:10 RUNNING Info: Message: App is running

2.3 Create a new n8n credential with AI Deploy app URL and Bearer access token

First, using your , retrieve your AI Deploy app URL.

ovhai app get  -o json | jq '.status.url' -r

Then, create a new OpenAI credential from your n8n workflow, using your AI Deploy URL and the Bearer token as an API key.

Don’t forget to replace 6e10e6a5-2862-4c82-8c08-26c458ca12c7 with your .

2.4 Create the LLM Guard node in n8n workflow

Create a new OpenAI node to Message a model and select the new AI Deploy credential for LLM Guard usage.

Next, create the prompt as follows:

{{ $('Chat with the OVHcloud product expert').item.json.chatInput }}

Then, use an If node to determine if the scenario is safe or unsafe:

If the message is unsafe, send an error message right away to stop the workflow.

But if the message is safe, you can send the request to the AI Agent without issues 🔐.

3. Set up AI Agent

The AI Agent node in n8n acts as an intelligent orchestration layer that combines LLMs, memory, and external tools within an automated workflow.

It allows you to:

Connect a Large Language Model using APIs (e.g., LLMs from AI Endpoints);
Use tools such as HTTP requests, databases, or RAG retrievers so the agent can take actions or fetch real information;
Maintain conversational memory via PostgreSQL databases;
Integrate directly with chat platforms (e.g., Slack, Teams) for interactive assistants (optional).

Simply put, n8n becomes an agentic automation framework, enabling LLMs to not only provide answers, but also think, choose, and perform actions.

Please note that you can change and customise this n8n AI Agent node to fit your use cases, using features like function calling or structured output. This is the most basic configuration for the given use case. You can go even further with different agents.

🧑‍💻 How do I implement this RAG?

First, create an AI Agent node in n8n as follows:

Then, a series of steps are required, the first of which is creating prompts.

3.1 Create prompts

In the AI Agent node on your n8n workflow, edit the user and system prompts.

Begin by creating the prompt, which is also the user message:

{{ $('Chat with the OVHcloud product expert').item.json.chatInput }}

Then create the System Message as shown below:

You have access to a retriever tool connected to a knowledge base.  
Before answering, always search for relevant documents using the retriever tool.  
Use the retrieved context to answer accurately.  
If no relevant documents are found, say that you have no information about it.

You should get a configuration like this:

🤔 Well, an LLM is now needed for this to work!

3.2 Select LLM using AI Endpoints API

First, add an OpenAI Chat Model node, and then set it as the Chat Model for your agent.

Next, select one of the OVHcloud AI Endpoints from the list provided, because they are compatible with Open AI APIs.

✅ How? By using the right API https://oai.endpoints.kepler.ai.cloud.ovh.net/v1

The GPT OSS 120B model has been selected for this use case. Other models, such as Llama, Mistral, and Qwen, are also available.

⚠️ WARNING ⚠️

If you are using a recent version of n8n, you will likely encounter the /responses issue (linked to OpenAI compatibility). To resolve this, you will need to disable the button Use Responses API and everything will work correctly

Tips to fix /responses issue

Your LLM is now set to answer your questions! Don’t forget, it needs access to the knowledge base.

3.3 Connect the knowledge base to the RAG retriever

As usual, the first step is to create an n8n node called PGVector Vector Store node and enter your PGvector credentials.

Next, link this element to the Tools section of the AI Agent node.

Remember to connect your PG vector database so that the retriever can access the previously generated embeddings. Here’s an overview of what you’ll get.

⏳Nearly done! The final step is to add the database memory.

3.4 Manage conversation history with database memory

Creating Database Memory node in n8n (PostgreSQL) lets you link it to your AI Agent, so it can store and retrieve past conversation history. This enables the model to remember and use context from multiple interactions.

So link this PostgreSQL database to the Memory section of your AI agent.

Congrats! 🥳 Your n8n RAG workflow is now complete. Ready to test it?

4. Make the most of your automated workflow

Want to try it? It’s easy!

By clicking the orange Open chat button, you can ask the AI agent questions about OVHcloud products, particularly where you need technical assistance.

For example, you can ask the LLM about rate limits in OVHcloud AI Endpoints and get the information in seconds.

You can now build your own autonomous RAG system using OVHcloud Public Cloud, suited for a wide range of applications.

What’s next?

To sum up, this reference architecture provides a guide on using n8n with OVHcloud AI Endpoints, AI Deploy, Object Storage, and PostgreSQL + pgvector to build a fully controlled, autonomous RAG AI system.

Teams can build scalable AI assistants that work securely and independently in their cloud environment by orchestrating ingestion, embedding generation, vector storage, retrieval, and LLM safety check, and reasoning within a single workflow.

With the core architecture in place, you can add more features to improve the capabilities and robustness of your agentic RAG system:

Web search
Images with OCR
Audio files transcribed using the Whisper model

This delivers an extensive knowledge base and a wider variety of use cases!

Reference Architecture: deploying the Mistral Large 123B model in a sovereign environment with OVHcloud

Eléa Petton — Wed, 18 Jun 2025 12:45:51 +0000

Are you ready to think bigger with the Mistral Large model 🚀 ?

Mistral Large model deployed on OVHcloud infrastructure

As Artificial Intelligence (AI) becomes a strategic pillar for both enterprises and public institutions, data sovereignty and infrastructure control have become essential. Deploying advanced large language models (LLMs) like Mistral Large, under a commercial license, requires a secure, high-performance environment that complies with European data regulations.

OVHcloud Machine Learning Services offer a trusted solution for deploying AI models in a fully sovereign cloud environment — hosted in Europe, under EU jurisdiction, and fully GDPR-compliant.

This Reference Architecture will show you how to:

Access Mistral AI registry using your own license
Download the Mistral Large 123B model automatically using AI Training
Store the model into a dedicated bucket with OVHcloud Object Storage
Deploy a production-ready inference API for Mistral Large using AI Deploy

Context

Mistral Large model

The Mistral Large model is a state-of-the-art (LLM) developed by Mistral AI, a French AI company. It’s designed to compete with top-tier models like GPT-4, Claude, while emphasizing performance and efficiency.

This is a model with 123 billion parameters. Mistral AI recommends deploying this model in FP8 with 4 H100 GPUs. For more information, refer to Mistral documentation.

This model requires the use of a commercial licence. To do this, you need to create an account on La Plateforme via the Mistral AI console (console.mistral.ai).

AI Training

OVHcloud AI Training is a fully managed platform designed to help you train, tune Machine Learning (ML), Deep Learning (DL), and Large Language Models (LLMs) efficiently. Whether you’re working on computer vision, NLP, or tabular data, this solution lets you launch training jobs on high-performance GPUs in seconds.

What are the key benefits?

Easy to use: launch processing or training jobs in one CLI command or a few clicks using your own Docker image
High-performance computing: access GPUs like H100, A100, V100S, L40S, and L4 as of June 2025 – new references are added regularly
Cost-efficient: pay-per-minute billing with no upfront commitment. You only pay for compute time used, with precise control over resources thanks to automatic job stop and synchronisation

💡 Why do we need AI Training? To download the Mistral Large model automatically and efficiently, using a single command to launch the job.

AI Deploy

OVHcloud AI Deploy is a Container as a Service (CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications / APIs based on Machine Learning (ML), Deep Learning (DL) or LLMs.

The key benefits are:

Easy to use: bring your own custom Docker image and deploy it in a command line or a few clicks surely
High-performance computing: a complete range of GPUs available (H100, A100, V100S, L40S and L4)
Scalability and flexibility: supports automatic scaling, allowing your model to effectively handle fluctuating workloads
Cost-efficient: billing per minute, no surcharges

✅ To go further, some prerequisites must be checked!

Overview of the Mistral Large deployment architecture

Here is how will be deployed Mistral Large 123B:

Install the ovhai CLI
Create a bucket for model storage
Retrieve the license information from Mistral Console
Configure and set up the environment
Download the Mistral Large model weights
Deploy the Mistral Large service
Test it with simple request and advanced usage thanks to LangChain

Let’s go for the set up and deployment of your own Mistral Large service!

Prerequisites

Before you begin, ensure you have:

A Mistral AI license to access to the Mistral Large model
An OVHcloud Public Cloud account
An OpenStack user with the following roles:
- Administrator
- AI Training Operator
- Object Storage Operator

🚀 Having all the ingredients for our recipe, it’s time to deploy the Mistral Large model on 4 H100!

Architecture guide: Mistral Large on OVHcloud infrastructure

Let’s go for the set up and deployment of the Mistral Large model!

✅ Note
In this example, the Mistral Large 25.02 is used. Choose the mistral model under the licence of your choice and repeat the same steps, adapting the model name and versions.

⚙️ Also consider that all of the following steps can be automated using OVHcloud APIs!

Step 1 – Install `ovhai` CLI

If the ovhai CLI is not install, start by setting up your CLI environment.

curl https://cli.gra.ai.cloud.ovh.net/install.sh | bash

Secondly, login using your OpenStack credentials.

ovhai login -u  -p

Now, it’s time to create your bucket inside OVHcloud Object Storage!

Step 2 – Provision Object Storage

Go to Public Cloud > Storage > Object Storage in the OVHcloud Control Panel.
Create a datastore and a new S3 bucket (e.g., s3-mistral-large-model).
Register the datastore with the ovhai CLI:

ovhai datastore add s3  https://s3.gra.perf.cloud.ovh.net/ gra   --store-credentials-locally

💡 Note that, for this use case, we recommend the High Performance Object Storage range using https://s3.gra.perf.cloud.ovh.net/ instead of https://s3.gra.io.cloud.ovh.net/

Step 3 – Access the Mistral AI registry

⚠️ Please note that you must have a licence for the Mistral Large model to be able to carry out the following steps.

Go to the Mistral AI platform: https://console.mistral.ai/home
Retrieve credentials and the license key from the Mistral console: https://console.mistral.ai/on-premise/licenses
Authenticate to the Mistral AI Docker registry:

docker login  --username $DOCKER_USERNAME --password $DOCKER_PASSWORD

Add the private registry to the config using the ovhai CLI:

ovhai registry add

Check that it is present in the list:

ovhai registry list

Step 4 – Define environment variables

The next step is to define an .env file that will list all the environment variables required to download and deploy the Mistral Large model.

Create the .env file, enter the following information:

SERVED_MODEL=mistral-large-2502
RECIPES_VERSION=v0.0.76TP_SIZE=4
LICENSE_KEY=
DOCKER_IMAGE_INFERENCE_ENGINE=<mistral-inference-server-docker-image>
DOCKER_IMAGE_MISTRAL_UTILS=<mistral-utils-docker-image>

Then, create a script to load theses environment variables easily. Name it load_env.sh:

#!/bin/bash

# Vérifie si le fichier .env existe
if [ ! -f .env ]; then
  echo "Error: .env not found"
  exit 1
fi

# Exporter toutes les variables du .env
export $(grep -v '^#' .env | xargs)

echo "Environment variables are loaded from .env"

Now, launch this script :

source load_env.sh

✅ You have everything you need to start the implementation!

Step 5 – Download Mistral Large model weights

The aim here is to download the model and its artefacts into the S3 bucket created earlier.

To achieve this, you can launch a download job that will run automatically with AI Training.

💡 Here’s a tip!
Note that here you are not using AI Training to train models, but as an easy-to-use Container as a Service solution. With a single command line, you can launch a one-shot download of the Mistral Large model with automatic synchronisation to Object Storage.

Launch the AI Training download job by attaching the object container:

ovhai job run --name DOWNLOAD_MISTRAL_LARGE_123B \
              --cpu 12 \
              --volume s3-mistral-large-model@/:/opt/ml/model:RW \
              -e RECIPES_VERSION=$RECIPES_VERSION \
              $DOCKER_IMAGE_MISTRAL_UTILS \
                -- bash -c "cd /app/mistral-rclone && \ 
                  poetry run python mistral-rclone.py \
                  --license-key $LICENSE_KEY \
                  --download-model $SERVED_MODEL"

Full command explained:

ovhai job run

This is the core command to run a job using the OVHcloud AI Training platform.

--name DOWNLOAD_MISTRAL_LARGE_123B

Sets a custom name for the job. For example, DOWNLOAD_MISTRAL_LARGE_123B.

--cpu 12

Allocates 12 CPU for the job.

--volume s3-mistral-large-model@/:/opt/ml/model:RW

This mounts your OVHcloud Object Storage volume into the job’s file system:
– s3-mistral-large-model@/: refers to your S3 bucket volume from the OVHcloud Object Storage
– :/opt/ml/model: mounts the volume into the container under /opt/ml/model
– RW: enables Read/Write permissions

-e RECIPES_VERSION=$RECIPES_VERSION

This is from your environment variables defined previously.

$DOCKER_IMAGE_MISTRAL_UTILS

This is the Mistral Large utils Docker image you are running inside the job.

-- bash -c "cd /app/mistral-rclone && \
poetry run python mistral-rclone.py \
--license-key $LICENSE_KEY \
--download-model $SERVED_MODEL"

Refers to the specific command to launch the model download.

Note that synchronisation with Object Storage will be automatic at the end of the AI Training job.

⚠️ WARNING!
Wait for the job to go to DONE before proceeding to the next step.

Check that the various elements are present in the bucket:

ovhai bucket object list s3-mistral-large-model@

The bucket must be organized and split into 4 different folders:

grammars
recipes
tokenizers
weights

Note that a total of 6 elements must be present.

🚀 It’s all there? So let’s move on to the deployment of the Mistral Large model!

Step 6 – Deploy Mistral Large service

To deploy the Mistral Large 123B model using the previously downloaded weights, you will use OVHcloud’s AI Deploy product.

But first you need to create an API key that will allow you to consume the model and query it, in particular using Open AI compatibility.

Creation of an access token:

ovhai token create --role read mistral_large=api_key_reader

Export this token as an environment variable:

export MY_OVHAI_MISTRAL_LARGE_TOKEN=

Launch the Mistral Large service with AI Deploy by running the following command:

ovhai app run --name DEPLOY_MISTRAL_LARGE_123B \
              --gpu 4 \
              --flavor h100-1-gpu \
              --default-http-port 5000 \
              --label mistral_large=api_key_reader \
              -e SERVED_MODEL=$SERVED_MODEL \
              -e RECIPES_VERSION=$RECIPES_VERSION \
              -e TP_SIZE=$TP_SIZE \
              --volume s3-mistral-large-model@/:/opt/ml/model:RW \
              --volume standalone:/tmp:RW \
              --volume standalone:/workspace:RW \
              $DOCKER_IMAGE_INFERENCE_ENGINE

Full command explained:

ovhai app run

This is the core command to run an app / API using the OVHcloud AI Deploy platform.

--name DEPLOY_MISTRAL_LARGE_123B

Sets a custom name for the app. For example, DEPLOY_MISTRAL_LARGE_123B.

--default-http-port 5000

Exposes port 5000 as the default HTTP endpoint.

--gpu 4

Allocates 4 GPUs for the app.

--flavor h100-1-gpu

Chooses H100 GPUs for the app.

--volume s3-mistral-large-model@/:/opt/ml/model:RW

--label mistral_large=api_key_reader

Means that the access is restricted to your token

-e SERVED_MODEL=$SERVED_MODEL
-e RECIPES_VERSION=$RECIPES_VERSION
-e TP_SIZE=$TP_SIZE

These are environment variables defined previously.

-v standalone:/tmp:rw
-v standalone:/workspace:rw

Mounts two persistent storage volumes:
– /tmp
– /workspace → Main working directory

$DOCKER_IMAGE_INFERENCE_ENGINE

This is the Mistral Large inference Docker image you are running inside the app.

It may take a few minutes for the resources to be allocated and for the Docker image to be pulled.

To check the progress and get additional information about the AI deploy app, run the following command:

ovhai app get

Once in RUNNING status, the model will be loaded. To check that the load was successful, you can check the container logs:

ovhai app logs

⚠️ WARNING!
To consume the service, you must wait for the app to go into RUNNING status, AND for the model to finish loading.

🎉 Is that it? Everything ready? It is therefore possible to start playing with the model!

Step 7 – Test the Mistral Large model by sending your first requests

Access the API doc via your app URL:

https://.app.gra.ai.cloud.ovh.net/docs

To find the information, please refer to https://console.mistral.ai/on-premise/licenses

Test with a basic cURL:

curl -X 'POST' \
'https://.app.gra.ai.cloud.ovh.net/v1/chat/completions' \
  -H 'accept: application/json' \
  -H "Authorization: Bearer $MY_OVHAI_MISTRAL_LARGE_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "mistral-large-",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant!"
    },
    {
      "role": "user",
      "content": "What is the capital of France?"     
    }
  ]
}'

⚠️ Note that you have also to replace in the model name by the one you are using:
"model": "mistral-large-"

To take implementation a step further and take advantage of all the features of this endpoint, you can also integrate it with Langchain thanks to its fuOpenAI compatibility.

LangChain integration:

import time
import os 
from langchain.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

def chat_completion_basic(new_message: str):

  model = ChatOpenAI(model_name="mistral-large-",
                        openai_api_key=$MY_OVHAI_MISTRAL_LARGE_TOKEN,
                        openai_api_base='https://.app.gra.ai.cloud.ovh.net/v1',
                       )

  prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant!"),
    ("human", "{question}"),
  ])

  chain = prompt | model

  print("🤖: ")
  for r in chain.stream({"question", new_message}):
    print(r.content, end="", flush=True)
    time.sleep(0.150)

chat_completion_basic("What is the capital of France?)

🥹 Congratulations! You have successfully completed the deployment!

Conclusion

You can now consume your Mistral Large 123B in a secure environment!

The result of your implementation? The deployment of a sovereign, scalable, production-quality 123B LLM, powered by OVHcloud AI Deploy.

➡️ To go further?

Update your model in a single command line and without interruption following this documentation
Go to the next replica in the event of a heavy load to ensure high availability using this method

Reference Architecture: set up MLflow Remote Tracking Server on OVHcloud

Eléa Petton — Tue, 15 Apr 2025 07:52:46 +0000

Travel through the Data & AI universe of OVHcloud with the MLflow integration.

Mlflow Remote Tracking Server on OVHcloud

As Artificial Intelligence (AI) continues to grow in importance, Data Scientists and Machine Learning Engineers need a robust and scalable platform to manage the entire Machine Learning (ML) lifecycle.
MLflow, an open-source platform, provides a comprehensive framework for managing ML experiments, models, and deployments.

Mlflow offers many benefits and provides a complete framework for ML lifecycle management with features such as:

Experiment tracking and model management
Reproducibility and collaboration
Scalability, flexibility, and integration
Automated ML and model serving capabilities
Improved model accuracy, faster time-to-market, and reduced costs.

In this reference architecture, you will explore how to leverage remote experience tracking with the MLflow Tracking Server on the OVHcloud Public Cloud infrastructure.
In fact, you will be able to build a scalable and efficient ML platform, streamlining your ML workflow and accelerating model development using OVHcloud AI Notebooks, AI Training, Managed Databases (PostgreSQL), and Object Storage.

The result? A fully remote, production-ready ML experiment tracking pipeline, powered by OVHcloud’s Data & Machine Learning Services (e.g. AI Notebooks and AI Training).

Overview of the MLflow server architecture

Here is how will be configured MLflow:

Development and training environment: create and train model with AI Notebooks
Remote Tracking Server: host in an AI Training job (Container as a Service)
Backend Store: benefit from a managed PostgreSQL database (DBaaS).
Artifact Store: use OVHcloud Object Storage (S3-compatible).

MLflow remote server deployment steps

In the following tutorial, all services are deployed within the OVHcloud Public Cloud.

Prerequisites

Before you begin, ensure you have:

An OVHcloud Public Cloud account
An OpenStack user with the following roles:
- Administrator
- AI Training Operator
- Object Storage Operator

🚀 Having all the ingredients for our recipe, it’s time to set up your MLflow remote tracking server!

Architecture guide: MLflow remote tracking server

Let’s go for the set up and deployment of your custom MLflow tracking tool!

⚙️ Also consider that all of the following steps can be automated using OVHcloud APIs!

Step 1 – Install `ovhai` CLI

Firstly, start by setting up your CLI environment.

curl https://cli.gra.ai.cloud.ovh.net/install.sh | bash

Secondly, login using your OpenStack credentials.

ovhai login -u  -p

Now, it’s time to create your bucket inside OVHcloud Object Storage!

Step 2 – Provision Object Storage (Artifact Store)

Go to Public Cloud > Storage > Object Storage in the OVHcloud Control Panel.
Create a datastore and a new S3 bucket (e.g., mlflow-s3-bucket).
Register the datastore with the ovhai CLI:

ovhai datastore add s3  https://s3.gra.io.cloud.ovh.net/ gra   --store-credentials-locally

Step 3 – Create PostgreSQL Managed DB (Backend Store)

1. Navigate to Databases & Analytics > Databases

2. Create a new PostgreSQL instance with Essential plan

3. Select Location and Node type

4. Reset the user password

5. Take note of te following parameters

Go to your database dashboard:

Then, copy the connexion information:

Your Backend Store is now ready to use!

Step 4 -Build you custom MLflow Docker image and

1. Develop MLflow launching script

Firstly, you have to write a script in bash to launch the server: mlflow_server.sh

echo "The MLflow server is starting..."

mlflow server \
  --backend-store-uri postgresql://${POSTGRE_USER}:${POSTGRE_PASSWORD}@${PG_HOST}:${PG_PORT}/${PG_DB}?sslmode=${SSL_MODE} \
  --default-artifact-root ${S3_BUCKET_NAME}/ \
  --host 0.0.0.0 \
  --port 5000

2. Create Dockerfile

Install the required Python dependency and give the rights on the /mlruns path to the OVHcloud user.

FROM ghcr.io/mlflow/mlflow:latest

# Install Python dependencies
RUN pip install psycopg2-binary

COPY mlflow_server.sh .

# Change the ownership of `mlruns` directory to the OVHcloud user (42420:42420)
RUN mkdir -p /mlruns
RUN chown -R 42420:42420 /mlruns

# Start MLflow server inside container
CMD ["bash", "mlflow_server.sh"]

3. Build your custom MLflow docker image

Build the docker image using the previous Dockerfile.

docker build . -t mlflow-server-ai-training:latest

4. Tag and push the docker image to your registry

Finally, you can push the Docker image to your registry.

docker tag mlflow-server-ai-training:latest /mlflow-server-ai-training:latest

docker push /mlflow-server-ai-training:latest

Congrats! You can now use the Docker image to launch MLflow server.

Step 5 – Start MLflow Tracking Server inside container

You can use AI Training to start MLflow server inside a job.

1. Using ovhai CLI, run the following command inside terminal

ovhai job run --name mlflow-server \
              --default-http-port 5000 \
              --cpu 4 \
              -v mlflow-s3-bucket@DEMO/:/artifacts:RW:cache \
              -e POSTGRE_USER=avnadmin \
              -e POSTGRE_PASSWORD= \
              -e S3_ENDPOINT=https://s3.gra.io.cloud.ovh.net/ \
              -e S3_BUCKET_NAME=mlflow-s3-bucket \
              -e PG_HOST= \
              -e PG_DB=defaultdb \
              -e PG_PORT=20184 \
              -e SSL_MODE=require \
              /mlflow-server-ai-training:latest

Full command explained:

ovhai job run

This is the core command to run a job using the OVHcloud AI Training platform.

--name mlflow-server

Sets a custom name for the job. For example, mlflow-server.

--default-http-port 5000

Exposes port 5000 as the default HTTP endpoint. MLflow’s web UI typically runs on port 5000, so this ensures the UI is accessible once the job is running.

--cpu 4

Allocates 4 CPUs for the job. You can adjust this based on how heavy your MLflow workload is.

-v mlflow-s3-bucket@DEMO/:/artifacts:RW:cache

This mounts your OVHcloud Object Storage volume into the job’s file system:
– mlflow-s3-bucket@DEMO/: refers to your S3 bucket volume from the OVHcloud Object Storage
– :/artifacts: mounts the volume into the container under /artifacts
– RW: enables Read/Write permissions
– cache: enables volume caching, improving performance for frequent reads/writes

-e POSTGRE_USER=avnadmin
-e POSTGRE_PASSWORD=
-e PG_HOST=
-e PG_DB=defaultdb
-e PG_PORT=20184
-e SSL_MODE=require

These are environment variables for connecting to the PostgreSQL backend store:
– avnadmin: the default admin user for OVHcloud’s managed PostgreSQL
– POSTGRE_PASSWORD: must be replaced with your actual database password
– PG_HOST: the hostname of your managed PostgreSQL instance
– PG_DB: the name of the database to use (default: defaultdb)
– PG_PORT: the port your PostgreSQL server is listening on
– SSL_MODE: enforce SSL connection to secure DB traffic

-e S3_ENDPOINT=https://s3.gra.io.cloud.ovh.net/

Tells MLflow where the S3-compatible endpoint is hosted. This is specific to OVHcloud’s GRA (Gravelines) region Object Storage.

-e S3_BUCKET_NAME=mlflow-s3-bucket

Sets the name of the S3 bucket where MLflow should store artifacts (models, metrics, etc.).

/mlflow-server-ai-training:latest

This is the custom MLflow Docker image you are running inside the job.

2. Check if your AI Training job is RUNNING

Replace the by yours.

ovhai job get

You should obtain:

History: DATE STATE 04-04-25 09:58:00 QUEUED 04-04-25 09:58:01 INITIALIZING 04-04-25 09:58:07 PENDING 04-04-25 09:58:10 RUNNING Info: Message: Job is running

3. Recover the IP and external IP of your AI Training job

Using, your , you can retrieve your AI Training job IP.

ovhai job get  -o json | jq '.status.ip' -r

For example, you can obtain something like that: 10.42.80.176

You also need the External IP:

ovhai job get  -o json | jq '.status.externalIp' -r

Returning the IP address you will have to whitelist to be able to connect to your database (e.g. 51.210.38.188)

Step 6 – Whitelist AI Training job IP in PostgreSQL DB

From Databases & Analytics > Databases, edit your DB configuration to allow access from the job Extranal IP.

Then, you can see that the job External IP is now white listed.

Well done! Your MLflow server and the backend store are now connected.

Step 7 – Create an AI Notebook

It’s time to train and track your Machine Learning models using MLflow!

To do so, use the OVHcloud ovhai CLI and start a new AI Notebook with GPU.

ovhai notebook run conda jupyterlab \
  --name mlflow-notebook \
  --framework-version conda-py311-cudaDevel11.8 \
  --gpu 1

Full command explained:

ovhai notebook run

This is the core command to run a notebook using the OVHcloud AI Notebooks platform.

--name mlflow-notebook

Sets a custom name for the notebook. In this case, you can name it mlflow-notebook.

--framework-version conda-py311-cudaDevel11.8

Define the framework and version you want to use in your notebook. Here, you are using Python 3.11 with Conda framework and CUDA compatibility.

--gpu 1

Allocates 1 GPU for the job, by default a Tesla V100S from NVIDIA (ai1-1-gpu). You can select the flavor you want from the OVHcloud GPU range.

Then, check if your AI Notebook is RUNNING.

ovhai notebook get

Once your notebook is in RUNNING status, you should be able to access it using its URL:

State: RUNNING Duration: 1411412 Url: https://.notebook.gra.ai.cloud.ovh.net Grpc Address: .nb-grpc.gra.ai.cloud.ovh.net:443 Info Url: https://ui.gra.ai.cloud.ovh.net/notebook/

You can start your AI model development inside notebook.

Step 8 – Model training inside Jupyter notebook

To begin with, set up your notebook environment.

1. Create the requirements.txt file

numpy==2.2.3
scipy==1.15.2
mlflow==2.20.3
sklearn==1.6.1

2. Install dependencies

From a notebook cell, launch the following command.

!pip3 install -r requirements.txt

Perfect! You can start coding…

3. Import Python librairies

Here, you have to import os, mlflow and scikit-learn.

# import dependencies
import os
import mlflow
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor

In another notebook cell, set the MLflow tracking URI. Note that you have to replace 10.42.80.176 by your own job IP.

mlflow.set_tracking_uri("http://10.42.80.176:5000")

Then start training your model!

mlflow.autolog()

db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

# Create and train models.
rf = RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
rf.fit(X_train, y_train)

# Use the model to make predictions on the test dataset.
predictions = rf.predict(X_test)

Output:

🏃 View run dashing-foal-850 at: http://10.42.80.176:5000/#/experiments/0/runs/e7dad7c073634ec28675c0defce2b9ec
🧪 View experiment at: http://10.42.80.176:5000/#/experiments/0

Congrats! You can now track your model training from MLflow remote server…

Step 9 – Track and compare models from MLflow remote server

Finally, access to MLflow dashboard using the job URL: https://.job.gra.ai.cloud.ovh.net

Then, you can check your model trainings and evaluations:

What a success! You can finally use your MLflow to evaluate, compare and archive your various trainings.

Step 10 – Monitor everything remotely

You now have a complete Machine Learning pipeline with remote experiment tracking. Access:

Metrics, Parameters, and Tags → PostgreSQL
Artifacts (Models, Files) → S3 bucket

This setup is reusable, automatable, and production-ready!

What’s next?

Automate deployment with OVHcloud APIs
Run different training sessions in parallel and compare them with your remote MLflow tracking server
Use AI Deploy to serve your trained models

Mistral Small 24B served with vLLM and AI Deploy – a single command to deploy an LLM (Part 1)

Eléa Petton — Mon, 24 Feb 2025 10:08:37 +0000

You are not dreaming! You can deploy open-source LLM in a single command line.

Deploying advanced language models can be a challenge! But this sometimes this arduous task is becoming increasingly accessible, enabling developers to integrate sophisticated AI capabilities into their applications.

In this guide, we will walk through deploying the Mistral-Small-24B-Instruct-2501 model using vLLM on OVHcloud’s AI Deploy platform. This combination offers a powerful solution for efficient and scalable AI model serving.

Deploying a model is great, but doing it quickly is even better!

🤯 What if a single command line was enough? That’s the challenge we’re tackling today!

Context

Before deployment, let’s take a closer look at our key technologies!

Mistral Small

The mistralai/Mistral-Small-24B-Instruct-2501 is a 24-billion-parameter instruction-fine-tuned model, renowned for its compact size and performance comparable to larger models.

This model, from MistralAI, is an instruction-fine-tuned version of the base model: Mistral-Small-24B-Base-2501.

To serve this model efficiently, we will utilize vLLM, an open-source library for LLM inference.

vLLM

vLLM (Virtual LLM) is a highly optimized service engine designed to efficiently run large language models. It takes advantage of several key optimizations, such as:

PagedAttention: an attention mechanism that reduces memory fragmentation and enables more efficient use of GPU memory
Continuous Batching: vLLM dynamically adjusts batch sizes in real time, ensuring that the GPU is always used efficiently, even with multiple simultaneous requests
Tensor parallelism: enables model inference across multiple GPUs to boost performance
Optimized kernel implementations: vLLM uses custom CUDA kernels for faster execution, reducing latency compared to traditional inference frameworks

These features make vLLM one of the best choices for large models such as Mistral Small 24B, enabling low-latency, high-throughput inference on the latest GPUs.

By deploying on OVHcloud’s AI Deploy platform, you can deploy this model in a single command line.

AI Deploy

OVHcloud AI Deploy is a Container as a Service (CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications / APIs based on Machine Learning (ML), Deep Learning (DL) or LLMs.

The key benefits are:

Easy to use: bring your own custom Docker image and deploy it in a command line or a few clicks surely
High-performance computing: a complete range of GPUs available (H100, A100, V100S, L40S and L4)
Scalability and flexibility: supports automatic scaling, allowing your model to effectively handle fluctuating workloads
Cost-efficient: billing per minute, no surcharges

✅ To go further, some prerequisites must be checked!

Prerequisites

Before you begin, ensure that you have:

OVHcloud account: access to the OVHcloud Control Panel
ovhai CLI available: install the ovhai CLI
AI Deploy access: ensure you have a user for AI Deploy
Hugging Face access: create an Hugging Face account and generate an access token
Gated model authorization: be sure you have been granted access to Mistral-Small-24B-Instruct-2501 model

🚀 Having all the ingredients for our recipe, it’s time to deploy!

Deployment of the Mistral Small 24B LLM

Let’s go for the deployment of the model mistralai/Mistral-Small-24B-Instruct-2501

Manage access tokens

Export your Hugging Face token.

export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

Create a token to access your AI Deploy app once it will be deployed.

ovhai token create --role operator ai_deploy_token=my_operator_token

Returning the following output:

Id: 47292486-fb98-4a5b-8451-600895597a2b Created At: 20-02-25 11:53:05 Updated At: 20-02-25 11:53:05 Spec: Name: ai_deploy_token=my_operator_token Role: AiTrainingOperator Label Selector: Status: Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Version: 1

You can now store and export your access token:

export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Launch Mistral Small LLM with AI Deploy

You are ready to start Mistral-Small-24B using vLLM and AI Deploy:

ovhai app run --name vllm-mistral-small \
              --default-http-port 8000 \
              --label ai_deploy_token=my_operator_token \
              --gpu 2 \
              --flavor l40s-1-gpu \
              -e OUTLINES_CACHE_DIR=/tmp/.outlines \
              -e HF_TOKEN=$MY_HF_TOKEN \
              -e HF_HOME=/hub \
              -e HF_DATASETS_TRUST_REMOTE_CODE=1 \
              -e HF_HUB_ENABLE_HF_TRANSFER=0 \
              -v standalone:/hub:rw \
              -v standalone:/workspace:rw \
              vllm/vllm-openai:v0.8.2 \
              -- bash -c "python3 -m vllm.entrypoints.openai.api_server \
                        --model mistralai/Mistral-Small-24B-Instruct-2501 \
                        --tensor-parallel-size 2 \
                        --tokenizer_mode mistral \
                        --load_format mistral \
                        --config_format mistral \
                        --dtype half"

How to understand the different parameters of this command?

1. Start your AI Deploy app

Launch a new app using ovhai CLI and name it.

ovhai app run --name vllm-mistral-small

2. Define access

Define the HTTP API port and restrict access to your token.

--default-http-port 8000
--label ai_deploy_token=my_operator_token

3. Configure GPU resources

Specifies the hardware type (l40s-1-gpu), which refers to an NVIDIA L40S GPU and the number (2).

--gpu 2 --flavor l40s-1-gpu

⚠️WARNING! For this model, two L40S are sufficient, but if you want to deploy another model, you will need to check which GPU you need. Note that you can also access to A100 and H100 GPUs for your larger models.

4. Set up environment variables

Configure caching for the Outlines library (used for efficient text generation):

-e OUTLINES_CACHE_DIR=/tmp/.outlines

Pass the Hugging Face token ($MY_HF_TOKEN) for model authentication and download:

-e HF_TOKEN=$MY_HF_TOKEN

Set the Hugging Face cache directory to /hub (where models will be stored):

-e HF_HOME=/hub

Allow execution of custom remote code from Hugging Face datasets (required for some model behaviors):

-e HF_DATASETS_TRUST_REMOTE_CODE=1

Disable Hugging Face Hub transfer acceleration (to use standard model downloading):

-e HF_HUB_ENABLE_HF_TRANSFER=0

5. Mount persistent volumes

Mounts two persistent storage volumes:

/hub → Stores Hugging Face model files
/workspace → Main working directory

The rw flag means read-write access.

-v standalone:/hub:rw -v standalone:/workspace:rw

6. Choose the target Docker image

Uses the vllm/vllm-openai:v0.8.2 Docker image (a pre-configured vLLM OpenAI API server).

vllm/vllm-openai:v0.8.2

7. Running the model inside the container

Runs a bash shell inside the container and executes a Python command to launch the vLLM API server:

python3 -m vllm.entrypoints.openai.api_server → Starts the OpenAI-compatible vLLM API server
--model mistralai/Mistral-Small-24B-Instruct-2501 → Loads the Mistral Small 24B model from Hugging Face
--tensor-parallel-size 2 → Distributes the model across 2 GPUs
--tokenizer_mode mistral → Uses the Mistral tokenizer
--load_format mistral → Uses Mistral’s model loading format
--config_format mistral → Ensures the model configuration follows Mistral’s standard
--dtype half → Uses FP16 (half-precision floating point) for optimized GPU performance

You can now check if your AI Deploy app is alive:

ovhai app get

💡Is your app in RUNNING status? Perfect! You can check in the logs that the server is started…

ovhai app logs

⚠️WARNING! This step may take a little time as the template must be loaded…
After a few minutes, you should get the following information in the logs:

2025-02-20T13:48:07Z [app] [tcmzt] INFO: Started server process [13] 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Waiting for application startup. 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Application startup complete. 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

🚦 Are all the indicators green? Then it’s off to inference!

Request and send prompt to the LLM

Launch the following query by asking the question of your choice:

curl https://.app.gra.ai.cloud.ovh.net/v1/chat/completions \
  -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Small-24B-Instruct-2501",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Give me the name of OVHcloud’s founder."}
    ],
    "stream": false
  }'

Returning the following result:

{
  "id":"chatcmpl-d6ea734b524bd851668e71d4111ba496",
  "object":"chat.completion",
  "created":1740059807,
  "model":"mistralai/Mistral-Small-24B-Instruct-2501",
  "choices":[
    {
      "index":0,
      "message":{
        "role":"assistant",
        "reasoning_content":null, 
        "content":"The founder of OVHcloud is Octave Klaba.",
        "tool_calls":[]
      },
      "logprobs":null,
      "finish_reason":"stop",
      "stop_reason":null
    }
  ],
  "usage":{
    "prompt_tokens":22,
    "total_tokens":35,
    "completion_tokens":13,
    "prompt_tokens_details":null
  },
  "prompt_logprobs":null
}

Conclusion

By following these steps, you have successfully deployed the mistralai/Mistral-Small-24B-Instruct-2501 model using vLLM on OVHcloud’s AI Deploy platform. This setup provides a scalable and efficient solution for serving advanced language models in production environments.

For further customization and optimization, refer to the vLLM documentation and OVHcloud AI Deploy resources.

💪 Challenges taken! You can now enjoy the power of your LLM deployed in a single command line!

Want even more simplicity? You can also use ready-to-use APIs with AI Endpoints!

But… what’s next?

Master Speech AI and build your own Video Translator app with AI Endpoints!

Eléa Petton — Thu, 25 Jul 2024 08:10:49 +0000

Extend the impact of any video by embarking on the development of a transcription and translation solution for your multimedia content.

A robot capable of transcribing, translating and dubbing videos in any language

Today, media and social networks are omnipresent in our professional and personal lives: videos, tweets, posts, forums and Twitch lives… These different types of media enable companies and content creators to promote their activities and build community loyalty.

But have you ever wondered about the role of language when creating your content? Using just one language can be a hindrance to your business!

Transcribing and translating your videos could be the solution! Adapt your videos into different languages and make the content accessible to a wider audience, increasing its reach and impact.

💡 How can we achieve this?

By automatically subtitling and dubbing voices using AI APIs! With AI Endpoints, you will benefit from APIs based on several Speech AI models: ASR (Automatic Speech Recognition), NMT (Neural Machine Translation) and TTS (Text To Speech).

Objective

Whatever your level in AI, whether you’re a beginner or an expert, this tutorial will enable you to create your own powerful Video Translator in just a few lines of code.

The aim of this article is to show you how to make the most of Speech AI’s APIs, with no prior knowledge required!

⚡️ How to?

By connecting the Speech AI API endpoints using easy-to-implement features, and developing a web-app using the Gradio framework.

AI Endpoints “puzzles” connexion

🚀 At the end?

We will have a web app that lets you upload a video in French, transcribe it and then subtitle it in English.

But that’s not all… They will also be able to dub the voice of a video into another language!

👀 Before we start coding, let’s take a look at the different concepts…

Concept

To better understand the technologies that revolve around the Video Translator, let’s start by examining the models and notions of ASR, NMT, TTS…

AI Endpoints in a few words

AI Endpoints is a new serverless platform powered by OVHcloud and designed for developers.

The aim of AI Endpoints is to enable developers to enhance their applications with AI APIs, whatever their level and without the need for AI expertise.

It offers a curated catalog of world-renowned AI models and Nvidia’s optimized models, with a commitment to privacy as data is not stored or shared during or after model use.

AI Endpoints provides access to advanced AI models, including Large Language Models (LLMs), natural language processing, translation, speech recognition, image recognition, and more.

OVHcloud AI Endpoints website

To know more about AI Endpoints, refer to this website.

AI Endpoints proposes several ASR APIs in different languages… But what means ASR?

Transcribe video using ASR

Automatic Speech Recognition (ASR) technology, also known as Speech-To-Text, is the process of converting spoken language into written text.

This process consists of several stages, including preparing the speech signal, extracting features, creating acoustic models, developing language models, and utilizing speech recognition engines.

With AI Endpoints, we simplify the use of ASR technology through our ready-to-use inference APIs. Learn how to use our APIs by following this link.

These APIs can be used to transcribe audio from video into text, which can then be sent to NMT model in order to translate it into an other language.

Translate thanks to NMT

NMT, for Neural Machine Translation, is a subfield of Machine Translation (MT) that uses Artificial Neural Networks (ANNs) to predict or generate translations from one language to another.

If you want to learn more, the best way is to try it out for yourself! You can do so by following this link.

In this particular application, the NMT models will translate into an other language the results of the ASR (Automatic Speech Recognition) endpoint.

Then, there are two options:

generate subtitles .srt file based on the NMT translation
apply voice dubbing thanks to speech synthesis

🤯 Would you like to use speech synthesis ? Challenge accepted, that’s what TTS is for.

Allow voice dubbing with TTS

TTS stands for Text-To-Speech, which is a type of technology that converts written text into spoken words.

This technology uses Artificial Intelligence algorithms to interpret and generate human-like speech from text input.

It is commonly used in various applications such as voice assistants, audiobooks, language learning platforms, and accessibility tools for individuals with visual or reading impairments.

With AI Endpoints, TTS is easy to use thanks to the turnkey inference APIs. Test it for free here.

🤖 Are you ready to start coding the Video Translator? Let’s go!

Technical implementation of the Audio Virtual Assistant

This technical section covers the following points:

send your video (.mp4) and extract audio as .wav file
use ASR endpoint to transcribe audio into text
translate ASR transcription into the target language using NMT endpoint
create .srt file to add video subtitles
use TTS endpoint to convert NMT translation into spoken words
implement voice dubbing function to merge generated audio with input video

Finally, create a web app with Gradio to make it easy to use!

➡️ Access the full code here.

Working principle of the web app resulting from technical implementation

In order to build the Video Translator, start by setting up the environment.

Set up the environment

In order to use AI Endpoints APIs easily, create a .env file to store environment variables.

ASR_GRPC_ENDPOINT=nvr-asr-fr-fr.endpoints-grpc.kepler.ai.cloud.ovh.net:443
NMT_GRPC_ENDPOINT=nvr-nmt-en-fr.endpoints-grpc.kepler.ai.cloud.ovh.net:443
TTS_GRPC_ENDPOINT=nvr-tts-en-us.endpoints-grpc.kepler.ai.cloud.ovh.net:443
OVH_AI_ENDPOINTS_ACCESS_TOKEN=

⚠️ Test AI Endpoints and get your free token here

In the next step, install the needed Python dependencies.

If the library ffmpeg is not already installed, launch the following command:

sudo apt install ffmpeg

Create the requirements.txt file with the following libraries and launch the installation.

⚠️The environnement workspace is based on Python 3.11

nvidia-riva-client==2.15.0 gradio==4.16.0 moviepy==1.0.3 librosa==0.10.1 pysrt==1.1.2

pip install -r requirements.txt

Once this is done, you have to create a five Python files:

asr.py – transcribe audio into text
nmt.py – translate the transcription into an other language
tts.py – synthesize the text into speech
utils.py – extract audio from video, connect ASR, NMT and TTS functions together and merge the result with the input video
main.py – create the Gradio app to make it easy to use

⚠️ Note that only a few functions will be covered in this article! To create the entire app, refer to the GitHub repo, which contains all the code ⚠️

Transcribe the audio part of the video in writing

First, create the Automatic Speech Recognition (ASR) function in order to obtain the video transcription into French.

💡 How it works?

The asr_transcription function allows you to transcribe the audio part of the video into text and to get the beginning and the end of each sentence thanks to the enable_word_time_offsets parameter.

def asr_transcription(audio_input):

    # connect with asr server
    asr_service = riva.client.ASRService(
                    riva.client.Auth(
                        uri=os.environ.get('ASR_GRPC_ENDPOINT'), 
                        use_ssl=True, 
                        metadata_args=[["authorization", f"bearer {ai_endpoint_token}"]]
                    )
                )

    # set up config
    asr_config = riva.client.RecognitionConfig(
    language_code="fr-FR",
        max_alternatives=1,
        enable_automatic_punctuation=True,
        enable_word_time_offsets=True,
        audio_channel_count = 1,
    )
    
    # open and read audio file
    with open(audio_input, 'rb') as fh:
        audio = fh.read()
    
    riva.client.add_audio_file_specs_to_config(asr_config, audio)

    # return response
    resp = asr_service.offline_recognize(audio, asr_config)
    output_asr = []
    
    # extract sentence information
    for s in range(len(resp.results)):
        output_sentence = []
        sentence = resp.results[s].alternatives[0].transcript
        output_sentence.append(sentence)
        
        for w in range(len(resp.results[s].alternatives[0].words)):
            start_sentence = resp.results[s].alternatives[0].words[0].start_time
            end_sentence = resp.results[s].alternatives[0].words[w].end_time
        
        # add start time and stop time of the sentence
        output_sentence.append(start_sentence)
        output_sentence.append(end_sentence)
       
        # final asr transcription and time sequences
        output_asr.append(output_sentence)
        
    # return response
    return output_asr

🎉 Congratulations! Your ASR function is ready to use.

⏳ But that’s just the beginning! Now you have to build the translation part….

Translate French text into English

Then, build the Neural Machine Translation (NMT) function to transform theFrench transcription into English.

➡️ In practice?

The nmt_translation function allows you to translate the different sentences in English. Don’t forget to keep the start and end times for each sentence!

def nmt_translation(output_asr):
    
    # connect with nmt server
    nmt_service = riva.client.NeuralMachineTranslationClient(
                    riva.client.Auth(
                        uri=os.environ.get('NMT_GRPC_ENDPOINT'), 
                        use_ssl=True, 
                        metadata_args=[["authorization", f"bearer {ai_endpoint_token}"]]
                    )
                )

    # set up config
    model_name = 'fr_en_24x6'
    
    output_nmt = []
    for s in range(len(output_asr)):
        output_nmt.append(output_asr[s])
        text_translation = nmt_service.translate([output_asr[s][0]], model_name, "fr", "en")
        output_nmt[s][0]=text_translation.translations[0].text
        
    # return response
    return output_nmt

⚡️ You’re almost there! Now all you have to do is build the TTS function.

Synthesize the translated text into spoken words

Finally, create the Text To Speech (TTS) function to synthesize the French text into audio.

👀 How to?

Firstly, create the tts_transcription function, dedicated to audio generation and silences management based on ASR and NMT results.

def tts_transcription(output_nmt, video_input, video_title, voice_type):
    
    # connect with tts server
    tts_service = riva.client.SpeechSynthesisService(
                    riva.client.Auth(
                        uri=os.environ.get('TTS_GRPC_ENDPOINT'), 
                        use_ssl=True, 
                        metadata_args=[["authorization", f"bearer {ai_endpoint_token}"]]
                    )
                )

    # set up tts config
    sample_rate_hz = 16000
    req = { 
            "language_code"  : "en-US",
            "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,  
            "sample_rate_hz" : sample_rate_hz,                       
            "voice_name"     : f"English-US.{voice_type}"                    
    }

    output_audio = 0
    output_audio_file = f"{outputs_path}/audios/{video_title}.wav"
    for i in range(len(output_nmt)):
        
        # add silence between audio sample
        if i==0:
            duration_silence = output_nmt[i][1]
        else:
            duration_silence = output_nmt[i][1] - output_nmt[i-1][2]
        silent_segment = AudioSegment.silent(duration = duration_silence)
        output_audio += silent_segment
        
        # create tts transcription
        req["text"] = output_nmt[i][0]
        resp = tts_service.synthesize(**req)
        sound_segment = AudioSegment(
            resp.audio,
            sample_width=2,
            frame_rate=16000,
            channels=1,
        )

        output_audio += sound_segment
    
    # export new audio as wav file
    output_audio.export(output_audio_file, format="wav")
    
    # add new voice on video
    voice_dubbing = add_audio_on_video(output_audio_file, video_input, video_title)
    
    return voice_dubbing

Secondly, build the add_audio_on_video function in order to merge the new audio on the video.

def add_audio_on_video(translated_audio, video_input, video_title):

    videoclip = VideoFileClip(video_input)
    audioclip = AudioFileClip(translated_audio)

    new_audioclip = CompositeAudioClip([audioclip])
    new_videoclip = f"{outputs_path}/videos/{video_title}.mp4"
    
    videoclip.audio = new_audioclip
    videoclip.write_videofile(new_videoclip)
    
    return new_videoclip

🤖 Congratulations! Now you’re ready to put the puzzle pieces together with the utils.py file.

Combine the results of the various functions

This is the most important step! Connect the functions output to return the final video…

🚀 What to do?

1. Create the main.py to implement the Gradio app

➡️ Access to the code here.

2. Build utils.py file to connect the results to each other

➡️ Refer to this Python file.

😎 Well done! You can now use your web app to translate any video from French to English.

Video Translator web app overview

🚀 That’s it! Now get the most out of your tool by launching it locally.

Launch Video Translator web app locally

In this last step , you can start this Gradio app locally by launching the following command:

python3 main.py

Benefit from the full power of your tool as follow!

Video Translator demo

☁️ It’s also possible to make your interface accessible to everyone…

Go further

If you want to go further and deploy your web app in the cloud, refer to the following articles and tutorials.

Conclusion of the Audio Virtual Assistant

Congratulations 🎉! You have learned how to build your own Video Translator in thanks to AI Endpoints.

You’ve also seen how easy it is to use AI Endpoints to create innovative turnkey solutions.

➡️ Access the full code here.

🚀 What’s next? Implement an Audio Virtual Assistant in less than 100 lines of code!

References

Chatbot memory management with LangChain and AI Endpoints

Eléa Petton — Thu, 11 Jul 2024 14:03:03 +0000

Use Conversational Memory to enable your chatbot to answer multiple questions using its knowledge based on previous interactions.

A robot assistant with a lot of knowledge talking to a human

When it comes to Conversational Applications, especially those with interfaces, the ability to remember information about past interactions is paramount.

Imagine you’re talking to a Virtual Assistant or Chatbot, and you want it to remember details of previous conversations…

LangChain‘s Memory module is the solution that rescues our conversation models from the constraints of short-term memory!

In this article we will learn how it is possible to use OVHcloud AI Endpoints, especially Mistral7B API, and LangChain in order to add a Memory window to a Chatbot.

This step-by-step tutorial will introduce the different types of memory in LangChain. Then, we will compare the Mistral7b model used without memory and the one benefiting from the memory window.

Introduction

Before getting our hands into the code, let’s contextualize it by introducing AI Endpoints and the notion of memory in the LLM domain.

AI Endpoints in a few words

AI Endpoints is a new serverless platform powered by OVHcloud and designed for developers.

The aim of AI Endpoints is to enable developers to enhance their applications with AI APIs, whatever their level and without the need for AI expertise.

It offers a curated catalog of world-renowned AI models and Nvidia’s optimized models, with a commitment to privacy as data is not stored or shared during or after model use.

AI Endpoints provides access to advanced AI models, including Large Language Models (LLMs), natural language processing, translation, speech recognition, image recognition, and more.

OVHcloud AI Endpoints website

To know more about AI Endpoints, refer to this website.

Conversational Memory concept

Conversational memory for LLMs (Language Learning Models) refers to the ability of these models to remember and use information from previous interactions within the same conversation.

It works in a similar way to how humans use short-term memory in day-to-day conversations.

This feature is essential for maintaining context and coherence throughout a dialogue. It allows the model to recall details, facts, or inquiries mentioned earlier in the conversation (chat history), and use that information effectively to generate more relevant responses corresponding to the new user inputs.

Distinction between chat history and new user input

Conversational memory can be implemented through various techniques and architecture, especially ing LangChain.

Memory types in LangChain

LangChain offers several types of conversational memory with the ConversationChain.

Each memory type may have its own parameters and concepts that need to be understood…

ConversationBufferMemory

The first component is the ConversationBufferMemory. This is an extremely simple form of memory that simply holds a list of chat messages in a buffer and passes them on to the prompt model.

All conversation interactions between the human and the AI are passed to the parameter history.

ConversationSummaryMemory

The second component solves a problem that arises when using ConversationBufferMemory: we quickly consume a large number of tokens, often exceeding the context window limit of even the most advanced LLMs.

The solution may be to use the ConversationSummaryMemory component. The latter makes it possible to limit the abusive use of tokens while exploiting memory. This type of memory summarizes the history of interactions to send it to the dedicated parameter (history).

ConversationBufferWindowMemory

The third one is the ConversationBufferWindowMemory. It introduces a window into the buffer memory, keeping only the K most recent interactions.

⚠️ Note that this approach reduces the number of tokens used, it also causes the previous K interactions.

ConversationSummaryBufferMemory

The ConversationSummaryBufferMemory component is a mix of ConversationSummaryMemory and ConversationBufferWindowMemory.

It summarizes the earliest interactions while retaining the latest tokens in the human / AI conversation.

🧠 Let’s move on to the technical part and take a look at the component ConversationBufferWindowMemory!

How to add a conversational memory window?

This technical section covers the following points:

set up the dev environment
test the Mistral7B model without conversational memory
implement ConversationBufferWindowMemory to benefit from the model knowledge during the conversation

Comparaison between chatbot with and without memory

➡️ Access the full code here.

Set up the environment

In order to use AI Endpoints Mistral7B API easily, create a .env file to store environment variables.

LLM_AI_ENDPOINT=https://mistral-7b-instruct-v0-3.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1

OVH_AI_ENDPOINTS_ACCESS_TOKEN=

⚠️ Test AI Endpoints and get your free token here

In the next step, install the needed Python dependencies.

Create the requirements.txt file with the following libraries and launch the installation.

⚠️The environnement workspace is based on Python 3.11

python-dotenv==1.0.1 langchain_openai==0.1.14 langchain==0.2.17
openai==1.68.2

pip install -r requirements.txt

Once this is done, you can create a notebook named chatbot-memory-langchain.ipynb.

First, import Python librairies as follow:

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import  ChatPromptTemplate
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferWindowMemory

Then, load the environment variables:

load_dotenv()

# access the environment variables from the .env file
ai_endpoint_token = os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN")
ai_endpoint_mistral7b = os.getenv("LLM_AI_ENDPOINT")

👀 You are now ready to test your LLM without conversational memory!

Test Mistral7b without Conversational Memory

Test your model in a basic way and see what happens with the context…

# Set up the LLM
llm = ChatOpenAI(
        model_name="Mistral-7B-Instruct-v0.3", 
        openai_api_key=ai_endpoint_token,
        openai_api_base=ai_endpoint_mistral7b, 
        max_tokens=512,
        temperature=0.0
)

prompt = ChatPromptTemplate.from_messages([
("system", "You are an assistant. Answer to the question."),
("human", "{question}"),
])

# Create the conversation chain
chain = prompt | llm

# Start the conversation
question = "Hello, my name is Elea"
response = chain.invoke(question)
print(f"👤: {question}")
print(f"🤖: {response.content}")

question = "What is the capital of France?"
response = chain.invoke(question)
print(f"👤: {question}")
print(f"🤖: {response.content}")

question = "Do you know my name?"
response = chain.invoke(question)
print(f"👤: {question}")
print(f"🤖: {response.content}")

You should obtain the following result:

👤: Hello, my name is Elea 🤖: Hello Elea, nice to meet you. How can I assist you today? 👤: What is the capital of France? 🤖: The capital city of France is Paris. Paris is one of the most famous and visited cities in the world. It is known for its art, culture, and cuisine. 👤: Do you know my name? 🤖: I'm an assistant and I don't have the ability to know your name without being told.

Note here that the model does not store the conversation in memory, since it no longer remembers the first name sent in the first prompt.

But how to fix it?

💡 You can solve it using ConversationBufferWindowMemory from LangChain…

Add Memory Window to your LLM

In this step, we add a Conversation Window Memory using the following component:

memory = ConversationBufferWindowMemory(k=10)

Parameter k defines the number of recorded interactions.
➡️ Note that if we set k=1, it means that the window will remember the single latest interaction between the human and AI. That is the latest human input and the latest AI response.

Then, we have to create the conversation chain:

conversation = ConversationChain(llm=llm, memory=memory)

# Set up the LLM
llm = ChatOpenAI(
        model_name="Mistral-7B-Instruct-v0.3", 
        openai_api_key=ai_endpoint_token,
        openai_api_base=ai_endpoint_mistral7b,
        max_tokens=512,
        temperature=0.0
)

# Add Converstion Window Memory
memory = ConversationBufferWindowMemory(k=10)

# Create the conversation chain
conversation = ConversationChain(llm=llm, memory=memory)

# Start the conversation
question = "Hello, my name is Elea"
response = conversation.predict(input=question)
print(f"👤: {question}")
print(f"🤖: {response}")

question = "What is the capital of France?"
response = conversation.predict(input=question)
print(f"👤: {question}")
print(f"🤖: {response}")

question = "Do you know my name?"
response = conversation.predict(input=question)
print(f"👤: {question}")
print(f"🤖: {response}")

Finally, you should obtain this type of output:

👤: Hello, my name is Elea
🤖: Hello Elea, nice to meet you. I'm an AI designed to assist and engage in friendly conversations. How can I help you today? Would you like to know a joke, play a game, or discuss a specific topic? I'm here to help and provide lots of specific details from my context. If I don't know the answer to a question, I'll truthfully say I don't know. So, what would you like to talk about today? I'm all ears!
👤: What is the capital of France?
🤖: The capital city of France is Paris. Paris is one of the most famous and romantic cities in the world. It is known for its beautiful architecture, iconic landmarks, world-renowned museums, delicious cuisine, vibrant culture, and friendly people. Paris is a must-visit destination for anyone who loves travel, adventure, history, art, culture, and new experiences. So, if you ever have the opportunity to visit Paris, I highly recommend that you take it! You won't be disappointed!
👤: Do you know my name?
🤖: Yes, I do. Your name is Elea. How can I help you today, Elea? Would you like to know a joke, play a game, or discuss a specific topic? I'm here to help and provide lots of specific details from my context. If I don't know the answer to a question, I'll truthfully say I don't know. So, what would you like to talk about today, Elea? I'm all ears!

As you can see, thanks to the ConversationBufferWindowMemory, your model keeps track of the conversation and retrieves previously exchanged information.

⚠️ Here, the memory window is k=10, so feel free to customize the k value to suit your needs.

Conclusion

Congratulations! You can now benefit from the memory generated by the history of your interactions with the LLM.

🤖 This will enable you to streamline exchanges with the Chatbot and get more relevant answers!

In this blog, we explored the LangChain Memory module and, more specifically, the ConversationBufferWindowMemory component.

This has enabled us to understand the importance of memory in the creation of a Chatbot or Virtual assistant!

➡️ Access the full code here.

🚀 What’s next? If you would like to find out more, take a look at the following article on memory chatbot with LangChain4j.

References

Build a powerful Audio Virtual Assistant in less than 100 lines of code with AI Endpoints!

Eléa Petton — Tue, 09 Jul 2024 13:16:29 +0000

Raise your hands off the keyboard and chat with your LLM by voice with this Audio Virtual Assistant!

An audio robot assistant talking to a human about the recipe for apple pie

Nowadays, the creation of virtual assistants has become more accessible than ever, thanks to advances in AI (Artificial Intelligence), particularly in the field of SpeechAI and the GenAI models.

We will explore how OVHcloud AI Endpoints can be leveraged to design and develop an Audio Virtual Assistant capable of processing and understanding verbal questions, providing accurate answers, and returning answers verbally through speech synthesis.

In this step-by-step tutorial, we will take a look at how to send audio through the microphone to the LLM (Large Language Models) via a written transcript of the ASR (Automatic Speech Recognition). The response is then formulated orally by a TTS (Text To Speech) model.

Objectives

Whatever your level in AI, whether you’re a beginner or an expert, this tutorial will enable you to create your own powerful Audio Virtual Assistant in just a few lines of code.

How to?

By connecting your AI Endpoints like puzzles!

Retrieve the written transcript of your oral question with ASR endpoint
Get the answer to your question with an LLM endpoint
Take advantage of the TTS endpoint with the oral reply

AI Endpoints “puzzles” connexion

👀 But first of all, a few definitions are needed to fully understand the technical implementation that follows.

Concept

To better understand the technologies that revolve around the Audio Virtual Assistant, let’s start by examining the models and notions of ASR, LLM, TTS…

AI Endpoints in a few words

AI Endpoints is a new serverless platform powered by OVHcloud and designed for developers.

The aim of AI Endpoints is to enable developers to enhance their applications with AI APIs, whatever their level and without the need for AI expertise.

It offers a curated catalog of world-renowned AI models and Nvidia’s optimized models, with a commitment to privacy as data is not stored or shared during or after model use.

AI Endpoints provides access to advanced AI models, including Large Language Models (LLMs), natural language processing, translation, speech recognition, image recognition, and more.

OVHcloud AI Endpoints website

To know more about AI Endpoints, refer to this website.

AI Endpoints proposes several ASR APIs in different languages… But what means ASR?

Questioning with ASR

Automatic Speech Recognition (ASR) technology, also known as Speech-To-Text, is the process of converting spoken language into written text.

This process consists of several stages, including preparing the speech signal, extracting features, creating acoustic models, developing language models, and utilizing speech recognition engines.

With AI Endpoints, we simplify the use of ASR technology through our ready-to-use inference APIs. Learn how to use our APIs by following this link.

These APIs can be used to transcribe a recorded audio question into text, which can then be sent to a Large Language Model (LLM) for an answer.

Answering using LLM

LLMs, or Large Language Models, are known for producing text that is similar to how humans write.

They use complex algorithms to predict patterns in human language, understand context, and provide relevant responses. With LLM, virtual assistants can engage in meaningful and dynamic conversations with users.

If you want to learn more, the best way is to try it out for yourself! You can do so by following this link.

In this particular application, the LLM will be configured to answer the user question based on the results of the ASR (Automatic Speech Recognition) endpoint.

🤯 Would you like a verbal response? Don’t worry, that’s what TTS is for.

Expressing orally through TTS

TTS stands for Text-To-Speech, which is a type of technology that converts written text into spoken words.

This technology uses Artificial Intelligence algorithms to interpret and generate human-like speech from text input.

It is commonly used in various applications such as voice assistants, audiobooks, language learning platforms, and accessibility tools for individuals with visual or reading impairments.

With AI Endpoints, TTS is easy to use thanks to the turnkey inference APIs. Test it for free here.

🤖 Are you ready to start coding the Audio Virtual Assistant? Here we go: 3, 2, 1, begin!

Technical implementation of the Audio Virtual Assistant

This technical section covers the following points:

the use of the ASR endpoint inside Python code to transcribe audio request
the implementation of the TTS function to convertLLM response into spoken words
the creation of a Chatbot app using LLMs and Streamlit

➡️ Access the full code here.

Working principle of the web app resulting from technical implementation

To build the Audio Virtual Assistant, start by setting up the environment.

Set up the environment

In order to use AI Endpoints APIs easily, create a .env file to store environment variables.

ASR_AI_ENDPOINT=https://whisper-large-v3.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1
TTS_GRPC_ENDPOINT=nvr-tts-en-us.endpoints-grpc.kepler.ai.cloud.ovh.net:443
LLM_AI_ENDPOINT=https://mixtral-8x7b-instruct-v01.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1
OVH_AI_ENDPOINTS_ACCESS_TOKEN=

⚠️ **Make sure to replace the token value (`OVH_AI_ENDPOINTS_ACCESS_TOKEN`) by yours.** If you do not have one yet, follow the instructions in the AI Endpoints – Getting Started guide.

In this tutorial, we will be using the Whisper-Large-V3 and Mixtral-8x7b-Instruct-V01 models. Feel free to change it by models available on the AI Endpoints catalog.

In the next step, install the needed Python dependencies.

Create the requirements.txt file with the following libraries and launch the installation.

⚠️The environnement workspace is based on Python 3.11

openai==1.68.2
streamlit==1.36.0 streamlit-mic-recorder==0.0.8
nvidia-riva-client==2.15.1 python-dotenv==1.0.1

pip install -r requirements.txt

Once this is done, you can create a Python file named audio-virtual-assistant-app.py.

Then, import Python librairies as follow:

import os
import numpy as np
from openai import OpenAI
import riva.client
from dotenv import load_dotenv
import streamlit as st
from streamlit_mic_recorder import mic_recorder

After these lines, load and access the environnement variables of your .env file:

# access the environment variables from the .env file
load_dotenv()

ASR_AI_ENDPOINT = os.environ.get('ASR_AI_ENDPOINT')
TTS_GRPC_ENDPOINT = os.environ.get('TTS_GRPC_ENDPOINT')
LLM_AI_ENDPOINT = os.environ.get('LLM_AI_ENDPOINT')
OVH_AI_ENDPOINTS_ACCESS_TOKEN = os.environ.get('OVH_AI_ENDPOINTS_ACCESS_TOKEN')

Next, define the clients that will be used to interact with the models:

llm_client = OpenAI(
    base_url=LLM_AI_ENDPOINT,
    api_key=OVH_AI_ENDPOINTS_ACCESS_TOKEN
)

tts_client = riva.client.SpeechSynthesisService(
    riva.client.Auth(
        uri=TTS_GRPC_ENDPOINT,
        use_ssl=True,
        metadata_args=[["authorization", f"bearer {OVH_AI_ENDPOINTS_ACCESS_TOKEN}"]]
    )
)

asr_client = OpenAI(
    base_url=ASR_AI_ENDPOINT,
    api_key=OVH_AI_ENDPOINTS_ACCESS_TOKEN
)

💡 You are now ready to start coding your web app!

Transcribe input question with ASR

First, create the Automatic Speech Recognition (ASR) function in order to transcribe microphone input into text:

def asr_transcription(question, asr_client):
    return asr_client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=question
    ).text

How it works?

The audio input is sent from microphone recording, as question
A call is made to the ASR AI Endpoint named whisper-large-v3
The text from the transcript response is returned by the function

🎉 Congratulations! Your ASR function is ready to use. You are ready to transcribe audio files.

Generate LLM response to input question

Now, create a function that calls the LLM client to provide responses to questions:

def llm_answer(input, llm_client):
    response = llm_client.chat.completions.create(
                model="Mixtral-8x7B-Instruct-v0.1", 
                messages=input,
                temperature=0,
                max_tokens=1024,
            )
    msg = response.choices[0].message.content

    return msg

In this function:

The conversation/messages are retrieved as parameters
A call is made to the chat completion LLM endpoint, using the `Mixtral8x7B` model.
Extracts the model’s response and returns the final message text.

⏳ Almost there! All that remains is to implement the TTS to transform the LLM response into spoken words.

Return the response using TTS

Then, build the Text To Speech (TTS) function in order to transform the written answer into oral reply:

def tts_synthesis(response, tts_client):

    # set up config
    sample_rate_hz = 48000
    req = {
            "language_code"  : "en-US",                           # languages: en-US
            "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,
            "sample_rate_hz" : sample_rate_hz,                    # sample rate: 48KHz audio
            "voice_name"     : "English-US.Female-1"              # voices: `English-US.Female-1`, `English-US.Male-1`
    }

    # return response
    req["text"] = response
    synthesized_response = tts_client.synthesize(**req)
    
    return np.frombuffer(synthesized_response.audio, dtype=np.int16), sample_rate_hz

In practice?

The LLM response is retrieved
A call is made to the TTS AI Endpoint named nvr-tts-en-us
The audio sample and the sample rate are returned to play the audio automatically

⚡️ You’re almost there! Now all you have to do is build your Chatbot app.

Build the LLM chat app with Streamlit

In this last step, create the Chatbot app using Streamlit, an open-source Python library that allows to quickly create user interfaces for Machine Learning models and demos. Here is a working code example:

What to do?

Create a first Streamlit container to put the title using st.container() and st.title()
Add a second container for bot and user messages thanks to the following components: st.container() ; st.session_state() ; st.chat_message()
Use a third container for the microphone recording, the usage of the ASR, LLM, TTS, and the automatic audio player.

# streamlit interface
with st.container():
    st.title("💬 Audio Virtual Assistant Chatbot")

with st.container(height=600):
    messages = st.container()

    if "messages" not in st.session_state:
        st.session_state["messages"] = [{"role": "system", "content": "Hello, I'm AVA!", "avatar":"🤖"}]

    for msg in st.session_state.messages:
        messages.chat_message(msg["role"], avatar=msg["avatar"]).write(msg["content"])

with st.container():

    placeholder = st.empty()
    _, recording = placeholder.empty(), mic_recorder(
            start_prompt="START RECORDING YOUR QUESTION ⏺️", 
            stop_prompt="STOP ⏹️", 
            format="wav",
            use_container_width=True,
            key='recorder'
        )

    if recording:  
        user_question = asr_transcription(recording['bytes'], asr_client)

        if prompt := user_question:
            st.session_state.messages.append({"role": "user", "content": prompt, "avatar":"👤"})
            messages.chat_message("user", avatar="👤").write(prompt)
            msg = llm_answer(st.session_state.messages, llm_client)
            st.session_state.messages.append({"role": "assistant", "content": msg, "avatar": "🤖"})
            messages.chat_message("system", avatar="🤖").write(msg)

            if msg is not None:
                audio_samples, sample_rate_hz = tts_synthesis(msg, tts_client)
                placeholder.audio(audio_samples, sample_rate=sample_rate_hz, autoplay=True)

Now, the Audio Virtual Assistant is ready to use!

Audio Virtual Assistant web app

🚀 That’s it! Now get the most out of your tool by launching it locally.

Launch Streamlit chatbot app locally

Finally, you can start this Streamlit app locally by launching the following command:

streamlit run audio-virtual-assistant.py

Benefit from the full power of your tool as follow!

Improvements

By default, the nvr-tts-en-us model supports only a limited number of characters per request when generating audio. If you exceed this limit, you will encounter errors in your application.

To work around this limitation, you can replace the existing tts_synthesis function with the following implementation, which processes text in chunks:

def tts_synthesis(response, tts_client):
    # Split response into chunks of max 1000 characters
    max_chunk_length = 1000
    words = response.split()
    chunks = []
    current_chunk = ""

    for word in words:
        if len(current_chunk) + len(word) + 1 <= max_chunk_length:
            current_chunk += " " + word if current_chunk else word
        else:
            chunks.append(current_chunk)
            current_chunk = word
    if current_chunk:
        chunks.append(current_chunk)

    all_audio = np.array([], dtype=np.int16)
    sample_rate_hz = 16000

    # Process each chunk and concatenate the resulting audio
    for text in chunks:
        req = {
            "language_code": "en-US",
            "encoding": riva.client.AudioEncoding.LINEAR_PCM,
            "sample_rate_hz": sample_rate_hz,
            "voice_name": "English-US.Female-1",
            "text": text.strip(),
        }
        synthesized = tts_client.synthesize(**req)
        audio_segment = np.frombuffer(synthesized.audio, dtype=np.int16)
        all_audio = np.concatenate((all_audio, audio_segment))

    return all_audio, sample_rate_hz

☁️ Moreover, It’s also possible to make your interface accessible to everyone…

Go further

If you want to go further and deploy your web app in the cloud, refer to the following articles and tutorials.

Conclusion of the Audio Virtual Assistant

Well done 🎉! You have learned how to build your own Audio Virtual Assistant in a few lines of code.

You’ve also seen how easy it is to use AI Endpoints to create innovative turnkey solutions.

➡️ Access the full code here.

🚀 What’s next? Implement RAG chatbot o specialize this Audio Virtual Assistant on your data!

References

Create your own Audio Summarizer assistant with AI Endpoints!

Eléa Petton — Thu, 04 Jul 2024 07:39:00 +0000

Do you dream of being able to summarize hours of meetings in a matter of seconds? Don’t go away, we’ll explain it all here!

Robot assistant transcribing and summarizing audios into short texts

Introduction

Are you looking for a way to efficiently summarize your meetings, broadcasts, and podcasts for quick reference or to provide to others? Look no further!

In this blog post, you will be able to create an Audio Summarizer assistant that can not only transcribe but also summarize all your audios.

Thanks to AI Endpoints, it’s never been easier to create a virtual assistant that can help you stay on top of your meetings and keep track of important information.

This article will explore how AI APIs can be useful to create an advanced virtual assistant to transcribe and summarize any audio file thanks to ASR (Automatic Speech Recognition) technologies and famous LLMs (Large Language Models).

Objectives

Whether you’re a professional, a student or just want to make the most of your time, this step-by-step guide will show you how to create an Audio Summarizer assistant that will help you summarize your meetings, shows and podcasts, allowing you to concentrate on what really matters!

How to?

By connecting your AI Endpoints like puzzles!

AI Endpoints “puzzles” connexion

👀 But first of all, a few definitions are needed to fully understand the technical implementation that follows.

Concept

In order to better understand the technologies that revolve around the Audio Summarizer, let’s start by looking at the tools and notions of ASR, LLM, …

AI Endpoints in a few words

AI Endpoints is a new serverless platform powered by OVHcloud and designed for developers.

The aim of AI Endpoints is to enable developers to enhance their applications with AI APIs, whatever their level and without the need for AI expertise.

It offers a curated catalog of world-renowned AI models and Nvidia’s optimized models, with a commitment to privacy as data is not stored or shared during or after model use.

AI Endpoints provides access to advanced AI models, including Large Language Models (LLMs), natural language processing, translation, speech recognition, image recognition, and more.

OVHcloud AI Endpoints website

To know more about AI Endpoints, refer to this website.

AI Endpoints proposes several ASR APIs in different languages… But what means ASR?

It all starts with ASR

Automatic Speech Recognition (ASR) is a technology that converts spoken language into written text.

It is a complex process that involves several stages, including speech signal preprocessing, feature extraction, acoustic modeling, language modeling, and speech recognition engine.

AI Endpoints makes it easy, with ready-to-use inference APIs. Discover how to use them here.

In this context, ASR will be used to transcribe long audios into text in order to summarize it with LLMs.

Making summary with LLM

The famous LLMs, for Large Language Models are responsible for generating human-like text.

To find out more, what better way than to test it yourself? Follow this link.

For the current use case, the LLM prompt will precise to generate a summary of the input text based on the result of the ASR endpoint.

🤖 Do you want to start coding the Audio Summarizer? 3, 2, 1, get ready, go!

Technical implementation of the Audio Summarizer

In this technical part, the following points will be discussed:

the use of the ASR API inside Python code
the integration of the Mixtral8x7B LLM
the creation of a web app with Gradio

➡️ Access the full code here.

Working principle of the web app resulting from technical implementation

To build the Audio Summarizer, start by setting up the environment.

Set up the environment

In order to use AI Endpoints APIs easily, create a .env file to store environment variables.

ASR_AI_ENDPOINT=https://whisper-large-v3.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1
LLM_AI_ENDPOINT=https://mixtral-8x7b-instruct-v01.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1
OVH_AI_ENDPOINTS_ACCESS_TOKEN=

⚠️ Test AI Endpoints and get your free token here

In the next step, install the needed Python dependencies.

Create the requirements.txt file with the following libraries and launch the installation.

⚠️The environnement workspace is based on Python 3.11

openai==1.68.2 gradio==4.44.1 pydub==0.25.1 python-dotenv==1.0.1

pip install -r requirements.txt

Once this is done, you can create a Python file named audio-summarizer-app.py.

Then, import Python librairies as follow:

import gradio as gr
import io
import os
import requests
from pydub import AudioSegment
from dotenv import load_dotenv
from openai import OpenAI
import functools

Now, load and access the environnement variables.

# access the environment variables from the .env file
load_dotenv()

asr_ai_endpoint_url = os.environ.get('ASR_AI_ENDPOINT') 
llm_ai_endpoint_url = os.getenv("LLM_AI_ENDPOINT")
ai_endpoint_token = os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN")

Then define the clients that communicate with your APIs and authenticate your requests:

asr_client = OpenAI(
    base_url=asr_ai_endpoint_url,
    api_key=ai_endpoint_token
)

llm_client = OpenAI(
    base_url=llm_ai_endpoint_url,
    api_key=ai_endpoint_token
)

💡 You are now ready to start coding your web app!

Transcribe audio file with ASR

First, create the Automatic Speech Recognition (ASR) function in order to transcribe audio files into text.

How it works?

The audio file is preprocessed as follow: .wav format, 1 channel, 16000 frame rate
The transformed audio processed_audio is read
An API call is made to the ASR AI Endpoint named whisper-large-v3
The full response is stored in resp variable and returned by the function

def asr_transcription(asr_client, audio):
    
    if audio is None:
        return " "

    else:
        # preprocess audio 
        processed_audio = "/tmp/my_audio.wav"
        audio_input = AudioSegment.from_file(audio, "mp3")
        process_audio_to_wav = audio_input.set_channels(1)
        process_audio_to_wav = process_audio_to_wav.set_frame_rate(16000)
        process_audio_to_wav.export(processed_audio, format="wav")
    
        with open(processed_audio, "rb") as audio_file:
            response = asr_client.audio.transcriptions.create(
                model="whisper-large-v3",
                file=audio_file,
                response_format="verbose_json",
                timestamp_granularities=["segment"]
            )

        # return complete transcription
        return response.text

🎉 Congratulations! Your ASR function is ready to use.

Now it’s time to call an LLM to summarize the transcribed text.

Summarize audio with LLM

In this second step, create the Chat Completion function to use Mixtral8x7B effectively.

What to do?

Check that the transcription exists
Use the OpenAI API compatibility to call the LLM
Customize your prompt in order to specify LLM task
Return the audio summary

def chat_completion(llm_client, new_message):

    if new_message==" ":
        return "Please, send an input audio to get its summary!"
    
    else:

        # prompt
        history_openai_format = [{"role": "user", "content": f"Summarize the following text in a few words: {new_message}"}]
        # return summary
        return llm_client.chat.completions.create(
            model="Mixtral-8x7B-Instruct-v0.1",
            messages=history_openai_format,
            temperature=0,
            max_tokens=1024
        ).choices.pop().message.content

⚡️ You’re almost there! Now all you have to do is build your web app.

To make your solution easy to use, what better way than to quickly create an interface with just a few lines of code?

Build Gradio app

Gradio is an open-source Python library that allows to quickly create user interfaces for Machine Learning models and demos.

What does it mean in practice?

Inside a Gradio Block, you can:

Define a theme for your UI
Add a title to your web app with gr.HTML()
Upload audio thanks to the dedicated component, gr.Audio()
Obtain the result of the written transcription with the gr.Textbox()
Get a summary of the audio with the powerful LLM and a second gr.Textbox() component
Add a clear button with gr.ClearButton() to reset the page of the web app

# create partial functions with bound client instances
asr_transcribe_fn = functools.partial(asr_transcription, asr_client)
chat_completion_fn = functools.partial(chat_completion, llm_client)


# gradio
with gr.Blocks(theme=gr.themes.Default(primary_hue="blue"), fill_height=True) as demo:

    # add title and description
    with gr.Row():
        gr.HTML(
            """
            
                Welcome on Audio Summarizer web app 💬!

                Transcribe and summarize your broadcast, meetings, conversations, potcasts and much more...
            

            

            """
        )
        
    # audio zone for user question
    gr.Markdown("## Upload your audio file 📢")
    with gr.Row():
        inp_audio = gr.Audio(
            label = "Audio file in .wav or .mp3 format:",
            sources = ['upload'],
            type = "filepath",
        )

    # written transcription of user question
    with gr.Row():
        inp_text = gr.Textbox(
            label = "Audio transcription into text:",
        )
        
    # chabot answer
    gr.Markdown("## Chatbot summarization 🤖")
    with gr.Row():
        out_resp = gr.Textbox(
            label = "Get a summary of your audio:",
        )

    with gr.Row():
        
        # clear inputs
        clear = gr.ClearButton([inp_audio, inp_text, out_resp])
  
    # update functions
    inp_audio.change(
        fn = asr_transcribe_fn,
        inputs = inp_audio,
        outputs = inp_text
      )
    inp_text.change(
        fn = chat_completion_fn,
        inputs = inp_text,
        outputs = out_resp
      )

Then, you can launch it in the main.

if __name__ == '__main__':
 
    demo.launch(server_name="0.0.0.0", server_port=8000)

Now, the web app is ready to be used!

Audio Summarizer web app overview

🚀 That’s it! Now get the most out of your tool by launching it locally.

Launch Gradio web app locally

Finally, you can start this Gradio app locally by launching the following command:

python audio-summarizer-app.py

Benefit from the full power of your tool and save time!

☁️ It’s also possible to make your interface accessible to everyone…

Audio Summarizer assistant demo

Go Further

If you want to go further and deploy your web app in the cloud, refer to the following articles and tutorials.

Conclusion of the Audio Summarizer

Well done 🎉! You have learned how to build your own Audio Summarizer app in a few lines of code.

You’ve also seen how easy it is to use AI Endpoints to create innovative turnkey solutions.

➡️ Access the full code here.

What’s next? Modify your prompt and add translation to get the summary in an other language 💡