Open Source Archives - OVHcloud Blog

Reference Architecture: Deploying a vision-language model with vLLM on OVHcloud MKS for high performance inference and full observability

Eléa Petton — Fri, 10 Apr 2026 07:48:53 +0000

Ensure complete digital sovereignty of your AI models with end-to-end control through open-source solutions on OVHcloud’s Managed Kubernetes Service.

vLLM on OVHcloud MKS for high availability and full observability

This reference architecture demonstrates how to deploy a Large Language Model (LLM) inference system using vLLM on OVHcloud Managed Kubernetes Service (MKS). The solution leverages NVIDIA L40S GPUs to serve the Qwen3-VL-8B-Instruct multimodal model (vision + text) with OpenAI-compatible API endpoints.

This comprehensive guide shows you how to deploy, to scale automatically, and how to monitor vLLM-based LLM workloads on the OVHcloud infrastructure.

What are the key benefits?

Cost-effectiveness: Leverage managed services to minimise operational overhead
Real-time observability: Track Time-to-First-Token (TTFT), throughput, and resource utilisation
Sovereign infrastructure: Keep all metrics and data within European datacentres
Scalable by design: Automatically scale GPU inference replicas based on real workload demand

Context

Managed Kubernetes Service

OVHcloud MKS is a fully managed Kubernetes platform designed to help you deploy, operate, and scale containerised applications in production. It provides a secure and reliable Kubernetes environment without the operational overhead of managing the control plane.

How does this benefit you?

Cost-efficient: Pay only for worker nodes and consumed resources, with no additional charge for the Kubernetes control plane
Fully managed Kubernetes: Certified upstream Kubernetes with automated control plane management, provided upgrades and high availability
Production-ready by design: Built-in integrations with OVHcloud Load Balancers, networking, and persistent storage
Scalable and flexible: Scale workloads easily, node pools to match application demand
Open and portable: Based on standard Kubernetes APIs, enable seamless integration with open-source ecosystems and avoid vendor lock-in

In the following guide, all services are deployed within the OVHcloud Public Cloud.

Architecture overview

This reference architecture demonstrates a basic deployment of vLLM for vision-language model inference on OVHcloud Managed Kubernetes Service, featuring:

High-availability deployment with 2 GPU nodes (NVIDIA L40S)
Optimised GPU utilisation with proper driver configuration
Scalable infrastructure supporting vision-language models
Comprehensive monitoring using Prometheus, Grafana, and DCGM
Full observability for both application and hardware metrics

Data flow:

Data Flow

Inference request:
- User → LoadBalancer → Gateway → NGINX Ingress → “Qwen3 VL” Service → vLLM Pod → GPU
- Response follows reverse path with streaming support
Metrics collection:
- vLLM Pods expose /metrics endpoint (port 8000)
- DCGM Exporters expose GPU metrics (port 9400)
- Prometheus scrapes both endpoints every 30 seconds
- Grafana queries Prometheus for visualization
Load distribution
- NGINX Ingress uses cookie-based session affinity
- vLLM Service uses ClientIP session affinity
- Anti-affinity ensures 1 pod per GPU node

Prerequisites

Before you begin, ensure you have:

An OVHcloud Public Cloud account
An OpenStack user with the Administrator role
Hugging Face access – create a Hugging Face account and generate an access token
kubectl already installed and helm installed (at least version 3.x)

🚀 Now you have all the ingredients, it’s time to deploy the recipe for Qwen/Qwen3-VL-8B-Instruct using vLLM and MKS!

Architecture guide: Native GPU deployment of vLLM on MKS with full stack observability

This reference architecture describes a Large Language Model deployment using vLLM inference server and Kubernetes, to enjoy the benefits of a service that’s both highly available and monitorable in real time.

Step 1 – Create MKS cluster and Node pools

From OVHcloud Control Panel, create a Kubernetes cluster using the MKS.

Navigate to: Public Cloud → Managed Kubernetes Service → Create a cluster

1. Configure cluster

Consider using the following configuration for the current use case:

Name: vllm-deployment-l40s-qwen3-8b
Location: 1-AZ Region – Gravelines (GRA11)
Plan: Free (or Standard)
Network: attach a Private network (e.g. 0000 - AI Private Network)
Version: Latest stable (e.g. 1.34)

2. Create GPU Node pool

During the cluster creation, configure the vLLM Node pool for GPUs:

Node pool name: vllm
Flavor: L40S-90
Number of nodes: 2
Autoscaling: Disabled (OFF)

Why L40S-90?

Cost-effective for single-model deployment (1 GPU per node)
Sufficient RAM (90GB) for Qwen3-VL-8B model

You should see your cluster (e.g. vllm-deployment-l40s-qwen3-8b) in the list, along with the following information:

You can now set up the node pool dedicated to monitoring.

3. Create CPU Node pool

From your cluster, click on Add a node pool and configure it as follow:

Node pool name: monitoring
Flavor: B2-15
Number of nodes: 1
Autoscaling: Disabled (OFF)

✅ Note

Monitoring stack can run on GPU nodes if cost is a concern. Dedicated CPU node provides better isolation and resource management.

If the status is green with the OK label, you can proceed to the next step.

4. Configure Kubernetes access

Once your nodes have been provisioned, you can download the Kubeconfig file and configure kubectl with your MKS cluster.

# configure kubectl with your MKS cluster
export KUBECONFIG=/path/to/your/kubeconfig-xxxxxx.yml

# verify cluster connectivity
kubectl cluster-info
kubectl get nodes

Returning:

NAME STATUS ROLES AGE VERSION monitoring-node-xxxxxx Ready 1d v1.34.2 vllm-node-yyyyyy Ready 1d v1.34.2 vllm-node-zzzzzz Ready 1d v1.34.2

Before going further, add a label to the CPU node for monitoring workloads.

CPU_NODE=$(kubectl get nodes -o json | \
  jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" == null) | .metadata.name')
kubectl label node $CPU_NODE node-role=monitoring

Finally, check with the following command:

NAME                     GPU      ROLE
monitoring-node-xxxxxx      monitoring
vllm-node-yyyyyy         1        
vllm-node-zzzzzz         1

Once both nodes are in Ready status, you can proceed to the next step.

Step 2 – Install GPU operator

To start, consider setting up the GPU operator.

✅ Note

This step is based on this OVHcloud documentation: Deploying a GPU application on OVHcloud Managed Kubernetes Service

1. Add NVIDIA helm repository and create namespace

Add NVIDIA helm repo:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

And create Namespace as follow.

kubectl create namespace gpu-operator

2. Install GPU operator with correct configuration

The GPU Operator must be configured with specific driver versions to ensure compatibility with vLLM containers.

However, the default installation uses recent drivers (580.x with CUDA 13.x) which are incompatible with vLLM containers (CUDA 12.x).

Solution: Force driver version 535.183.01 (CUDA 12.2).

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --set driver.enabled=true \
  --set driver.version="535.183.01" \
  --set toolkit.enabled=true \
  --set operator.defaultRuntime=containerd \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set dcgmExporter.image="dcgm-exporter" \
  --set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04" \
  --set gfd.enabled=true \
  --set migManager.enabled=false \
  --set nodeStatusExporter.enabled=true \
  --set validator.driver.enable=false \
  --set validator.toolkit.enable=false \
  --set validator.plugin.enable=false \
  --timeout 20m

✅ Note

Specifying the DCGM version may only be necessary if you encounter problems with the default image (e.g. ‘ImagePullBackOff’). If this is the case, add the following parameters:
--set dcgmExporter.repository="nvcr.io/nvidia/k8s" --set dcgmExporter.image="dcgm-exporter" --set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04"

kubectl get pods -n gpu-operator

Note that all pods should reach Running state in 5-10 minutes.

You can also check the GPU availability:

kubectl get nodes -o json | jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" != null) | "\(.metadata.name): \(.status.allocatable."nvidia.com/gpu") GPU(s)"'

Returning:

vllm-node-yyyyyy: 1 GPU(s) vllm-node-zzzzzz: 1 GPU(s)

And you can test to run nvidia-smi:

DRIVER_POD=$(kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset -o name | head -1)
kubectl exec -n gpu-operator $DRIVER_POD -- nvidia-smi

If GPU tests are working properly, you can move on DCGM service configuration.

3. Configure DCGM service

Why is DCGM Exporter required?

DCGM (Data Centre GPU Manager) is NVIDIA’s official tool for monitoring GPUs in production. The goal is to be able to collect and display metrics from both GPU nodes.

GPU monitoring with DCGM

The metrics provided are:

DCGM_FI_DEV_GPU_UTIL – GPU utilisation (%)
DCGM_FI_DEV_GPU_TEMP – GPU temperature (°C)
DCGM_FI_DEV_FB_USED – VRAM used (MB)
DCGM_FI_DEV_FB_FREE – Free VRAM (MB)
DCGM_FI_DEV_POWER_USAGE – Power consumption (W)
And 50+ other GPU metrics

Next, ensure DCGM service has the correct labels and port configuration:

kubectl patch svc nvidia-dcgm-exporter -n gpu-operator --type merge -p '{
  "metadata": {
    "labels": {
      "app": "nvidia-dcgm-exporter"
    }
  },
  "spec": {
    "ports": [
      {
        "name": "metrics",
        "port": 9400,
        "targetPort": 9400,
        "protocol": "TCP"
      }
    ]
  }
}'

Verify the endpoints (should show 2 IPs, one per GPU node).

kubectl get endpoints nvidia-dcgm-exporter -n gpu-operator

NAME ENDPOINTS AGE nvidia-dcgm-exporter x.x.x.x:9400,x.x.x.x:9400 17d

Step 3 – Deploy Qwen3 VL 8B with vLLM inference server

The deployment of the Qwen 3 VL 8B model on two L40S GPU nodes is carried out in several stages.

1. Create namespace and Hugging Face secret

Start by creating Namespace:

kubectl create namespace vllm

Next, you must retrieve your Hugging Face token and replace the HF_TOKEN value by your own:

export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Create your secret as follow:

kubectl create secret generic huggingface-secret \
  --from-literal=token=$HF_TOKEN \
  --namespace=vllm

Verify you obtain the following output by launching:

kubectl get secret huggingface-secret -n vllm

NAME TYPE DATA AGE huggingface-secret Opaque 1 14d

2. Create vLLM deployment configuration

First, you can create vllm-deployment-2nodes.yaml file.

Deploy vLLM:

kubectl apply -f vllm-deployment-2nodes.yaml

You can monitor the deployment (it should take 8-10 minutes for model download and loading).

kubectl get pods -n vllm -o wide -w

Expected output after 10 minutes:

NAME               READY  STATUS   RESTARTS  AGE  IP       NODE  
qwen3-vl-xxxx-yyy  1/1    Running  0         1d   X.X.X.X  vllm-node-yyyyyy
qwen3-vl-xxxx-zzz  1/1    Running  0         1d   X.X.X.X  vllm-node-zzzzzz

You can also check the container logs:

kubectl logs -f -n vllm

You should find in the logs: “Uvicorn running on http://0.0.0.0:8000“

Is everything installed correctly? Then let’s move on to the next step.

3. Add service label

Ensure service has the correct label for ServiceMonitor discovery.

kubectl label svc qwen3-vl-service -n vllm app=qwen3-vl --overwrite

You can now verify by launching the following command.

kubectl get svc qwen3-vl-service -n vllm --show-labels | grep "app=qwen3-vl"

Returning:

qwen3-vl-service ClusterIP X.X.X.X 8000/TCP 1d app=qwen3-vl

Step 4 – Install NGINX ingress controller

⚠️ Moving beyond Ingress

Follow this tutorial if you want to use Gateway instead of Ingress.

1. Add helm repository and configure Ingress

First of all, add helm repository:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

Create configuration file with ingress-nginx-values.yaml.

Then, install NGINX Ingress:

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  -f ingress-nginx-values.yaml \
  --wait

Wait for LoadBalancer IP. The external IP assignment should take 1-2 minutes.

kubectl get svc -n ingress-nginx ingress-nginx-controller -w

Once is no longer , Ctrl+C and export it:

export EXTERNAL_IP=
echo "API URL: http://$EXTERNAL_IP"

2. Create vLLM Ingress resource

Next, create vLLM Ingress using vllm-ingress.yaml.

Apply it as follow:

kubectl apply -f vllm-ingress.yaml

You can now test different API calls to verify that your deployment is functional.

3. Test API

Firstly, check if the model is available:

curl http://$EXTERNAL_IP/v1/models | jq

{
  "object": "list",
  "data": [
    {
      "id": "qwen3-vl-8b",
      "object": "model",
      "created": 1772472143,
      "owned_by": "vllm",
      "root": "Qwen/Qwen3-VL-8B-Instruct",
      "parent": null,
      "max_model_len": 8192,
      "permission": [
        {
          "id": "modelperm-8fb35cdd3208b068",
          "object": "model_permission",
          "created": 1772472143,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

Next, test inference using the following request:

curl http://$EXTERNAL_IP/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-vl-8b",
    "messages": [{"role": "user", "content": "Count from 1 to 10."}],
    "max_tokens": 100
  }' | jq '.choices[0].message.content'

"1, 2, 3, 4, 5, 6, 7, 8, 9, 10"

Great! You’re almost there…

Step 5 – Install Prometheus stack

Now, set up the monitoring stack that provides complete observability for application-level (vLLM) and hardware-level (GPU) metrics:

Monitoring architecture

1. Add helm repository and create namespace

Add Prometheus helm repo:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Then, create the monitoring Namespace.

kubectl create namespace monitoring

2. Create Prometheus deployment configuration and installation

First, create prometheus.yaml file.

Install Prometheus stack:

helm install prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring \
  -f prometheus.yaml \
  --timeout 10m \
  --wait

Now, monitor its installation and wait until the pods are ready:

kubectl get pods -n monitoring -w

If all pods are running successfully, you can proceed to the next step.

3. Check that the installation is operational

First access Grafana in background:

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &

Test Grafana health:

curl -s http://localhost:3000/api/health | jq

{
  "database": "ok",
  "version": "12.3.3",
  "commit": "2a14494b2d6ab60f860d8b27603d0ccb264336f6"
}

You can now access to Grafana locally via http://localhost:3000. You will have to use:

Login: admin
Password: Admin123!vLLM

Well done! You can now proceed to the configuration step.

Step 6 – Configure ServiceMonitors

The ServiceMonitors is used to tell Prometheus which endpoints to scrape for metrics.

1. Create vLLM ServiceMonitor

Retrieve the file from the GitHub repository: vllm-servicemonitor.yaml.

Next, apply and check that the ServiceMonitor vllm-metrics exists:

kubectl apply -f vllm-servicemonitor.yaml
kubectl get servicemonitor -n vllm

2. Create DCGM ServiceMonitor

First, create the dcgm-servicemonitor.yaml file.

Once again, apply and verify:

kubectl apply -f dcgm-servicemonitor.yaml
kubectl get servicemonitor -n gpu-operator

gpu-operator                  1d
nvidia-dcgm-exporter          1d
nvidia-node-status-exporter   1d

3. Configure Prometheus for Cross-Namespace discovery

Apply a patch to allow Prometheus to discover ServiceMonitors in all namespaces.

kubectl patch prometheus prometheus-kube-prometheus-prometheus -n monitoring --type merge -p '{
  "spec": {
    "serviceMonitorNamespaceSelector": {},
    "podMonitorNamespaceSelector": {}
  }
}'

Now you have to restart Prometheus.

Delete Prometheus pod to force configuration reload
Wait for Prometheus to restart

kubectl delete pod prometheus-prometheus-kube-prometheus-prometheus-0 -n monitoring

kubectl wait --for=condition=Ready \
  pod/prometheus-prometheus-kube-prometheus-prometheus-0 \
  -n monitoring \
  --timeout=180s

Wait about 2 minutes for discovery and finally, verify targets:

kubectl port-forward -n monitoring \
  prometheus-prometheus-kube-prometheus-prometheus-0 9090:9090 &

You can open in browser: http://localhost:9090/targets and search for:

vllm
dcgm

Note that the expected targets are:

serviceMonitor/vllm/vllm-metrics/0 (2/2 UP)
serviceMonitor/gpu-operator/nvidia-dcgm-exporter/0 (2/2 UP)

Step 7 – Create Grafana dashboards

In this final step, the goal is to create two Grafana dashboards to track both the software side with vLLM metrics and the hardware metrics that will monitor the GPU consumption and system.

1. vLLM application metrics

The dashboard provides insights into vLLM application performance, request handling, and resource utilization based on the following metrics:

Metric	Type	Description	Unit	Dashboard Usage
`vllm:request_success_total`	Counter	Total successful requests	count	Request Rate, Total Requests
`vllm:num_requests_running`	Gauge	Requests currently being processed	count	Queue Depth, Active Requests
`vllm:num_requests_waiting`	Gauge	Requests waiting in queue	count	Queue Depth, Queued Requests
`vllm:time_to_first_token_seconds`	Histogram	Latency until first token generated	seconds	TTFT P50/P95/P99
`vllm:e2e_request_latency_seconds`	Histogram	Total end-to-end latency	seconds	E2E Latency P50/P95/P99
`vllm:generation_tokens_total`	Counter	Total tokens generated (output)	count	Token Generation Rate, Throughput
`vllm:prompt_tokens_total`	Counter	Total prompt tokens (input)	count	Token Generation Rate, Avg Tokens
`vllm:kv_cache_usage_perc`	Gauge	GPU KV cache utilization	0-1 (0-100%)	KV Cache Usage
`vllm:prefix_cache_hits_total`	Counter	Number of prefix cache hits	count	Cache Hit Rate
`vllm:prefix_cache_queries_total`	Counter	Number of prefix cache queries	count	Cache Hit Rate
`vllm:request_queue_time_seconds`	Histogram	Time spent waiting in queue	seconds	Request Queue Time
`vllm:request_prefill_time_seconds`	Histogram	Prefill phase time	seconds	Prefill Time
`vllm:request_decode_time_seconds`	Histogram	Decode phase time	seconds	Decode Time
`vllm:inter_token_latency_seconds`	Histogram	Latency between each token	seconds	Inter-Token Latency
`vllm:num_preemptions_total`	Counter	Number of preemptions (OOM)	count	Preemptions
`vllm:prompt_tokens_cached_total`	Counter	Prompt tokens cached	count	Cached Tokens
`vllm:request_prompt_tokens`	Histogram	Prompt size distribution	count	(Table)
`vllm:request_generation_tokens`	Histogram	Generated tokens distribution	count	(Table)
`vllm:iteration_tokens_total`	Histogram	Tokens per iteration	count	(Advanced analysis)

This vLLM Grafana dashboard is composed of 23 panels:

The dashboard provides insights into LLM application performance, request handling, and resource utilisation based on the previous metrics.

Type	Nombre	Panels
Timeseries	12	Request Rate, Queue Depth, TTFT, E2E Latency, Token Gen, Cache Usage, Cache Hit, Queue Time, Prefill/Decode, Inter-Token, Preemptions, Avg Tokens
Stat	10	Throughput, TTFT P95, Active Req, Queued Req, Cache Hit Rate, Cache Usage, Total Req, Total Tokens, Cached Tokens, Preemptions
Table	1	Pod Performance

Now create the dashboard using vllm-app-dashboard.json. Then, launch:

echo "Importing vLLM application dashboard..."
curl -X POST \
  'http://localhost:3000/api/dashboards/db' \
  -H 'Content-Type: application/json' \
  -u 'admin:Admin123!vLLM' \
  -d @vllm-app-dashboard.json | jq '.status, .url'

Next, you an access the vLLM dashboard and follow metrics in real time:

This dashboard is also essential to track hardware consumption for comprehensive monitoring.

2. GPU hardware metrics

Take advantage of the most useful DCGM metrics to check both the functioning and consumption of your hardware resources:

Metric	Type	Description	Unit	Normal Thresholds	Dashboard Usage
`DCGM_FI_DEV_GPU_UTIL`	Gauge	GPU utilization (compute)	% (0-100)	70-95% optimal	GPU Utilization
`DCGM_FI_DEV_GPU_TEMP`	Gauge	GPU temperature	°C	< 85°C normal	GPU Temperature
`DCGM_FI_DEV_FB_USED`	Gauge	VRAM used	MB	Variable by model	GPU Memory Used
`DCGM_FI_DEV_FB_FREE`	Gauge	VRAM free	MB	> 2GB recommended	GPU Memory Free
`DCGM_FI_DEV_POWER_USAGE`	Gauge	Power consumption	Watts	< 300W (L40S)	GPU Power Usage
`DCGM_FI_DEV_SM_CLOCK`	Gauge	GPU clock speed (compute)	MHz	Variable	GPU Clock Speed
`DCGM_FI_DEV_MEM_CLOCK`	Gauge	Memory clock speed	MHz	Variable	Memory Clock Speed
`DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL`	Counter	Total NVLink bandwidth	bytes/s	(If multi-GPU)	NVLink Bandwidth
`DCGM_FI_DEV_PCIE_TX_BYTES`	Counter	PCIe data transmitted	bytes	(I/O monitoring)	PCIe TX
`DCGM_FI_DEV_PCIE_RX_BYTES`	Counter	PCIe data received	bytes	(I/O monitoring)	PCIe RX
`DCGM_FI_DEV_ECC_DBE_VOL_TOTAL`	Counter	ECC double-bit errors	count	0 ideal	(Health check)
`DCGM_FI_DEV_ECC_SBE_VOL_TOTAL`	Counter	ECC single-bit errors	count	< 10/day acceptable	(Health check)

This hardware Grafana dashboard is composed of 13 panels with GPU hardware and system metrics. A detailed view is also available GPU util (%), temperature (°C), vRAM (GB) and power (Watt).

Type	Count	Panels
Timeseries	8	GPU Util, GPU Mem, GPU Temp, GPU Power, CPU Usage, RAM Usage, Network I/O, Disk I/O
Stat	4	Avg GPU Util, Avg GPU Temp, Total GPU Mem, Total GPU Power
Table	1	Hardware Status

Please refer to hardware-dashboard.json by loading it as follows:

echo "Importing hardware dashboard..."
curl -X POST \
  'http://localhost:3000/api/dashboards/db' \
  -H 'Content-Type: application/json' \
  -u 'admin:Admin123!vLLM' \
  -d @hardware-dashboard.json | jq '.status, .url'

Finally, track resource consumption using this hardware dashboard:

Congratulations! Everything is working. You can now test your model and track the various metrics in real time.

Step 8 – LLM testing and performance tracking

Start by installing Python dependencies:

pip3 install openai tqdm

Replace the by the vLLM service external IP and launch the performance test thanks to the following Python code:

import time
import threading
import random
from statistics import mean
from openai import OpenAI
from tqdm import tqdm

APP_URL = "http://94.23.185.22/v1"
MODEL = "qwen3-vl-8b"

CONCURRENT_WORKERS = 500          # concurrency
REQUESTS_PER_WORKER = 10
MAX_TOKENS = 200                  # generation pressure

# some random prompts
SHORT_PROMPTS = [
    "Summarize the theory of relativity.",
    "Explain what a transformer model is.",
    "What is Kubernetes autoscaling?"
]

MEDIUM_PROMPTS = [
    "Explain how attention mechanisms work in transformer-based models, including self-attention and multi-head attention.",
    "Describe how vLLM manages KV cache and why it impacts inference performance."
]

LONG_PROMPTS = [
    "Write a very detailed technical explanation of how large language models perform inference, "
    "including tokenization, embedding lookup, transformer layers, attention computation, KV cache usage, "
    "GPU memory management, and how batching affects latency and throughput. Use examples.",
]

PROMPT_POOL = (
    SHORT_PROMPTS * 2 +
    MEDIUM_PROMPTS * 4 +
    LONG_PROMPTS * 6    # bias toward long prompts
)

# openai compliance
client = OpenAI(
    base_url=APP_URL,
    api_key="foo"
)

# basic metrics
latencies = []
errors = 0
lock = threading.Lock()

# worker
def worker(worker_id):
    global errors
    for _ in range(REQUESTS_PER_WORKER):
        prompt = random.choice(PROMPT_POOL)

        start = time.time()
        try:
            client.chat.completions.create(
                model=MODEL,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=MAX_TOKENS,
                temperature=0.7,
            )
            elapsed = time.time() - start

            with lock:
                latencies.append(elapsed)

        except Exception as e:
            with lock:
                errors += 1

# run
threads = []
start_time = time.time()

print("\n-> STARTING PERFORMANCE TEST:")
print(f"Concurrency: {CONCURRENT_WORKERS}")
print(f"Total requests: {CONCURRENT_WORKERS * REQUESTS_PER_WORKER}")

for i in range(CONCURRENT_WORKERS):
    t = threading.Thread(target=worker, args=(i,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

total_time = time.time() - start_time

# results
print("\n-> BENCH RESULTS:")
print(f"Total requests sent: {len(latencies) + errors}")
print(f"Successful requests: {len(latencies)}")
print(f"Errors: {errors}")
print(f"Total wall time: {total_time:.2f}s")

if latencies:
    print(f"Avg latency: {mean(latencies):.2f}s")
    print(f"Min latency: {min(latencies):.2f}s")
    print(f"Max latency: {max(latencies):.2f}s")
    print(f"Throughput: {len(latencies)/total_time:.2f} req/s")

Returning:

-> STARTING PERFORMANCE TEST:
Concurrency: 500
Total requests: 5000

-> BENCH RESULTS:
Total requests sent: 5000
Successful requests: 5000
Errors: 0
Total wall time: 225.54s
Avg latency: 21.45s
Min latency: 6.06s
Max latency: 25.19s
Throughput: 22.17 req/s

Don’t forget to track GPU and vLLM metrics in your Grafana dashboards!

Conslusion

This reference architecture demonstrates a vLLM deployment on OVHcloud Managed Kubernetes Service (MKS) with comprehensive GPU monitoring. Benefits include:

High Performance: GPU-accelerated inference with L40S
Scalability: Kubernetes-native, horizontal scaling-ready
Reliability: Health checks, auto-restart, monitoring
API Compatibility: OpenAI-compatible endpoints
Multimodality: Vision & text capabilities
Full stack monitoring: Complete vLLM application and hardware dashboards

Going Further

Your current architecture is functional. However, if desired, it could be improved into a full production-ready solution.

Wish to take production hardening a step further?

Go further with the following enhancements:

Authentication & authorization
- vLLM API authentication
- Grafana authentication
- Prometheus security
High availability & load balancing
- Grafana high availability with multiple replicas and shared storage
- Prometheus high availability
- vLLM Horizontal Pod Autoscaling (HPA) based on custom metrics
Data persistence & backup
- Prometheus long-term storage with persistent storage
- Grafana Dashboard Backup
Observability enhancements
- Distributed tracing by adding OpenTelemetry for request tracing
- Alerting rules with production-ready alert rules

Reference Architecture: Custom metric autoscaling for LLM inference with vLLM on OVHcloud AI Deploy and observability using MKS

Eléa Petton — Tue, 10 Feb 2026 08:51:11 +0000

Take your LLM (Large Language Model) deployment to production level with comprehensive custom autoscaling configuration and advanced vLLM metrics observability.

vLLM metrics monitoring and observability based on OVHcloud infrastructure

This reference architecture describes a comprehensive solution for deploying, autoscaling and monitoring vLLM-based LLM workloads on OVHcloud infrastructure. It combinesAI Deploy, used for model serving with custom metric autoscaling, and Managed Kubernetes Service (MKS), which hosts the monitoring and observability stack.

By leveraging application-level Prometheus metrics exposed by vLLM, AI Deploy can automatically scale inference replicas based on real workload demand, ensuring high availability, consistent performance under load and efficient GPU utilisation. This autoscaling mechanism allows the platform to react dynamically to traffic spikes while maintaining predictable latency for end users.

On top of this scalable inference layer, the monitoring architecture provides observability through Prometheus, Grafana and Alertmanager. It enables real-time performance monitoring, capacity planning, and operational insights, while ensuring full data sovereignty for organisations running Large Language Models (LLMs) in production environments.

What are the key benefits?

Cost-effective: Leverage managed services to minimise operational overhead
Real-time observability: Track Time-to-First-Token (TTFT), throughput, and resource utilisation
Sovereign infrastructure: All metrics and data remain within European datacentres
Production-ready: Persistent storage, high availability, and automated monitoring

Context

AI Deploy

OVHcloud AI Deploy is a Container as a Service (CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications/APIs based on Machine Learning (ML), Deep Learning (DL) or Large Language Models (LLMs).

Key points to keep in mind:

Easy to use: Bring your own custom Docker image and deploy it in a command line or a few clicks surely
High-performance computing: A complete range of GPUs available (H100, A100, V100S, L40S and L4)
Scalability and flexibility: Supports automatic scaling, allowing your model to effectively handle fluctuating workloads
Cost-efficient: Billing per minute, no surcharges

Managed Kubernetes Service

What should you keep in mind?

Cost-efficient: Only pay for worker nodes and consumed resources, with no additional charge for the Kubernetes control plane
Fully managed Kubernetes: Certified upstream Kubernetes with automated control plane management, upgrades and high availability
Production-ready by design: Built-in integrations with OVHcloud Load Balancers, networking and persistent storage
Scalability and flexibility: Easily scale workloads and node pools to match application demand
Open and portable: Based on standard Kubernetes APIs, enabling seamless integration with open-source ecosystems and avoiding vendor lock-in

In the following guide, all services are deployed within the OVHcloud Public Cloud.

Overview of the architecture

This reference architecture describes a complete, secure and scalable solution to:

Deploy an LLM with vLLM and AI Deploy, benefiting from automatic scaling based on custom metrics to ensure high service availability – vLLM exposes /metrics via its public HTTPS endpoint on AI Deploy
Collect, store and visualise these vLLM metrics using Prometheus and Grafana on MKS

vLLM metrics monitoring and observability architecture overview

Here you will find the main components of the architecture. The solution comprises three main layers:

Model serving layer with AI Deploy
- vLLM containers running on top of GPUs for LLM inference
- vLLM inference server exposing Prometheus metrics
- Automatic scaling based on custom metrics to ensure high availability
- HTTPS endpoints with Bearer token authentication
Monitoring and observability infrastructure using Kubernetes
- Prometheus for metrics collection and storage
- Grafana for visualisation and dashboards
- Persistent volume storage for long-term retention
Network layer
- Secure HTTPS communication between components
- OVHcloud LoadBalancer for external access

To go further, some prerequisites must be checked!

Prerequisites

Before you begin, ensure you have:

An OVHcloud Public Cloud account
An OpenStack user with the Administrator role
ovhai CLI available – install the ovhai CLI
A Hugging Face access – create a Hugging Face account and generate an access token
kubectl installed and helm installed (at least version 3.x)

🚀 Now you have all the ingredients for our recipe, it’s time to deploy the Ministral 14B using AI Deploy and vLLM Docker container!

Architecture guide: From autoscaling to observability for LLMs served by vLLM

Let’s set up and deploy this architecture!

Overview of the deployment workflow

✅ Note

In this example, mistralai/Ministral-3-14B-Instruct-2512 is used. Choose the open-source model of your choice and follow the same steps, adapting the model slug (from Hugging Face), the versions and the GPU(s) flavour.

Remember that all of the following steps can be automated using OVHcloud APIs!

Step 1 – Manage access tokens

Before introducing the monitoring stack, this architecture starts with the deployment of the Ministral 3 14B on OVHcloud AI Deploy, configured to autoscale based on custom Prometheus metrics exposed by vLLM itself.

Export your Hugging Face token.

export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

Create a Bearer token to access your AI Deploy app once it’s been deployed.

ovhai token create --role operator ai_deploy_token=my_operator_token

Returning the following output:

Id: 47292486-fb98-4a5b-8451-600895597a2b Created At: 20-01-26 11:53:05 Updated At: 20-01-26 11:53:05 Spec: Name: ai_deploy_token=my_operator_token Role: AiTrainingOperator Label Selector: Status: Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Version: 1

You can now store and export your access token:

export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Step 2 – LLM deployment using AI Deploy

1. Define the targeted vLLM metric for autoscaling

Before proceeding with the deployment of the Ministral 3 14B endpoint, you have to choose the metric you want to use as the trigger for scaling.

Instead of relying solely on CPU/RAM utilisation, AI Deploy allows autoscaling decisions to be driven by application-level signals.

To do this, you can consult the metrics exposed by vLLM.

In this example, you can use a basic metric such as vllm:num_requests_running to scale the number of replicas based on real inference load.

This enables:

Faster reaction to traffic spikes
Better GPU utilisation
Reduced inference latency under load
Cost-efficient scaling

Finally, the configuration chosen for scaling this application is as follows:

Parameter	Value	Description
Metric source	`/metrics`	vLLM Prometheus endpoint
Metric name	`vllm:num_requests_running`	Number of in-flight requests
Aggregation	`AVERAGE`	Mean across replicas
Target value	`50`	Desired load per replica
Min replicas	`1`	Baseline capacity
Max replicas	`3`	Burst capacity

✅ Note

You can choose the metric that best suits your use case. You can also apply a patch to your AI Deploy deployment at any time to change the target metric for scaling.

When the average number of running requests exceeds 50, AI Deploy automatically provisions additional GPU-backed replicas.

2. Deploy Ministral 3 14B using AI Deploy

Now you can deploy the LLM using the ovhai CLI.

Key elements necessary for proper functioning:

GPU-based inference: 1 x H100
vLLM OpenAI-compatible Docker image: vllm/vllm-openai:v0.13.0
Custom autoscaling rules based on Prometheus metrics: vllm:num_requests_running

Below is the reference command used to deploy the mistralai/Ministral-3-14B-Instruct-2512:

ovhai app run \
  --name vllm-ministral-14B-autoscaling-custom-metric \
  --default-http-port 8000 \
  --label ai_deploy_token=my_operator_token \
  --gpu 1 \
  --flavor h100-1-gpu \
  -e OUTLINES_CACHE_DIR=/tmp/.outlines \
  -e HF_TOKEN=$MY_HF_TOKEN \
  -e HF_HOME=/hub \
  -e HF_DATASETS_TRUST_REMOTE_CODE=1 \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -v standalone:/hub:rw \
  -v standalone:/workspace:rw \
  --liveness-probe-path /health \
  --liveness-probe-port 8000 \
  --liveness-initial-delay-seconds 300 \
  --probe-path /v1/models \
  --probe-port 8000 \
  --initial-delay-seconds 300 \
  --auto-min-replicas 1 \
  --auto-max-replicas 3 \
  --auto-custom-api-url "http://:8000/metrics" \
  --auto-custom-metric-format PROMETHEUS \
  --auto-custom-value-location vllm:num_requests_running \
  --auto-custom-target-value 50 \
  --auto-custom-metric-aggregation-type AVERAGE \
  vllm/vllm-openai:v0.13.0 \
  -- bash -c "python3 -m vllm.entrypoints.openai.api_server \
    --model mistralai/Ministral-3-14B-Instruct-2512 \
    --tokenizer_mode mistral \
    --load_format mistral \
    --config_format mistral \
    --enable-auto-tool-choice \
    --tool-call-parser mistral \
    --enable-prefix-caching"

How to understand the different parameters of this command?

a. Start your AI Deploy app

Launch a new app using ovhai CLI and name it.

ovhai app run --name vllm-ministral-14B-autoscaling-custom-metric

b. Define access

Define the HTTP API port and restrict access to your token.

--default-http-port 8000
--label ai_deploy_token=my_operator_token

c. Configure GPU resources

Specify the hardware type (h100-1-gpu), which refers to an NVIDIA H100 GPU and the number (1).

--gpu 1 --flavor h100-1-gpu

⚠️WARNING! For this model, one H100 is sufficient, but if you want to deploy another model, you will need to check which GPU you need. Note that you can also access L40S and A100 GPUs for your LLM deployment.

d. Set up environment variables

Configure caching for the Outlines library (used for efficient text generation):

-e OUTLINES_CACHE_DIR=/tmp/.outlines

Pass the Hugging Face token ($MY_HF_TOKEN) for model authentication and download:

-e HF_TOKEN=$MY_HF_TOKEN

Set the Hugging Face cache directory to /hub (where models will be stored):

-e HF_HOME=/hub

Allow execution of custom remote code from Hugging Face datasets (required for some model behaviours):

-e HF_DATASETS_TRUST_REMOTE_CODE=1

Disable Hugging Face Hub transfer acceleration (to use standard model downloading):

-e HF_HUB_ENABLE_HF_TRANSFER=0

e. Mount persistent volumes

Mount two persistent storage volumes:

/hub → Stores Hugging Face model files
/workspace → Main working directory

The rw flag means read-write access.

-v standalone:/hub:rw -v standalone:/workspace:rw

f. Health checks and readiness

Configure liveness and readiness probes:

/health verifies the container is alive
/v1/models confirms the model is loaded and ready to serve requests

The long initial delays (300 seconds) can be reduced; they correspond to the startup time of vLLM and the loading of the model on the GPU.

--liveness-probe-path /health --liveness-probe-port 8000 --liveness-initial-delay-seconds 300 --probe-path /v1/models --probe-port 8000 --initial-delay-seconds 300

g. Autoscaling configuration (custom metrics)

First set the minimum and maximum number of replicas.

--auto-min-replicas 1 --auto-max-replicas 3

This guarantees basic availability (one replica always up) while allowing for peak capacity.

Then enable autoscaling based on application-level metrics exposed by vLLM.

--auto-custom-api-url "http://:8000/metrics" --auto-custom-metric-format PROMETHEUS --auto-custom-value-location vllm:num_requests_running --auto-custom-target-value 50 --auto-custom-metric-aggregation-type AVERAGE

AI Deploy:

Scrapes the local /metrics endpoint
Parses Prometheus-formatted metrics
Extracts the vllm:num_requests_running gauge
Computes the average value across replicas

Scaling behaviour:

When the average number of in-flight requests exceeds 50, AI Deploy adds replicas
When load decreases, replicas are scaled down

This approach ensures high availability and predictable latency under fluctuating traffic.

h. Choose the target Docker image and the startup command

Use the official vLLM OpenAI-compatible Docker image.

vllm/vllm-openai:v0.13.0

Finally, run the model inside the container using a Python command to launch the vLLM API server:

python3 -m vllm.entrypoints.openai.api_server → Starts the OpenAI-compatible vLLM API server
--model mistralai/Ministral-3-14B-Instruct-2512 → Loads the Ministral 3 14B model from Hugging Face
--tokenizer_mode mistral → Uses the Mistral tokenizer
--load_format mistral → Uses Mistral’s model loading format
--config_format mistral → Ensures the model configuration follows Mistral’s standard
--enable-auto-tool-choice → Automatic call of tools if necessary (function/tool call)
--tool-call-parser mistral → Tool calling support
--enable-prefix-caching → Prefix caching for improved throughput and reduced latency

You can now launch this command using ovhai CLI.

3. Check AI Deploy app status

You can now check if your AI Deploy app is alive:

ovhai app get

Is your app in RUNNING status? Perfect! You can check in the logs that the server is started:

ovhai app logs

⚠️WARNING! This step may take a little time as the LLM must be loaded.

4. Test that the deployment is functional

First you can request and send a prompt to the LLM. Launch the following query by asking the question of your choice:

curl https://.app.gra.ai.cloud.ovh.net/v1/chat/completions \
  -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Ministral-3-14B-Instruct-2512",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Give me the name of OVHcloud’s founder."}
    ],
    "stream": false
  }'

You can also verify access to vLLM metrics.

curl -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  https://.app.gra.ai.cloud.ovh.net/metrics

If both tests show that the model deployment is functional and you receive 200 HTTP responses, you are ready to move on to the next step!

The next step is to set up the observability and monitoring stack. This autoscaling mechanism is fully independent from Prometheus used for observability:

AI Deploy queries the local /metrics endpoint internally
Prometheus scrapes the same metrics endpoint externally for monitoring, dashboards and potentially alerting

This ensures:

A single source of truth for metrics
No duplication of exporters
Consistent signals for scaling and observability

Step 3 – Create an MKS cluster

From OVHcloud Control Panel, create a Kubernetes cluster using the MKS.

Consider using the following configuration for the current use case:

Location: GRA ( Gravelines) – you can select the same region as for AI Deploy
Network: Public
Node pool :
- Flavour : b2-15 (or something similar)
- Number of nodes: 3
- Autoscaling : OFF
Name your node pool: monitoring

You should see your cluster (e.g. prometheus-vllm-metrics-ai-deploy) in the list, along with the following information:

If the status is green with the OK label, you can proceed to the next step.

Step 4 – Configure Kubernetes access

Download your kubeconfig file from the OVHcloud Control Panel and configure kubectl:

# configure kubectl with your MKS cluster
export KUBECONFIG=/path/to/your/kubeconfig-xxxxxx.yml

# verify cluster connectivity
kubectl cluster-info
kubectl get nodes

Now,- you can create the values-prometheus.yaml file:

# general configuration
nameOverride: "monitoring"
fullnameOverride: "monitoring"

# Prometheus configuration
prometheus:
  prometheusSpec:
    # data retention (15d)
    retention: 15d
    
    # scrape interval (15s)
    scrapeInterval: 15s
    
    # persistent storage (required for production deployment)
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: csi-cinder-high-speed  # OVHcloud storage
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi  # (can be modified according to your needs)
    
    # scrape vLLM metrics from your AI Deploy instance (Ministral 3 14B)
    additionalScrapeConfigs:
      - job_name: 'vllm-ministral'
        scheme: https
        metrics_path: '/metrics'
        scrape_interval: 15s
        scrape_timeout: 10s
        
        # authentication using AI Deploy Bearer token stored Kubernetes Secret
        bearer_token_file: /etc/prometheus/secrets/vllm-auth-token/token
        static_configs:
          - targets:
              - '.app.gra.ai.cloud.ovh.net'  # /!\ REPLACE THE  by yours /!\
            labels:
              service: 'vllm'
              model: 'ministral'
              environment: 'production'
        
        # TLS configuration
        tls_config:
          insecure_skip_verify: false
    
    # kube-prometheus-stack mounts the secret under /etc/prometheus/secrets/ and makes it accessible to Prometheus
    secrets:
      - vllm-auth-token

# Grafana configuration (visualization layer)
grafana:
  enabled: true
  
  # disable automatic datasource provisioning
  sidecar:
    datasources:
      enabled: false
  
  # persistent dashboards
  persistence:
    enabled: true
    storageClassName: csi-cinder-high-speed
    size: 10Gi
  
  # /!\ DEFINE ADMIN PASSWORD - REPLACE "test" BY YOURS /!\
  adminPassword: "test"
  
  # access via OVHcloud LoadBalancer (public IP and managed LB)
  service:
    type: LoadBalancer
    port: 80
    annotations:
      # optional : limiter l'accès à certaines IPs
      # service.beta.kubernetes.io/ovh-loadbalancer-allowed-sources: "1.2.3.4/32"
  
# alertmanager (optional but recommended for production)
alertmanager:
  enabled: true
  
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: csi-cinder-high-speed
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

# cluster observability components
nodeExporter:
  enabled: true
  
kubeStateMetrics:
  enabled: true

✅ Note

On OVHcloud MKS, persistent storage is handled automatically through the Cinder CSI driver. When a PersistentVolumeClaim (PVC) references a supported storageClassName such as csi-cinder-high-speed, OVHcloud dynamically provisions the underlying Block Storage volume and attaches it to the node running the pod. This enables stateful components like Prometheus, Alertmanager and Grafana to persist data reliably without any manual volume management, making the architecture fully cloud-native and operationally simple.

Then create the monitoring namespace:

# create namespace
kubectl create namespace monitoring

# verify creation
kubectl get namespaces | grep monitoring

Finally, configure the Bearer token secret to access vLLM metrics.

# create bearer token secret
kubectl create secret generic vllm-auth-token \
  --from-literal=token='"$MY_OVHAI_ACCESS_TOKEN"' \
  -n monitoring

# verify secret creation
kubectl get secret vllm-auth-token -n monitoring

# test token (optional)
kubectl get secret vllm-auth-token -n monitoring \
  -o jsonpath='{.data.token}' | base64 -d

Right, if everything is working, let’s move on to deployment.

Step 5 – Deploy Prometheus stack

Add the Prometheus Helm repository and install the monitoring stack. The deployment creates:

Prometheus StatefulSet with persistent storage
Grafana deployment with LoadBalancer access
Alertmanager for future alert configuration (optional)
Supporting components (node exporters, kube-state-metrics)

# add Helm repository
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update

# install monitoring stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values values-prometheus.yaml \
  --wait

Then you can retrieve the LoadBalancer IP address to access Grafana:

kubectl get svc -n monitoring monitoring-grafana

Finally, open your browser to http:// and login with:

Username: admin
Password: as configured in your values-prometheus.yaml file

Step 6 – Create Grafana dashboards

In this step, you will be able to access Grafana interface and add your Prometheus as a new data source, then create a complete dashboard with different vLLM metrics.

1. Add a new data source in Grafana

First of all, create a new Prometheus connection inside Grafana:

Navigate to Connections → Data sources → Add data source
Select Prometheus
Configure URL: http://monitoring-prometheus:9090
Click Save & test

Now that your Prometheus has been configured as a new data source, you can create your Grafana dashboard.

2. Create your monitoring dashboard

To begin with, you can use the following pre-configured Grafana dashboard by downloading this JSON file locally:

In the left-hand menu, select Dashboard:

Navigate to Dashboards → Import
Upload the provided dashboard JSON
Select Prometheus as datasource
Click Import and select the vLLM-metrics-grafana-monitoring.json file

The dashboard provides real-time visibility for Ministral 3 14B deployed with vLLM container and OVHcloud AI Deploy.

You can now track:

Performance metrics: TTFT, inter-token latency, end-to-end latency
Throughput indicators: Requests per second, token generation rates
Resource utilisation: KV cache usage, active/waiting requests
Capacity indicators: Queue depth, preemption rates

Here are the key metrics tracked and displayed in the Grafana dashboard:

Metric Category	Prometheus Metric	Description	Use case
Latency	`vllm:time_to_first_token_seconds`	Time until first token generation	User experience monitoring
Latency	`vllm:inter_token_latency_seconds`	Time between tokens	Throughput optimisation
Latency	`vllm:e2e_request_latency_seconds`	End-to-end request time	SLA monitoring
Throughput	`vllm:request_success_total`	Successful requests counter	Capacity planning
Resource	`vllm:kv_cache_usage_perc`	KV cache memory usage	Memory management
Queue	`vllm:num_requests_running`	Active requests	Load monitoring
Queue	`vllm:num_requests_waiting`	Queued requests	Overload detection
Capacity	`vllm:num_preemptions_total`	Request preemptions	Peak load indicator
Tokens	`vllm:prompt_tokens_total`	Input tokens processed	Usage analytics
Tokens	`vllm:generation_tokens_total`	Output tokens generated	Cost tracking

Well done, you now have at your disposal:

An endpoint of the Ministral 3 14B model deployed with vLLM thanks to OVHcloud AI Deploy and its autoscaling strategies based on custom metrics
Prometheus for metrics collection and Grafana for visualisation/dashboards thanks to OVHcloud MKS

But how can you check that everything will work when the load increases?

Step 7 – Test autoscaling and real-time visualisation

The first objective here is to force AI Deploy to:

Increase vllm:num_requests_running
‘Saturate’ a single replica
Trigger the scale up
Observe replica increase + latency drop

1. Autoscaling testing strategy

The goal is to combine:

High concurrency
Long prompts (KVcache heavy)
Long generations
Bursty load

This is what vLLM autoscaling actually reacts to.

To do so, a Python code can simulate the expected behaviour:

import time
import threading
import random
from statistics import mean
from openai import OpenAI
from tqdm import tqdm

APP_URL = "https://.app.gra.ai.cloud.ovh.net/v1" # /!\ REPLACE THE  by yours /!\
MODEL = "mistralai/Ministral-3-14B-Instruct-2512"
API_KEY = $MY_OVHAI_ACCESS_TOKEN

CONCURRENT_WORKERS = 500          # concurrency (main scaling trigger)
REQUESTS_PER_WORKER = 25
MAX_TOKENS = 768                  # generation pressure

# some random prompts
SHORT_PROMPTS = [
    "Summarize the theory of relativity.",
    "Explain what a transformer model is.",
    "What is Kubernetes autoscaling?"
]

MEDIUM_PROMPTS = [
    "Explain how attention mechanisms work in transformer-based models, including self-attention and multi-head attention.",
    "Describe how vLLM manages KV cache and why it impacts inference performance."
]

LONG_PROMPTS = [
    "Write a very detailed technical explanation of how large language models perform inference, "
    "including tokenization, embedding lookup, transformer layers, attention computation, KV cache usage, "
    "GPU memory management, and how batching affects latency and throughput. Use examples.",
]

PROMPT_POOL = (
    SHORT_PROMPTS * 2 +
    MEDIUM_PROMPTS * 4 +
    LONG_PROMPTS * 6    # bias toward long prompts
)

# openai compliance
client = OpenAI(
    base_url=APP_URL,
    api_key=API_KEY,
)

# basic metrics
latencies = []
errors = 0
lock = threading.Lock()

# worker
def worker(worker_id):
    global errors
    for _ in range(REQUESTS_PER_WORKER):
        prompt = random.choice(PROMPT_POOL)

        start = time.time()
        try:
            client.chat.completions.create(
                model=MODEL,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=MAX_TOKENS,
                temperature=0.7,
            )
            elapsed = time.time() - start

            with lock:
                latencies.append(elapsed)

        except Exception as e:
            with lock:
                errors += 1

# run
threads = []
start_time = time.time()

print("Starting autoscaling stress test...")
print(f"Concurrency: {CONCURRENT_WORKERS}")
print(f"Total requests: {CONCURRENT_WORKERS * REQUESTS_PER_WORKER}")

for i in range(CONCURRENT_WORKERS):
    t = threading.Thread(target=worker, args=(i,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

total_time = time.time() - start_time

# results
print("\n=== AUTOSCALING BENCH RESULTS ===")
print(f"Total requests sent: {len(latencies) + errors}")
print(f"Successful requests: {len(latencies)}")
print(f"Errors: {errors}")
print(f"Total wall time: {total_time:.2f}s")

if latencies:
    print(f"Avg latency: {mean(latencies):.2f}s")
    print(f"Min latency: {min(latencies):.2f}s")
    print(f"Max latency: {max(latencies):.2f}s")
    print(f"Throughput: {len(latencies)/total_time:.2f} req/s")

How can you verify that autoscaling is working and that the load is being handled correctly without latency skyrocketing?

2. Hardware and platform-level monitoring

First, AI Deploy Grafana answers ‘What resources are being used and how many replicas exist?‘.

GPU utilisation, GPU memory, CPU, RAM and replica count are monitored through OVHcloud AI Deploy Grafana (monitoring URL), which exposes infrastructure and runtime metrics for the AI Deploy application. This layer provides visibility into resource saturation and scaling events managed by the AI Deploy platform itself.

Access it using the following URL (do not forget to replace by yours): https://monitoring.gra.ai.cloud.ovh.net/d/app/app-monitoring?var-app=&orgId=1

For example, check GPU/RAM metrics:

You can also monitor scale ups and downs in real time, as well as information on HTTP calls and much more!

3. Software and application-level monitoring

Next the combination of MKS + Prometheus + Grafana answers ‘How the inference engine behaves internally’.

In fact, vLLM internal metrics (request concurrency, token throughput, latency indicators, KV cache pressure, etc.) are collected via the vLLM /metrics endpoint and scraped by Prometheus running on OVHcloud MKS, then visualised in a dedicated Grafana instance. This layer focuses on model behaviour and inference performance.

Find all these metrics via (just replace ): http:///d/vllm-ministral-monitoring/ministral-14b-vllm-metrics-monitoring?orgId=1

Find key metrics such as TTF, etc:

You can also find some information about ‘Model load and throughput’:

To go further and add even more metrics, you can refer to the vLLM documentation on ‘Prometheus and Grafana‘.

Conclusion

This reference architecture provides a scalable, and production-ready approach for deploying LLM inference on OVHcloud using AI Deploy and the autoscaling on custom metric feature.

OVHcloud MKS is dedicated to running Prometheus and Grafana, enabling secure scraping and visualisation of vLLM internal metrics exposed via the /metrics endpoint.

By scraping vLLM metrics securely from AI Deploy into Prometheus and exposing them through Grafana, the architecture provides full visibility into model behaviour, performance and load, enabling informed scaling analysis, troubleshooting and capacity planning in production environments.

Reference Architecture: build a sovereign n8n RAG workflow for AI agent using OVHcloud Public Cloud solutions

Eléa Petton — Tue, 27 Jan 2026 13:12:03 +0000

What if an n8n workflow, deployed in a sovereign environment, saved you time while giving you peace of mind? From document ingestion to targeted response generation, n8n acts as the conductor of your RAG pipeline without compromising data protection.

n8n workflow overview

In the current landscape of AI agents and knowledge assistants, connecting your internal documentation with Large Language Models (LLMs) is becoming a strategic differentiator.

How? By building Agentic RAG systems capable of retrieving, reasoning, and acting autonomously based on external knowledge.

To make this possible, engineers need a way to connect retrieval pipelines (RAG) with tool-based orchestration.

This article outlines a reference architecture for building a fully automated RAG pipeline orchestrated by n8n, leveraging OVHcloud AI Endpoints and PostgreSQL with pgvector as core components.

The final result will be a system that automatically ingests Markdown documentation from Object Storage, creates embeddings with OVHcloud’s BGE-M3 model available on AI Endpoints, and stores them in a Managed Database PostgreSQL with pgvector extension.

Lastly, you’ll be able to build an AI Agent that lets you chat with an LLM (GPT-OSS-120B on AI Endpoints). This agent, utilising the RAG implementation carried out upstream, will be an expert on OVHcloud products.

You can further improve the process by using an LLM guard to protect the questions sent to the LLM, and set up a chat memory to use conversation history for higher response quality.

But what about n8n?

n8n, the open-source workflow automation tool, offers many benefits and connects seamlessly with over 300 APIs, apps, and services:

Open-source: n8n is a 100% self-hostable solution, which means you retain full data control;
Flexible: combines low-code nodes and custom JavaScript/Python logic;
AI-ready: includes useful integrations for LangChain, OpenAI, and embedding support capabilities;
Composable: enables simple connections between data, APIs, and models in minutes;
Sovereign by design: compliant with privacy-sensitive or regulated sectors.

This reference architecture serves as a blueprint for building a sovereign, scalable Retrieval Augmented Generation (RAG) platform using n8n and OVHcloud Public Cloud solutions.

This setup shows how to orchestrate data ingestion, generate embedding, and enable conversational AI by combining OVHcloud Object Storage, Managed Databases with PostgreSQL, AI Endpoints and AI Deploy.The result? An AI environment that is fully integrated, protects privacy, and is exclusively hosted on OVHcloud’s European infrastructure.

Overview of the n8n workflow architecture for RAG

The workflow involves the following steps:

Ingestion: documentation in markdown format is fetched from OVHcloud Object Storage (S3);
Preprocessing: n8n cleans and normalises the text, removing YAML front-matter and encoding noise;
Vectorisation: Each document is embedded using the BGE-M3 model, which is available via OVHcloud AI Endpoints;
Persistence: vectors and metadata are stored in OVHcloud PostgreSQL Managed Database using pgvector;
Retrieval: when a user sends a query, n8n triggers a LangChain Agent that retrieves relevant chunks from the database;
Reasoning and actions: The AI Agent node combines LLM reasoning, memory, and tool usage to generate a contextual response or trigger downstream actions (Slack reply, Notion update, API call, etc.).

In this tutorial, all services are deployed within the OVHcloud Public Cloud.

Prerequisites

Before you start, double-check that you have:

an OVHcloud Public Cloud account
an OpenStack user with the following roles:
- Administrator
- AI Operator
- Object Storage Operator
An API key for AI Endpoints
ovhai CLI available – install the ovhai CLI
Hugging Face access – create a Hugging Face account and generate an access token

🚀 Now that you have everything you need, you can start building your n8n workflow!

Architecture guide: n8n agentic RAG workflow

You’re all set to configure and deploy your n8n workflow

⚙️ Keep in mind that the following steps can be completed using OVHcloud APIs!

Step 1 – Build the RAG data ingestion pipeline

This first step involves building the foundation of the entire RAG workflow by preparing the elements you need:

n8n deployment
Object Storage bucket creation
PostgreSQL database creation
and more

Remember to set up the proper credentials in n8n so the different elements can connect and function.

1. Deploy n8n on OVHcloud VPS

OVHcloud provides VPS solutions compatible with n8n. Get a ready-to-use virtual server with pre-installed n8n and start building automation workflows without manual setup. With plans ranging from 6 vCores / 12 GB RAM to 24 vCores / 96 GB RAM, you can choose the capacity that suits your workload.

How to set up n8n on a VPS?

Setting up n8n on an OVHcloud VPS generally involves:

Choosing and provisioning your OVHcloud VPS plan;
Connecting to your server via SSH and carrying out the initial server configuration, which includes updating the OS;
Installing n8n, typically with Docker (recommended for ease of management and updates), or npm by following this guide;
Configuring n8n with a domain name, SSL certificate for HTTPS, and any necessary environment variables for databases or settings.

While OVHcloud provides a robust VPS platform, you can find detailed n8n installation guides in the official n8n documentation.

Once the configuration is complete, you can configure the database and bucket in Object Storage.

2. Create Object Storage bucket

First, you have to set up your data source. Here you can store all your documentation in an S3-compatible Object Storage bucket.

Here, assume that all the documentation files are in Markdown format.

From OVHcloud Control Panel, create a new Object Storage container with S3-compatible API solution; follow this guide.

When the bucket is ready, add your Markdown documentation to it.

Note: For this tutorial, we’re using the various OVHcloud product documentation available in Open-Source on the GitHub repository maintained by OVHcloud members.

Click this link to access the repository.

How do you do that? Extract all the guide.en-gb.md files from the GitHub repository and rename each one to match its parent folder.

Example: the documentation about ovhai cli installation docs/pages/public_cloud/ai_machine_learning/cli_10_howto_install_cli/guide.en-gb.md is stored in ovhcloud-products-documentation-md bucket as cli_10_howto_install_cli.md

You should get an overview that looks like this:

Keep the following elements and create a new credential in n8n named OVHcloud S3 gra credentials:

S3 Endpoint: https://s3.gra.io.cloud.ovh.net/
Region: gra
Access Key ID:
Secret Access Key:

Then, create a new n8n node by selecting S3, then Get Multiple Files.
Configure this node as follows:

Connect the node to the previous one before moving on to the next step.

With the first phase done, you can now configure the vector DB.

3. Configure PostgreSQL Managed DB (pgvector)

In this step, you can set up the vector database that lets you store the embeddings generated from your documents.

How? By using OVHcloud’s managed databases, a pgvector extension of PostgreSQL. Go to your OVHcloud Control Panel and follow the steps.

1. Navigate to Databases & Analytics > Databases

2. Create a new database and select PostgreSQL and a datacenter location

3. Select Production plan and Instance type

4. Reset the user password and save it

5. Whitelist the IP of your n8n instance as follows

6. Take note of te following parameters

Make a note of this information and create a new credential in n8n named OVHcloud PGvector credentials:

Host:
Database: defaultdb
User: avnadmin
Password:
Port: 20184

Consider enabling the Ignore SSL Issues (Insecure) button as needed and setting the Maximum Number of Connections value to 1000.

✅ You’re now connected to the database! But what about the PGvector extension?

Add a PosgreSQL node in your n8n workflow Execute a SQL query, and create the extension through an SQL query, which should look like this:

-- drop table as needed
DROP TABLE IF EXISTS md_embeddings;

-- activate pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- create table
CREATE TABLE md_embeddings (
    id SERIAL PRIMARY KEY,
    text TEXT,
    embedding vector(1024),
    metadata JSONB
);

You should get this n8n node:

Finally, you can create a new table and name it md_embeddings using this node. Create a Stop and Error node if you run into errors setting up the table.

All set! Your vector DB is prepped and ready for data! Keep in mind, you still need an embeddings model for the RAG data ingestion pipeline.

4. Access to OVHcloud AI Endpoints

OVHcloud AI Endpoints is a managed service that provides ready-to-use APIs for AI models, including LLM, CodeLLM, embeddings, Speech-to-Text, and image models hosted within OVHcloud’s European infrastructure.

To vectorise the various documents in Markdown format, you have to select an embedding model: BGE-M3.

Usually, your AI Endpoints API key should already be created. If not, head to the AI Endpoints menu in your OVHcloud Control Panel to generate a new API key.

Once this is done, you can create new OpenAI credentials in your n8n.

Why do I need OpenAI credentials? Because AI Endpoints API is fully compatible with OpenAI’s, integrating it is simple and ensures the sovereignty of your data.

How? Thanks to a single endpoint https://oai.endpoints.kepler.ai.cloud.ovh.net/v1, you can request the different AI Endpoints models.

This means you can create a new n8n node by selecting Postgres PGVector Store and Add documents to Vector Store.
Set up this node as shown below:

Then configure the Data Loader with a custom text splitting and a JSON type.

For the text splitter, here are some options:

To finish, select the BGE-M3 embedding model from the model list and set the Dimensions to 1024.

You now have everything you need to build the ingestion pipeline.

5. Set up the ingestion pipeline loop

To make use of a fully automated document ingestion and vectorisation pipeline, you have to integrate some specific nodes, mainly:

a Loop Over Items that downloads each markdown file one by one so that it can be vectorised;
a Code in JavaScript that counts the number of files processed, which subsequently determines the number of requests sent to the embedding model;
an If condition that allows you to check when the 400 requests have been reached;
a Wait node that pauses after every 400 requests to avoid getting rate-limited;
an S3 block Download a file to download each markdown;
another Code in JavaScript to extract and process text from Markdown files by cleaning and removing special characters before sending it to the embeddings model;
a PostgreSQL node to Execute a SQL query to check that the table contains vectors after the process (loop) is complete.

5.1. Create a loop to process each documentation file

Begin by creating a Loop Over Items to process all the Markdown files one at a time. Set the batch size to 1 in this loop.

Add the Loop statement right after the S3 Get Many Files node as shown below:

Time to put the loop’s content into action!

5.2. Count the number of files using a code snippet

Next, choose the Code in JavaScript node from the list to see how many files have been processed. Set “Run Once for Each Item” Mode and “JavaScript” code Language, then add the following code snippet to the designated block.

// simple counter per item
const counter = $runIndex + 1;

return {
  counter
};

Make sure this code snippet is included in the loop.

You can start adding the if part to the loop now.

5.3. Add a condition that applies a rule every 400 requests

Here, you need to create an If node and add the following condition, which you have set as an expression.

{{ (Number($json["counter"]) % 400) === 0 }}

Add it immediately after counting the files:

If this condition is true, trigger the Wait node.

5.4. Insert a pause after each set of 400 requests

Then insert a Wait node to pause for a few seconds before resuming. You can insert Resume “After Time Interval” and set the Wait Amount to “60:00” seconds.

Link it to the If condition when this is True.

Next, you can go ahead and download the Markdown file, and then process it.

5.5. Launch documentation download

To do this, create a new Download a file S3 node and configure it with this File Key expression:

{{ $('Process each documentation file').item.json.Key }}

Want to connect it? That’s easy, link it to the output of the Wait and If statements when the ‘if’ statement returns False; this will allow the file to be processed only if the rate limit is not exceeded.

You’re almost done! Now you need to extract and process the text from the Markdown files – clean and remove any special characters before sending it to the embedding model.

5.6 Clean Markdown text content

Next, create another Code in JavaScript to process text from Markdown files:

// extract binary content
const binary = $input.item.binary.data;

// decoding into clean UTF-8 text
let text = Buffer.from(binary.data, 'base64').toString('utf8');

// cleaning - remove non-printable characters
text = text
  .replace(/[^\x09\x0A\x0D\x20-\x7EÀ-ÿ€£¥•–—‘’“”«»©®™°±§¶÷×]/g, ' ')
  .replace(/\s{2,}/g, ' ')
  .trim();

// check lenght
if (text.length > 14000) {
  text = text.slice(0, 14000);
}

return [{
  text,
  fileName: binary.fileName,
  mimeType: binary.mimeType
}];

Select the “Run Once for Each Item” Mode and place the previous code in the dedicated JavaScript block.

To finish, check that the output text has been sent to the document vectorisation system, which was set up in Step 3 – Configure PostgreSQL Managed DB (pgvector).

How do I confirm that the table contains all elements after vectorisation?

5.7 Double-check that the documents are in the table

To confirm that your RAG system is working, make sure your vector database has different vectors; use a PostgreSQL node with Execute a SQL query in your n8n workflow.

Then, run the following query:

-- count the number of elements
SELECT COUNT(*) FROM md_embeddings;

Next, link this element to the Done section of your Loop, so the elements are counted when the process is complete.

Congrats! You can now run the workflow to begin ingesting documents.

Click the Execute workflow button and wait until the vectorization process is complete.

Remember, everything should be green when it’s finished ✅.

Step 2 – RAG chatbot

With the data ingestion and vectorisation steps completed, you can now begin implementing your AI agent.

This involves building a RAG-based AI Agent by simply starting a chat with an LLM.

1. Set up the chat box to start a conversation

First, configure your AI Agent based on the RAG system, and add a new node in the same n8n workflow: Chat Trigger.

This node will allow you to interact directly with your AI agent! But before that, you need to check that your message is safe.

This node will allow you to interact directly with your AI agent! But before that, you need to check that your message is secure.

2. Set up your LLM Guard with AI Deploy

To check whether a message is secure or not, use an LLM Guard.

What’s an LLM Guard? This is a safety and control layer that sits between users and an LLM, or between the LLM and an external connection. Its main goal is to filter, monitor, and enforce rules on what goes into or comes out of the model 🔐.

You can use AI Deploy from OVHcloud to deploy your desired LLM guard. With a single command line, this AI solution lets you deploy a Hugging Face model using vLLM Docker containers.

For more details, please refer to this blog.

For the use case covered in this article, you can use the open-source model meta-llama/Llama-Guard-3-8B available on Hugging Face.

2.1 Create a Bearer token to request your custom AI Deploy endpoint

Create a token to access your AI Deploy app once it’s deployed.

ovhai token create --role operator ai_deploy_token=my_operator_token

The following output is returned:

Id: 47292486-fb98-4a5b-8451-600895597a2b Created At: 20-10-25 8:53:05 Updated At: 20-10-25 8:53:05 Spec: Name: ai_deploy_token=my_operator_token Role: AiTrainingOperator Label Selector: Status: Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Version: 1

You can now store and export your access token to add it as a new credential in n8n.

export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

2.1 Start Llama Guard 3 model with AI Deploy

Using ovhai CLI, launch the following command and vLLM start inference server.

ovhai app run \
	--name vllm-llama-guard3 \
        --default-http-port 8000 \
        --gpu 1 \
	--flavor l40s-1-gpu \
        --label ai_deploy_token=my_operator_token \
	--env OUTLINES_CACHE_DIR=/tmp/.outlines \
	--env HF_TOKEN=$MY_HF_TOKEN \
	--env HF_HOME=/hub \
	--env HF_DATASETS_TRUST_REMOTE_CODE=1 \
	--env HF_HUB_ENABLE_HF_TRANSFER=0 \
	--volume standalone:/workspace:RW \
	--volume standalone:/hub:RW \
	vllm/vllm-openai:v0.10.1.1 \
	-- bash -c python3 -m vllm.entrypoints.openai.api_server                       
                           --model meta-llama/Llama-Guard-3-8B \                     
                           --tensor-parallel-size 1 \                     
                           --dtype bfloat16

Full command explained:

ovhai app run

This is the core command to run an app using the OVHcloud AI Deploy platform.

--name vllm-llama-guard3

Sets a custom name for the job. For example, vllm-llama-guard3.

--default-http-port 8000

Exposes port 8000 as the default HTTP endpoint. vLLM server typically runs on port 8000.

--gpu 1
--flavor l40s-1-gpu

Allocates 1 GPU L40S for the app. You can adjust the GPU type and number depending on the model you have to deploy.

--volume standalone:/workspace:RW
--volume standalone:/hub:RW

Mounts two persistent storage volumes: /workspace which is the main working directory and /hub to store Hugging Face model files.

--env OUTLINES_CACHE_DIR=/tmp/.outlines
--env HF_TOKEN=$MY_HF_TOKEN
--env HF_HOME=/hub
--env HF_DATASETS_TRUST_REMOTE_CODE=1
--env HF_HUB_ENABLE_HF_TRANSFER=0

These are Hugging Face environment variables you have to set. Please export your Hugging Face access token as environment variable before starting the app: export MY_HF_TOKEN=***********

vllm/vllm-openai:v0.10.1.1

Use the vllm/vllm-openai Docker image (a pre-configured vLLM OpenAI API server).

-- bash -c python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-Guard-3-8B \ --tensor-parallel-size 1 \ --dtype bfloat16

Finally, run a bash shell inside the container and executes a Python command to launch the vLLM API server.

2.2 Check to confirm your AI Deploy app is RUNNING

Replace the by yours.

ovhai app get

You should get:

History: DATE STATE 20-1O-25 09:58:00 QUEUED 20-10-25 09:58:01 INITIALIZING 04-04-25 09:58:07 PENDING 04-04-25 10:03:10 RUNNING Info: Message: App is running

2.3 Create a new n8n credential with AI Deploy app URL and Bearer access token

First, using your , retrieve your AI Deploy app URL.

ovhai app get  -o json | jq '.status.url' -r

Then, create a new OpenAI credential from your n8n workflow, using your AI Deploy URL and the Bearer token as an API key.

Don’t forget to replace 6e10e6a5-2862-4c82-8c08-26c458ca12c7 with your .

2.4 Create the LLM Guard node in n8n workflow

Create a new OpenAI node to Message a model and select the new AI Deploy credential for LLM Guard usage.

Next, create the prompt as follows:

{{ $('Chat with the OVHcloud product expert').item.json.chatInput }}

Then, use an If node to determine if the scenario is safe or unsafe:

If the message is unsafe, send an error message right away to stop the workflow.

But if the message is safe, you can send the request to the AI Agent without issues 🔐.

3. Set up AI Agent

The AI Agent node in n8n acts as an intelligent orchestration layer that combines LLMs, memory, and external tools within an automated workflow.

It allows you to:

Connect a Large Language Model using APIs (e.g., LLMs from AI Endpoints);
Use tools such as HTTP requests, databases, or RAG retrievers so the agent can take actions or fetch real information;
Maintain conversational memory via PostgreSQL databases;
Integrate directly with chat platforms (e.g., Slack, Teams) for interactive assistants (optional).

Simply put, n8n becomes an agentic automation framework, enabling LLMs to not only provide answers, but also think, choose, and perform actions.

Please note that you can change and customise this n8n AI Agent node to fit your use cases, using features like function calling or structured output. This is the most basic configuration for the given use case. You can go even further with different agents.

🧑‍💻 How do I implement this RAG?

First, create an AI Agent node in n8n as follows:

Then, a series of steps are required, the first of which is creating prompts.

3.1 Create prompts

In the AI Agent node on your n8n workflow, edit the user and system prompts.

Begin by creating the prompt, which is also the user message:

{{ $('Chat with the OVHcloud product expert').item.json.chatInput }}

Then create the System Message as shown below:

You have access to a retriever tool connected to a knowledge base.  
Before answering, always search for relevant documents using the retriever tool.  
Use the retrieved context to answer accurately.  
If no relevant documents are found, say that you have no information about it.

You should get a configuration like this:

🤔 Well, an LLM is now needed for this to work!

3.2 Select LLM using AI Endpoints API

First, add an OpenAI Chat Model node, and then set it as the Chat Model for your agent.

Next, select one of the OVHcloud AI Endpoints from the list provided, because they are compatible with Open AI APIs.

✅ How? By using the right API https://oai.endpoints.kepler.ai.cloud.ovh.net/v1

The GPT OSS 120B model has been selected for this use case. Other models, such as Llama, Mistral, and Qwen, are also available.

⚠️ WARNING ⚠️

If you are using a recent version of n8n, you will likely encounter the /responses issue (linked to OpenAI compatibility). To resolve this, you will need to disable the button Use Responses API and everything will work correctly

Tips to fix /responses issue

Your LLM is now set to answer your questions! Don’t forget, it needs access to the knowledge base.

3.3 Connect the knowledge base to the RAG retriever

As usual, the first step is to create an n8n node called PGVector Vector Store node and enter your PGvector credentials.

Next, link this element to the Tools section of the AI Agent node.

Remember to connect your PG vector database so that the retriever can access the previously generated embeddings. Here’s an overview of what you’ll get.

⏳Nearly done! The final step is to add the database memory.

3.4 Manage conversation history with database memory

Creating Database Memory node in n8n (PostgreSQL) lets you link it to your AI Agent, so it can store and retrieve past conversation history. This enables the model to remember and use context from multiple interactions.

So link this PostgreSQL database to the Memory section of your AI agent.

Congrats! 🥳 Your n8n RAG workflow is now complete. Ready to test it?

4. Make the most of your automated workflow

Want to try it? It’s easy!

By clicking the orange Open chat button, you can ask the AI agent questions about OVHcloud products, particularly where you need technical assistance.

For example, you can ask the LLM about rate limits in OVHcloud AI Endpoints and get the information in seconds.

You can now build your own autonomous RAG system using OVHcloud Public Cloud, suited for a wide range of applications.

What’s next?

To sum up, this reference architecture provides a guide on using n8n with OVHcloud AI Endpoints, AI Deploy, Object Storage, and PostgreSQL + pgvector to build a fully controlled, autonomous RAG AI system.

Teams can build scalable AI assistants that work securely and independently in their cloud environment by orchestrating ingestion, embedding generation, vector storage, retrieval, and LLM safety check, and reasoning within a single workflow.

With the core architecture in place, you can add more features to improve the capabilities and robustness of your agentic RAG system:

Web search
Images with OCR
Audio files transcribed using the Whisper model

This delivers an extensive knowledge base and a wider variety of use cases!

Celebrating 10 Years of Impact: Looking Forward to 2035

Philip Marais — Mon, 09 Jun 2025 10:40:26 +0000

The Startup Program is 10 years old this year! As we mark our 10th anniversary, we are not just reflecting on the past decade – we are looking ahead to the future and the impact we can have by 2035.

The key to achieving this vision lies with YOU, our valued members of OVHcloud’s unique data sovereign ecosystem, including startups, scaleups, incubators, accelerators, venture capital companies, government agencies, technology partners, and other enablers. Together, we are united around a common vision of data freedom, innovation, and mutual growth.

Global Report 2025: 10 Years of Impact

To capture the essence of our unique ecosystem, we have compiled a comprehensive report, “Global Report 2025 – 10 Years of Impact”. This report showcases key stories from our ecosystem, including:

Our support for Harfanglab, a French scaleup that’s developed cutting-edge technologies to anticipate and neutralise cyberattacks, raising almost €30m and leveraging OVHcloud and the Startup Program to drive innovation, data sovereignty, and cybersecurity excellence.
The success of Internxt, a Southern Europe scaleup alumni, which has become a recognized privacy-first alternative to mainstream cloud providers, offering secure, user-centric, and environmentally sustainable file-sharing and storage solutions that protect user privacy and data sovereignty.
The journey of female founders Jeanne Le Peillet and Cecile Doan, who developed a collaborative design SaaS solution, Beink Dream, selected for the France 2030 initiative.
The acquisition of Startup Program alumnus OpenIO by OVHcloud, which has become our high-performance object storage solution.

“OVHcloud is a great partner if you are looking for a long-term, reliable, affordable and robust provider. The synergy between Internxt’s mission to protect user privacy and OVHcloud’s commitment to data sovereignty has been pivotal.”

Fran Villalba Segarra Founder & CEO at Internxt

The Startup Program: A Decade of Growth

The report also highlights the Startup Program’s journey over the last decade, including how we operate, our partnerships with incubators, accelerators, venture capital companies, and other enablers, what sets us apart, and how we have successfully supported over 5,000 members to date.

Key Statistics

5000+
Startups have joined our program

100+
Ecosystem enablers (Accelerators etc.)

Thousands
of hours of free mentorship and support

€ Millions
in free cloud credits given

Personalised Support

What sets our Startup Program apart is our personal touch. As Philip Marais, Global Startup Program Director at OVHcloud, explains: “You’re personally onboarded by a manager in your region, have free support from our engineers to solve technical and migration issues, and access to our unique ecosystem to grow your business.”

Download the Report

To learn more about our ecosystem, our plans for the future, and the impact we can have by 2035, download the “Global Report 2025 – 10 Years of Impact” now.

Our 5000+ startups’ journey with OVHcloud highlights how the right cloud partnership can help overcome challenges, achieve sustainable growth, and scale globally. If you’re a startup looking to transform your business, we encourage you to join the OVHcloud Startup Program or contact OVHcloud to discover how our solutions can support your journey!

Reference Architecture: set up MLflow Remote Tracking Server on OVHcloud

Eléa Petton — Tue, 15 Apr 2025 07:52:46 +0000

Travel through the Data & AI universe of OVHcloud with the MLflow integration.

Mlflow Remote Tracking Server on OVHcloud

As Artificial Intelligence (AI) continues to grow in importance, Data Scientists and Machine Learning Engineers need a robust and scalable platform to manage the entire Machine Learning (ML) lifecycle.
MLflow, an open-source platform, provides a comprehensive framework for managing ML experiments, models, and deployments.

Mlflow offers many benefits and provides a complete framework for ML lifecycle management with features such as:

Experiment tracking and model management
Reproducibility and collaboration
Scalability, flexibility, and integration
Automated ML and model serving capabilities
Improved model accuracy, faster time-to-market, and reduced costs.

In this reference architecture, you will explore how to leverage remote experience tracking with the MLflow Tracking Server on the OVHcloud Public Cloud infrastructure.
In fact, you will be able to build a scalable and efficient ML platform, streamlining your ML workflow and accelerating model development using OVHcloud AI Notebooks, AI Training, Managed Databases (PostgreSQL), and Object Storage.

The result? A fully remote, production-ready ML experiment tracking pipeline, powered by OVHcloud’s Data & Machine Learning Services (e.g. AI Notebooks and AI Training).

Overview of the MLflow server architecture

Here is how will be configured MLflow:

Development and training environment: create and train model with AI Notebooks
Remote Tracking Server: host in an AI Training job (Container as a Service)
Backend Store: benefit from a managed PostgreSQL database (DBaaS).
Artifact Store: use OVHcloud Object Storage (S3-compatible).

MLflow remote server deployment steps

In the following tutorial, all services are deployed within the OVHcloud Public Cloud.

Prerequisites

Before you begin, ensure you have:

An OVHcloud Public Cloud account
An OpenStack user with the following roles:
- Administrator
- AI Training Operator
- Object Storage Operator

🚀 Having all the ingredients for our recipe, it’s time to set up your MLflow remote tracking server!

Architecture guide: MLflow remote tracking server

Let’s go for the set up and deployment of your custom MLflow tracking tool!

⚙️ Also consider that all of the following steps can be automated using OVHcloud APIs!

Step 1 – Install `ovhai` CLI

Firstly, start by setting up your CLI environment.

curl https://cli.gra.ai.cloud.ovh.net/install.sh | bash

Secondly, login using your OpenStack credentials.

ovhai login -u  -p

Now, it’s time to create your bucket inside OVHcloud Object Storage!

Step 2 – Provision Object Storage (Artifact Store)

Go to Public Cloud > Storage > Object Storage in the OVHcloud Control Panel.
Create a datastore and a new S3 bucket (e.g., mlflow-s3-bucket).
Register the datastore with the ovhai CLI:

ovhai datastore add s3  https://s3.gra.io.cloud.ovh.net/ gra   --store-credentials-locally

Step 3 – Create PostgreSQL Managed DB (Backend Store)

1. Navigate to Databases & Analytics > Databases

2. Create a new PostgreSQL instance with Essential plan

3. Select Location and Node type

4. Reset the user password

5. Take note of te following parameters

Go to your database dashboard:

Then, copy the connexion information:

Your Backend Store is now ready to use!

Step 4 -Build you custom MLflow Docker image and

1. Develop MLflow launching script

Firstly, you have to write a script in bash to launch the server: mlflow_server.sh

echo "The MLflow server is starting..."

mlflow server \
  --backend-store-uri postgresql://${POSTGRE_USER}:${POSTGRE_PASSWORD}@${PG_HOST}:${PG_PORT}/${PG_DB}?sslmode=${SSL_MODE} \
  --default-artifact-root ${S3_BUCKET_NAME}/ \
  --host 0.0.0.0 \
  --port 5000

2. Create Dockerfile

Install the required Python dependency and give the rights on the /mlruns path to the OVHcloud user.

FROM ghcr.io/mlflow/mlflow:latest

# Install Python dependencies
RUN pip install psycopg2-binary

COPY mlflow_server.sh .

# Change the ownership of `mlruns` directory to the OVHcloud user (42420:42420)
RUN mkdir -p /mlruns
RUN chown -R 42420:42420 /mlruns

# Start MLflow server inside container
CMD ["bash", "mlflow_server.sh"]

3. Build your custom MLflow docker image

Build the docker image using the previous Dockerfile.

docker build . -t mlflow-server-ai-training:latest

4. Tag and push the docker image to your registry

Finally, you can push the Docker image to your registry.

docker tag mlflow-server-ai-training:latest /mlflow-server-ai-training:latest

docker push /mlflow-server-ai-training:latest

Congrats! You can now use the Docker image to launch MLflow server.

Step 5 – Start MLflow Tracking Server inside container

You can use AI Training to start MLflow server inside a job.

1. Using ovhai CLI, run the following command inside terminal

ovhai job run --name mlflow-server \
              --default-http-port 5000 \
              --cpu 4 \
              -v mlflow-s3-bucket@DEMO/:/artifacts:RW:cache \
              -e POSTGRE_USER=avnadmin \
              -e POSTGRE_PASSWORD= \
              -e S3_ENDPOINT=https://s3.gra.io.cloud.ovh.net/ \
              -e S3_BUCKET_NAME=mlflow-s3-bucket \
              -e PG_HOST= \
              -e PG_DB=defaultdb \
              -e PG_PORT=20184 \
              -e SSL_MODE=require \
              /mlflow-server-ai-training:latest

Full command explained:

ovhai job run

This is the core command to run a job using the OVHcloud AI Training platform.

--name mlflow-server

Sets a custom name for the job. For example, mlflow-server.

--default-http-port 5000

Exposes port 5000 as the default HTTP endpoint. MLflow’s web UI typically runs on port 5000, so this ensures the UI is accessible once the job is running.

--cpu 4

Allocates 4 CPUs for the job. You can adjust this based on how heavy your MLflow workload is.

-v mlflow-s3-bucket@DEMO/:/artifacts:RW:cache

This mounts your OVHcloud Object Storage volume into the job’s file system:
– mlflow-s3-bucket@DEMO/: refers to your S3 bucket volume from the OVHcloud Object Storage
– :/artifacts: mounts the volume into the container under /artifacts
– RW: enables Read/Write permissions
– cache: enables volume caching, improving performance for frequent reads/writes

-e POSTGRE_USER=avnadmin
-e POSTGRE_PASSWORD=
-e PG_HOST=
-e PG_DB=defaultdb
-e PG_PORT=20184
-e SSL_MODE=require

These are environment variables for connecting to the PostgreSQL backend store:
– avnadmin: the default admin user for OVHcloud’s managed PostgreSQL
– POSTGRE_PASSWORD: must be replaced with your actual database password
– PG_HOST: the hostname of your managed PostgreSQL instance
– PG_DB: the name of the database to use (default: defaultdb)
– PG_PORT: the port your PostgreSQL server is listening on
– SSL_MODE: enforce SSL connection to secure DB traffic

-e S3_ENDPOINT=https://s3.gra.io.cloud.ovh.net/

Tells MLflow where the S3-compatible endpoint is hosted. This is specific to OVHcloud’s GRA (Gravelines) region Object Storage.

-e S3_BUCKET_NAME=mlflow-s3-bucket

Sets the name of the S3 bucket where MLflow should store artifacts (models, metrics, etc.).

/mlflow-server-ai-training:latest

This is the custom MLflow Docker image you are running inside the job.

2. Check if your AI Training job is RUNNING

Replace the by yours.

ovhai job get

You should obtain:

History: DATE STATE 04-04-25 09:58:00 QUEUED 04-04-25 09:58:01 INITIALIZING 04-04-25 09:58:07 PENDING 04-04-25 09:58:10 RUNNING Info: Message: Job is running

3. Recover the IP and external IP of your AI Training job

Using, your , you can retrieve your AI Training job IP.

ovhai job get  -o json | jq '.status.ip' -r

For example, you can obtain something like that: 10.42.80.176

You also need the External IP:

ovhai job get  -o json | jq '.status.externalIp' -r

Returning the IP address you will have to whitelist to be able to connect to your database (e.g. 51.210.38.188)

Step 6 – Whitelist AI Training job IP in PostgreSQL DB

From Databases & Analytics > Databases, edit your DB configuration to allow access from the job Extranal IP.

Then, you can see that the job External IP is now white listed.

Well done! Your MLflow server and the backend store are now connected.

Step 7 – Create an AI Notebook

It’s time to train and track your Machine Learning models using MLflow!

To do so, use the OVHcloud ovhai CLI and start a new AI Notebook with GPU.

ovhai notebook run conda jupyterlab \
  --name mlflow-notebook \
  --framework-version conda-py311-cudaDevel11.8 \
  --gpu 1

Full command explained:

ovhai notebook run

This is the core command to run a notebook using the OVHcloud AI Notebooks platform.

--name mlflow-notebook

Sets a custom name for the notebook. In this case, you can name it mlflow-notebook.

--framework-version conda-py311-cudaDevel11.8

Define the framework and version you want to use in your notebook. Here, you are using Python 3.11 with Conda framework and CUDA compatibility.

--gpu 1

Allocates 1 GPU for the job, by default a Tesla V100S from NVIDIA (ai1-1-gpu). You can select the flavor you want from the OVHcloud GPU range.

Then, check if your AI Notebook is RUNNING.

ovhai notebook get

Once your notebook is in RUNNING status, you should be able to access it using its URL:

State: RUNNING Duration: 1411412 Url: https://.notebook.gra.ai.cloud.ovh.net Grpc Address: .nb-grpc.gra.ai.cloud.ovh.net:443 Info Url: https://ui.gra.ai.cloud.ovh.net/notebook/

You can start your AI model development inside notebook.

Step 8 – Model training inside Jupyter notebook

To begin with, set up your notebook environment.

1. Create the requirements.txt file

numpy==2.2.3
scipy==1.15.2
mlflow==2.20.3
sklearn==1.6.1

2. Install dependencies

From a notebook cell, launch the following command.

!pip3 install -r requirements.txt

Perfect! You can start coding…

3. Import Python librairies

Here, you have to import os, mlflow and scikit-learn.

# import dependencies
import os
import mlflow
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor

In another notebook cell, set the MLflow tracking URI. Note that you have to replace 10.42.80.176 by your own job IP.

mlflow.set_tracking_uri("http://10.42.80.176:5000")

Then start training your model!

mlflow.autolog()

db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

# Create and train models.
rf = RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
rf.fit(X_train, y_train)

# Use the model to make predictions on the test dataset.
predictions = rf.predict(X_test)

Output:

🏃 View run dashing-foal-850 at: http://10.42.80.176:5000/#/experiments/0/runs/e7dad7c073634ec28675c0defce2b9ec
🧪 View experiment at: http://10.42.80.176:5000/#/experiments/0

Congrats! You can now track your model training from MLflow remote server…

Step 9 – Track and compare models from MLflow remote server

Finally, access to MLflow dashboard using the job URL: https://.job.gra.ai.cloud.ovh.net

Then, you can check your model trainings and evaluations:

What a success! You can finally use your MLflow to evaluate, compare and archive your various trainings.

Step 10 – Monitor everything remotely

You now have a complete Machine Learning pipeline with remote experiment tracking. Access:

Metrics, Parameters, and Tags → PostgreSQL
Artifacts (Models, Files) → S3 bucket

This setup is reusable, automatable, and production-ready!

What’s next?

Automate deployment with OVHcloud APIs
Run different training sessions in parallel and compare them with your remote MLflow tracking server
Use AI Deploy to serve your trained models

Bare Metal Pod: Genesis

David Mondon — Tue, 01 Apr 2025 07:10:26 +0000

Today, we’re going to embark on a journey of discovery, and unveil our latest product: Bare Metal Pod.

You know us for the services we provide: bare metal servers, managed and unmanaged virtualisation platform, our 40+ public cloud services, domain names and telco.

This is just the tip of the iceberg, and to understand why we built and now offer Bare Metal Pod, we have to dig deeper.

So let’s begin this journey exploring the origins of Bare Metal Pod, and in later articles we’ll cover the more technical details—there’s a lot to touch on.

The OVHcloud way: more than just servers

As a cloud services provider, we supply the different platforms mentioned above. But most importantly, we have to take care of the infrastructure dedicated to these services, from the buildings, power and cooling to the software stack and automation required.

And we’ve been doing just this since 2001. It all started with the opening of our first datacentre in Paris, then building our own servers the next year, and our proprietary water-cooling solution the year after that.

At the core, we are all about efficiency, automation, and sustainability:

Repurposing buildings as datacentres
Designing our own servers to optimise performance and cost
Maximising cooling efficiency to cut waste
Automating everything to reduce errors and delays

And, in all modesty…. we’re pretty good at these.

Optimising datacentres like a pro

Basically, building our own servers in our Croix (FR) and Beauharnois (CA) plants means packing a ton of servers into a square metre. We’re talking about 4 custom racks, each hosting 48 servers, all in just 3 sq.m and using up to 160kW of 12V DC power. This gives us a server density of about 5000W per sq/ft, which beats out 90% of the industry.

And on top of that, we’ve got our proprietary water-cooling system—we save energy by not using AC for our servers. To further optimise air cooling, each of our rack is equipped with a large condenser (we call it a chilled door) at the rear of the rack, dissipating regular server heat in our water system. This keeps the datacentre comfortably warm for our staff and the network equipment, and extends hardware lifespan (less maintenance, fewer replacements, fewer outages….so more savings).

In addition to the physical optimisations we’ve just mentioned is our automation system. When a server or a cluster of servers have been assembled and tested in our plant, it’s sent to the datacentre, racked and connected to power, network, and water-cooling systems by our DC staff.

And from there, everything is automated. From server power management, discovery, testing, and readiness checks, to the moment it’s selected by a customer using their Control Panel, and then configured. No human interaction is required, meaning no delay and no error.

And these operations have been optimised and refined for over 20 years.

Enter Project Gold-o-rack

So in June 2023, a small team was assembled to review, analyse and build a new version of this system. We had 3 goals:

Provide customers with dedicated on-premises autonomous racks
Offer custom-built, plug-and-play Bare Metal Pods
Upgrade the automation and security of our own datacentres

And that’s how Project Gold-o-rack came to be—a tribute to Goldorak (Grendizer), the legendary 70s anime mecha that crushed its enemies with style. Like its namesake, our system is powerful, autonomous, and unstoppable.

Using opensource technology was a must, as we absolutely can’t do without transparency and community support. So we went for OpenStack, Netbox, Grafana, and developed our own network management and automation system, and much more.

By September 2023—just three months later—we had a fully functional 24U rack, deployable and operational in 25 minutes. That’s not just fast—that’s insanely fast.

Security was a top priority since these racks would be installed in third-party datacentres. We quickly applied for SecNumCloud qualification, leveraging our existing compliance expertise.

Then, it hit us: why not offer this as a full-fledged product? And that’s how Bare Metal Pod came to be—dedicated, secure, and fully automated.

We structured the product into three key components:

On-Prem Cloud Platform (OPCP): The autonomous rack, with its own KMS and encryption mechanisms
Bare Metal Pod: Built on OPCP, hosted in our datacentres, and SecNumCloud-compliant
Cloud Store: A software catalogue enabling automated deployment within the rack

In June 2024, OPCP was ready, just 12 months after the 1st meeting… and shortly after we got the “green light” from the ANSSI, allowing us to pursue the SecNumCloud qualification process.

And if you were at, or watched our Summit Keynote in November 2024, you definitely saw it live…

What’s under the hood?

As an autonomous rack, it contains:

Power Distribution Units
Network equipment for internal and external connectivity
Servers, including a Pod Controller

There are 9 Bare Metal server models available, from 16 to 256 cores, from 128 GB to 2.5 TB of memory, up to 792 TB NVMe SSD (RAW), Nvidia L4 and L40s GPU depending on your needs.

And the best part is that you can mix and match them, to build and manage the perfect autonomous rack, while keeping full control on security and resources.

We’ve got a total of 607 models in Bare Metal Pod, enough for nearly any configuration and need. And with up to 1500 servers in a single Pod, the possibilities are endless.

And on top of these servers, we are building an automated software library: the Cloud Store. Enclosed in the Bare Metal Pod, the Cloud Store will offer the Pod admin a selection of OS, virtualisation platforms and various software that can be pushed, installed, configured automatically on the servers in the Pod. This includes built-in security, monitoring, and logging integrated in the Pod monitoring tools.

And herein¹ lies the main challenge: making sure an entire collection of software from various editors can cohabit and interact with a single, opensource monitoring platform, a KMS, and an IAM without breaking anything…

Coming up next…

That’s a wrap for now! In the next article, we’ll deep-dive into hardware, networking, and security. Stay tuned!

Some of the Bare Metal servers options:

Scale A1 – A8: Equipped with 4th Gen Intel Xeon Gold or AMD EPYC 9004 series processors, these servers provide between 16 to 256 cores and 128 GB to 1 TB of DDR5 ECC RAM. They are suitable for:
- Hosting SaaS and PaaS solutions
- Virtualisation
- Database hosting
- Containerisation and orchestration
- Confidential computing
- High-performance computing
Scale-GPU 1 – 3: Featuring NVIDIA L4 GPU cards (x2 or x4) and up to 1.2 TB of DDR5 ECC RAM, these servers are ideal for:
- 3D modelling
- Media streaming
- Virtual Desktop Infrastructure (VDI)
- Data inference

HGR-HCI I1 – I4: With dual 5th Gen Intel Xeon Gold or 4th Gen AMD EPYC 9004 series processors, these servers provide between 16 to 72 cores and up to 2.5 TB of DDR5 ECC RAM. They are suitable for:
- Hyperconverged infrastructure
- Virtualisation
- Database hosting
- Containerisation and orchestration
- Confidential computing
- High-performance computing
HGR-SDS 1 – 2: Equipped with dual 5th Gen Intel Xeon Gold processors, these servers offer between 16 to 48 cores and up to 1.5 TB of DDR5 ECC RAM. They are ideal for:
- Software-defined storage solutions
- Object storage solutions
- Big data
- Database hosting
HGR-STOR 1 – 2: Featuring a 5th Gen Intel Xeon Gold processor with 36 cores and up to 512 GB of DDR5 ECC RAM, these servers are designed for:
- Archiving
- Database hosting
- Backup and disaster recovery plans
HGR-AI-2: Equipped with NVIDIA L40s GPU cards (x2 or x4) and up to 2.3 TB of DDR5 ECC RAM, these servers are optimized for:
- Machine learning
- Deep learning

(And many other options… you get the idea.)

My editor liked the word and I found it cool too. https://www.collinsdictionary.com/dictionary/english/herein ↩︎

Deep Dive into DeepSeek-R1 – Part 1

Fabien Ric — Thu, 06 Mar 2025 09:56:20 +0000

Introduction

A few weeks ago, the release of the open-source large language model DeepSeek-R1 has taken the AI world by storm. The Chinese research team claimed their new reasoning model was on par with OpenAI’s flagship model o1, open-sourced the model and gave details about the work behind it.

In this blog post series, we will dive into the DeepSeek-R1 model family and see how you can run it on OVHcloud to build a simple chatbot that handles reasoning.

The “R” in DeepSeek-R1 stands for “Reasoning”, so let’s start by defining what a reasoning model is.

What are reasoning models?

Reasoning models are large language models (LLM) capable of reflecting on a problem before generating an answer. Traditionally, LLMs have been improved by spending more compute (more data, increase the number of parameters and the number of training iterations) at training time: it is training-time compute. Reasoning models, however, differ with standard LLMs in the way they use test-time compute, which means that during inference, they spend more time and resources to generate and refine a better answer.

Reasoning models excel at tasks that require understanding and working through a problem step-by-step, such as mathematics, riddles, puzzles, coding, planning tasks and agentic workflows. They may be counterproductive for use cases that don’t require reasoning capabilities, such as knowledge facts (for example, who discovered penicillin).

In a classroom, a reasoning model would be a student that takes time to understand the question, split the problem into manageable steps and detail the resolution process before rushing to write the answer.

Here is a comparison between the outputs of a standard LLM and a reasoning LLM, on an example prompt:

The reasoning model has generated more tokens, showing how it plans to solve the problem, before the actual answer. You can see it generates reasoning content into ... tags, in the case of DeepSeek-R1.

A standard LLM can also show reasoning abilities, that are often more visible when using a technique called Chain-of-Thought prompting (CoT), by adding phrases such as “let’s think step-by-step” in the prompt.

However, a reasoning LLM has been trained to behave this way. Its reasoning skill is internalized, so it doesn’t require specific prompting techniques to trigger the chain of thoughts process.

It’s important to note that DeepSeek-R1 is not the first reasoning model; OpenAI led the way by releasing their o1 model in September 2024.

The two main reasons why DeepSeek-R1 made the headline are its open-source nature, and the paper released by the research team which give many details on how they trained the model, with valuable insight for the open-source community to create reasoning models. Especially, the key highlight of their paper is that they observe the reasoning behavior can emerge only through Reinforcement Learning (RL), without fine-tuning.

The DeepSeek-R1 model family

You may have heard about DeepSeek-R1 but it’s not the only model of the DeepSeek family: DeepSeek-V3, DeepSeek-R1-Zero, and distilled models, are also available. So what are the differences between those models?

First, let’s go through some definitions and an overview of how language models are trained.

Language model training overview

The large language models available in apps and playgrounds are usually trained in 3 steps:

A base model is trained on an unsupervised language modeling task (for instance, next token prediction) with a dataset of trillions of tokens (also called pre-training),
An instruct model is trained from the base model, by fine-tuning it on a massive dataset of instructions, conversations, questions and answers, to improve the performances of the model with the prompts frequently encountered in a chat,
The final model is the instruct model trained to better handle human preferences, avoid the generation of harmful content, etc. with techniques such as RLHF (reinforcement learning from human feedback) and DPO (direct policy optimization).

DeepSeek-V3 training

According to the technical report provided by DeepSeek, DeepSeek-V3 is a mixture-of-experts (MoE) language model trained with the same kind of process, which is described in the image below:

DeepSeek-V3-Base is trained with 14.8 trillion tokens,
A dataset of 1.5 million instructions examples is used to fine-tune the base model,
This instruct model goes through reinforcement learning with several reward models. The final model is DeepSeek-V3.

For the reinforcement learning step, DeepSeek uses their algorithm called GRPO (group relative policy optimization), which uses several reward models to assess the quality of the content generated by the model. The score given by each reward model is combined into a final score, used to update the model so that it maximizes its global score the next time.

DeepSeek-R1 model series training

DeepSeek-R1 models are built with a different training pipeline, using the base model of DeepSeek-V3. The diagram below shows the main steps of the process designed by DeepSeek to create several reasoning models mentioned in their technical report:

Let’s walk through it step-by-step (no pun intended):

1. The main breakthrough described in DeepSeek’s paper: they managed to train the DeepSeek-V3-Base 671B model to learn the reasoning capability with reinforcement learning only, which doesn’t require labeled data, as opposed to supervised fine-tuning. They use the same GRPO algorithm as before, with two rewards: one on the accuracy of the generated content, with “rule-based” experts instead of full reward models, that are also trained and require significant resources. For example, to assess if the model generated a correct Python code, you could have one expert that compiles the generated code and gives a note based on the number of errors. Another expert would generate test cases and see if the generated code can handle them. The other reward they use is about the format of the model’s responses, which must follow the ... tags to enclose the reasoning content. The resulting model is DeepSeek-R1-Zero. However, it has limitations that make it unsuitable for direct use, such as language mixing and poor readability.

2. To overcome these limitations, DeepSeek uses DeepSeek-R1-Zero to create a cold-start reasoning dataset, augmented with other data from sources not explicitly mentioned. DeepSeek-V3-Base is trained with this cold-start data, before applying a new round of reinforcement learning.

3. They use the same RL approach to get a new reasoning model, that generates a better quality of output. Using this model, they build a 100x bigger reasoning data, growing from 5k to 600k samples, using DeepSeek-V3 as a quality judge. This dataset is then completed with 200k samples generated with DeepSeek-V3 on non-reasoning tasks.

4. A second stage of supervised fine-tuning is done with the dataset built earlier.

5. The model is then aligned with human preferences with a final round of reinforcement learning with a specific human preferences reward. The resulting model is DeepSeek-R1.

6. Finally, DeepSeek experimented with fine-tuning much smaller models than DeepSeek-V3 (LLaMa 3.3 70B, Qwen 2.5 32B…) with the dataset built at step 3. In the paper, they call this process distillation. However, it must not be mistaken with the knowledge distillation technique frequently used in deep learning, where a student model learns from the probabilities distribution of a teacher model. Here, the term “distillation” refers to the fact that the reasoning skill is “distilled” into the base model, but it’s plain old supervised fine-tuning. This is how the DeepSeek-R1-Distill model series is trained. The quality of the dataset enables the resulting distilled models to beat much larger models on reasoning tasks, as show in the benchmark below:

Recap

The table below summarize the differences between the model of the DeepSeek-R1 series:

Model	Description
DeepSeek-R1-Zero	Intermediate 671B reasoning model trained from DeepSeek-V3 exclusively with reinforcement learning, and used to bootstrap DeepSeek-R1 training.
DeepSeek-R1	671B reasoning model trained from DeepSeek-V3.
DeepSeek-R1-Distill	Smaller models fine-tuned for reasoning with a dataset generated by an intermediate version of DeepSeek-R1.

Run DeepSeek-R1 on OVHcloud

Now that we’ve seen the differences between all DeepSeek models, let’s try to use them!

AI Endpoints

The fastest way to test DeepSeek-R1 is to use OVHcloud AI Endpoints.

DeepSeek-R1-Distill-Llama-70B is already available, ready to use and optimized for inference speed. Check it out here: https://endpoints.ai.cloud.ovh.net/models/a011515c-0042-41b2-9a00-ec8b5d34462d

AI Endpoints makes it easy to integrate AI into your applications with a simple API call, without the need for deep AI expertise or infrastructure management. And while it’s in beta, it’s free!

Here is an example cURL command to use DeepSeek-R1 Distill Llama 70B on the OpenAI compatible endpoint provided by OVHcloud AI Endpoints:

curl -X 'POST' \
  'https://deepseek-r1-distill-llama-70b.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "max_tokens": 4096,
  "messages": [
    {
      "content": "How can I calculate an approximation of Pi in Python?",
      "role": "user"
    }
  ],
  "model": null,
  "seed": null,
  "stream": false,
  "temperature": 0.7,
  "top_p": 1
}'

We can see in the output the thinking process followed by the answer, which have been truncated for clarity.

{
    "id": "chatcmpl-8c21b2e3fac44d43b63c06fa25e58091",
    "object": "chat.completion",
    "created": 1741199564,
    "model": "DeepSeek-R1-Distill-Llama-70B",
    "choices":
    [
        {
            "index": 0,
            "message":
            {
                "role": "assistant",
                "content": "\nOkay, the user is asking how to approximate Pi using Python. I need to think about different methods they can use. Let's see, there are a few common approaches. \n\nFirst, there's the Monte Carlo method. ... Let me structure the response with each method as a separate section, explaining what it is, how it works, and providing the code. Then, the user can pick which one they prefer based on their situation.\n\n\nThere are several ways to approximate the value of Pi (π) using Python. Below are a few methods:\n\n### 1. Using the Monte Carlo Method..."
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage":
    {
        "prompt_tokens": 14,
        "completion_tokens": 1377,
        "total_tokens": 1391
    }
}

Stéphane Philippart, Developer Relation Advocate at OVHcloud, has written a blog post covering everything you need to know to get up to speed with AI Endpoints and run this model: Release of DeepSeek-R1 on OVHcloud AI Endpoints

AI Deploy

What if you want to run another version of DeepSeek-R1, such as the Qwen 7B distilled version?

You can use another OVHcloud AI product, AI Deploy, to create your own serving endpoint, with vLLM as the inference engine. It is open-source, fast and well maintained, ensuring maximal compatibility with even the most recent AI models.

Eléa Petton, Solution Architect at OVHcloud, has written a blog post explaining in details how to serve an open-source model with vLLM on AI Deploy. Just replace the Mistral Small model with the DeepSeek distilled version you want to use (e.g. deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) and adapt the number of L40S cards needed (1 is enough for the 7B version) : Mistral Small 24B served with vLLM and AI Deploy – a single command to deploy an LLM (Part 1)

Next up, creating a reasoning chatbot with DeepSeek-R1

In part 2 of this blog post series, we will use a DeepSeek-R1-Distill model to create a chatbot that will handle reasoning gracefully, by showing the thinking process of the model.

We will develop our chatbot with OVHcloud AI Endpoints and the Python library Gradio, that enables to quickly create simple chat interfaces.

Here a screenshot of the finalized chatbot we will build:

Stay tuned for the next article in this DeepSeek-R1 series. In the meantime, try out DeepSeek-R1 on AI Endpoints and AI Deploy and let us know what you !

Resources

If you want to learn more about DeepSeek-R1 and the topics we covered in this blog post, such as test-time compute, GRPO, reinforcement learning and reasoning models, we suggest having a look at these resources:

DeepSeek-R1 technical report, by the DeepSeek team
The Illustrated DeepSeek-R1, by Jay Alamar
Understanding Reasoning LLMs, by Sebastian Raschka
A Visual Guide to Reasoning LLMs, by Maarten Grootendorst

Managed Valkey: Our Commitment to Open Source and Customer choice.

Jonathan Clarke — Mon, 03 Mar 2025 10:04:51 +0000

OVHcloud strives for openness, expert takes and customer centricity. As part of this commitment, we keep adapting to next-gen industry shifts. Our goal remains the same: to provide our community with top market solutions.

Up to now, OVHcloud had offered a Managed Caching service based on the renowned Redis engine. We helped our customers to effortlessly scale their caching or real-time data management instances. A particularly useful asset for example in e-commerce sites and apps.

As Redis’ 2024 licensing model changed, introducing dual source-available licences with Redis 7.4. We took the strategic decision to discontinue this Managed Caching service. That’s why we will transition to Managed Valkey, its open-source alternative, in the Spring of 2025.

Why Valkey? A fully open-source alternative

Valkey is an open-source fork of Redis developed under the Linux Foundation. Through Valkey, our users benefit from a seamless and reliable alternative to Redis OSS. Its mission aligns with OVHcloud’s core values:

100% open-source: unlike Redis’ new licensing model, Valkey remains fully open-source. True to our words: ensuring freedom, transparency, security and long-term sustainability.
Seamless compatibility: Valkey is designed as a drop-in replacement for Redis. It means customers can continue to use the same commands, data structures, and configurations.
Community-driven innovation: as an Open-Source project, Valkey benefits from a thriving developer ecosystem, allowing for collaborative advancements and continuous improvements.

By choosing Managed Valkey, OVHcloud is ensuring that customers have access to a high-performance caching solution. Both future-proof and aligned with open-source values.

On top of that, this choice aligns with our partner Aiven’s decision to stand firmly as an early supporter of and committer to Valkey. Thanks to our long-term partnership with this unicorn, OVHcloud is now able to provide multiple managed services to its ecosystem. Lately, we demonstrated our partnership during the last OVHcloud Summit. On this much expected ecosystem event, we presented the many specific use cases we cover together.

What this means for OVHcloud customers?

Managed Valkey, which we will run on version 8.0, ensures full compatibility with Redis OSS 7.2.4. It’s now making it easier for users to transition their existing applications without disruption:

Effortless migration: our teams have designed an automated transition process. This requires no changes on the customer’s end and no service disruption.
Full support & documentation: comprehensive guides, migration tools, and expert assistance will be available to facilitate the switch.

Looking ahead: OVHcloud’s vision for Managed Databases

Beyond Valkey, we are constantly expanding our Managed Databases portfolio. Our goal? To provide cost-efficient, high-performance, secured and transparent solutions for our customers. Recent developments include:

Expanding database engines to new regions to ensure high availability and low latency worldwide, more recently in Singapore and the USA.
Upcoming ClickHouse Deployment, enhancing our analytics database functionality by Q3 2025.

Our Databases Public Cloud Roadmap on GitHub remains available for customers who want to stay informed about our future developments.

🚀 Let’s build the future of open source databases – together!

Mistral Small 24B served with vLLM and AI Deploy – a single command to deploy an LLM (Part 1)

Eléa Petton — Mon, 24 Feb 2025 10:08:37 +0000

You are not dreaming! You can deploy open-source LLM in a single command line.

Deploying advanced language models can be a challenge! But this sometimes this arduous task is becoming increasingly accessible, enabling developers to integrate sophisticated AI capabilities into their applications.

In this guide, we will walk through deploying the Mistral-Small-24B-Instruct-2501 model using vLLM on OVHcloud’s AI Deploy platform. This combination offers a powerful solution for efficient and scalable AI model serving.

Deploying a model is great, but doing it quickly is even better!

🤯 What if a single command line was enough? That’s the challenge we’re tackling today!

Context

Before deployment, let’s take a closer look at our key technologies!

Mistral Small

The mistralai/Mistral-Small-24B-Instruct-2501 is a 24-billion-parameter instruction-fine-tuned model, renowned for its compact size and performance comparable to larger models.

This model, from MistralAI, is an instruction-fine-tuned version of the base model: Mistral-Small-24B-Base-2501.

To serve this model efficiently, we will utilize vLLM, an open-source library for LLM inference.

vLLM

vLLM (Virtual LLM) is a highly optimized service engine designed to efficiently run large language models. It takes advantage of several key optimizations, such as:

PagedAttention: an attention mechanism that reduces memory fragmentation and enables more efficient use of GPU memory
Continuous Batching: vLLM dynamically adjusts batch sizes in real time, ensuring that the GPU is always used efficiently, even with multiple simultaneous requests
Tensor parallelism: enables model inference across multiple GPUs to boost performance
Optimized kernel implementations: vLLM uses custom CUDA kernels for faster execution, reducing latency compared to traditional inference frameworks

These features make vLLM one of the best choices for large models such as Mistral Small 24B, enabling low-latency, high-throughput inference on the latest GPUs.

By deploying on OVHcloud’s AI Deploy platform, you can deploy this model in a single command line.

AI Deploy

OVHcloud AI Deploy is a Container as a Service (CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications / APIs based on Machine Learning (ML), Deep Learning (DL) or LLMs.

The key benefits are:

Easy to use: bring your own custom Docker image and deploy it in a command line or a few clicks surely
High-performance computing: a complete range of GPUs available (H100, A100, V100S, L40S and L4)
Scalability and flexibility: supports automatic scaling, allowing your model to effectively handle fluctuating workloads
Cost-efficient: billing per minute, no surcharges

✅ To go further, some prerequisites must be checked!

Prerequisites

Before you begin, ensure that you have:

OVHcloud account: access to the OVHcloud Control Panel
ovhai CLI available: install the ovhai CLI
AI Deploy access: ensure you have a user for AI Deploy
Hugging Face access: create an Hugging Face account and generate an access token
Gated model authorization: be sure you have been granted access to Mistral-Small-24B-Instruct-2501 model

🚀 Having all the ingredients for our recipe, it’s time to deploy!

Deployment of the Mistral Small 24B LLM

Let’s go for the deployment of the model mistralai/Mistral-Small-24B-Instruct-2501

Manage access tokens

Export your Hugging Face token.

export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

Create a token to access your AI Deploy app once it will be deployed.

ovhai token create --role operator ai_deploy_token=my_operator_token

Returning the following output:

Id: 47292486-fb98-4a5b-8451-600895597a2b Created At: 20-02-25 11:53:05 Updated At: 20-02-25 11:53:05 Spec: Name: ai_deploy_token=my_operator_token Role: AiTrainingOperator Label Selector: Status: Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Version: 1

You can now store and export your access token:

export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Launch Mistral Small LLM with AI Deploy

You are ready to start Mistral-Small-24B using vLLM and AI Deploy:

ovhai app run --name vllm-mistral-small \
              --default-http-port 8000 \
              --label ai_deploy_token=my_operator_token \
              --gpu 2 \
              --flavor l40s-1-gpu \
              -e OUTLINES_CACHE_DIR=/tmp/.outlines \
              -e HF_TOKEN=$MY_HF_TOKEN \
              -e HF_HOME=/hub \
              -e HF_DATASETS_TRUST_REMOTE_CODE=1 \
              -e HF_HUB_ENABLE_HF_TRANSFER=0 \
              -v standalone:/hub:rw \
              -v standalone:/workspace:rw \
              vllm/vllm-openai:v0.8.2 \
              -- bash -c "python3 -m vllm.entrypoints.openai.api_server \
                        --model mistralai/Mistral-Small-24B-Instruct-2501 \
                        --tensor-parallel-size 2 \
                        --tokenizer_mode mistral \
                        --load_format mistral \
                        --config_format mistral \
                        --dtype half"

How to understand the different parameters of this command?

1. Start your AI Deploy app

Launch a new app using ovhai CLI and name it.

ovhai app run --name vllm-mistral-small

2. Define access

Define the HTTP API port and restrict access to your token.

--default-http-port 8000
--label ai_deploy_token=my_operator_token

3. Configure GPU resources

Specifies the hardware type (l40s-1-gpu), which refers to an NVIDIA L40S GPU and the number (2).

--gpu 2 --flavor l40s-1-gpu

⚠️WARNING! For this model, two L40S are sufficient, but if you want to deploy another model, you will need to check which GPU you need. Note that you can also access to A100 and H100 GPUs for your larger models.

4. Set up environment variables

Configure caching for the Outlines library (used for efficient text generation):

-e OUTLINES_CACHE_DIR=/tmp/.outlines

Pass the Hugging Face token ($MY_HF_TOKEN) for model authentication and download:

-e HF_TOKEN=$MY_HF_TOKEN

Set the Hugging Face cache directory to /hub (where models will be stored):

-e HF_HOME=/hub

Allow execution of custom remote code from Hugging Face datasets (required for some model behaviors):

-e HF_DATASETS_TRUST_REMOTE_CODE=1

Disable Hugging Face Hub transfer acceleration (to use standard model downloading):

-e HF_HUB_ENABLE_HF_TRANSFER=0

5. Mount persistent volumes

Mounts two persistent storage volumes:

/hub → Stores Hugging Face model files
/workspace → Main working directory

The rw flag means read-write access.

-v standalone:/hub:rw -v standalone:/workspace:rw

6. Choose the target Docker image

Uses the vllm/vllm-openai:v0.8.2 Docker image (a pre-configured vLLM OpenAI API server).

vllm/vllm-openai:v0.8.2

7. Running the model inside the container

Runs a bash shell inside the container and executes a Python command to launch the vLLM API server:

python3 -m vllm.entrypoints.openai.api_server → Starts the OpenAI-compatible vLLM API server
--model mistralai/Mistral-Small-24B-Instruct-2501 → Loads the Mistral Small 24B model from Hugging Face
--tensor-parallel-size 2 → Distributes the model across 2 GPUs
--tokenizer_mode mistral → Uses the Mistral tokenizer
--load_format mistral → Uses Mistral’s model loading format
--config_format mistral → Ensures the model configuration follows Mistral’s standard
--dtype half → Uses FP16 (half-precision floating point) for optimized GPU performance

You can now check if your AI Deploy app is alive:

ovhai app get

💡Is your app in RUNNING status? Perfect! You can check in the logs that the server is started…

ovhai app logs

⚠️WARNING! This step may take a little time as the template must be loaded…
After a few minutes, you should get the following information in the logs:

2025-02-20T13:48:07Z [app] [tcmzt] INFO: Started server process [13] 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Waiting for application startup. 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Application startup complete. 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

🚦 Are all the indicators green? Then it’s off to inference!

Request and send prompt to the LLM

Launch the following query by asking the question of your choice:

curl https://.app.gra.ai.cloud.ovh.net/v1/chat/completions \
  -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Small-24B-Instruct-2501",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Give me the name of OVHcloud’s founder."}
    ],
    "stream": false
  }'

Returning the following result:

{
  "id":"chatcmpl-d6ea734b524bd851668e71d4111ba496",
  "object":"chat.completion",
  "created":1740059807,
  "model":"mistralai/Mistral-Small-24B-Instruct-2501",
  "choices":[
    {
      "index":0,
      "message":{
        "role":"assistant",
        "reasoning_content":null, 
        "content":"The founder of OVHcloud is Octave Klaba.",
        "tool_calls":[]
      },
      "logprobs":null,
      "finish_reason":"stop",
      "stop_reason":null
    }
  ],
  "usage":{
    "prompt_tokens":22,
    "total_tokens":35,
    "completion_tokens":13,
    "prompt_tokens_details":null
  },
  "prompt_logprobs":null
}

Conclusion

By following these steps, you have successfully deployed the mistralai/Mistral-Small-24B-Instruct-2501 model using vLLM on OVHcloud’s AI Deploy platform. This setup provides a scalable and efficient solution for serving advanced language models in production environments.

For further customization and optimization, refer to the vLLM documentation and OVHcloud AI Deploy resources.

💪 Challenges taken! You can now enjoy the power of your LLM deployed in a single command line!

Want even more simplicity? You can also use ready-to-use APIs with AI Endpoints!

But… what’s next?

Enhancing Kubernetes Security: Detecting Threats in OVHcloud Managed Kubernetes cluster (MKS) Audit Logs with Falco

Aurélie Vache — Tue, 11 Feb 2025 08:58:40 +0000

Several month ago we discovered Falco, a Cloud Native near real-time threats detection tool, and we saw how to install it on an OVHcloud MKS cluster.

Today we will connect our Falco instance to a MKS cluster in order to retrieve Kubernetes Audit Logs events and watch if everything is OK in our cluster.

Concretely, in this blog post we will:

deploy an OVHcloud LDP (Logs Data Platform)
create a data stream into this LDP
connect an OVHcloud MKS cluster to the data stream (to send Audit Logs into it)
use the k8saudit-ovh Falco plugin to retrieve in realtime the Audit Logs of a MKS cluster
test a rule and detect security events based on MKS audit logs activity

Prerequisites

This blog post presupposes that you already have a working OVHcloud Managed Kubernetes (MKS) cluster, and a running instance of Falco.

If it is not the case, follow the Near real-time threats detection with Falco on OVHcloud Managed Kubernetes blog post.

Deploying a Logs Data Platform (LDP)

LDP is the managed platform for collecting, processing, analyzing and storing your logs of the OVHcloud products. To be able to access to our Kubernetes clusters Audit Logs we need to deploy a LDP.

Find more information on our dedicated LDP page.

We can deploy a LDP through the OVHcloud Control Panel and the API. In this blog post, we will deploy it through the Control Panel.

First, you have to log in to the OVHcloud Control Panel, click on the Bare Metal Cloud section located at the top in the header and then click on the Logs Data Platform in the sidebar.

Choose the LDP plan you want: Standard (free) or Enterprise one, depending on your needs.

Select a region (North America or Europe). We will choose “GRA” for this blog post, click on Order button and follow the instructions.

After several minutes your LDP will be created.

Refresh the page, click on the new deployed LDP, then enter a password and click on the Save button.

Creating a Data stream and retrieving the Websocket URL

Our Kubernetes Audit Logs will be stored in a data stream so click on the Data stream tab and then click on the Add data stream button.

Choose a name of the data stream. On my side I like to call it with the name of my MKS cluster following by “-audit-logs” to know easily what it is this data stream for. My MKS cluster’s name is “my-rancher-mks-cluster” so let’s name it “my-rancher-mks-cluster-audit-logs”. Fill the description (mandatory).

The OVHcloud Audit Logs Falco plugin you will use receive the audit logs through Websocket so you need to enable Websocket broadcasting then click on the Save button.

Now, to retrieve the Websocket URL of your data stream, click on the Data stream tab, then click on the … button (located at the right in the line of your data stream), and click on Monitor in real time action.

Finally, click on the Action button and in the Copy Websocket address, then save the LDP Websocket URL somewhere ;-).

Note that the Websocket address have this kind of format: wss://.logs.ovh.com/tail/?tk=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx

Connect a MKS cluster to a LDP data stream

Now we need to send the Kubernetes Audit Logs of our MKS cluster in the data stream.

For that, in the OVHcloud Control Panel, click on the Public Cloud section in the header and then in Managed Kubernetes Service in the sidebar.

Click on your Kubernetes cluster (my-rancher-mks-cluster for example), then in the Logs tab and click on the Subscribe button.

Click on the Add data stream button to visualize in real time the Audit Logs of your cluster. Then select the LDP instance and click on the Subscribe button for the data stream your created:

Retrieve the MKS Audit Logs with Falco

Falco can receive Events, compare them to a set of Rules to determine the actions to perform and generate Alerts to different endpoints.

Thanks to the k8saudit-ovh plugin, Falco can receive a new sort of Events: the Audit Logs of your MKS cluster. These events have also some rules to follow.

Concretely, when an user will execute some kubectl commands in an OVHcloud MKS cluster, Audit Logs will be generated. Falco is listening from them and depending on the configured rules, it will generate some alerts.

Let’s install or update a Falco configuration running in a MKS cluster and use this plugin.

Create a values.yaml file with the following content:

tty: true
kubernetes: false

# Just a Deployment with 1 replica (instead of a Daemonset) to have only one Pod that pulls the MKS Audit Logs from a OVHcloud LDP
controller:
  kind: deployment
  deployment:
    replicas: 1

falco:
  rule_matching: all
  rules_files:
    - /etc/falco/k8s_audit_rules.yaml
    - /etc/falco/rules.d
  plugins:
    - name: k8saudit-ovh
      library_path: libk8saudit-ovh.so
      open_params: ".logs.ovh.com/tail/?tk=" # Replace with your LDP Websocket URL
    - name: json
      library_path: libjson.so
      init_config: ""
  # Plugins that Falco will load. Note: the same plugins are installed by the falcoctl-artifact-install init container.
  load_plugins: [k8saudit-ovh, json]

driver:
  enabled: false
collectors:
  enabled: false

# use falcoctl to install automatically the plugin and the rules
falcoctl:
  artifact:
    install:
      enabled: true
    follow:
      enabled: true
  config:
    indexes:
    - name: falcosecurity
      url: https://falcosecurity.github.io/falcoctl/index.yaml
    artifact:
      allowedTypes:
        - plugin
        - rulesfile
      install:
        resolveDeps: false
        refs: [k8saudit-rules:0, k8saudit-ovh:0.1, json:0]
      follow:
        refs: [k8saudit-rules:0]

This values.yaml file will install Falco with the k8saudit-ovh and the json plugins.

Install the latest version of Falco with helm install command:

$ helm install falco --create-namespace --namespace falco --values=values.yaml falcosecurity/falco

This command will install the latest version of Falco, with the k8saudit-ovh and json plugins, and create a new falco namespace:

$ helm install falco --create-namespace --namespace falco --values=values.yaml falcosecurity/falco

NAME: falco
LAST DEPLOYED: Mon Feb 10 10:15:20 2025
NAMESPACE: falco
STATUS: deployed
REVISION: 1
NOTES:
No further action should be required.

Or if you already have Falco deployed in a Kubernetes cluster, you can use the helm update command instead:

$ helm upgrade falco --create-namespace --namespace falco --values=values.yaml falcosecurity/falco

You can check if the Falco pods are correctly running:

$ kubectl get pods -n falco

NAME                                      READY   STATUS    RESTARTS   AGE
falco-6b8bc77d8b-v24jr                    2/2     Running   0          96s
falco-falcosidekick-67877d6946-4hmbn      1/1     Running   0          96s
falco-falcosidekick-67877d6946-tpjk6      1/1     Running   0          96s
falco-falcosidekick-ui-78b96fd57d-4wb6q   1/1     Running   0          96s
falco-falcosidekick-ui-78b96fd57d-v7rnm   1/1     Running   0          96s
falco-falcosidekick-ui-redis-0            1/1     Running   0          96s

Wait and execute the command again if the pods are in “Init” or “ContainerCreating” state.

Once the Falco pod is ready, run the following command to see the logs:

kubectl logs -l app.kubernetes.io/name=falco -n falco -c falco

You should see logs like that:

$ kubectl logs -l app.kubernetes.io/name=falco -n falco -c falco

Mon Feb 10 09:15:35 2025:    /etc/falco/k8s_audit_rules.yaml | schema validation: ok
Mon Feb 10 09:15:35 2025: Hostname value has been overridden via environment variable to: my-pool-1-node-921b61
Mon Feb 10 09:15:35 2025: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Mon Feb 10 09:15:35 2025: Starting health webserver with threadiness 2, listening on 0.0.0.0:8765
Mon Feb 10 09:15:35 2025: Loaded event sources: syscall, k8s_audit
Mon Feb 10 09:15:35 2025: Enabled event sources: k8s_audit
Mon Feb 10 09:15:35 2025: Opening 'k8s_audit' source with plugin 'k8saudit-ovh'
{"hostname":"my-pool-1-node-921b61","output":"09:15:40.698757000: Warning K8s Operation performed by user not in allowed list of users (user=csi-cinder-controller target=csi-6afb06dce281b86b7bab718b5d966dc261b2b1554941ae449519a128cb2e3fb3/volumeattachments verb=patch uri=/apis/storage.k8s.io/v1/volumeattachments/csi-6afb06dce281b86b7bab718b5d966dc261b2b1554941ae449519a128cb2e3fb3/status resp=200)","output_fields":{"evt.time":1739178940698757000,"ka.response.code":"200","ka.target.name":"csi-6afb06dce281b86b7bab718b5d966dc261b2b1554941ae449519a128cb2e3fb3","ka.target.resource":"volumeattachments","ka.uri":"/apis/storage.k8s.io/v1/volumeattachments/csi-6afb06dce281b86b7bab718b5d966dc261b2b1554941ae449519a128cb2e3fb3/status","ka.user.name":"csi-cinder-controller","ka.verb":"patch"},"priority":"Warning","rule":"Disallowed K8s User","source":"k8s_audit","tags":["k8s"],"time":"2025-02-10T09:15:40.698757000Z"}
{"hostname":"my-pool-1-node-921b61","output":"09:15:57.508657000: Warning K8s Operation performed by user not in allowed list of users (user=yacht target=my-pool-1.18051c0a88716868/events verb=patch uri=/api/v1/namespaces/default/events/my-pool-1.18051c0a88716868 resp=403)","output_fields":{"evt.time":1739178957508657000,"ka.response.code":"403","ka.target.name":"my-pool-1.18051c0a88716868","ka.target.resource":"events","ka.uri":"/api/v1/namespaces/default/events/my-pool-1.18051c0a88716868","ka.user.name":"yacht","ka.verb":"patch"},"priority":"Warning","rule":"Disallowed K8s User","source":"k8s_audit","tags":["k8s"],"time":"2025-02-10T09:15:57.508657000Z"}
{"hostname":"my-pool-1-node-921b61","output":"09:15:57.807013000: Warning K8s Operation performed by user not in allowed list of users (user=yacht target=my-pool-1/nodepools verb=update uri=/apis/kube.cloud.ovh.com/v1alpha1/nodepools/my-pool-1/status resp=200)","output_fields":{"evt.time":1739178957807013000,"ka.response.code":"200","ka.target.name":"my-pool-1","ka.target.resource":"nodepools","ka.uri":"/apis/kube.cloud.ovh.com/v1alpha1/nodepools/my-pool-1/status","ka.user.name":"yacht","ka.verb":"update"},"priority":"Warning","rule":"Disallowed K8s User","source":"k8s_audit","tags":["k8s"],"time":"2025-02-10T09:15:57.807013000Z"}

The logs confirm that Falco k8saudit-ovh plugin and the k8saudit rules have been loaded correctly 💪.

Testing Falco

In order to test Falco we need to know which rules are installed by default. In our case, as we defined it in the values.yaml file, the k8saudit-ovh plugin follow the k8s_audit_rules.yaml file. You can take a look at them in order to know them.

In this blog post we will test one of well-known default k8s audit rules:

- rule: Attach/Exec Pod
  desc: >
    Detect any attempt to attach/exec to a pod
  condition: kevt_started and pod_subresource and (kcreate or kget) and ka.target.subresource in (exec,attach) and not user_known_exec_pod_activities
  output: Attach/Exec to pod (user=%ka.user.name pod=%ka.target.name resource=%ka.target.resource ns=%ka.target.namespace action=%ka.target.subresource command=%ka.uri.param[command])
  priority: NOTICE
  source: k8s_audit
  tags: [k8s]

This rule is interesting because an event will be generated if/when an user execute commands in a pod.

Let’s test the rule!

In a tab of your terminal, watch the coming logs:

$ kubectl logs -l app.kubernetes.io/name=falco -n falco -c falco -f

In an another tab of your terminal, create a Nginx pod and execute a command into it:

$ kubectl run nginx --image=nginx

$ kubectl exec -it nginx -- cat /etc/shadow

Several seconds later, in the logs you should see this you will see this Attach/Exec to pod logs:

...
{"hostname":"my-pool-1-node-921b61","output":"09:29:46.302906000: Notice Attach/Exec to pod (user=kubernetes-admin pod=nginx-676b6c5bbc-4xc6t resource=pods ns=hello-app action=exec command=cat)","output_fields":{"evt.time":1739179786302906000,"ka.target.name":"nginx-676b6c5bbc-4xc6t","ka.target.namespace":"hello-app","ka.target.resource":"pods","ka.target.subresource":"exec","ka.uri.param[command]":"cat","ka.user.name":"kubernetes-admin"},"priority":"Notice","rule":"Attach/Exec Pod","source":"k8s_audit","tags":["k8s"],"time":"2025-02-10T09:29:46.302906000Z"}
...

🎉

Conclusion

Ensuring the security of Kubernetes clusters is important and in general we have a lot of information in the Audit Logs but we don’t use them so don’t hesitate to use this new plugin.

We installed the new k8saudit-ovh plugin in an OVHcloud MKS cluster but note that you can deploy it in a Kubernetes cluster in another Cloud provider and even in a Falco instance running locally 💪.

We visualized the logs/the events in the terminal but you can also visualize them in the sidekick UI, create a custom rule and even use Talon to execute some actions.