Reference Architecture: Custom metric autoscaling for LLM inference with vLLM on OVHcloud AI Deploy and observability using MKS

Take your LLM (Large Language Model) deployment to production level with comprehensive custom autoscaling configuration and advanced vLLM metrics observability.

*vLLM metrics monitoring and observability based on OVHcloud infrastructure*

This reference architecture describes a comprehensive solution for deploying, autoscaling and monitoring vLLM-based LLM workloads on OVHcloud infrastructure. It combinesAI Deploy, used for model serving with custom metric autoscaling, and Managed Kubernetes Service (MKS), which hosts the monitoring and observability stack.

By leveraging application-level Prometheus metrics exposed by vLLM, AI Deploy can automatically scale inference replicas based on real workload demand, ensuring high availability, consistent performance under load and efficient GPU utilisation. This autoscaling mechanism allows the platform to react dynamically to traffic spikes while maintaining predictable latency for end users.

On top of this scalable inference layer, the monitoring architecture provides observability through Prometheus, Grafana and Alertmanager. It enables real-time performance monitoring, capacity planning, and operational insights, while ensuring full data sovereignty for organisations running Large Language Models (LLMs) in production environments.

What are the key benefits?

Cost-effective: Leverage managed services to minimise operational overhead
Real-time observability: Track Time-to-First-Token (TTFT), throughput, and resource utilisation
Sovereign infrastructure: All metrics and data remain within European datacentres
Production-ready: Persistent storage, high availability, and automated monitoring

Context

AI Deploy

OVHcloud AI Deploy is a Container as a Service (CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications/APIs based on Machine Learning (ML), Deep Learning (DL) or Large Language Models (LLMs).

Key points to keep in mind:

Easy to use: Bring your own custom Docker image and deploy it in a command line or a few clicks surely
High-performance computing: A complete range of GPUs available (H100, A100, V100S, L40S and L4)
Scalability and flexibility: Supports automatic scaling, allowing your model to effectively handle fluctuating workloads
Cost-efficient: Billing per minute, no surcharges

Managed Kubernetes Service

OVHcloud MKS is a fully managed Kubernetes platform designed to help you deploy, operate, and scale containerised applications in production. It provides a secure and reliable Kubernetes environment without the operational overhead of managing the control plane.

What should you keep in mind?

Cost-efficient: Only pay for worker nodes and consumed resources, with no additional charge for the Kubernetes control plane
Fully managed Kubernetes: Certified upstream Kubernetes with automated control plane management, upgrades and high availability
Production-ready by design: Built-in integrations with OVHcloud Load Balancers, networking and persistent storage
Scalability and flexibility: Easily scale workloads and node pools to match application demand
Open and portable: Based on standard Kubernetes APIs, enabling seamless integration with open-source ecosystems and avoiding vendor lock-in

In the following guide, all services are deployed within the OVHcloud Public Cloud.

Overview of the architecture

This reference architecture describes a complete, secure and scalable solution to:

Deploy an LLM with vLLM and AI Deploy, benefiting from automatic scaling based on custom metrics to ensure high service availability – vLLM exposes /metrics via its public HTTPS endpoint on AI Deploy
Collect, store and visualise these vLLM metrics using Prometheus and Grafana on MKS

*vLLM metrics monitoring and observability architecture overview*

Here you will find the main components of the architecture. The solution comprises three main layers:

Model serving layer with AI Deploy
- vLLM containers running on top of GPUs for LLM inference
- vLLM inference server exposing Prometheus metrics
- Automatic scaling based on custom metrics to ensure high availability
- HTTPS endpoints with Bearer token authentication
Monitoring and observability infrastructure using Kubernetes
- Prometheus for metrics collection and storage
- Grafana for visualisation and dashboards
- Persistent volume storage for long-term retention
Network layer
- Secure HTTPS communication between components
- OVHcloud LoadBalancer for external access

To go further, some prerequisites must be checked!

Prerequisites

Before you begin, ensure you have:

An OVHcloud Public Cloud account
An OpenStack user with the Administrator role
ovhai CLI available – install the ovhai CLI
A Hugging Face access – create a Hugging Face account and generate an access token
kubectl installed and helm installed (at least version 3.x)

🚀 Now you have all the ingredients for our recipe, it’s time to deploy the Ministral 14B using AI Deploy and vLLM Docker container!

Architecture guide: From autoscaling to observability for LLMs served by vLLM

Let’s set up and deploy this architecture!

✅ Note

In this example, mistralai/Ministral-3-14B-Instruct-2512 is used. Choose the open-source model of your choice and follow the same steps, adapting the model slug (from Hugging Face), the versions and the GPU(s) flavour.

Remember that all of the following steps can be automated using OVHcloud APIs!

Step 1 – Manage access tokens

Before introducing the monitoring stack, this architecture starts with the deployment of the Ministral 3 14B on OVHcloud AI Deploy, configured to autoscale based on custom Prometheus metrics exposed by vLLM itself.

Export your Hugging Face token.

export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

Create a Bearer token to access your AI Deploy app once it’s been deployed.

ovhai token create --role operator ai_deploy_token=my_operator_token

Returning the following output:

Id: 47292486-fb98-4a5b-8451-600895597a2b Created At: 20-01-26 11:53:05 Updated At: 20-01-26 11:53:05 Spec: Name: ai_deploy_token=my_operator_token Role: AiTrainingOperator Label Selector: Status: Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Version: 1

You can now store and export your access token:

export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Step 2 – LLM deployment using AI Deploy

1. Define the targeted vLLM metric for autoscaling

Before proceeding with the deployment of the Ministral 3 14B endpoint, you have to choose the metric you want to use as the trigger for scaling.

Instead of relying solely on CPU/RAM utilisation, AI Deploy allows autoscaling decisions to be driven by application-level signals.

To do this, you can consult the metrics exposed by vLLM.

In this example, you can use a basic metric such as vllm:num_requests_running to scale the number of replicas based on real inference load.

This enables:

Faster reaction to traffic spikes
Better GPU utilisation
Reduced inference latency under load
Cost-efficient scaling

Finally, the configuration chosen for scaling this application is as follows:

Parameter	Value	Description
Metric source	`/metrics`	vLLM Prometheus endpoint
Metric name	`vllm:num_requests_running`	Number of in-flight requests
Aggregation	`AVERAGE`	Mean across replicas
Target value	`50`	Desired load per replica
Min replicas	`1`	Baseline capacity
Max replicas	`3`	Burst capacity

✅ Note

You can choose the metric that best suits your use case. You can also apply a patch to your AI Deploy deployment at any time to change the target metric for scaling.

When the average number of running requests exceeds 50, AI Deploy automatically provisions additional GPU-backed replicas.

2. Deploy Ministral 3 14B using AI Deploy

Now you can deploy the LLM using the ovhai CLI.

Key elements necessary for proper functioning:

GPU-based inference: 1 x H100
vLLM OpenAI-compatible Docker image: vllm/vllm-openai:v0.13.0
Custom autoscaling rules based on Prometheus metrics: vllm:num_requests_running

Below is the reference command used to deploy the mistralai/Ministral-3-14B-Instruct-2512:

ovhai app run \
  --name vllm-ministral-14B-autoscaling-custom-metric \
  --default-http-port 8000 \
  --label ai_deploy_token=my_operator_token \
  --gpu 1 \
  --flavor h100-1-gpu \
  -e OUTLINES_CACHE_DIR=/tmp/.outlines \
  -e HF_TOKEN=$MY_HF_TOKEN \
  -e HF_HOME=/hub \
  -e HF_DATASETS_TRUST_REMOTE_CODE=1 \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -v standalone:/hub:rw \
  -v standalone:/workspace:rw \
  --liveness-probe-path /health \
  --liveness-probe-port 8000 \
  --liveness-initial-delay-seconds 300 \
  --probe-path /v1/models \
  --probe-port 8000 \
  --initial-delay-seconds 300 \
  --auto-min-replicas 1 \
  --auto-max-replicas 3 \
  --auto-custom-api-url "http://<SELF>:8000/metrics" \
  --auto-custom-metric-format PROMETHEUS \
  --auto-custom-value-location vllm:num_requests_running \
  --auto-custom-target-value 50 \
  --auto-custom-metric-aggregation-type AVERAGE \
  vllm/vllm-openai:v0.13.0 \
  -- bash -c "python3 -m vllm.entrypoints.openai.api_server \
    --model mistralai/Ministral-3-14B-Instruct-2512 \
    --tokenizer_mode mistral \
    --load_format mistral \
    --config_format mistral \
    --enable-auto-tool-choice \
    --tool-call-parser mistral \
    --enable-prefix-caching"

How to understand the different parameters of this command?

a. Start your AI Deploy app

Launch a new app using ovhai CLI and name it.

ovhai app run --name vllm-ministral-14B-autoscaling-custom-metric

b. Define access

Define the HTTP API port and restrict access to your token.

--default-http-port 8000
--label ai_deploy_token=my_operator_token

c. Configure GPU resources

Specify the hardware type (h100-1-gpu), which refers to an NVIDIA H100 GPU and the number (1).

--gpu 1 --flavor h100-1-gpu

⚠️WARNING! For this model, one H100 is sufficient, but if you want to deploy another model, you will need to check which GPU you need. Note that you can also access L40S and A100 GPUs for your LLM deployment.

d. Set up environment variables

Configure caching for the Outlines library (used for efficient text generation):

-e OUTLINES_CACHE_DIR=/tmp/.outlines

Pass the Hugging Face token ($MY_HF_TOKEN) for model authentication and download:

-e HF_TOKEN=$MY_HF_TOKEN

Set the Hugging Face cache directory to /hub (where models will be stored):

-e HF_HOME=/hub

Allow execution of custom remote code from Hugging Face datasets (required for some model behaviours):

-e HF_DATASETS_TRUST_REMOTE_CODE=1

Disable Hugging Face Hub transfer acceleration (to use standard model downloading):

-e HF_HUB_ENABLE_HF_TRANSFER=0

e. Mount persistent volumes

Mount two persistent storage volumes:

/hub → Stores Hugging Face model files
/workspace → Main working directory

The rw flag means read-write access.

-v standalone:/hub:rw -v standalone:/workspace:rw

f. Health checks and readiness

Configure liveness and readiness probes:

/health verifies the container is alive
/v1/models confirms the model is loaded and ready to serve requests

The long initial delays (300 seconds) can be reduced; they correspond to the startup time of vLLM and the loading of the model on the GPU.

--liveness-probe-path /health --liveness-probe-port 8000 --liveness-initial-delay-seconds 300 --probe-path /v1/models --probe-port 8000 --initial-delay-seconds 300

g. Autoscaling configuration (custom metrics)

First set the minimum and maximum number of replicas.

--auto-min-replicas 1 --auto-max-replicas 3

This guarantees basic availability (one replica always up) while allowing for peak capacity.

Then enable autoscaling based on application-level metrics exposed by vLLM.

--auto-custom-api-url "http://<SELF>:8000/metrics" --auto-custom-metric-format PROMETHEUS --auto-custom-value-location vllm:num_requests_running --auto-custom-target-value 50 --auto-custom-metric-aggregation-type AVERAGE

AI Deploy:

Scrapes the local /metrics endpoint
Parses Prometheus-formatted metrics
Extracts the vllm:num_requests_running gauge
Computes the average value across replicas

Scaling behaviour:

When the average number of in-flight requests exceeds 50, AI Deploy adds replicas
When load decreases, replicas are scaled down

This approach ensures high availability and predictable latency under fluctuating traffic.

h. Choose the target Docker image and the startup command

Use the official vLLM OpenAI-compatible Docker image.

vllm/vllm-openai:v0.13.0

Finally, run the model inside the container using a Python command to launch the vLLM API server:

python3 -m vllm.entrypoints.openai.api_server → Starts the OpenAI-compatible vLLM API server
--model mistralai/Ministral-3-14B-Instruct-2512 → Loads the Ministral 3 14B model from Hugging Face
--tokenizer_mode mistral → Uses the Mistral tokenizer
--load_format mistral → Uses Mistral’s model loading format
--config_format mistral → Ensures the model configuration follows Mistral’s standard
--enable-auto-tool-choice → Automatic call of tools if necessary (function/tool call)
--tool-call-parser mistral → Tool calling support
--enable-prefix-caching → Prefix caching for improved throughput and reduced latency

You can now launch this command using ovhai CLI.

3. Check AI Deploy app status

You can now check if your AI Deploy app is alive:

ovhai app get <your_vllm_app_id>

Is your app in RUNNING status? Perfect! You can check in the logs that the server is started:

ovhai app logs <your_vllm_app_id>

⚠️WARNING! This step may take a little time as the LLM must be loaded.

4. Test that the deployment is functional

First you can request and send a prompt to the LLM. Launch the following query by asking the question of your choice:

curl https://<your_vllm_app_id>.app.gra.ai.cloud.ovh.net/v1/chat/completions \
  -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Ministral-3-14B-Instruct-2512",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Give me the name of OVHcloud’s founder."}
    ],
    "stream": false
  }'

You can also verify access to vLLM metrics.

curl -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  https://<your_vllm_app_id>.app.gra.ai.cloud.ovh.net/metrics

If both tests show that the model deployment is functional and you receive 200 HTTP responses, you are ready to move on to the next step!

The next step is to set up the observability and monitoring stack. This autoscaling mechanism is fully independent from Prometheus used for observability:

AI Deploy queries the local /metrics endpoint internally
Prometheus scrapes the same metrics endpoint externally for monitoring, dashboards and potentially alerting

This ensures:

A single source of truth for metrics
No duplication of exporters
Consistent signals for scaling and observability

Step 3 – Create an MKS cluster

From OVHcloud Control Panel, create a Kubernetes cluster using the MKS.

Consider using the following configuration for the current use case:

Location: GRA ( Gravelines) – you can select the same region as for AI Deploy
Network: Public
Node pool :
- Flavour : b2-15 (or something similar)
- Number of nodes: 3
- Autoscaling : OFF
Name your node pool: monitoring

You should see your cluster (e.g. prometheus-vllm-metrics-ai-deploy) in the list, along with the following information:

If the status is green with the OK label, you can proceed to the next step.

Step 4 – Configure Kubernetes access

Download your kubeconfig file from the OVHcloud Control Panel and configure kubectl:

# configure kubectl with your MKS cluster
export KUBECONFIG=/path/to/your/kubeconfig-xxxxxx.yml

# verify cluster connectivity
kubectl cluster-info
kubectl get nodes

Now,- you can create the values-prometheus.yaml file:

# general configuration
nameOverride: "monitoring"
fullnameOverride: "monitoring"

# Prometheus configuration
prometheus:
  prometheusSpec:
    # data retention (15d)
    retention: 15d
    
    # scrape interval (15s)
    scrapeInterval: 15s
    
    # persistent storage (required for production deployment)
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: csi-cinder-high-speed  # OVHcloud storage
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi  # (can be modified according to your needs)
    
    # scrape vLLM metrics from your AI Deploy instance (Ministral 3 14B)
    additionalScrapeConfigs:
      - job_name: 'vllm-ministral'
        scheme: https
        metrics_path: '/metrics'
        scrape_interval: 15s
        scrape_timeout: 10s
        
        # authentication using AI Deploy Bearer token stored Kubernetes Secret
        bearer_token_file: /etc/prometheus/secrets/vllm-auth-token/token
        static_configs:
          - targets:
              - '<APP_ID>.app.gra.ai.cloud.ovh.net'  # /!\ REPLACE THE <APP_ID> by yours /!\
            labels:
              service: 'vllm'
              model: 'ministral'
              environment: 'production'
        
        # TLS configuration
        tls_config:
          insecure_skip_verify: false
    
    # kube-prometheus-stack mounts the secret under /etc/prometheus/secrets/ and makes it accessible to Prometheus
    secrets:
      - vllm-auth-token

# Grafana configuration (visualization layer)
grafana:
  enabled: true
  
  # disable automatic datasource provisioning
  sidecar:
    datasources:
      enabled: false
  
  # persistent dashboards
  persistence:
    enabled: true
    storageClassName: csi-cinder-high-speed
    size: 10Gi
  
  # /!\ DEFINE ADMIN PASSWORD - REPLACE "test" BY YOURS /!\
  adminPassword: "test"
  
  # access via OVHcloud LoadBalancer (public IP and managed LB)
  service:
    type: LoadBalancer
    port: 80
    annotations:
      # optional : limiter l'accès à certaines IPs
      # service.beta.kubernetes.io/ovh-loadbalancer-allowed-sources: "1.2.3.4/32"
  
# alertmanager (optional but recommended for production)
alertmanager:
  enabled: true
  
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: csi-cinder-high-speed
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

# cluster observability components
nodeExporter:
  enabled: true
  
kubeStateMetrics:
  enabled: true

✅ Note

On OVHcloud MKS, persistent storage is handled automatically through the Cinder CSI driver. When a PersistentVolumeClaim (PVC) references a supported storageClassName such as csi-cinder-high-speed, OVHcloud dynamically provisions the underlying Block Storage volume and attaches it to the node running the pod. This enables stateful components like Prometheus, Alertmanager and Grafana to persist data reliably without any manual volume management, making the architecture fully cloud-native and operationally simple.

Then create the monitoring namespace:

# create namespace
kubectl create namespace monitoring

# verify creation
kubectl get namespaces | grep monitoring

Finally, configure the Bearer token secret to access vLLM metrics.

# create bearer token secret
kubectl create secret generic vllm-auth-token \
  --from-literal=token='"$MY_OVHAI_ACCESS_TOKEN"' \
  -n monitoring

# verify secret creation
kubectl get secret vllm-auth-token -n monitoring

# test token (optional)
kubectl get secret vllm-auth-token -n monitoring \
  -o jsonpath='{.data.token}' | base64 -d

Right, if everything is working, let’s move on to deployment.

Step 5 – Deploy Prometheus stack

Add the Prometheus Helm repository and install the monitoring stack. The deployment creates:

Prometheus StatefulSet with persistent storage
Grafana deployment with LoadBalancer access
Alertmanager for future alert configuration (optional)
Supporting components (node exporters, kube-state-metrics)

# add Helm repository
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update

# install monitoring stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values values-prometheus.yaml \
  --wait

Then you can retrieve the LoadBalancer IP address to access Grafana:

kubectl get svc -n monitoring monitoring-grafana

Finally, open your browser to http://<EXTERNAL-IP> and login with:

Username: admin
Password: as configured in your values-prometheus.yaml file

Step 6 – Create Grafana dashboards

In this step, you will be able to access Grafana interface and add your Prometheus as a new data source, then create a complete dashboard with different vLLM metrics.

1. Add a new data source in Grafana

First of all, create a new Prometheus connection inside Grafana:

Navigate to Connections → Data sources → Add data source
Select Prometheus
Configure URL: http://monitoring-prometheus:9090
Click Save & test

Now that your Prometheus has been configured as a new data source, you can create your Grafana dashboard.

2. Create your monitoring dashboard

To begin with, you can use the following pre-configured Grafana dashboard by downloading this JSON file locally:

In the left-hand menu, select Dashboard:

Navigate to Dashboards → Import
Upload the provided dashboard JSON
Select Prometheus as datasource
Click Import and select the vLLM-metrics-grafana-monitoring.json file

The dashboard provides real-time visibility for Ministral 3 14B deployed with vLLM container and OVHcloud AI Deploy.

You can now track:

Performance metrics: TTFT, inter-token latency, end-to-end latency
Throughput indicators: Requests per second, token generation rates
Resource utilisation: KV cache usage, active/waiting requests
Capacity indicators: Queue depth, preemption rates

Here are the key metrics tracked and displayed in the Grafana dashboard:

Metric Category	Prometheus Metric	Description	Use case
Latency	`vllm:time_to_first_token_seconds`	Time until first token generation	User experience monitoring
Latency	`vllm:inter_token_latency_seconds`	Time between tokens	Throughput optimisation
Latency	`vllm:e2e_request_latency_seconds`	End-to-end request time	SLA monitoring
Throughput	`vllm:request_success_total`	Successful requests counter	Capacity planning
Resource	`vllm:kv_cache_usage_perc`	KV cache memory usage	Memory management
Queue	`vllm:num_requests_running`	Active requests	Load monitoring
Queue	`vllm:num_requests_waiting`	Queued requests	Overload detection
Capacity	`vllm:num_preemptions_total`	Request preemptions	Peak load indicator
Tokens	`vllm:prompt_tokens_total`	Input tokens processed	Usage analytics
Tokens	`vllm:generation_tokens_total`	Output tokens generated	Cost tracking

Well done, you now have at your disposal:

An endpoint of the Ministral 3 14B model deployed with vLLM thanks to OVHcloud AI Deploy and its autoscaling strategies based on custom metrics
Prometheus for metrics collection and Grafana for visualisation/dashboards thanks to OVHcloud MKS

But how can you check that everything will work when the load increases?

Step 7 – Test autoscaling and real-time visualisation

The first objective here is to force AI Deploy to:

Increase vllm:num_requests_running
‘Saturate’ a single replica
Trigger the scale up
Observe replica increase + latency drop

1. Autoscaling testing strategy

The goal is to combine:

High concurrency
Long prompts (KVcache heavy)
Long generations
Bursty load

This is what vLLM autoscaling actually reacts to.

To do so, a Python code can simulate the expected behaviour:

import time
import threading
import random
from statistics import mean
from openai import OpenAI
from tqdm import tqdm

APP_URL = "https://<APP_ID>.app.gra.ai.cloud.ovh.net/v1" # /!\ REPLACE THE <APP_ID> by yours /!\
MODEL = "mistralai/Ministral-3-14B-Instruct-2512"
API_KEY = $MY_OVHAI_ACCESS_TOKEN

CONCURRENT_WORKERS = 500          # concurrency (main scaling trigger)
REQUESTS_PER_WORKER = 25
MAX_TOKENS = 768                  # generation pressure

# some random prompts
SHORT_PROMPTS = [
    "Summarize the theory of relativity.",
    "Explain what a transformer model is.",
    "What is Kubernetes autoscaling?"
]

MEDIUM_PROMPTS = [
    "Explain how attention mechanisms work in transformer-based models, including self-attention and multi-head attention.",
    "Describe how vLLM manages KV cache and why it impacts inference performance."
]

LONG_PROMPTS = [
    "Write a very detailed technical explanation of how large language models perform inference, "
    "including tokenization, embedding lookup, transformer layers, attention computation, KV cache usage, "
    "GPU memory management, and how batching affects latency and throughput. Use examples.",
]

PROMPT_POOL = (
    SHORT_PROMPTS * 2 +
    MEDIUM_PROMPTS * 4 +
    LONG_PROMPTS * 6    # bias toward long prompts
)

# openai compliance
client = OpenAI(
    base_url=APP_URL,
    api_key=API_KEY,
)

# basic metrics
latencies = []
errors = 0
lock = threading.Lock()

# worker
def worker(worker_id):
    global errors
    for _ in range(REQUESTS_PER_WORKER):
        prompt = random.choice(PROMPT_POOL)

        start = time.time()
        try:
            client.chat.completions.create(
                model=MODEL,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=MAX_TOKENS,
                temperature=0.7,
            )
            elapsed = time.time() - start

            with lock:
                latencies.append(elapsed)

        except Exception as e:
            with lock:
                errors += 1

# run
threads = []
start_time = time.time()

print("Starting autoscaling stress test...")
print(f"Concurrency: {CONCURRENT_WORKERS}")
print(f"Total requests: {CONCURRENT_WORKERS * REQUESTS_PER_WORKER}")

for i in range(CONCURRENT_WORKERS):
    t = threading.Thread(target=worker, args=(i,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

total_time = time.time() - start_time

# results
print("\n=== AUTOSCALING BENCH RESULTS ===")
print(f"Total requests sent: {len(latencies) + errors}")
print(f"Successful requests: {len(latencies)}")
print(f"Errors: {errors}")
print(f"Total wall time: {total_time:.2f}s")

if latencies:
    print(f"Avg latency: {mean(latencies):.2f}s")
    print(f"Min latency: {min(latencies):.2f}s")
    print(f"Max latency: {max(latencies):.2f}s")
    print(f"Throughput: {len(latencies)/total_time:.2f} req/s")

How can you verify that autoscaling is working and that the load is being handled correctly without latency skyrocketing?

2. Hardware and platform-level monitoring

First, AI Deploy Grafana answers ‘What resources are being used and how many replicas exist?‘.

GPU utilisation, GPU memory, CPU, RAM and replica count are monitored through OVHcloud AI Deploy Grafana (monitoring URL), which exposes infrastructure and runtime metrics for the AI Deploy application. This layer provides visibility into resource saturation and scaling events managed by the AI Deploy platform itself.

Access it using the following URL (do not forget to replace <APP_ID> by yours): https://monitoring.gra.ai.cloud.ovh.net/d/app/app-monitoring?var-app=<APP_ID>&orgId=1

For example, check GPU/RAM metrics:

You can also monitor scale ups and downs in real time, as well as information on HTTP calls and much more!

3. Software and application-level monitoring

Next the combination of MKS + Prometheus + Grafana answers ‘How the inference engine behaves internally’.

In fact, vLLM internal metrics (request concurrency, token throughput, latency indicators, KV cache pressure, etc.) are collected via the vLLM /metrics endpoint and scraped by Prometheus running on OVHcloud MKS, then visualised in a dedicated Grafana instance. This layer focuses on model behaviour and inference performance.

Find all these metrics via (just replace <EXTERNAL-IP>): http://<EXTERNAL-IP>/d/vllm-ministral-monitoring/ministral-14b-vllm-metrics-monitoring?orgId=1

Find key metrics such as TTF, etc:

You can also find some information about ‘Model load and throughput’:

To go further and add even more metrics, you can refer to the vLLM documentation on ‘Prometheus and Grafana‘.

Conclusion

This reference architecture provides a scalable, and production-ready approach for deploying LLM inference on OVHcloud using AI Deploy and the autoscaling on custom metric feature.

OVHcloud MKS is dedicated to running Prometheus and Grafana, enabling secure scraping and visualisation of vLLM internal metrics exposed via the /metrics endpoint.

By scraping vLLM metrics securely from AI Deploy into Prometheus and exposing them through Grafana, the architecture provides full visibility into model behaviour, performance and load, enabling informed scaling analysis, troubleshooting and capacity planning in production environments.

Eléa Petton

Solution Architect @OVHcloud