GPU Archives - OVHcloud Blog

Reference Architecture: Deploying a vision-language model with vLLM on OVHcloud MKS for high performance inference and full observability

Eléa Petton — Fri, 10 Apr 2026 07:48:53 +0000

Ensure complete digital sovereignty of your AI models with end-to-end control through open-source solutions on OVHcloud’s Managed Kubernetes Service.

vLLM on OVHcloud MKS for high availability and full observability

This reference architecture demonstrates how to deploy a Large Language Model (LLM) inference system using vLLM on OVHcloud Managed Kubernetes Service (MKS). The solution leverages NVIDIA L40S GPUs to serve the Qwen3-VL-8B-Instruct multimodal model (vision + text) with OpenAI-compatible API endpoints.

This comprehensive guide shows you how to deploy, to scale automatically, and how to monitor vLLM-based LLM workloads on the OVHcloud infrastructure.

What are the key benefits?

Cost-effectiveness: Leverage managed services to minimise operational overhead
Real-time observability: Track Time-to-First-Token (TTFT), throughput, and resource utilisation
Sovereign infrastructure: Keep all metrics and data within European datacentres
Scalable by design: Automatically scale GPU inference replicas based on real workload demand

Context

Managed Kubernetes Service

OVHcloud MKS is a fully managed Kubernetes platform designed to help you deploy, operate, and scale containerised applications in production. It provides a secure and reliable Kubernetes environment without the operational overhead of managing the control plane.

How does this benefit you?

Cost-efficient: Pay only for worker nodes and consumed resources, with no additional charge for the Kubernetes control plane
Fully managed Kubernetes: Certified upstream Kubernetes with automated control plane management, provided upgrades and high availability
Production-ready by design: Built-in integrations with OVHcloud Load Balancers, networking, and persistent storage
Scalable and flexible: Scale workloads easily, node pools to match application demand
Open and portable: Based on standard Kubernetes APIs, enable seamless integration with open-source ecosystems and avoid vendor lock-in

In the following guide, all services are deployed within the OVHcloud Public Cloud.

Architecture overview

This reference architecture demonstrates a basic deployment of vLLM for vision-language model inference on OVHcloud Managed Kubernetes Service, featuring:

High-availability deployment with 2 GPU nodes (NVIDIA L40S)
Optimised GPU utilisation with proper driver configuration
Scalable infrastructure supporting vision-language models
Comprehensive monitoring using Prometheus, Grafana, and DCGM
Full observability for both application and hardware metrics

Data flow:

Data Flow

Inference request:
- User → LoadBalancer → Gateway → NGINX Ingress → “Qwen3 VL” Service → vLLM Pod → GPU
- Response follows reverse path with streaming support
Metrics collection:
- vLLM Pods expose /metrics endpoint (port 8000)
- DCGM Exporters expose GPU metrics (port 9400)
- Prometheus scrapes both endpoints every 30 seconds
- Grafana queries Prometheus for visualization
Load distribution
- NGINX Ingress uses cookie-based session affinity
- vLLM Service uses ClientIP session affinity
- Anti-affinity ensures 1 pod per GPU node

Prerequisites

Before you begin, ensure you have:

An OVHcloud Public Cloud account
An OpenStack user with the Administrator role
Hugging Face access – create a Hugging Face account and generate an access token
kubectl already installed and helm installed (at least version 3.x)

🚀 Now you have all the ingredients, it’s time to deploy the recipe for Qwen/Qwen3-VL-8B-Instruct using vLLM and MKS!

Architecture guide: Native GPU deployment of vLLM on MKS with full stack observability

This reference architecture describes a Large Language Model deployment using vLLM inference server and Kubernetes, to enjoy the benefits of a service that’s both highly available and monitorable in real time.

Step 1 – Create MKS cluster and Node pools

From OVHcloud Control Panel, create a Kubernetes cluster using the MKS.

Navigate to: Public Cloud → Managed Kubernetes Service → Create a cluster

1. Configure cluster

Consider using the following configuration for the current use case:

Name: vllm-deployment-l40s-qwen3-8b
Location: 1-AZ Region – Gravelines (GRA11)
Plan: Free (or Standard)
Network: attach a Private network (e.g. 0000 - AI Private Network)
Version: Latest stable (e.g. 1.34)

2. Create GPU Node pool

During the cluster creation, configure the vLLM Node pool for GPUs:

Node pool name: vllm
Flavor: L40S-90
Number of nodes: 2
Autoscaling: Disabled (OFF)

Why L40S-90?

Cost-effective for single-model deployment (1 GPU per node)
Sufficient RAM (90GB) for Qwen3-VL-8B model

You should see your cluster (e.g. vllm-deployment-l40s-qwen3-8b) in the list, along with the following information:

You can now set up the node pool dedicated to monitoring.

3. Create CPU Node pool

From your cluster, click on Add a node pool and configure it as follow:

Node pool name: monitoring
Flavor: B2-15
Number of nodes: 1
Autoscaling: Disabled (OFF)

✅ Note

Monitoring stack can run on GPU nodes if cost is a concern. Dedicated CPU node provides better isolation and resource management.

If the status is green with the OK label, you can proceed to the next step.

4. Configure Kubernetes access

Once your nodes have been provisioned, you can download the Kubeconfig file and configure kubectl with your MKS cluster.

# configure kubectl with your MKS cluster
export KUBECONFIG=/path/to/your/kubeconfig-xxxxxx.yml

# verify cluster connectivity
kubectl cluster-info
kubectl get nodes

Returning:

NAME STATUS ROLES AGE VERSION monitoring-node-xxxxxx Ready 1d v1.34.2 vllm-node-yyyyyy Ready 1d v1.34.2 vllm-node-zzzzzz Ready 1d v1.34.2

Before going further, add a label to the CPU node for monitoring workloads.

CPU_NODE=$(kubectl get nodes -o json | \
  jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" == null) | .metadata.name')
kubectl label node $CPU_NODE node-role=monitoring

Finally, check with the following command:

NAME                     GPU      ROLE
monitoring-node-xxxxxx      monitoring
vllm-node-yyyyyy         1        
vllm-node-zzzzzz         1

Once both nodes are in Ready status, you can proceed to the next step.

Step 2 – Install GPU operator

To start, consider setting up the GPU operator.

✅ Note

This step is based on this OVHcloud documentation: Deploying a GPU application on OVHcloud Managed Kubernetes Service

1. Add NVIDIA helm repository and create namespace

Add NVIDIA helm repo:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

And create Namespace as follow.

kubectl create namespace gpu-operator

2. Install GPU operator with correct configuration

The GPU Operator must be configured with specific driver versions to ensure compatibility with vLLM containers.

However, the default installation uses recent drivers (580.x with CUDA 13.x) which are incompatible with vLLM containers (CUDA 12.x).

Solution: Force driver version 535.183.01 (CUDA 12.2).

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --set driver.enabled=true \
  --set driver.version="535.183.01" \
  --set toolkit.enabled=true \
  --set operator.defaultRuntime=containerd \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set dcgmExporter.image="dcgm-exporter" \
  --set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04" \
  --set gfd.enabled=true \
  --set migManager.enabled=false \
  --set nodeStatusExporter.enabled=true \
  --set validator.driver.enable=false \
  --set validator.toolkit.enable=false \
  --set validator.plugin.enable=false \
  --timeout 20m

✅ Note

Specifying the DCGM version may only be necessary if you encounter problems with the default image (e.g. ‘ImagePullBackOff’). If this is the case, add the following parameters:
--set dcgmExporter.repository="nvcr.io/nvidia/k8s" --set dcgmExporter.image="dcgm-exporter" --set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04"

kubectl get pods -n gpu-operator

Note that all pods should reach Running state in 5-10 minutes.

You can also check the GPU availability:

kubectl get nodes -o json | jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" != null) | "\(.metadata.name): \(.status.allocatable."nvidia.com/gpu") GPU(s)"'

Returning:

vllm-node-yyyyyy: 1 GPU(s) vllm-node-zzzzzz: 1 GPU(s)

And you can test to run nvidia-smi:

DRIVER_POD=$(kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset -o name | head -1)
kubectl exec -n gpu-operator $DRIVER_POD -- nvidia-smi

If GPU tests are working properly, you can move on DCGM service configuration.

3. Configure DCGM service

Why is DCGM Exporter required?

DCGM (Data Centre GPU Manager) is NVIDIA’s official tool for monitoring GPUs in production. The goal is to be able to collect and display metrics from both GPU nodes.

GPU monitoring with DCGM

The metrics provided are:

DCGM_FI_DEV_GPU_UTIL – GPU utilisation (%)
DCGM_FI_DEV_GPU_TEMP – GPU temperature (°C)
DCGM_FI_DEV_FB_USED – VRAM used (MB)
DCGM_FI_DEV_FB_FREE – Free VRAM (MB)
DCGM_FI_DEV_POWER_USAGE – Power consumption (W)
And 50+ other GPU metrics

Next, ensure DCGM service has the correct labels and port configuration:

kubectl patch svc nvidia-dcgm-exporter -n gpu-operator --type merge -p '{
  "metadata": {
    "labels": {
      "app": "nvidia-dcgm-exporter"
    }
  },
  "spec": {
    "ports": [
      {
        "name": "metrics",
        "port": 9400,
        "targetPort": 9400,
        "protocol": "TCP"
      }
    ]
  }
}'

Verify the endpoints (should show 2 IPs, one per GPU node).

kubectl get endpoints nvidia-dcgm-exporter -n gpu-operator

NAME ENDPOINTS AGE nvidia-dcgm-exporter x.x.x.x:9400,x.x.x.x:9400 17d

Step 3 – Deploy Qwen3 VL 8B with vLLM inference server

The deployment of the Qwen 3 VL 8B model on two L40S GPU nodes is carried out in several stages.

1. Create namespace and Hugging Face secret

Start by creating Namespace:

kubectl create namespace vllm

Next, you must retrieve your Hugging Face token and replace the HF_TOKEN value by your own:

export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Create your secret as follow:

kubectl create secret generic huggingface-secret \
  --from-literal=token=$HF_TOKEN \
  --namespace=vllm

Verify you obtain the following output by launching:

kubectl get secret huggingface-secret -n vllm

NAME TYPE DATA AGE huggingface-secret Opaque 1 14d

2. Create vLLM deployment configuration

First, you can create vllm-deployment-2nodes.yaml file.

Deploy vLLM:

kubectl apply -f vllm-deployment-2nodes.yaml

You can monitor the deployment (it should take 8-10 minutes for model download and loading).

kubectl get pods -n vllm -o wide -w

Expected output after 10 minutes:

NAME               READY  STATUS   RESTARTS  AGE  IP       NODE  
qwen3-vl-xxxx-yyy  1/1    Running  0         1d   X.X.X.X  vllm-node-yyyyyy
qwen3-vl-xxxx-zzz  1/1    Running  0         1d   X.X.X.X  vllm-node-zzzzzz

You can also check the container logs:

kubectl logs -f -n vllm

You should find in the logs: “Uvicorn running on http://0.0.0.0:8000“

Is everything installed correctly? Then let’s move on to the next step.

3. Add service label

Ensure service has the correct label for ServiceMonitor discovery.

kubectl label svc qwen3-vl-service -n vllm app=qwen3-vl --overwrite

You can now verify by launching the following command.

kubectl get svc qwen3-vl-service -n vllm --show-labels | grep "app=qwen3-vl"

Returning:

qwen3-vl-service ClusterIP X.X.X.X 8000/TCP 1d app=qwen3-vl

Step 4 – Install NGINX ingress controller

⚠️ Moving beyond Ingress

Follow this tutorial if you want to use Gateway instead of Ingress.

1. Add helm repository and configure Ingress

First of all, add helm repository:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

Create configuration file with ingress-nginx-values.yaml.

Then, install NGINX Ingress:

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  -f ingress-nginx-values.yaml \
  --wait

Wait for LoadBalancer IP. The external IP assignment should take 1-2 minutes.

kubectl get svc -n ingress-nginx ingress-nginx-controller -w

Once is no longer , Ctrl+C and export it:

export EXTERNAL_IP=
echo "API URL: http://$EXTERNAL_IP"

2. Create vLLM Ingress resource

Next, create vLLM Ingress using vllm-ingress.yaml.

Apply it as follow:

kubectl apply -f vllm-ingress.yaml

You can now test different API calls to verify that your deployment is functional.

3. Test API

Firstly, check if the model is available:

curl http://$EXTERNAL_IP/v1/models | jq

{
  "object": "list",
  "data": [
    {
      "id": "qwen3-vl-8b",
      "object": "model",
      "created": 1772472143,
      "owned_by": "vllm",
      "root": "Qwen/Qwen3-VL-8B-Instruct",
      "parent": null,
      "max_model_len": 8192,
      "permission": [
        {
          "id": "modelperm-8fb35cdd3208b068",
          "object": "model_permission",
          "created": 1772472143,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

Next, test inference using the following request:

curl http://$EXTERNAL_IP/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-vl-8b",
    "messages": [{"role": "user", "content": "Count from 1 to 10."}],
    "max_tokens": 100
  }' | jq '.choices[0].message.content'

"1, 2, 3, 4, 5, 6, 7, 8, 9, 10"

Great! You’re almost there…

Step 5 – Install Prometheus stack

Now, set up the monitoring stack that provides complete observability for application-level (vLLM) and hardware-level (GPU) metrics:

Monitoring architecture

1. Add helm repository and create namespace

Add Prometheus helm repo:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Then, create the monitoring Namespace.

kubectl create namespace monitoring

2. Create Prometheus deployment configuration and installation

First, create prometheus.yaml file.

Install Prometheus stack:

helm install prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring \
  -f prometheus.yaml \
  --timeout 10m \
  --wait

Now, monitor its installation and wait until the pods are ready:

kubectl get pods -n monitoring -w

If all pods are running successfully, you can proceed to the next step.

3. Check that the installation is operational

First access Grafana in background:

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &

Test Grafana health:

curl -s http://localhost:3000/api/health | jq

{
  "database": "ok",
  "version": "12.3.3",
  "commit": "2a14494b2d6ab60f860d8b27603d0ccb264336f6"
}

You can now access to Grafana locally via http://localhost:3000. You will have to use:

Login: admin
Password: Admin123!vLLM

Well done! You can now proceed to the configuration step.

Step 6 – Configure ServiceMonitors

The ServiceMonitors is used to tell Prometheus which endpoints to scrape for metrics.

1. Create vLLM ServiceMonitor

Retrieve the file from the GitHub repository: vllm-servicemonitor.yaml.

Next, apply and check that the ServiceMonitor vllm-metrics exists:

kubectl apply -f vllm-servicemonitor.yaml
kubectl get servicemonitor -n vllm

2. Create DCGM ServiceMonitor

First, create the dcgm-servicemonitor.yaml file.

Once again, apply and verify:

kubectl apply -f dcgm-servicemonitor.yaml
kubectl get servicemonitor -n gpu-operator

gpu-operator                  1d
nvidia-dcgm-exporter          1d
nvidia-node-status-exporter   1d

3. Configure Prometheus for Cross-Namespace discovery

Apply a patch to allow Prometheus to discover ServiceMonitors in all namespaces.

kubectl patch prometheus prometheus-kube-prometheus-prometheus -n monitoring --type merge -p '{
  "spec": {
    "serviceMonitorNamespaceSelector": {},
    "podMonitorNamespaceSelector": {}
  }
}'

Now you have to restart Prometheus.

Delete Prometheus pod to force configuration reload
Wait for Prometheus to restart

kubectl delete pod prometheus-prometheus-kube-prometheus-prometheus-0 -n monitoring

kubectl wait --for=condition=Ready \
  pod/prometheus-prometheus-kube-prometheus-prometheus-0 \
  -n monitoring \
  --timeout=180s

Wait about 2 minutes for discovery and finally, verify targets:

kubectl port-forward -n monitoring \
  prometheus-prometheus-kube-prometheus-prometheus-0 9090:9090 &

You can open in browser: http://localhost:9090/targets and search for:

vllm
dcgm

Note that the expected targets are:

serviceMonitor/vllm/vllm-metrics/0 (2/2 UP)
serviceMonitor/gpu-operator/nvidia-dcgm-exporter/0 (2/2 UP)

Step 7 – Create Grafana dashboards

In this final step, the goal is to create two Grafana dashboards to track both the software side with vLLM metrics and the hardware metrics that will monitor the GPU consumption and system.

1. vLLM application metrics

The dashboard provides insights into vLLM application performance, request handling, and resource utilization based on the following metrics:

Metric	Type	Description	Unit	Dashboard Usage
`vllm:request_success_total`	Counter	Total successful requests	count	Request Rate, Total Requests
`vllm:num_requests_running`	Gauge	Requests currently being processed	count	Queue Depth, Active Requests
`vllm:num_requests_waiting`	Gauge	Requests waiting in queue	count	Queue Depth, Queued Requests
`vllm:time_to_first_token_seconds`	Histogram	Latency until first token generated	seconds	TTFT P50/P95/P99
`vllm:e2e_request_latency_seconds`	Histogram	Total end-to-end latency	seconds	E2E Latency P50/P95/P99
`vllm:generation_tokens_total`	Counter	Total tokens generated (output)	count	Token Generation Rate, Throughput
`vllm:prompt_tokens_total`	Counter	Total prompt tokens (input)	count	Token Generation Rate, Avg Tokens
`vllm:kv_cache_usage_perc`	Gauge	GPU KV cache utilization	0-1 (0-100%)	KV Cache Usage
`vllm:prefix_cache_hits_total`	Counter	Number of prefix cache hits	count	Cache Hit Rate
`vllm:prefix_cache_queries_total`	Counter	Number of prefix cache queries	count	Cache Hit Rate
`vllm:request_queue_time_seconds`	Histogram	Time spent waiting in queue	seconds	Request Queue Time
`vllm:request_prefill_time_seconds`	Histogram	Prefill phase time	seconds	Prefill Time
`vllm:request_decode_time_seconds`	Histogram	Decode phase time	seconds	Decode Time
`vllm:inter_token_latency_seconds`	Histogram	Latency between each token	seconds	Inter-Token Latency
`vllm:num_preemptions_total`	Counter	Number of preemptions (OOM)	count	Preemptions
`vllm:prompt_tokens_cached_total`	Counter	Prompt tokens cached	count	Cached Tokens
`vllm:request_prompt_tokens`	Histogram	Prompt size distribution	count	(Table)
`vllm:request_generation_tokens`	Histogram	Generated tokens distribution	count	(Table)
`vllm:iteration_tokens_total`	Histogram	Tokens per iteration	count	(Advanced analysis)

This vLLM Grafana dashboard is composed of 23 panels:

The dashboard provides insights into LLM application performance, request handling, and resource utilisation based on the previous metrics.

Type	Nombre	Panels
Timeseries	12	Request Rate, Queue Depth, TTFT, E2E Latency, Token Gen, Cache Usage, Cache Hit, Queue Time, Prefill/Decode, Inter-Token, Preemptions, Avg Tokens
Stat	10	Throughput, TTFT P95, Active Req, Queued Req, Cache Hit Rate, Cache Usage, Total Req, Total Tokens, Cached Tokens, Preemptions
Table	1	Pod Performance

Now create the dashboard using vllm-app-dashboard.json. Then, launch:

echo "Importing vLLM application dashboard..."
curl -X POST \
  'http://localhost:3000/api/dashboards/db' \
  -H 'Content-Type: application/json' \
  -u 'admin:Admin123!vLLM' \
  -d @vllm-app-dashboard.json | jq '.status, .url'

Next, you an access the vLLM dashboard and follow metrics in real time:

This dashboard is also essential to track hardware consumption for comprehensive monitoring.

2. GPU hardware metrics

Take advantage of the most useful DCGM metrics to check both the functioning and consumption of your hardware resources:

Metric	Type	Description	Unit	Normal Thresholds	Dashboard Usage
`DCGM_FI_DEV_GPU_UTIL`	Gauge	GPU utilization (compute)	% (0-100)	70-95% optimal	GPU Utilization
`DCGM_FI_DEV_GPU_TEMP`	Gauge	GPU temperature	°C	< 85°C normal	GPU Temperature
`DCGM_FI_DEV_FB_USED`	Gauge	VRAM used	MB	Variable by model	GPU Memory Used
`DCGM_FI_DEV_FB_FREE`	Gauge	VRAM free	MB	> 2GB recommended	GPU Memory Free
`DCGM_FI_DEV_POWER_USAGE`	Gauge	Power consumption	Watts	< 300W (L40S)	GPU Power Usage
`DCGM_FI_DEV_SM_CLOCK`	Gauge	GPU clock speed (compute)	MHz	Variable	GPU Clock Speed
`DCGM_FI_DEV_MEM_CLOCK`	Gauge	Memory clock speed	MHz	Variable	Memory Clock Speed
`DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL`	Counter	Total NVLink bandwidth	bytes/s	(If multi-GPU)	NVLink Bandwidth
`DCGM_FI_DEV_PCIE_TX_BYTES`	Counter	PCIe data transmitted	bytes	(I/O monitoring)	PCIe TX
`DCGM_FI_DEV_PCIE_RX_BYTES`	Counter	PCIe data received	bytes	(I/O monitoring)	PCIe RX
`DCGM_FI_DEV_ECC_DBE_VOL_TOTAL`	Counter	ECC double-bit errors	count	0 ideal	(Health check)
`DCGM_FI_DEV_ECC_SBE_VOL_TOTAL`	Counter	ECC single-bit errors	count	< 10/day acceptable	(Health check)

This hardware Grafana dashboard is composed of 13 panels with GPU hardware and system metrics. A detailed view is also available GPU util (%), temperature (°C), vRAM (GB) and power (Watt).

Type	Count	Panels
Timeseries	8	GPU Util, GPU Mem, GPU Temp, GPU Power, CPU Usage, RAM Usage, Network I/O, Disk I/O
Stat	4	Avg GPU Util, Avg GPU Temp, Total GPU Mem, Total GPU Power
Table	1	Hardware Status

Please refer to hardware-dashboard.json by loading it as follows:

echo "Importing hardware dashboard..."
curl -X POST \
  'http://localhost:3000/api/dashboards/db' \
  -H 'Content-Type: application/json' \
  -u 'admin:Admin123!vLLM' \
  -d @hardware-dashboard.json | jq '.status, .url'

Finally, track resource consumption using this hardware dashboard:

Congratulations! Everything is working. You can now test your model and track the various metrics in real time.

Step 8 – LLM testing and performance tracking

Start by installing Python dependencies:

pip3 install openai tqdm

Replace the by the vLLM service external IP and launch the performance test thanks to the following Python code:

import time
import threading
import random
from statistics import mean
from openai import OpenAI
from tqdm import tqdm

APP_URL = "http://94.23.185.22/v1"
MODEL = "qwen3-vl-8b"

CONCURRENT_WORKERS = 500          # concurrency
REQUESTS_PER_WORKER = 10
MAX_TOKENS = 200                  # generation pressure

# some random prompts
SHORT_PROMPTS = [
    "Summarize the theory of relativity.",
    "Explain what a transformer model is.",
    "What is Kubernetes autoscaling?"
]

MEDIUM_PROMPTS = [
    "Explain how attention mechanisms work in transformer-based models, including self-attention and multi-head attention.",
    "Describe how vLLM manages KV cache and why it impacts inference performance."
]

LONG_PROMPTS = [
    "Write a very detailed technical explanation of how large language models perform inference, "
    "including tokenization, embedding lookup, transformer layers, attention computation, KV cache usage, "
    "GPU memory management, and how batching affects latency and throughput. Use examples.",
]

PROMPT_POOL = (
    SHORT_PROMPTS * 2 +
    MEDIUM_PROMPTS * 4 +
    LONG_PROMPTS * 6    # bias toward long prompts
)

# openai compliance
client = OpenAI(
    base_url=APP_URL,
    api_key="foo"
)

# basic metrics
latencies = []
errors = 0
lock = threading.Lock()

# worker
def worker(worker_id):
    global errors
    for _ in range(REQUESTS_PER_WORKER):
        prompt = random.choice(PROMPT_POOL)

        start = time.time()
        try:
            client.chat.completions.create(
                model=MODEL,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=MAX_TOKENS,
                temperature=0.7,
            )
            elapsed = time.time() - start

            with lock:
                latencies.append(elapsed)

        except Exception as e:
            with lock:
                errors += 1

# run
threads = []
start_time = time.time()

print("\n-> STARTING PERFORMANCE TEST:")
print(f"Concurrency: {CONCURRENT_WORKERS}")
print(f"Total requests: {CONCURRENT_WORKERS * REQUESTS_PER_WORKER}")

for i in range(CONCURRENT_WORKERS):
    t = threading.Thread(target=worker, args=(i,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

total_time = time.time() - start_time

# results
print("\n-> BENCH RESULTS:")
print(f"Total requests sent: {len(latencies) + errors}")
print(f"Successful requests: {len(latencies)}")
print(f"Errors: {errors}")
print(f"Total wall time: {total_time:.2f}s")

if latencies:
    print(f"Avg latency: {mean(latencies):.2f}s")
    print(f"Min latency: {min(latencies):.2f}s")
    print(f"Max latency: {max(latencies):.2f}s")
    print(f"Throughput: {len(latencies)/total_time:.2f} req/s")

Returning:

-> STARTING PERFORMANCE TEST:
Concurrency: 500
Total requests: 5000

-> BENCH RESULTS:
Total requests sent: 5000
Successful requests: 5000
Errors: 0
Total wall time: 225.54s
Avg latency: 21.45s
Min latency: 6.06s
Max latency: 25.19s
Throughput: 22.17 req/s

Don’t forget to track GPU and vLLM metrics in your Grafana dashboards!

Conslusion

This reference architecture demonstrates a vLLM deployment on OVHcloud Managed Kubernetes Service (MKS) with comprehensive GPU monitoring. Benefits include:

High Performance: GPU-accelerated inference with L40S
Scalability: Kubernetes-native, horizontal scaling-ready
Reliability: Health checks, auto-restart, monitoring
API Compatibility: OpenAI-compatible endpoints
Multimodality: Vision & text capabilities
Full stack monitoring: Complete vLLM application and hardware dashboards

Going Further

Your current architecture is functional. However, if desired, it could be improved into a full production-ready solution.

Wish to take production hardening a step further?

Go further with the following enhancements:

Authentication & authorization
- vLLM API authentication
- Grafana authentication
- Prometheus security
High availability & load balancing
- Grafana high availability with multiple replicas and shared storage
- Prometheus high availability
- vLLM Horizontal Pod Autoscaling (HPA) based on custom metrics
Data persistence & backup
- Prometheus long-term storage with persistent storage
- Grafana Dashboard Backup
Observability enhancements
- Distributed tracing by adding OpenTelemetry for request tracing
- Alerting rules with production-ready alert rules

Pushing beyond the limits of embedded real-time AI for edge devices

Katya Guez — Thu, 03 Apr 2025 22:44:40 +0000

Startup highlight: Interview with Kevin Conley, CEO at Applied Brain Research (ABR)

Applied Brain Research (ABR) is a fabless semiconductor company founded by a team from the University of Waterloo’s Centre for Theoretical Neuroscience, under the leadership of Dr. Chris Eliasmith, the Centre’s founding chair, to commercialize brain-inspired AI inference solutions.

Can you introduce Applied Brain Research and its mission?

ABR’s mission is to bring advanced time series inference out of data centers by empowering edge devices with advanced time series AI capability. Out of its research, ABR invented the Legendre Memory Unit which has established a new chapter of state space models for advanced time series processing. To enable low power edge devices with advanced capabilities like low latency natural language interfaces, ABR has developed an LMU powered time series processor AI accelerator ASIC that will enter the market this year. Our CEO, Kevin Conley, is a semiconductor vet who has built billion dollar businesses in the past and plans to do the same with ABR’s cutting edge technology.

What challenges did ABR face before partnering with OVHcloud?

Our business depends on its ability to train advanced AI models for edge device applications creating technical and financial challenges. This requires the availability of advanced GPUs for network training. Successfully optimizing neural networks for our chip is critical to our success. Our main challenges have been budgetary and scaling our R&D efforts. We decided to explore cloud solutions because investment in our own training capability would be prohibitive both from a cost and management perspective.

How did OVHcloud and the Startup Program help you overcome these challenges?

The Startup Program has let us explore different ways of leveraging OVHcloud’s resources in order to train next generation models, and fine-tune them in ways that would be very difficult to do in house. It lets us quickly expand or focus our efforts without having all of the infrastructure headaches that come along with that typically.

Which OVHcloud services or features do you use, and how do they stand out from other solutions?

We use the Public Cloud service from OVHcloud. Compared to other services from Google, Amazon, etc., OVHCloud provides the most cost-effective solution per FLOP.

How has OVHcloud’s support helped you evolve your infrastructure to meet the demands of your business?

The nature of our AI development cycle means that our usage of AI training hardware fluctuates over time. OVHcloud’s public cloud allows us to dynamically scale up and down our AI training hardware in a cost effective manner.

What tangible results have you achieved since collaborating with OVHcloud?

We have pushed our networks to be the smallest possible ASR networks with the highest accuracy. This was possible because we could do hyper perimeter, searching, using OVHcloud’s infrastructure. Specifically we’ve gotten less than 5% word error rates on full vocabulary speech transcription with a tiny 8 million parameter quantized network. We have built other state of the networks for TTS and Voice Control using the same infrastructure, but we haven’t announced their general availability yet on our platform.

Providing these networks as starting points for customers to use our chip, greatly reduces the barrier to entry for taking advantage of our technologies. Broadening our pre-trained library available customers will only improve that going forward, and this will be much more efficient using OVHcloud than doing it in house.

What are your ambitions for the future of your startup, and how do you see it evolving within the cloud ecosystem? What future challenges do you foresee?

We plan to offer our no code environment to all of our customers, which will effectively scale as we grow, and allow customers to build, train, and deploy all manner of models on our chip. This SaaS offering will be crucial as we deploy our chip to many markets. Our fundamental advantage is for time series processing, i.e. problems where the order of the data in time is important for making decisions. This includes everything from speech and language processing to heartbeat monitoring and fall detection. As a result, we are building a versatile, cloud-hosted development environment that will require rapid scalability.

What advice would you give to other growth-stage startups considering the cloud or joining a support program?

Probably the most important piece of advice is to take advantage of everything that’s provided. This requires some commitment on the side of the startup, but going to the meetings, asking questions, and leveraging the resources is the only way to get the most out of the program.

Applied Brain Research’s journey with OVHcloud, joining the Startup Program then AI Accelerator, highlights how a startup can make the most of available resources to overcome challenges, achieve sustainable growth, and scale. If you’re a startup looking to transform your business, we encourage you to join the OVHcloud Startup Program or contact OVHcloud to discover how our solutions can support your journey!

Enhancing Customer Service with Interactive Avatars

Leonard Pommereau — Thu, 20 Mar 2025 14:24:19 +0000

Startup highlight: Interview with Fatma Chelly, Marketing Manager at Jumbo Mana

Jumbo Mana is a deep-tech startup founded in 2022. Specializing in Agentic AI, the company creates conversational solutions, including avatars and digital assistants that provide precise, fast, reliable and engaging answers.

Can you introduce Jumbo Mana and its mission?

Jumbo Mana’s solution distinguishes by a high level of adaptability making it easy to integrate for any customer, as well as a strong degree of accuracy in its answers. Jumbo Mana aims to offer (i) alternative solutions to sectors suffering from low-skilled labour recruitment issues (airports, subways, hospitality, education, etc.) and (ii) powerful digital assistants providing true and efficient answers to users with a high level of comprehension.

Jumbo Mana is at the forefront of the rapidly growing Agentic AI market and already delivers high-accuracy solutions that drive engagement across a wide range of industries.

Our guiding philosophy is to create technology that serves humanity. We envision developing a companion assistant that adapts to individual needs, simplifies daily life, and enhances human interaction. Our ultimate goal is to become the global leader in interactive AI avatars, combining technological innovation with a commitment to improving quality of life.

What challenges did Jumbo Mana face before partnering with OVHcloud?

Jumbo Mana primarily faced technical and operational challenges. One of the key issues was scalability: the rapid growth of our SaaS platform’s user base required an infrastructure capable of handling high traffic volumes while maintaining consistent performance and ensuring a seamless user experience. Additionally, security was a critical concern, as we needed to protect sensitive client data from cyber threats and preserve customer trust. Finally, data privacy and compliance posed a significant challenge, as strict adherence to the GDPR (General Data Protection Regulation) was mandatory, necessitating solutions that guaranteed data privacy and ensured full compliance with European regulations.

Jumbo Mana chose OVHcloud solutions as the ideal way to address our challenges. Scalability was ensured through cloud platforms that offer on-demand resources, allowing us to easily adjust to traffic fluctuations without requiring significant upfront hardware investments. Enhanced security was another key advantage, with OVHcloud providing advanced built-in security protocols and tools to protect the platform from evolving threats. In terms of regulatory compliance, OVHcloud’s adherence to GDPR standards and expertise in data privacy gave Jumbo Mana confidence in meeting legal obligations without compromise. Lastly, the cost-effectiveness of cloud solutions offered a more predictable and manageable cost structure compared to maintaining physical infrastructure, enabling better resource allocation.

How did OVHcloud and the Startup Program help you overcome these challenges?

OVHcloud’s wide array of services provided a strong foundation for Jumbo Mana’s technical architecture. The GPU cluster allowed us to deploy our Triton server, which hosts and serves our AI models, enabling real-time performance and efficient inference. With OVHcloud’s competitive pricing, we were able to deploy enough replicas of our AI services to ensure a robust infrastructure capable of handling high workloads. This not only increased the reliability of our platform but also allowed us to scale seamlessly as our user base grew. Moreover, the ease of obtaining additional quotas from OVHcloud made it simple to scale our infrastructure quickly, supporting the platform’s growth without operational bottlenecks.

Which OVHcloud services or features do you use, and how do they stand out from other solutions?

For the backend, we deployed the entire stack of backend microservices, APIs, and databases on OVHcloud’s Kubernetes-managed clusters. This streamlined the deployment process and allowed us to efficiently manage scaling, updates, and monitoring of all services. Kubernetes also provided the flexibility needed to adapt our deployments to fluctuating workloads, ensuring consistent performance for our users.

How has OVHcloud’s support helped you evolve your infrastructure to meet the demands of your business?

One of the key highlights of our collaboration was OVHcloud’s active and responsive customer support. Whenever challenges arose, their team provided quick and effective solutions, ensuring that any issues were resolved with minimal downtime. The OVHcloud Startup Program further complemented this with infrastructure credits and access to a vibrant ecosystem of partners and startups, helping us optimize our architecture for growth and performance.

With OVHcloud’s cost-effective solutions, scalability, and excellent support, Jumbo Mana was able to overcome its challenges and establish a high-performing, reliable infrastructure. This partnership has been instrumental in enabling us to focus on our mission of delivering innovative Agentic AI solutions, while OVHcloud handles the complexities of infrastructure management.

What tangible results have you achieved since collaborating with OVHcloud?

Since partnering with OVHcloud, we have achieved significant advancements in the performance and scalability of our platform. Thanks to OVHcloud’s reliable infrastructure and robust Kubernetes-managed clusters, we have maintained zero downtime, even during high traffic periods or unexpected surges in demand. This reliability is critical for delivering a seamless user experience and meeting the expectations of our clients in industries like airports and education.

Our platform’s capacity has increased by 300%, now supporting three times the number of concurrent users. This scalability is driven by the ability to dynamically allocate resources and deploy additional replicas as needed, enabling us to handle rapid growth across sectors like airports and education.

We’ve also doubled our GPU capacity while staying within the same budget, thanks to OVHcloud’s competitive GPU pricing. This enhancement has significantly boosted our ability to handle high-performance AI workloads, ensuring fast and efficient service delivery for our clients.

What are your ambitions for the future of your startup, and how do you see it evolving within the cloud ecosystem? What future challenges do you foresee?

Our future is focused on international expansion and significantly growing our client base. By leveraging OVHcloud’s scalable infrastructure and global presence, we aim to adapt to the needs of diverse markets while maintaining high performance and compliance.

As we grow, we anticipate challenges like global scalability, regulatory compliance across regions, and cost management at scale. The flexibility and innovation provided by OVHcloud’s ecosystem will play a critical role in helping us overcome these obstacles and achieve sustainable success.

What advice would you give to other growth-stage startups considering the cloud or joining a support program?

Pick a cloud that supports your growth with scalability, security, and compliance. Make the most of support programs to grow faster and stay focused on your business.

Jumbo Mana’s journey with OVHcloud highlights how the right cloud partnership can help startups overcome challenges, achieve sustainable growth, and scale globally. If you’re a startup looking to transform your business, we encourage you to join the OVHcloud Startup Program or contact OVHcloud to discover how our solutions can support your journey!

Five ways to develop sovereign, sustainable AI solutions

Cezary Skarzynski — Mon, 27 Jan 2025 15:07:21 +0000

Now that organisations understand AI and what it can achieve, businesses around the world are focusing on how to build it responsibly. Three of the five main themes at the Paris AI Action Summit examine the need for responsible AI, with separate streams on trust, public interest and good governance.

These themes are not simple. In addition to the core function of AI tools – for example, considering what an AI app does, how it does it, and whether bias is present – most businesses are starting to realise that they need to consider the deeper ‘AI supply chain’.

This is not just altruistic. A number of LLM tools are currently facing the risk of lawsuits for copyright infringement, because they may have been trained without due content permission. AI tools that present biased results are quickly exposed in press, leading to reputational damage and a loss of customer trust. Some countries also have legislation permitting data usage for economic intelligence purposes – but in another region, this may represent a data breach. AI has also received negative publicity for ‘running hot’ and consuming large amounts of energy and water in datacenters.

However, AI can also be a tremendous force for good – if handled correctly. So, what should businesses be thinking about so that they get the most from AI, without incurring undue commercial or reputational risk?

1- Consider Sovereignty from the Start

Understand your data ‘supply chain’ from the very beginning of the process. For example, if you’re using an external LLM for a chatbot, where was this developed? Which data was it trained on, and was this data acquired ethically?

“AI can often be a black box when it comes to processing data,” says Lex Avstreikh, Strategy Lead for Stockholm-based AI firm Hopsworks. “It’s far too complex to show how the system arrived at any one decision. But if you can show people the inputs and the outputs, then that goes a long way to building transparency and trust.”

2- Plan for a Sovereign Future

It’s important to think about where data will be during its future lifecycle – will you be running in an external datacenter, and where will data be in transit and at rest? Where are the headquarters of the datacenter company in question and what does this mean from a regulatory and handling perspective? Perhaps most importantly, will your customers be happy with all of these arrangements?

This was the decision journey faced by Swedish AI firm Ebbot. In July 2020, the Data Protection Commission v. Facebook Ireland case, commonly referred to as Schrems II, resulted in the Court of Justice of the European Union (CJEU) issuing a decision that added more regulations to data protection and processing principles. Ebbot recognised the importance of data security and compliance and thus made it a priority to store and process all data within the EU.

3- Location, location, location

Location isn’t just an important sovereignty concern – it’s also crucial to sustainability. Although Scandinavia may have very green energy, it’s easy to forget that many cloud providers will offer geographical ‘computing zones’ rather than defined locations, which can result in a less green footprint. CPU- and GPU-intensive tasks like model training should be run in green energy zones wherever possible, and are rarely latency-dependent; consequently, you can locate them far away if necessary.

When your AI app goes into production, also remember that backup and redundancy are a necessity – but will also increase your carbon footprint. Consider having a ‘low power’ or passive backup if commercially feasible – it will take longer to bring online in the case of emergency, but you’ll be consuming less power.

4- Always Consider Necessity

A lot of organisations only consider hardware efficiency and power consumption during the development process, but green software is rapidly gaining popularity. Having efficient code which is still fit for purpose can have a huge impact on power consumption, particularly if you’re building an app for very broad use. “We’ll definitely see more efficient and specific LLMs, because they’re absolutely needed,” added Avstreikh.

Although organisations are often considering the cost of development, with FinOps initiatives, we are also seeing the dawn of GreenOps, ensuring that technology is as green as possible from end to end. To that effect, consider benchmarking the CPU and memory usage of your application, because less hardware-intensive apps are usually less power-hungry.

5- Re-use, recycle

Developing bespoke code can make sure that it’s as lean and efficient as possible, but it can also use needless computing power to develop. Many technology organisations will offer PaaS offerings that can automate common parts of the application development and deployment process. For example, consider our AI Endpoints solution, which helps developers to access other AI models, from Bert to Mistral to Llama, all using a simple API.

This is not an easy process, but establishing responsible AI conduct in your organisation’s DNA will avoid complications further down the road, and also show to customers that you are considering data – including theirs – in a responsible, secure way. With increasing numbers of organisations tracking not only their scope three emissions, but also their data supply chains in a more comprehensive fashion, sovereignty and sustainability are two clear ‘musts’ for any modern AI company.

If you’re a startup or scale-up building an AI solution, and would like to work with a sovereign, sustainable cloud provider in turn, you can find more information about OVHcloud – including our cloud credit scheme – on our startup hub.

How to serve LLMs with vLLM and OVHcloud AI Deploy

Mathieu Busquet — Wed, 29 May 2024 12:22:26 +0000

In this tutorial, we will walk you through the process of serving large language models (LLMs), providing step-by-step instruction.

Introduction

In recent years, large language models (LLMs) have become increasingly popular, with open-source models like Mistral and LLaMA gaining widespread attention. In particular, the LLaMA 3 model was released on April 18, 2024, is one of today’s most powerful open-source LLMs.

However, serving these LLMs can be challenging, particularly on hardware with limited resources. Indeed, even on expensive hardware, LLMs can be surprisingly slow, with high VRAM utilization and throughput limitations.

This is where vLLM comes in. vLLM is an open-source project that enables fast and easy-to-use LLM inference and serving. Designed for optimal performance and resource utilization, vLLM supports a range of LLM architectures and offers flexible customization options. That’s why we are going to use it to efficiently deploy and scale our LLMs.

Objective

In this guide, you will discover how to deploy a LLM thanks to vLLM and the AI Deploy OVHcloud solution. This will enable you to benefit from vLLM‘s optimisations and OVHcloud‘s GPU computing resources. Your LLM will then be exposed by a secured API.

🎁 And for those who do not want to bother with the deployment process, a surprise awaits you at the end of the article. We are going to introduce you to our new solution for using LLMs, called AI Endpoints. This product makes it easy to integrate AI capabilities into your applications with a simple API call, without the need for deep AI expertise or infrastructure management. And while it’s in alpha, it’s free!

Requirements

To deploy your vLLM server, you need:

An OVHcloud account to access the OVHcloud Control Panel
A Public Cloud project
A user for the AI Products, related to this Public Cloud project
The OVHcloud AI CLI installed on your local computer (to interact with the AI products by running commands).
Docker installed on your local computer, or access to a Debian Docker Instance, which is available on the Public Cloud

Once these conditions have been met, you are ready to serve your LLMs.

Building a Docker image

Since the OVHcloud AI Deploy solution is based on Docker images, we will be using a Docker image to deploy our vLLM inference server.

As a reminder, Docker is a platform that allows you to create, deploy, and run applications in containers. Docker containers are standalone and executable packages that include everything needed to run an application (code, libraries, system tools).

To create this Docker image, we will need to write the following Dockerfile into a new folder:

mkdir my_vllm_image
nano Dockerfile

# 🐳 Base image
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

# 👱 Set the working directory inside the container
WORKDIR /workspace

# 📚 Install missing system packages (git) so we can clone the vLLM project repository
RUN apt-get update && apt-get install -y git
RUN git clone https://github.com/vllm-project/vllm/

# 📚 Install the Python dependencies
RUN pip3 install --upgrade pip
RUN pip3 install vllm 

# 🔑 Give correct access rights to the OVHcloud user
ENV HOME=/workspace
RUN chown -R 42420:42420 /workspace

Let’s take a closer look at this Dockerfile to understand it:

FROM: Specify the base image for our Docker Image. We choose the PyTorch image since it comes with CUDA, CuDNN and torch, which is needed by vLLM.
WORKDIR /workspace: We set the working directory for the Docker container to /workspace, which is the default folder when we use AI Deploy.
RUN: It allows us to upgrade pip to the latest version to make sure we have access to the latest libraries and dependencies. We will install vLLM library, and git, which will enable to clone the vLLM repository into the /workspace directory.
ENV HOME=/workspace: This sets the HOME environment variable to /workspace. This is a requirement to use the OVHcloud AI Products.
RUN chown -R 42420:42420 /workspace: This changes the owner of the /workspace directory to the user and group with IDs of 42420 (OVHcloud user). This is also a requirement to use the OVHcloud AI Products.

This Dockerfile does not contain a CMD instruction and therefore does not launch our VLLM server. Do not worry about that, we will do it directly from AI Deploy to have more flexibility.

Once your Dockerfile is written, launch the following command to build your image:

docker build . -t vllm_image:latest

Push the image into the shared registry

Once you have built the Docker image, you will need to push it to a registry to make it accessible from AI Deploy. A registry is a service that allows you to store and distribute Docker images, making it easy to deploy them in different environments.

Several registries can be used (OVHcloud Managed Private Registry, Docker Hub, GitHub packages, …). In this tutorial, we will use the OVHcloud shared registry. More information are available in the Registries documentation.

To find the address of your shared registry, use the following command (ovhai CLI needs to be installed on your computer):

ovhai registry list

Then, log in on your shared registry with your usual AI Platform user credentials:

docker login -u  -p

Once you are logged in to the registry, tag the compiled image and push it into your shared registry:

docker tag vllm_image:latest /vllm_image:latest
docker push /vllm_image:latest

vLLM inference server deployment

Once your image has been pushed, it can be used with AI Deploy, using either the ovhai CLI or the OVHcloud Control Panel (UI).

Creating an access token

Tokens are used as unique authenticators to securely access the AI Deploy apps. By creating a token, you can ensure that only authorized requests are allowed to interact with the vLLM endpoint. You can create this token by using the OVHcloud Control Panel (UI) or by running the following command:

ovhai token create vllm --role operator --label-selector name=vllm

This will give you a token that you will need to keep.

Creating a Hugging Face token (optionnal)

Note that some models, such as LLaMA 3 require you to accept their license, hence, you need to create a HuggingFace account, accept the model’s license, and generate a token by accessing your account settings, that will allow you to access the model.

For example, when visiting the HugginFace Gemma model page, you’ll see this (if you are logged in):

If you want to use this model, you will have to Acknowledge the license, and then make sure to create a token in the tokens section.

In the next step, we will set this token as an environment variable (named HF_TOKEN). Doing this will enable us to use any LLM whose conditions of use we have accepted.

Run the AI Deploy application

Run the following command to deploy your vLLM server by running your customized Docker image:

ovhai app run /vllm_image:latest \
  --name vllm_app \
  --flavor h100-1-gpu \
  --gpu 1 \
  --env HF_TOKEN="" \
  --label name=vllm \
  --default-http-port 8080 \
  -- python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model  --dtype half

You just need to change the address of your registry to the one you used, and the name of the LLM you want to use. Also pay attention to the name of the image, its tag, and the label selector of your label if you haven’t used the same ones as those given in this tutorial.

Parameters explanation

/vllm_image:latest is the image on which the app is based.
--name vllm_app is an optional argument that allows you to give your app a custom name, making it easier to manage all your apps.
--flavor h100-1-gpu indicates that we want to run our app on H100 GPU(s). You can access the full list of GPUs available by running ovhai capabilities flavor list
--gpu 1 indicates that we request 1 GPU for that app.
--env HF_TOKEN is an optional argument that allows us to set our Hugging Face token as an environment variable. This gives us access to models for which we have accepted the conditions.
--label name=vllm allows to privatize our LLM by adding the token corresponding to the label selector name=vllm.
--default-http-port 8080 indicates that the port to reach on the app URL is the 8080.
--python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model allows to start the vLLM API server. The specified will be downloaded from Hugging Face. Here is a list of those that are supported by vLLM. Many arguments can be used to optimize your inference.

When this ovhai app run command is executed, several pieces of information will appear in your terminal. Get the ID of your application, and open the Info URL in a new tab. Wait a few minutes for your application to launch. When it is RUNNING, you can stream its logs by executing:

ovhai app logs -f

This will allow you to track the server launch, the model download and any errors you may encounter if you have used a model for which you have not accepted the user contract.

If all goes well, you should see the following output, which means that your server is up and running:

Started server process [11]
Waiting for application startup.
Application startup complete.
Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

Interacting with your LLM

Once the server is up and running, we can interact with our LLM by hitting the /generate endpoint.

Using cURL

Make sure you change the ID to that of your application so that you target the right endpoint. In order for the request to be accepted, also specify the token that you generated previously by executing ovhai token create. Feel free to adapt the parameters of the request (prompt, max_tokens, temperature, …)

curl --request POST \                                             
  --url https://.app.gra.ai.cloud.ovh.net/generate \
  --header 'Authorization: Bearer ' \
  --header 'Content-Type: application/json' \
  --data '{
        "prompt": "",
        "max_tokens": 50,
        "n": 1,
        "stream": false
}'

Using Python

Here too, you need to add your personal token and the correct link for your application.

import requests
import json

# change for your host
APP_URL = "https://.app.gra.ai.cloud.ovh.net"
TOKEN = "AI_TOKEN_generated_with_CLI"

url = f"{APP_URL}/generate"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {TOKEN}"
}
data = {
    "prompt": "What a LLM is in AI?",
    "max_tokens": 100,
    "temperature": 0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json()["text"][0])

OVHcloud AI Endpoints

If you are not interested in building your own image and deploying your own LLM inference server, you can use OVHcloud’s new AI Endpoints product which will make your life definitely easier!

AI Endpoints is a serverless solution that provides AI APIs, enabling you to easily use pre-trained and optimized AI models in your applications.

Overview of AI Endpoints

You can use LLM as a Service, choosing the desired model (such as LLaMA, Mistral, or Mixtral) and making an API call to use it in your application. This will allow you to interact with these models without even having to deploy them!

In addition to LLM capabilities, AI Endpoints also offers a range of other AI models, including speech-to-text, translation, summarization, embeddings and computer vision.

Best of all, AI Endpoints is currently in alpha phase and is free to use, making it an accessible and affordable solution for developers seeking to explore the possibilities of AI. Check this article and try it out today to discover the power of AI!

Join our Discord server to interact with the community and send us your feedbacks (#ai-endpoints channel)!

Fine-Tuning LLaMA 2 Models using a single GPU, QLoRA and AI Notebooks

Mathieu Busquet — Fri, 21 Jul 2023 15:04:00 +0000

In this tutorial, we will walk you through the process of fine-tuning LLaMA 2 models, providing step-by-step instructions.

All the code related to this article is available in our dedicated GitHub repository . You can reproduce all the experiments with OVHcloud AI Notebooks.

Introduction

On July 18, 2023, Meta released LLaMA 2, the latest version of their Large Language Model (LLM).

Trained between January 2023 and July 2023 on 2 trillion tokens, these new models outperforms other LLMs on many benchmarks, including reasoning, coding, proficiency, and knowledge tests. This release comes in different flavors, with parameter sizes of 7B, 13B, and a mind-blowing 70B. Models are intended for free for both commercial and research use in English.

To suit every text generation needed and fine-tune these models, we will use QLoRA (Efficient Finetuning of Quantized LLMs), a highly efficient fine-tuning technique that involves quantizing a pretrained LLM to just 4 bits and adding small “Low-Rank Adapters”. This unique approach allows for fine-tuning LLMs using just a single GPU! This technique is supported by the PEFT library.

To fine-tune our model, we will create a OVHcloud AI Notebooks with only 1 GPU.

Mandatory requirements

To successfully fine-tune LLaMA 2 models, you will need the following:

Fill Meta’s form to request access to the next version of Llama. Indeed, the use of Llama 2 is governed by the Meta license, that you must accept in order to download the model weights and tokenizer.
Have a Hugging Face account (with the same email address you entered in Meta’s form).
Have a Hugging Face token.
Visit the page of one of the LLaMA 2 available models (version 7B, 13B or 70B), and accept Hugging Face’s license terms and acceptable use policy.
Log in to the Hugging Face model Hub from your notebook’s terminal by running the huggingface-cli login command, and enter your token. You will not need to add your token as git credential.
Powerful Computing Resources: Fine-tuning the Llama 2 model requires substantial computational power. Ensure you are running code on GPU(s) when using AI Notebooks or AI Training.

Set up your Python environment

Create the following requirements.txt file:

torch
accelerate @ git+https://github.com/huggingface/accelerate.git
bitsandbytes
datasets==2.13.1
transformers @ git+https://github.com/huggingface/transformers.git
peft @ git+https://github.com/huggingface/peft.git
trl @ git+https://github.com/lvwerra/trl.git
scipy

Then install and import the installed libraries:

pip install -r requirements.txt

import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
    DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset

Download LLaMA 2 model

As mentioned before, LLaMA 2 models come in different flavors which are 7B, 13B, and 70B. Your choice can be influenced by your computational resources. Indeed, larger models require more resources, memory, processing power, and training time.

To download the model you have been granted access to, make sure you are logged in to the Hugging Face model hub. As mentioned in the requirements step, you need to use the huggingface-cli login command.

The following function will help us to download the model and its tokenizer. It requires a bitsandbytes configuration that we will define later.

def load_model(model_name, bnb_config):
    n_gpus = torch.cuda.device_count()
    max_memory = f'{40960}MB'

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto", # dispatch efficiently the model on the available ressources
        max_memory = {i: max_memory for i in range(n_gpus)},
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)

    # Needed for LLaMA tokenizer
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer

Download a Dataset

There are many datasets that can help you fine-tune your model. You can even use your own dataset!

In this tutorial, we are going to download and use the Databricks Dolly 15k dataset, which contains 15,000 prompt/response pairs. It was crafted by over 5,000 Databricks employees during March and April of 2023.

This dataset is designed specifically for fine-tuning large language models. Released under the CC BY-SA 3.0 license, it can be used, modified, and extended by any individual or company, even for commercial applications. So it’s a perfect fit for our use case!

However, like most datasets, this one has its limitations. Indeed, pay attention to the following points:

It consists of content collected from the public internet, which means it may contain objectionable, incorrect or biased content and typo errors, which could influence the behavior of models fine-tuned using this dataset.
Since the dataset has been created for Databricks by their own employees, it’s worth noting that the dataset reflects the interests and semantic choices of Databricks employees, which may not be representative of the global population at large.
We only have access to the train split of the dataset, which is its largest subset.

# Load the databricks dataset from Hugging Face
from datasets import load_dataset

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

Explore dataset

Once the dataset is downloaded, we can take a look at it to understand what it contains:

print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')

*** OUTPUT ***
Number of prompts: 15011
Column Names are: ['instruction', 'context', 'response', 'category']

As we can see, each sample is a dictionary that contains:

An instruction: What could be entered by the user, such as a question
A context: Help to interpret the sample
A response: Answer to the instruction
A category: Classify the sample between Open Q&A, Closed Q&A, Extract information from Wikipedia, Summarize information from Wikipedia, Brainstorming, Classification, Creative writing

Pre-processing dataset

Instruction fine-tuning is a common technique used to fine-tune a base LLM for a specific downstream use-case.

It will help us to format our prompts as follows:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Sea or Mountain

### Response:
I believe Mountain are more attractive but Ocean has it's own beauty and this tropical weather definitely turn you on! SO 50% 50%

### End

To delimit each prompt part by hashtags, we can use the following function:

def create_prompt_formats(sample):
    """
    Format various fields of the sample ('instruction', 'context', 'response')
    Then concatenate them using two newline characters 
    :param sample: Sample dictionnary
    """

    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruction:"
    INPUT_KEY = "Input:"
    RESPONSE_KEY = "### Response:"
    END_KEY = "### End"
    
    blurb = f"{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}\n{sample['instruction']}"
    input_context = f"{INPUT_KEY}\n{sample['context']}" if sample["context"] else None
    response = f"{RESPONSE_KEY}\n{sample['response']}"
    end = f"{END_KEY}"
    
    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    
    sample["text"] = formatted_prompt

    return sample

Now, we will use our model tokenizer to process these prompts into tokenized ones.

The goal is to create input sequences of uniform length (which are suitable for fine-tuning the language model because it maximizes efficiency and minimize computational overhead), that must not exceed the model’s maximum token limit.

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )


# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset: str):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """
    
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)
    
    # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=["instruction", "context", "response", "text", "category"],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
    
    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset

With these functions, our dataset will be ready for fine-tuning !

Create a bitsandbytes configuration

This will allow us to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices. We choose to apply bfloat16 compute data type and nested quantization for memory-saving purposes.

def create_bnb_config():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    return bnb_config

To leverage the LoRa method, we need to wrap the model as a PeftModel.

To do this, we need to implement a LoRa configuration:

def create_peft_config(modules):
    """
    Create Parameter-Efficient Fine-Tuning config for your model
    :param modules: Names of the modules to apply Lora to
    """
    config = LoraConfig(
        r=16,  # dimension of the updated matrices
        lora_alpha=64,  # parameter for scaling
        target_modules=modules,
        lora_dropout=0.1,  # dropout probability for layers
        bias="none",
        task_type="CAUSAL_LM",
    )

    return config

Previous function needs the target modules to update the necessary matrices. The following function will get them for our model:

# SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

Once everything is set up and the base model is prepared, we can use the print_trainable_parameters() helper function to see how many trainable parameters are in the model.

def print_trainable_parameters(model, use_4bit=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit:
        trainable_params /= 2
    print(
        f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
    )

We expect the LoRa model to have fewer trainable parameters compared to the original one, since we want to perform fine-tuning.

Train

Now that everything is ready, we can pre-process our dataset and load our model using the set configurations:

# Load model from HF with user's token and with bitsandbytes config

model_name = "meta-llama/Llama-2-7b-hf" 

bnb_config = create_bnb_config()

model, tokenizer = load_model(model_name, bnb_config)

## Preprocess dataset

max_length = get_max_length(model)

dataset = preprocess_dataset(tokenizer, max_length, seed, dataset)

Then, we can run our fine-tuning process:

def train(model, tokenizer, dataset, output_dir):
    # Apply preprocessing to the model to prepare it by
    # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
    model.gradient_checkpointing_enable()

    # 2 - Using the prepare_model_for_kbit_training method from PEFT
    model = prepare_model_for_kbit_training(model)

    # Get lora module names
    modules = find_all_linear_names(model)

    # Create PEFT config for these modules and wrap the model to PEFT
    peft_config = create_peft_config(modules)
    model = get_peft_model(model, peft_config)
    
    # Print information about the percentage of trainable parameters
    print_trainable_parameters(model)
    
    # Training parameters
    trainer = Trainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            max_steps=20,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            output_dir="outputs",
            optim="paged_adamw_8bit",
        ),
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )
    
    model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs
    
    ### SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
    # Verifying the datatypes before training
    
    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total+= v
    for k, v in dtypes.items():
        print(k, v, v/total)
     
    do_train = True
    
    # Launch training
    print("Training...")
    
    if do_train:
        train_result = trainer.train()
        metrics = train_result.metrics
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()
        print(metrics)    
    
    ###
    
    # Saving model
    print("Saving last checkpoint of the model...")
    os.makedirs(output_dir, exist_ok=True)
    trainer.model.save_pretrained(output_dir)
    
    # Free memory for merging weights
    del model
    del trainer
    torch.cuda.empty_cache()
    
    
output_dir = "results/llama2/final_checkpoint"
train(model, tokenizer, dataset, output_dir)

If you prefer to have a number of epochs (entire training dataset will be passed through the model) instead of a number of training steps (forward and backward passes through the model with one batch of data), you can replace the max_steps argument by num_train_epochs.

To later load and use the model for inference, we have used the trainer.model.save_pretrained(output_dir) function, which saves the fine-tuned model’s weights, configuration, and tokenizer files.

Fine-tuning llama2 results on databricks-dolly-15k dataset

Unfortunately, it is possible that the latest weights are not the best. To solve this problem, you can implement a EarlyStoppingCallback, from transformers, during your fine-tuning. This will enable you to regularly test your model on the validation set, if you have one, and keep only the best weights.

Merge weights

Once we have our fine-tuned weights, we can build our fine-tuned model and save it to a new directory, with its associated tokenizer. By performing these steps, we can have a memory-efficient fine-tuned model and tokenizer ready for inference!

model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
model = model.merge_and_unload()

output_merged_dir = "results/llama2/final_merged_checkpoint"
os.makedirs(output_merged_dir, exist_ok=True)
model.save_pretrained(output_merged_dir, safe_serialization=True)

# save tokenizer for easy inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_merged_dir)

Conclusion

We hope you have enjoyed this article!

You are now able to fine-tune LLaMA 2 models on your own datasets!

In our next tutorial, you will discover how to Deploy your Fine-tuned LLM on OVHcloud AI Deploy for inference!

Using GPU on Managed Kubernetes Service with NVIDIA GPU operator

Maxime Hurtrel — Wed, 19 Jan 2022 15:53:13 +0000

Two years after launching our Managed Kubernetes service, we’re seeing a lot of diversity in the workloads that run in production. We have been challenged by some customers looking for GPU acceleration, and have teamed up with our partner NVIDIA to deliver high performance GPUs on Kubernetes. We’ve done it in a way that combines simplicity, day-2-maintainability and total flexibility. The solution is now available in all OVHcloud regions where we offer Kubernetes and GPUs.

The challenge behind a fully managed service

Readers unfamiliar with our orchestration service and/or GPUs may be surprised that we did not yet offer this integration in general availability. This lies in the fact that our team is focused on providing a totally managed experience, including patching the OS (Operating System) and Kubelet of each Node each time it is required. To achieve this goal, we have built and maintained a single hardened image for the dozens of flavors, in each of the 10+ regions.
Based on the experience of selected beta users, we found that this approach doesn’t always work for use cases that require a very specific NVIDIA driver configuration. Working with our technical partners at NVIDIA, we found a solution to leverage GPUs is a simple way that allows fine tuning such as the CUDA configuration for example.

NVIDIA to the rescue

This Keep-It-Simple-Stupid (KISS) solution relies on the great work of NVIDIA building and maintaining an official NVIDIA GPU operator. The Apache 2.0 licensed software uses the operator framework within Kubernetes. It does this to automate the management of all NVIDIA software components needed to use GPUs, such as NVIDIA drivers, Kubernetes device plugin for GPUs, and others.

We ensured it was compliant with our fully maintained Operating System (OS), based on a recent Ubuntu LTS version. After testing it, we documented how to use it on our Managed Kubernetes Service. We appreciate that this solution leverages an open source software that you can use on any compatible NVIDIA hardware. This allows you to guarantee consistent behavior in hybrid or multicloud scenarios, aligned with our SMART motto.

Here is an illustration describing the shared responsibility model of the stack:

All our OVHcloud Public customers can now leverage the feature, adding a GPU node pool to any of their existing or new clusters. This can be done in the regions where both Kubernetes and T1 or T2 instances are available: GRA5, GRA7 and GRA9 (France), DE1 (Germany) (available in the upcoming weeks) and BHS5 (Canada) at the date this blog post is published.
Note that GPUs worker nodes are compatible with all features released, including vRack technology and cluster autoscaling for example.

Having Kubernetes clusters with GPU options means deploying typical AI/ML applications, such as Kubeflow, MLFlow, JupyterHub, NVIDIA NGC is easy and flexible. Do not hesitate to discuss this feature with other Kubernetes users on our Gitter Channel. You may also have a look to our fully managed AI Notebook or AI training services for even simpler out-of-the box experience and per-minute pricing!

Managing GPU pools efficiently in AI pipelines

Bastien Verdebout — Tue, 22 Dec 2020 16:18:36 +0000

A growing number of companies are using artificial intelligence on a daily basis — and dealing with the back-end architecture can reveal some unexpected challenges.

Whether the machine learning workload involves fraud detection, forecasts, chatbots, computer vision or NLP, it will need frequent access to computing power for training and fine-tuning.

GPUs have proven to be a game-changer for deep learning. If you’re wondering why, you can find out more by reading our blog post about GPU architectures. A few years ago, manufacturers such as NVIDIA began to develop specific ranges for cloud datacentres. You may be familiar with the NVIDIA TITAN RTX for gaming — and in our datacentres, we use NVIDIA A100, V100, Tesla and DGX GPUs for enterprise-grade workloads.

In short, GPUs are perfect for tasks that can be solved or improved by AI, and require a lot of processing power.
They offer optimal compute, and are widely used in deep learning. A growing number of companies are using AI, and GPUs seem to be the best choice for them.

However, when dealing with pools of GPUs, the back-end architecture can be really tricky.

So how do we use them to benefit a company with minimal hassle and headaches? On-premise or in the cloud?

These are good questions that I’m keen to discuss here, from both a business and technical perspective.

Dealing with GPU pools… The struggle is real.

For anyone who has had to deploy and manage more than 1 GPU for a data-AI team, I’m sure this topic will bring tears to your eyes, and make your voice tremble. Yes, it is indeed complicated.

I can talk about it on our blog, because our team of data scientists here at OVHcloud had to deal with the exact same annoying issues. Thankfully, we solved all of them — stay tuned!

GPU sharing is hard. Even if one GPU is better than none, in most cases it will not be sufficient, and a GPU pool will be far more effective. From a tech perspective, dealing with a GPU pool — or worse, allowing your team to use this pool simultaneously — is very tricky. The market is really mature for CPU sharing (via hypervisors), but by design, a GPU has to be attached to a VM or container. This means that quite often, it needs to be “booked” for a specific workload. To get around this issue, you’ll need to provide a scale-out with orchestration, so that you can dynamically assign GPUs to jobs over time. Whenever you tell yourself “I want to launch this task with 4 GPUs for 2 days“, you should simply be able to ask, and the back-end should work its magic for you.

Setting up and maintaining an architecture is time-consuming. So you’ve deployed servers with GPU, updated and upgraded your Linux distros, installed your main AI packages, CUDA drivers, and now you want to move on to something else. But wait — a new TensorFlow version has been released, and you also have a security patch to apply. What you initially thought to be a single task is now taking up 4-5 hours of your time per week.

Diagnosing is quite complex. If, for whatever reason, something isn’t working as it should — good luck. You barely know who is doing what, and you can’t track jobs or usage unless you connect to the platform yourself and set up monitoring tools. Remember to grab your snorkel set, because you’ll need to deep-dive.

Bottlenecks are almost inevitable. Imagine setting up a pool of GPUs based on your current AI project workloads. Your infrastructure is not really designed to scale automatically, and as soon as the AI workloads increase, your jobs have to be scheduled while the GPU fleet is being updated constantly. A backlog starts to accumulate, and a bottleneck is created as a result.

Providing tools for teams to work collaboratively on code is mandatory. Usually, your team will need to share their data experimentations — and the best way to do this for now is with JupyterLab Notebooks (we love them) or VSCode. But you’ll need to keep in mind that this is more software to set up and maintain.

Securing data access is essential. The required data must be easily accessible, and sensitive data must be covered by security guarantees.

Cost control is difficult. Even worse, for one reason or another (who said holidays?), you might need to stop almost all your GPU servers for a week or two — but to do this, you would need to wait for any ongoing jobs to be completed.

All jokes aside, while we may be passionate about tech and hardware, we have other things to do. Data engineers cannot achieve their full potential and talent in maintenance-based or billing-based tasks.

Kubeflow to the rescue?

Kubernetes 1.0 was launched 5 years ago. Whatever your opinion is on it, in five years they have become the de facto standard for container orchestration in enterprise environments.

Data scientists use containers for portability, agility, and community — but Kubernetes was made to orchestrate services, not data experimentations.

Kubernetes alone is not tailored for a data team. It presents too much complexity, with the sole benefit of solving the orchestration issue.

We need something that not only improves orchestration, but also code contribution, tests and deployments.

Luckily, Kubeflow appeared 2 years ago, and was open-sourced by Google at the time. Its main promise is to simplify complex ML workflows, for example data processing => data labeling => training => serving, and complete it with notebooks.

I do really love the promise, and the way they simplify ML pipelines. Kubeflow can be run over K8s clusters on-premise or in the cloud, and can also be set up on a single VM or even on a workstation (Linux/Mac/Windows).

Students can easily have their own ML environment. However, for the most advanced uses, a workstation or a single VM might be out of the question, and you would need a K8s cluster with Kubeflow installed on top of that. You’ll have a nice UI for starting notebooks and creating ML pipelines (processing/training/inference), but still zero GPU support by default.

Central Dashboard / Image property of Kubeflow.org

XGBoost pipeline / Image property of Kubeflow.org

Your GPU support will depend on your setup. It may differ if you host it on GCP, AWS, Azure, OVHcloud, on-premise, MicroK8s, or anything else.

For example, on AWS EKS, you need to declare GPU pools in your Kubeflow manifest:

# Official doc: https://www.kubeflow.org/docs/aws/customizing-aws/

# NodeGroup holds all configuration attributes that are specific to a node group
# You can have several node groups in your cluster.
nodeGroups:
  - name: eks-gpu
    instanceType: p2.xlarge
    availabilityZones: ["us-west-2b"]
    desiredCapacity: 2
    minSize: 0
    maxSize: 2
    volumeSize: 30
    ssh:
      allow: true
      publicKeyPath: '~/.ssh/id_rsa.pub'

On GCP GKE, you will need to run this command to export a GPU pool:

# Official doc: https://www.kubeflow.org/docs/gke/customizing-gke/#common-customizations
 
export GPU_POOL_NAME=
 
gcloud container node-pools create ${GPU_POOL_NAME} \
--accelerator type=nvidia-tesla-k80,count=1 \
--zone us-central1-a --cluster ${KF_NAME} \
--num-nodes=1 --machine-type=n1-standard-4 --min-nodes=0 --max-nodes=5 --enable-autoscaling

You will then need to install NVIDIA drivers on all the GPU nodes. NVIDIA maintains a deamonset, which enables you to install them easily:

# Official doc: https://www.kubeflow.org/docs/gke/customizing-gke/#common-customizations
 
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Once you have done this, you will be able to create GPU pools (don’t forget to check your quotas before — with a basic account, you are restricted by default, and you will need to contact their support).

Okay, but do things get easier from here?

As we say in France, especially in Normandy, yes but no.

Yes, Kubeflow does resolve some of the challenges we’ve mentioned — but some of the biggest challenges are yet to come, and they will take up a lot of your daily routine. Many manual operations will still require you to dig into specific K8s documentation, or guides published by cloud providers.

Below is a summary of Kubeflow vs GPU pool challenges.

Challenges	Status
GPU pool with sharing option	YES but will require manual configuration (declaration in manifest, driver installation, etc.).
Collaborative tools	YES definitely. Notebooks are provided via Kubeflow.
Infrastructure maintenance	Definitely NO. Now you have a Kubeflow cluster to maintain and operate.
Infrastructure diagnosis	YES BUT NO. Activity Dashboard and reporting tools based on SpartaKus, Logs, etc. But provided to the data engineers, not data scientists themselves. They may come back to you.
Infrastructure agility/flexibility	TRICKY. It will depend on your hosting implementation. If it’s on-premise, definitely no. You’ll need to buy hardware components (an NVIDIA V100 costs approximately $10K without chassis, electricity usage, etc.) Some cloud providers can provide “auto-scaling GPU pools” from 0 to n, which is nice.
Secured data access	TRICKY. It will depend on how you locate your data, and the technology used. It’s not a ready-to-use solution.
Cost control	TRICKY. Again, it will depend on your hosting implementation. It’s not easy, since you need to take care of the infrastructure. Some hidden costs can appear, too (network traffic, monitoring, etc.).

Kubeflow vs Challenges

Forget infrastructure, welcome to GPU platforms made for AI

You can now find various third-party solutions on the market that go one step further. Instead of dealing with the architecture and the Kubernetes cluster, what if you simply focused on your machine learning or deep learning code?

There are well-known solutions such as Paperspace Gradient — or smaller ones, like Run:AI — and we’re pleased to offer another option on the market: AI Training. We’re using this post as a self-promotion opportunity (it’s our blog after all), but the logic remains the same for competitors.

What are the concepts behind it?

No infrastructure to manage

You don’t need to set up and manage a K8s cluster, or a Kubeflow cluster.

You don’t need to declare GPU pools in your manifest.

You don’t need to install NVIDIA drivers on the nodes.

With GPU Platforms like OVHcloud AI Training, your neural network training is as simple as this:

# Upload data directly to Object Storage
ovhai data upload myBucket@GRA train.zip
 
# Launch a job with 4 GPUs on a Pytorch environment, with Object Storage bucket directly linked to it
ovhai job run \
     --gpu 4 \
    --volume myBucket@GRA:/data:RW \
    ovhcom/ai-training-pytorch:1.6.0

This line of code will provide you with a JupyterLab Notebook directly plugged to a pool of 4x NVIDIA GPUs, with the Pytorch environment installed. This is all you need to do, and the entire process takes around 15 seconds.

Parallel computing — a great advantage

One of the most significant benefits is that since the infrastructure is not on your premises, you can count on the provider to scale it.

So you can run dozens of jobs simultaneously. A classic use case is to fine-tune all of your models once a week or once a month, with 1 line of bash script:

# Start a basic loop
for model in my_models_listing
do
 
# Launch a job with 4 GPUs on a Pytorch environment, with Object Storage bucket directly linked to it
echo "starting training of $model"
ovhai job run \
--gpu 3 \
--volume myBucket@GRA:/data:RW \
my_docker_repository/$model
 
done

If you have 10 models, it will launch 10x 3 GPUs in few seconds, and it will stop them once the job is complete, from sequential to parallel work.

Collaboration out of the box

All of these platforms natively include notebooks, directly plugged to GPU power. With OVHcloud AI Training, we also provide pre-installed environments for TensorFlow, Hugging Face, Pytorch, MXnet, Fast.AI — and others will be added to this list soon.

JupyterLab Notebook

Data set access made easy

I haven’t tested all the GPU platforms on the market, but usually they provide some useful ways to access data. We aim to provide the best work environment for data science teams, so we are also offering an easy way for them to access their data — by enabling them to attach object storage containers during the job launch.

OVHcloud AI Training : attach Object Storage containers to notebooks

Cost control for users

Third-party GPU platforms quite often provide clear pricing. This is the case for Paperspace, but not for Run:AI (I was unable to find their price list). This is also the case for OVHcloud AI Training.

GPU power: You pay £1.58/hour/NVIDIA V100s GPU
Storage: Standard price of OVHcloud Object Storage (compliant with AWS S3 protocol)
Notebooks: Included
Observability tools: Logs and metrics included
Subscription: No, it’s pay-as-you-go, per minute

And there we go — cost and budget estimation is now simple. Try it out for yourself!

Mission complete?

Below is a summary addressing the major challenges to resolve when dealing with GPU pool sharing. It’s a big yes!

Challenges	Status
GPU pool with sharing option	YES definitely. In fact, even many GPU pools in parallel, if you want to.
Collaborative tools	YES definitely. Notebooks are always provided, as far as I know.
Infrastructure maintenance	YES definitely. Infrastructure is managed by the provider. You will need need to connect via SSH to debug.
Infrastructure diagnosis	YES. Logs and metrics provided on our side, at least.
Infrastructure agility/flexibility	YES definitely. Scale up or down one or more GPU pools, use them for 10 minutes or a full month, etc.
Secured data access	Depends on the solution you choose, but usually it’s a YES via simplified object storage access.
Cost control	Depends on the solution you choose, but usually is a YES with packaged prices and zero investments to make (zero CAPEX).

Conclusion

If we go back to the main challenges faced by a company that requires shared GPU pools, we can say without a doubt that Kubernetes is a market-standard for AI pipeline orchestration.

An on-premise K8s cluster with Kubeflow is really interesting if the data cannot be processed into the cloud (e.g. banking, hospitals, any kind of sensitive data) or if your team has flat (and lower-level) GPU requirements. You can invest in a few GPUs and manage the fleet yourself with software on top. But if you need more power, very soon the cloud will become the only viable option. Hardware investments, hardware obsolescence, electricity usage and scaling will give you some headaches.

Then, depending on the situation, Kubeflow in the cloud might be really useful. It delivers powerful pipeline features, notebooks, and enables users to manage virtual GPU pools.

But if you want to avoid infrastructure tasks, control your spending, and focus on your added value and code, you might consider GPU platforms as your first choice.

However, there is no such thing as magic — and without knowing exactly what you want, even the best platform won’t be able to meet your needs. Yet some start-ups, not listed here, can offer a combination of platforms and expertise to help you in your project, infrastructures and use cases.

Thank you for reading, and don’t forget that we also offer inference at scale with ML Serving. This is the next logical step after training.

Want to find out more?

Solution page: https://www.ovhcloud.com/en-gb/public-cloud/ai-training/
Public documentation: https://docs.ovh.com/gb/en/ai-training/
Community: community.ovh.com/en/

How PCI-Express works and why you should care? #GPU

Jean-Louis Queguiner — Thu, 09 Jul 2020 10:16:00 +0000

What is PCI-Express ?

Everyone, and I mean everyone, should pay attention when they do intensive Machine Learning / Deep Learning Training.

As I explained in a previous blog post, GPUs have accelerated Artificial Intelligence evolution massively.

However, building a GPUs server is not that easy. And failing to create an appropriate infrastructure can have consequences on training time.

If you use GPUs, you should know that there are 2 ways to connect them to the motherboard to allow it to connect to the other components (network, CPU, storage device). Solution 1 is through PCI Express and solution 2 through SXM2. We will talk about SXM2 in the future. Today, we will focus on PCI Express. This is because it has a strong dependency with the choice of adjacent hardware such as PCI BUS or CPU.

NVIDIA V100 with SXM2 design	NVIDIA V100 with PCI express design
Source : https://www.ebizpc.com/NVIDIA-Tesla-V100-900-2G502-0300-000-16GB-GPU-p/900-2g503-0310-000.htm	Source : https://nvidiastore.com.br/nvidia-tesla-v100-16gb

SXM2 design VS PCI Express Design

This is a major element to consider when talking about deep learning as data loading phase is a waste of compute time, so bandwidth between components and GPUs is a key bottleneck in most deep learning training contexts.

How does PCI-Express work and why you should care about the number of PCIe lanes?

What is a PCI-Express Lanes and are there any associated CPU limitations?

Each GPU V100 is using the 16 PCI-e lanes. What does it mean exactly?

Extract from NVidia V100 product specification sheet

The “x16” means that the PCIe has 16 dedicated lanes. So… next question: What is a PCI Express lane ?

What’s a PCI Express lane?

2 PCI Express Devices with its interconnexion : figure inspired of the awesome article – what is chipset and why should I care

PCIe lanes are used to communicate between PCIe Devices or between PCIe and CPU. A lane is composed of 2 wires: one for inbound communications and one, which has double the traffic bandwidth, for outbound.

Lane communications are similar to network Layer 1 communications – it’s all about transferring bits as fast as possible through electrical wires! However, the technique used for PCIe Link is a bit different as the PCIe device is composed of xN lanes. In our previous example N=16 but it could be any power of 2 from 1 to 16 (1/2/4/8/16).

So… if PCIe is similar to network architecture it means that PCIe layers exist, doesn’t it?

Yes ! you are right PCIe has 4 layers:

**The Physical Layer (aka the Big Negotiation Layer)**

The Physical Layer (PL) is responsible for negotiating the terms and conditions for receiving the raw packets (PLP for Physical Layer Packets) i.e the lane width and the frequency with the other device.

You should be aware that only the smallest number of lanes of the two devices will be used. This is why choosing the appropriate CPU is so important. CPUs have a limited number of lanes that they can manage so having a nice GPU with 16 PCIe Lanes and having a CPU with 8 PCIe Bus lanes will be as efficient as throwing away half your money because it doesn’t fit in your wallet.

Packets received at the Physical Layer (aka PHY) are coming from other PCIe devices or from the system (via Direct Access Memory — DAM or from CPU for instance) and are encapsulated in a frame.

The purpose of a Start-of-Frame is to say: “I am sending you data, this is the beginning,” and it takes just 1 byte to say that!

The End-of-Frame word is also 1 byte to say “goodbye I’m done with it”.

This layer implement a 8b/10b or 128b/130b decoding that we will explain later and is mainly used for clock recovery.

**The Data Link Layer Packet (aka Let’s put this mess in the right order)**

The Data Link Layer Packet (DLLP) is starting with a Packet Sequence Number. This is really important as a packet might get corrupted at one point, so may need to be uniquely identified for retry purposes. The Sequence Number is coded on 2 bytes.

The Data Link Layer Packet is then followed by the Transaction Layer Packet and then closed with the LCRC (Local Cyclic Redundancy Check) and is used to check the Transaction Layer Packet (meaning the actual Payload) integrity.

If the LCRC is validated, then the Data Link Layer sends an ACK (ACKnowledge) signal to the emitter through the Physical Layer. Otherwise it sends a NAK (Not AcKnowledge) signal to the emitter which will resend the frame associated with the sequence number to retry; this part handles the replay buffer on the receiver side.

The Transaction Layer

The Transaction Layer is responsible for managing the actual payload (Header + Data) as well as the (optional) message digest ECRC (End to End Cyclic Redundancy Check). This Transaction Layer Packet is coming from the Data Link Layer where it has been decapsulated.

An integrity check is performed if needed/requested. This step will check the integrity of the business logic and will insure no packet corruption when passing data from Data Link Layer to Transaction Layer.

The header is describing the type of transaction such as:

Memory Transaction
I/O Transaction
Configuration Transaction
or Message Transaction

The Application Layer

The role of the application layer is to handle the User Logic. This layer is sending the Header and the data payload to the Transaction Layer. The magic happens in this layer where data in rooted to different hardware components.

How PCIe is communicating with the rest of the world?

PCIe Link is using the packet switching concept used in network in a full duplex mode.

PCIe device have an internal clock to orchestrate PCIe Data Transfer Cycles. This Data Transfer Cycle is also orchestrated thanks to the Referential Clock. The latter is sending a signal through a Dedicated Lane (which is not part of the x1/2/4/8/16/32 mentioned above). This clock will help both receiving and emitting devices to synchronize for packets communications.

Each PCIe lane is used to send bytes in parallel with other lanes. The Clock Synchronization mentioned above will help the receiver to put back those bytes in the right order

x16 means 16 lanes of parallel communication on generation 3 of PCIe protocol

You may have the bytes in order but do you have the data integrity at the physical layer ?

To ensure integrity PCIe device uses 8b/10b encoding for PCIe generations 1 and 2 or 128b/130b encoding scheme for generations 3 and 4.

These encodings are used to prevent the loss of temporal landmarks, especially when transmitting consecutive similar bits. This process is called “Clock Recovery”

Those 128 bits of payload data are sent and 2 bytes of control are appended to it.

Quick examples

Let’s simplify it with a 8b/10b example: according to IEEE 802.3 clause 36, table 36–1a based on Ethernet specifications here is the table 8b/10b encoding:

IEEE 802.3 clause 36, table 36–1a – 8b/10b encoding table

So how can the receiver make the difference between all those repeating 0 (Code Group Name D0.0) ?

8b/10b encoding is composed of 5b/6b + 3b/4b encodings.

Therefore 00000 000 will be encoded into 100111 0100 the 5 first bits of the original data 00000 are encoded to 100111 using 5b/6b encoding (rd+); same goes for the second group of 3bits of original data 000 encoded into 0100 using 3b/4b encoding (rd-).

It could have been also 5b/6b encoding rd+ and 3b/4b encoding rd- making 00000 000 turning into 011000 1011

Therefore the original data which was 8bits is now 10bits due to bits control (1 control bit for 5b/6b and 1 fir 3b/4b).

But don’t worry I will draft a blog post later dedicated to encoding.

PCIe Generations 1 and 2 were designed with 8b/10b encoding meaning that the actual data transmitted was only 80% of the total load (as 20% — 2 bits are used as Clock synchronization).

PCIe Gen3&4 were designed with 128b/130b meaning that the control bits are now representing only 1.56% of the payload. Quite good isn’t it?

Let’s calculate the PCIe bandwidth together

Here is the table of PCIe versions specifications

Number of Lanes	PCIe 1.0 (2003)	PCIe 2.0 (2007)	PCIe 3.0 (2010)	PCIe 4.0 (2017)	PCIe 5.0 (2019)	PCIe 6.0 (2021)
x1	250 MB/s	500 MB/s	1 GB/s	2 GB/s	4 GB/s	8 GB/s
x2	500 MB/s	1 GB/s	2 GB/s	4 GB/s	8 GB/s	16 GB/s
x4	1 GB/s	2 GB/s	4 GB/s	8 GB/s	16 GB/s	32 GB/s
x8	2 GB/s	4 GB/s	8 GB/s	16 GB/s	32 GB/s	64 GB/s
x16	4 GB/s	8 GB/s	16 GB/s	32 GB/s	64 GB/s	128 GB/s

consortium PCI-SIG PCIe theoretical bandwidth/Lane/Way specification sheet

	PCIe 1.0 (2003)	PCIe 2.0 (2007)	PCIe 3.0 (2010)	PCIe 4.0 (2017)	PCIe 5.0 (2019)	PCIe 6.0 (2021)
Frequency	2.5 GT/s	5.0 GT/s	8.0 GT/s	16 GT/s	32 GT/s	64 GT/s

consortium PCI-SIG PCIe theoretical raw bit rate specification sheet

To obtain such numbers let’s look at the general Bandwidth formula:

BW stands for Bandwidth
MT/s : Mega Transfers per second
Encoding could be 4b/5b/, 8b/10b, 128b/130b, …

For PCIe v1.0:

For PCIe v3.0 (the one that interest us for NVIDIA V100):

Therefore with 16 lanes for a NVIDIA V100 connected in PCIe v3.0, we have an effective data rate transfer (data bandwidth) of nearly 16GB/s/way (actual bandwidth is 15.75GB/s/way)

You need to be careful not to get confused, as total bandwidth can also be interpreted as two ways bandwidth; in this case we consider total bandwidth x16 to be around 32GB/s.

Note : Another element that we haven’t considered is that the maximum theoretical bandwidth needs to be reduced by around 1 Gb/s for error correction protocols (ECRC and LCRC) as well as the Headers (Start tag, Sequence tag, Header) and Footer (End tag) overheads explained earlier in this blog post.

In conclusion

We have seen that PCI Express has evolved a lot and that It’s based on the same concepts as network. To take the best from the PCIe devices it is necessary to understand the fundamentals of the underlying infrastructure.

Failing to choose the right underlying Motherboard, CPU or BUS can lead to major performance bottleneck and GPU under performance.

To sum up :

Friends don’t let friends build their own GPUs hosts 😉
Jean-Louis Quéguiner July 1^st, 2020

If you liked this post but you want to drill down a bit into the Deep Learning and AI aspect of things don’t hesitate to check out my other blog posts:

Distributed Training in a Deep Learning Context

Jean-Louis Queguiner — Tue, 05 May 2020 10:14:07 +0000

Previously on OVHcloud Blog …

In previous blog posts we have discussed a high level approach to deep learning as well as what is meant by ‘training’ in relation to Deep Learning.

Following the article, I had lots of questions entering my twitter inbox, especially regarding how GPUs actually works.

Don’t worry it’s a friend, he is ok with me sharing the DM 😉

I decided, therefore, to write an article on how GPUs work:

During our R&D process around hardware and AI models, the question of distributed training came up (quickly). But before looking in-depth at distributed training, I invite you to read the following article to understand how Deep Learning training actually works:

As previously discussed, Neural Networks training depends on :

Input Data
Neural Network architecture composed of ‘Layers’
Weights
Learning Rate (step used to adjust neural network weights)

Why do we need distributed Learning

Deep Learning is mainly used for non structured data pattern learning. Non structured data – such as text corpus, image, video or sound – can represent a huge amount of data to train on.

Training such a library can take days or even weeks because of the size of data and/or the size of the network.

Multiple distributed learning approaches can be considered.

The different Distributed Learning approaches

There are two main categories for distributed training when it comes to Deep Learning and both of them are based on the divide and conquer paradigm.

The first category is named : “Distributed Data Parallelism” where the data is split across multiple GPUs.

The second category is called : “Model Parallelism” where the deep learning model is split across multiple GPUs.

However the Distributed Data Parallelism is the most common approach as it fits almost any problem. The second approach has some serious technical limitations in relation to model splitting. Splitting a model is a highly technical approach, as you need to know the space used by each part of the network into the DRAM of the GPU. Once you have the DRAM usage per slice you need to enforce the computation by hard coding Neural Network Layers placement onto the desired GPU. This approach makes it hardware-related, as the DRAM may vary from one GPU to the other, while the Distributed Data Parallelism will just require data size adjustments (usually batch size) which is relatively simple.

Distributed Data Parallelism model has two variants, each of which has its advantages and disadvantages. The first variant allows you to train a model with a synchronous weight adjustment. That is to say that each training batch in each GPU will return the corrections that need to be made to the model in order for it to be trained. And that it will have to wait until all the workers have finished their task to have a new set of weights so it can recognise this in the next training batch.

Whereas the second variant lets you work in an asynchronous way. This means each batch of each GPU will report the corrections that need to be made to the neural network. The weights coordinator will send a new set of weights without waiting for the other GPUs to finish training their own batch.

3 cheat sheets to better understand Distributed Deep Learning

In this cheat sheets lets assume you’re using docker with a volume attached.

Now you need to choose your Distributed Training strategy (wisely)