OVHcloud Engineering Archives - OVHcloud Blog

Extract Text from Images with OCR using Python and OVHcloud AI Endpoints

Stéphane Philippart — Wed, 01 Apr 2026 12:55:19 +0000

If you want to have more information on AI Endpoints, please read the following blog post. You can, also, have a look at our previous blog posts on how use AI Endpoints.

You can find the full code example in the GitHub repository.

In this article, we will explore how to perform OCR (Optical Character Recognition) on images using a vision-capable LLM, the OpenAI Python library, and OVHcloud AI Endpoints.

Introduction to OCR with Vision Models

Optical Character Recognition has been around for decades, but traditional OCR engines often struggle with complex layouts, handwritten text, or noisy images. Vision-capable Large Language Models bring a new approach: instead of relying on specialized OCR pipelines, you can simply send an image to a model that understands both visual and textual content.

In this example, we use the OpenAI Python library to create a simple OCR script powered by a vision model hosted on OVHcloud AI Endpoints.

The whole application is a single Python file: no complex setup, just pip install openai and you’re ready to go.

Setting up the Environment Variables

Before running the script, you need to set the following environment variables:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN="your-access-token"
export OVH_AI_ENDPOINTS_MODEL_URL="https://your-model-url"
export OVH_AI_ENDPOINTS_VLLM_MODEL="your-vision-model-name"

You can find how to create your access token, model URL, and model name in the AI Endpoints catalog. Make sure to choose a vision-capable model from the AI Endpoints catalog.

Installing Dependencies

The only dependency is the OpenAI Python library:

pip install openai

Define the System Prompt

The first step is to define a system prompt that describes what our OCR service does. This prompt tells the model how to behave:

SYSTEM_PROMPT = """You are an expert OCR engine.
Extract every piece of text visible in the provided image.
Preserve the original layout as faithfully as possible (line breaks, columns, tables).
Do NOT interpret, summarise, or translate the content.
Use markdown formatting to represent the layout (e.g. tables, lists).
If the image contains no text, reply with: "No text found."
"""

We tell it to behave as an expert OCR engine, to preserve the original layout, and to use markdown formatting for structured content like tables or lists.

Load the Image

Before sending the image to the model, we need to encode it as a base64 string. Here is a simple helper function that reads a local PNG file and returns a base64-encoded string:

import base64
from pathlib import Path

def load_image_as_base64(path: Path) -> str:
    """Load a local image and encode it as base64."""
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

The base64-encoded data is what gets sent to the vision model as part of the prompt.

Extract Text from the Image

The extract_text function sends the image to the vision model and returns the extracted text:

def extract_text(client: OpenAI, image_base64: str, model: str) -> str:
    """Extract text from an image using the vision model."""
    response = client.chat.completions.create(
        model=model,
        temperature=0.0,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{image_base64}"
                        }
                    }
                ]
            }
        ]
    )
    return response.choices[0].message.content

The image is passed as a data URL inside the image_url field, following the OpenAI Vision API format. The temperature is set to 0.0 because we want deterministic, faithful text extraction and not creative output.

Configure the Client

This example uses a vision-capable model hosted on OVHcloud AI Endpoints. Since AI Endpoints exposes an OpenAI-compatible API, we use the OpenAI client and just point it to the OVHcloud endpoint:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"),
    base_url=os.getenv("OVH_AI_ENDPOINTS_MODEL_URL"),
)

model_name = os.getenv("OVH_AI_ENDPOINTS_VLLM_MODEL")

A few things to note:

The API key, base URL, and model name are read from environment variables.
The OpenAI library is compatible with any OpenAI compatible API, making it perfect for use with AI Endpoints.

Assemble and Run

With the client configured, extracting text from an image is straightforward:

image_base64 = load_image_as_base64(Path("./doc.png"))
result = extract_text(client, image_base64, model_name)
print(result)

And that’s it!

Here is the image used for this example:

And the result:

$ python ocr_demo.py
📄 Loading image: doc.png
🔍 Running OCR with Qwen2.5-VL-72B-Instruct via OVHcloud AI Endpoints...

📝 Extracted text 📝
Every month, the OVHcloud Developer Advocate team creates content, shares knowledge, and connects with the tech community. Here’s a look at what we did in March 2026. 🚀

🎙️ “Tranches de Tech” – Our monthly podcast

A new episode of our French-language podcast Tranches de Tech🥑 just dropped!

🎧 Episode 102: Tranches de Tech #26 – Architecte, c’est une bonne situation ça ?

This month we sat down with Alexandre Touret, Architect at Worldline to discuss the evolving role of software architects and the growing impact of AI on development practices. From Spotify’s claim that their devs no longer code, to agentic tools like OpenClaw and Claude Code reshaping workflows. We also cover ANSSI’s revised open-source policy, IBM tripling junior hires, and the critical responsibility of mentoring the next generation of developers in an AI-driven world.

📺 Live on Twitch

We streamed live on Twitch this month! Here’s what we covered:

🎥 Rémy Vandepoel discussed with Hugo Allabert and François Loiseau about our Public VCFaaS. Catch the replay on YouTube ▶️.

🎤 Conference Talks

The team hit the road (and the stage) at several conferences this month:

🇳🇱 KubeCon Amsterdam – Amsterdam, Netherlands 🇳🇱

Aurélie Vache gave a talk: The Ultimate Kubernetes Challenge: An Interactive Trivia Game

Conclusion

In this article, we have seen how to use a vision-capable LLM to perform OCR on images using the OpenAI Python library and OVHcloud AI Endpoints. The OpenAI library makes it very easy to send images to a vision model and extract text, and Python allows us to run the whole thing as a simple script.

You have a dedicated Discord channel (#ai-endpoints) on our Discord server (https://discord.gg/ovhcloud), see you there!

Secure your Software Supply Chain with OVHcloud Managed Private Registry (MPR)

Aurélie Vache — Fri, 13 Feb 2026 16:40:51 +0000

Before an application go to production, it passes through several stages: source code, build, packaging and distribution. But Malicious code – such as a compromised dependency, breached CI pipeline, or modified package in a registry – can be introduced at any point in the development cycle, potentially impacting thousands of projects

This is precisely where Software Supply Chain Security (SSCS) comes in: to protect not just the code itself, but also how it’s built, delivered, and utilised.

Attacks like SolarWinds and Log4Shell aren’t isolated incidents, but rather subtle indicators that have escalated in severity.

This blog post explores recommended solutions and best practices for OVHcloud Managed Private Registry (MPR), an OCI-compliant artifact registry, to help you enhance your Software Supply Chain Security.

Generate a Software Bill Of Materials (SBOM)

SBOMs provides a list of all the ingredients (OS, libraries, code) and anything that composes the images that will run on your Kubernetes cluster.

From that list, you can find out more about the image, its vulnerabilities, and licenses.

Generate an SBOM manually

To manually generate an SBOM from your image, click the ‘GENERATE SBOM’ button:

Within seconds, the SBOM column for your image will display “Queued”, then change to “Generating”, and a “SBOM details” link will appear.

Click the ‘SBOM details’ link to view the SBOM:

Your application’s SBOM is generated by Trivy in SPDX format. This item is then listed as an accessory for your image in the registry.

Click the ‘sbom.harbor’ accessory type for more details:

Generate an SBOM automatically

Manually generating an SBOM is a good practice, but automating the process is even better. The private registry can automatically generates the SBOM for you once an image is pushed to the desired project.

Click the project your image is part of, navigate to the ‘Configuration’ tab, then tick the SBOM generation checkbox:

Vulnerabilities scanning

We recommend running vulnerability scans on the images to confirm that:

the images provided are free of any known vulnerabilities (CVEs);
security patches are well integrated before deployment;
the images used in production comply with security and compliance policies.

There are several vulnerability scanners available, like Trivy, Docker Scout, and Grype.

The OVHcloud Managed Private Registry uses Trivy as its default vulnerability scanner, but you can add more scanners if needed. Go to the Administration panel, click ‘Interrogation Services’, then navigate to the ‘Scanners’ tab:

Scan your image manually

To manually run a vulnerability scan on your image, go to your project and click the SCAN VULNERABILITIES button:

Within a few seconds, a scan will run and reveal any vulnerabilities detected in your image.

Click your image to take a look at the CVEs list:

Scan your image automatically

To automatically scan images on push, click the project your image is part of, then the ‘Configuration’ tab, and tick the ‘Vulnerabilities scanning’ checkbox:

Schedule vulnerability scans

Another way to stay informed is by configuring your vulnerability scanner to run scans every day. Go in the Administration panel, click ‘Interrogation Services’, then the ‘Vulnerability’ tab:

You can choose to schedule the scan Hourly, Daily, Weekly or you can customize when the scan will be triggered.

Scheduled scans ensure that existing images are regularly/periodically analyzed for newly discovered vulnerabilities (CVEs).

Prevent vulnerable images from running

You can also configure a project to prevent vulnerable images from being pulled. In order to do that, check the Prevent vulnerable images from running checkbox.

Select the severity level of vulnerabilities to prevent images from running, from None to Critical.

With this configuration, images cannot be pulled if their level is equal to or higher than the selected level of severity.

Exploitable vulnerabilities

When a scanner found vulnerabilities for your images, it is not necessary that they are exploitable in your application/in your image.

In this example, my application is build with golang 1.25-alpine, but Trivy found several CVEs that are only exploitable in golang 1.19.1 or less.

In order to remove/skip the “false positive”, a solution exists.

VEX (Vulnerability Exploitability eXchange) is a standard “format” to state whether a vulnerability is exploitable or not in a specific context.

You can generate a VEX file with vexctl or govulncheck tools.

Example:

# With vexctl
$ VULN_ID="CVE-2022-27664"
$ PRODUCT="pkg:golang/golang.org/x/net@v0.0.0-20220127200216-cd36cc0744dd"
$ vexctl create --file vex.json --author 'Aurélie Vache' --product "pkg:oci/demo@sha256:$HASH?repository_url=$REGISTRY/$HARBOR_PROJECT/demo" --vuln "$VULN_ID" --status 'not_affected' --justification 'vulnerable_code_not_present' --impact-statement "HTTP/2 vulnerability $VULN_ID is not exploitable because the image is compiled with Go 1.20, which contains the patched library."

# With govulncheck (for Go apps)
$ govulncheck -format openvex ./... > ../demo.vex.json

For the moment, OVHcloud MPR (managed Harbor) does not support VEX files (and the OpenVEX format) but it is planned in the future.

💡But the good news is that you can configure a CVEs whitelist with the list of not exploitable CVEs to ignore them during vulnerability scanning:

You can optionally uncheck the Never expires checkbox and use the calendar selector to set an expiry date for the allowlist.

Sign your images

It’s recommended to sign your images to ensure they haven’t been modified and originate from your pipeline (CI/CD).

Signing your images is crucial for protecting them against compromised registries and unauthorised image replacements.

Without a signature, there’s no guarantee the deployed image is the one you originally built!

You can sign your images with Sigstore Cosign or Notation tools:

$ export HARBOR_PROJECT=supply-chain
$ export IMAGE=xxxxxx.c1.de1.container-registry.ovh.net/$HARBOR_PROJECT/demo
$ export HASH=$(skopeo inspect docker://${IMAGE}:latest | jq -r .Digest | sed "s/^sha256://")

# Sign with Cosign
## Generate a private and a public key
$ cosign generate-key-pair
## Sign the image with the OCI 1.1 Referrers API
$ cosign sign -y --key cosign.key $IMAGE@sha256:$HASH 

# Sign with Notation
## Generate a RSA key & a self-signed X.509 test certificate
$ notation cert generate-test --default "test"

## Sign the image with the OCI 1.1 Refferrers API
$ export NOTATION_EXPERIMENTAL=1 ; notation sign -d --allow-referrers-api ${IMAGE}@sha256:${HASH}

You can use Cosign or Notation to sign your images, OVHcloud MPR supports both.

Your signature will appear beside your image as an accessory, plus a green checkmark ✅ in your column:

⚠️ Keep in mind, MPR (Harbor) doesn’t support signatures generated by Cosign v3 (the signature will upload and appear as an accessory, but the mark will stay red instead of turning green). This bug should be fixed in Harbor 2.15 💪.

Signing your OCI artifacts and linking them to your images is recommended, and you can do this using Cosign:

$ cosign attest -y --predicate sbom.spdx.json --key cosign.key $IMAGE@sha256:$HASH

They will be uploaded to the OVHcloud private registry and listed as accessories.

Ensure only verified images are pushed to your registry’s projects

To allow only verified/signed images to be deployed on a project, click the project your image is part of, navigate to the ‘Configuration’ tab, and tick the Cosign and/or Notation checkbox:

When checked, the registry will only allow verified images to be pulled from the project. Verified images are determined by Cosign or Notation, depending on the policy you have checked. Note that if you have both Cosign and Notation policies enforced, then images will need to be signed by both Cosign and Notation to be pulled.

Tag immutability

By default, tags are mutables, it means that you can push an image demo with the tag 1.0.0, do a modification in the code and push again to this same tag.

It could be useful to fix a bug but in term of security a mutable tag does not guarantee that the image you’ve built and pushed for the 1.0.0 version is the same image that exists now in the registry.

Moreover, on Harbor (so on OVHcloud MPR), due to limitations in the upstream OCI Distribution specification, the registry does not enforce a strict link between a tag and an image digest.

As a result, a tag can be reassigned to a different artifact. And it causes a side effect on the registry, this causes the tag to migrate across the artifacts and every artifact that has its tag taken away becomes tagless.

To prevent this situation, you can configure tag immutability rules. Tag immutability guarantees that an immutable tagged artifact cannot be deleted, and also cannot be altered in any way such as through re-pushing, re-tagging, or replication from another target registry.

To do that, click on your project and on the Policy tab and select TAG IMMUTABILITY:

And then click the ADD RULE button.

Fill the repositories and tags list according to your needs.

Example:

⚠️ You can add a maximum of 15 immutability rules per project.

To wrap thing up

Software supply chain security is super important these days. Everything is changing quickly – the concept, standards, and tools. So, leveraging useful tools like OVHcloud MPR and knowing how to set them up can boost your Software Supply Chain Security efforts.

To learn more about how to use and configure OVHcloud private registries, don’t hesitate to follow our guides.

Reference Architecture: Custom metric autoscaling for LLM inference with vLLM on OVHcloud AI Deploy and observability using MKS

Eléa Petton — Tue, 10 Feb 2026 08:51:11 +0000

Take your LLM (Large Language Model) deployment to production level with comprehensive custom autoscaling configuration and advanced vLLM metrics observability.

vLLM metrics monitoring and observability based on OVHcloud infrastructure

This reference architecture describes a comprehensive solution for deploying, autoscaling and monitoring vLLM-based LLM workloads on OVHcloud infrastructure. It combinesAI Deploy, used for model serving with custom metric autoscaling, and Managed Kubernetes Service (MKS), which hosts the monitoring and observability stack.

By leveraging application-level Prometheus metrics exposed by vLLM, AI Deploy can automatically scale inference replicas based on real workload demand, ensuring high availability, consistent performance under load and efficient GPU utilisation. This autoscaling mechanism allows the platform to react dynamically to traffic spikes while maintaining predictable latency for end users.

On top of this scalable inference layer, the monitoring architecture provides observability through Prometheus, Grafana and Alertmanager. It enables real-time performance monitoring, capacity planning, and operational insights, while ensuring full data sovereignty for organisations running Large Language Models (LLMs) in production environments.

What are the key benefits?

Cost-effective: Leverage managed services to minimise operational overhead
Real-time observability: Track Time-to-First-Token (TTFT), throughput, and resource utilisation
Sovereign infrastructure: All metrics and data remain within European datacentres
Production-ready: Persistent storage, high availability, and automated monitoring

Context

AI Deploy

OVHcloud AI Deploy is a Container as a Service (CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications/APIs based on Machine Learning (ML), Deep Learning (DL) or Large Language Models (LLMs).

Key points to keep in mind:

Easy to use: Bring your own custom Docker image and deploy it in a command line or a few clicks surely
High-performance computing: A complete range of GPUs available (H100, A100, V100S, L40S and L4)
Scalability and flexibility: Supports automatic scaling, allowing your model to effectively handle fluctuating workloads
Cost-efficient: Billing per minute, no surcharges

Managed Kubernetes Service

OVHcloud MKS is a fully managed Kubernetes platform designed to help you deploy, operate, and scale containerised applications in production. It provides a secure and reliable Kubernetes environment without the operational overhead of managing the control plane.

What should you keep in mind?

Cost-efficient: Only pay for worker nodes and consumed resources, with no additional charge for the Kubernetes control plane
Fully managed Kubernetes: Certified upstream Kubernetes with automated control plane management, upgrades and high availability
Production-ready by design: Built-in integrations with OVHcloud Load Balancers, networking and persistent storage
Scalability and flexibility: Easily scale workloads and node pools to match application demand
Open and portable: Based on standard Kubernetes APIs, enabling seamless integration with open-source ecosystems and avoiding vendor lock-in

In the following guide, all services are deployed within the OVHcloud Public Cloud.

Overview of the architecture

This reference architecture describes a complete, secure and scalable solution to:

Deploy an LLM with vLLM and AI Deploy, benefiting from automatic scaling based on custom metrics to ensure high service availability – vLLM exposes /metrics via its public HTTPS endpoint on AI Deploy
Collect, store and visualise these vLLM metrics using Prometheus and Grafana on MKS

vLLM metrics monitoring and observability architecture overview

Here you will find the main components of the architecture. The solution comprises three main layers:

Model serving layer with AI Deploy
- vLLM containers running on top of GPUs for LLM inference
- vLLM inference server exposing Prometheus metrics
- Automatic scaling based on custom metrics to ensure high availability
- HTTPS endpoints with Bearer token authentication
Monitoring and observability infrastructure using Kubernetes
- Prometheus for metrics collection and storage
- Grafana for visualisation and dashboards
- Persistent volume storage for long-term retention
Network layer
- Secure HTTPS communication between components
- OVHcloud LoadBalancer for external access

To go further, some prerequisites must be checked!

Prerequisites

Before you begin, ensure you have:

An OVHcloud Public Cloud account
An OpenStack user with the Administrator role
ovhai CLI available – install the ovhai CLI
A Hugging Face access – create a Hugging Face account and generate an access token
kubectl installed and helm installed (at least version 3.x)

🚀 Now you have all the ingredients for our recipe, it’s time to deploy the Ministral 14B using AI Deploy and vLLM Docker container!

Architecture guide: From autoscaling to observability for LLMs served by vLLM

Let’s set up and deploy this architecture!

Overview of the deployment workflow

✅ Note

In this example, mistralai/Ministral-3-14B-Instruct-2512 is used. Choose the open-source model of your choice and follow the same steps, adapting the model slug (from Hugging Face), the versions and the GPU(s) flavour.

Remember that all of the following steps can be automated using OVHcloud APIs!

Step 1 – Manage access tokens

Before introducing the monitoring stack, this architecture starts with the deployment of the Ministral 3 14B on OVHcloud AI Deploy, configured to autoscale based on custom Prometheus metrics exposed by vLLM itself.

Export your Hugging Face token.

export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

Create a Bearer token to access your AI Deploy app once it’s been deployed.

ovhai token create --role operator ai_deploy_token=my_operator_token

Returning the following output:

Id: 47292486-fb98-4a5b-8451-600895597a2b Created At: 20-01-26 11:53:05 Updated At: 20-01-26 11:53:05 Spec: Name: ai_deploy_token=my_operator_token Role: AiTrainingOperator Label Selector: Status: Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Version: 1

You can now store and export your access token:

export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Step 2 – LLM deployment using AI Deploy

1. Define the targeted vLLM metric for autoscaling

Before proceeding with the deployment of the Ministral 3 14B endpoint, you have to choose the metric you want to use as the trigger for scaling.

Instead of relying solely on CPU/RAM utilisation, AI Deploy allows autoscaling decisions to be driven by application-level signals.

To do this, you can consult the metrics exposed by vLLM.

In this example, you can use a basic metric such as vllm:num_requests_running to scale the number of replicas based on real inference load.

This enables:

Faster reaction to traffic spikes
Better GPU utilisation
Reduced inference latency under load
Cost-efficient scaling

Finally, the configuration chosen for scaling this application is as follows:

Parameter	Value	Description
Metric source	`/metrics`	vLLM Prometheus endpoint
Metric name	`vllm:num_requests_running`	Number of in-flight requests
Aggregation	`AVERAGE`	Mean across replicas
Target value	`50`	Desired load per replica
Min replicas	`1`	Baseline capacity
Max replicas	`3`	Burst capacity

✅ Note

You can choose the metric that best suits your use case. You can also apply a patch to your AI Deploy deployment at any time to change the target metric for scaling.

When the average number of running requests exceeds 50, AI Deploy automatically provisions additional GPU-backed replicas.

2. Deploy Ministral 3 14B using AI Deploy

Now you can deploy the LLM using the ovhai CLI.

Key elements necessary for proper functioning:

GPU-based inference: 1 x H100
vLLM OpenAI-compatible Docker image: vllm/vllm-openai:v0.13.0
Custom autoscaling rules based on Prometheus metrics: vllm:num_requests_running

Below is the reference command used to deploy the mistralai/Ministral-3-14B-Instruct-2512:

ovhai app run \
  --name vllm-ministral-14B-autoscaling-custom-metric \
  --default-http-port 8000 \
  --label ai_deploy_token=my_operator_token \
  --gpu 1 \
  --flavor h100-1-gpu \
  -e OUTLINES_CACHE_DIR=/tmp/.outlines \
  -e HF_TOKEN=$MY_HF_TOKEN \
  -e HF_HOME=/hub \
  -e HF_DATASETS_TRUST_REMOTE_CODE=1 \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -v standalone:/hub:rw \
  -v standalone:/workspace:rw \
  --liveness-probe-path /health \
  --liveness-probe-port 8000 \
  --liveness-initial-delay-seconds 300 \
  --probe-path /v1/models \
  --probe-port 8000 \
  --initial-delay-seconds 300 \
  --auto-min-replicas 1 \
  --auto-max-replicas 3 \
  --auto-custom-api-url "http://:8000/metrics" \
  --auto-custom-metric-format PROMETHEUS \
  --auto-custom-value-location vllm:num_requests_running \
  --auto-custom-target-value 50 \
  --auto-custom-metric-aggregation-type AVERAGE \
  vllm/vllm-openai:v0.13.0 \
  -- bash -c "python3 -m vllm.entrypoints.openai.api_server \
    --model mistralai/Ministral-3-14B-Instruct-2512 \
    --tokenizer_mode mistral \
    --load_format mistral \
    --config_format mistral \
    --enable-auto-tool-choice \
    --tool-call-parser mistral \
    --enable-prefix-caching"

How to understand the different parameters of this command?

a. Start your AI Deploy app

Launch a new app using ovhai CLI and name it.

ovhai app run --name vllm-ministral-14B-autoscaling-custom-metric

b. Define access

Define the HTTP API port and restrict access to your token.

--default-http-port 8000
--label ai_deploy_token=my_operator_token

c. Configure GPU resources

Specify the hardware type (h100-1-gpu), which refers to an NVIDIA H100 GPU and the number (1).

--gpu 1 --flavor h100-1-gpu

⚠️WARNING! For this model, one H100 is sufficient, but if you want to deploy another model, you will need to check which GPU you need. Note that you can also access L40S and A100 GPUs for your LLM deployment.

d. Set up environment variables

Configure caching for the Outlines library (used for efficient text generation):

-e OUTLINES_CACHE_DIR=/tmp/.outlines

Pass the Hugging Face token ($MY_HF_TOKEN) for model authentication and download:

-e HF_TOKEN=$MY_HF_TOKEN

Set the Hugging Face cache directory to /hub (where models will be stored):

-e HF_HOME=/hub

Allow execution of custom remote code from Hugging Face datasets (required for some model behaviours):

-e HF_DATASETS_TRUST_REMOTE_CODE=1

Disable Hugging Face Hub transfer acceleration (to use standard model downloading):

-e HF_HUB_ENABLE_HF_TRANSFER=0

e. Mount persistent volumes

Mount two persistent storage volumes:

/hub → Stores Hugging Face model files
/workspace → Main working directory

The rw flag means read-write access.

-v standalone:/hub:rw -v standalone:/workspace:rw

f. Health checks and readiness

Configure liveness and readiness probes:

/health verifies the container is alive
/v1/models confirms the model is loaded and ready to serve requests

The long initial delays (300 seconds) can be reduced; they correspond to the startup time of vLLM and the loading of the model on the GPU.

--liveness-probe-path /health --liveness-probe-port 8000 --liveness-initial-delay-seconds 300 --probe-path /v1/models --probe-port 8000 --initial-delay-seconds 300

g. Autoscaling configuration (custom metrics)

First set the minimum and maximum number of replicas.

--auto-min-replicas 1 --auto-max-replicas 3

This guarantees basic availability (one replica always up) while allowing for peak capacity.

Then enable autoscaling based on application-level metrics exposed by vLLM.

--auto-custom-api-url "http://:8000/metrics" --auto-custom-metric-format PROMETHEUS --auto-custom-value-location vllm:num_requests_running --auto-custom-target-value 50 --auto-custom-metric-aggregation-type AVERAGE

AI Deploy:

Scrapes the local /metrics endpoint
Parses Prometheus-formatted metrics
Extracts the vllm:num_requests_running gauge
Computes the average value across replicas

Scaling behaviour:

When the average number of in-flight requests exceeds 50, AI Deploy adds replicas
When load decreases, replicas are scaled down

This approach ensures high availability and predictable latency under fluctuating traffic.

h. Choose the target Docker image and the startup command

Use the official vLLM OpenAI-compatible Docker image.

vllm/vllm-openai:v0.13.0

Finally, run the model inside the container using a Python command to launch the vLLM API server:

python3 -m vllm.entrypoints.openai.api_server → Starts the OpenAI-compatible vLLM API server
--model mistralai/Ministral-3-14B-Instruct-2512 → Loads the Ministral 3 14B model from Hugging Face
--tokenizer_mode mistral → Uses the Mistral tokenizer
--load_format mistral → Uses Mistral’s model loading format
--config_format mistral → Ensures the model configuration follows Mistral’s standard
--enable-auto-tool-choice → Automatic call of tools if necessary (function/tool call)
--tool-call-parser mistral → Tool calling support
--enable-prefix-caching → Prefix caching for improved throughput and reduced latency

You can now launch this command using ovhai CLI.

3. Check AI Deploy app status

You can now check if your AI Deploy app is alive:

ovhai app get

Is your app in RUNNING status? Perfect! You can check in the logs that the server is started:

ovhai app logs

⚠️WARNING! This step may take a little time as the LLM must be loaded.

4. Test that the deployment is functional

First you can request and send a prompt to the LLM. Launch the following query by asking the question of your choice:

curl https://.app.gra.ai.cloud.ovh.net/v1/chat/completions \
  -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Ministral-3-14B-Instruct-2512",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Give me the name of OVHcloud’s founder."}
    ],
    "stream": false
  }'

You can also verify access to vLLM metrics.

curl -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  https://.app.gra.ai.cloud.ovh.net/metrics

If both tests show that the model deployment is functional and you receive 200 HTTP responses, you are ready to move on to the next step!

The next step is to set up the observability and monitoring stack. This autoscaling mechanism is fully independent from Prometheus used for observability:

AI Deploy queries the local /metrics endpoint internally
Prometheus scrapes the same metrics endpoint externally for monitoring, dashboards and potentially alerting

This ensures:

A single source of truth for metrics
No duplication of exporters
Consistent signals for scaling and observability

Step 3 – Create an MKS cluster

From OVHcloud Control Panel, create a Kubernetes cluster using the MKS.

Consider using the following configuration for the current use case:

Location: GRA ( Gravelines) – you can select the same region as for AI Deploy
Network: Public
Node pool :
- Flavour : b2-15 (or something similar)
- Number of nodes: 3
- Autoscaling : OFF
Name your node pool: monitoring

You should see your cluster (e.g. prometheus-vllm-metrics-ai-deploy) in the list, along with the following information:

If the status is green with the OK label, you can proceed to the next step.

Step 4 – Configure Kubernetes access

Download your kubeconfig file from the OVHcloud Control Panel and configure kubectl:

# configure kubectl with your MKS cluster
export KUBECONFIG=/path/to/your/kubeconfig-xxxxxx.yml

# verify cluster connectivity
kubectl cluster-info
kubectl get nodes

Now,- you can create the values-prometheus.yaml file:

# general configuration
nameOverride: "monitoring"
fullnameOverride: "monitoring"

# Prometheus configuration
prometheus:
  prometheusSpec:
    # data retention (15d)
    retention: 15d
    
    # scrape interval (15s)
    scrapeInterval: 15s
    
    # persistent storage (required for production deployment)
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: csi-cinder-high-speed  # OVHcloud storage
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi  # (can be modified according to your needs)
    
    # scrape vLLM metrics from your AI Deploy instance (Ministral 3 14B)
    additionalScrapeConfigs:
      - job_name: 'vllm-ministral'
        scheme: https
        metrics_path: '/metrics'
        scrape_interval: 15s
        scrape_timeout: 10s
        
        # authentication using AI Deploy Bearer token stored Kubernetes Secret
        bearer_token_file: /etc/prometheus/secrets/vllm-auth-token/token
        static_configs:
          - targets:
              - '.app.gra.ai.cloud.ovh.net'  # /!\ REPLACE THE  by yours /!\
            labels:
              service: 'vllm'
              model: 'ministral'
              environment: 'production'
        
        # TLS configuration
        tls_config:
          insecure_skip_verify: false
    
    # kube-prometheus-stack mounts the secret under /etc/prometheus/secrets/ and makes it accessible to Prometheus
    secrets:
      - vllm-auth-token

# Grafana configuration (visualization layer)
grafana:
  enabled: true
  
  # disable automatic datasource provisioning
  sidecar:
    datasources:
      enabled: false
  
  # persistent dashboards
  persistence:
    enabled: true
    storageClassName: csi-cinder-high-speed
    size: 10Gi
  
  # /!\ DEFINE ADMIN PASSWORD - REPLACE "test" BY YOURS /!\
  adminPassword: "test"
  
  # access via OVHcloud LoadBalancer (public IP and managed LB)
  service:
    type: LoadBalancer
    port: 80
    annotations:
      # optional : limiter l'accès à certaines IPs
      # service.beta.kubernetes.io/ovh-loadbalancer-allowed-sources: "1.2.3.4/32"
  
# alertmanager (optional but recommended for production)
alertmanager:
  enabled: true
  
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: csi-cinder-high-speed
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

# cluster observability components
nodeExporter:
  enabled: true
  
kubeStateMetrics:
  enabled: true

✅ Note

On OVHcloud MKS, persistent storage is handled automatically through the Cinder CSI driver. When a PersistentVolumeClaim (PVC) references a supported storageClassName such as csi-cinder-high-speed, OVHcloud dynamically provisions the underlying Block Storage volume and attaches it to the node running the pod. This enables stateful components like Prometheus, Alertmanager and Grafana to persist data reliably without any manual volume management, making the architecture fully cloud-native and operationally simple.

Then create the monitoring namespace:

# create namespace
kubectl create namespace monitoring

# verify creation
kubectl get namespaces | grep monitoring

Finally, configure the Bearer token secret to access vLLM metrics.

# create bearer token secret
kubectl create secret generic vllm-auth-token \
  --from-literal=token='"$MY_OVHAI_ACCESS_TOKEN"' \
  -n monitoring

# verify secret creation
kubectl get secret vllm-auth-token -n monitoring

# test token (optional)
kubectl get secret vllm-auth-token -n monitoring \
  -o jsonpath='{.data.token}' | base64 -d

Right, if everything is working, let’s move on to deployment.

Step 5 – Deploy Prometheus stack

Add the Prometheus Helm repository and install the monitoring stack. The deployment creates:

Prometheus StatefulSet with persistent storage
Grafana deployment with LoadBalancer access
Alertmanager for future alert configuration (optional)
Supporting components (node exporters, kube-state-metrics)

# add Helm repository
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update

# install monitoring stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values values-prometheus.yaml \
  --wait

Then you can retrieve the LoadBalancer IP address to access Grafana:

kubectl get svc -n monitoring monitoring-grafana

Finally, open your browser to http:// and login with:

Username: admin
Password: as configured in your values-prometheus.yaml file

Step 6 – Create Grafana dashboards

In this step, you will be able to access Grafana interface and add your Prometheus as a new data source, then create a complete dashboard with different vLLM metrics.

1. Add a new data source in Grafana

First of all, create a new Prometheus connection inside Grafana:

Navigate to Connections → Data sources → Add data source
Select Prometheus
Configure URL: http://monitoring-prometheus:9090
Click Save & test

Now that your Prometheus has been configured as a new data source, you can create your Grafana dashboard.

2. Create your monitoring dashboard

To begin with, you can use the following pre-configured Grafana dashboard by downloading this JSON file locally:

In the left-hand menu, select Dashboard:

Navigate to Dashboards → Import
Upload the provided dashboard JSON
Select Prometheus as datasource
Click Import and select the vLLM-metrics-grafana-monitoring.json file

The dashboard provides real-time visibility for Ministral 3 14B deployed with vLLM container and OVHcloud AI Deploy.

You can now track:

Performance metrics: TTFT, inter-token latency, end-to-end latency
Throughput indicators: Requests per second, token generation rates
Resource utilisation: KV cache usage, active/waiting requests
Capacity indicators: Queue depth, preemption rates

Here are the key metrics tracked and displayed in the Grafana dashboard:

Metric Category	Prometheus Metric	Description	Use case
Latency	`vllm:time_to_first_token_seconds`	Time until first token generation	User experience monitoring
Latency	`vllm:inter_token_latency_seconds`	Time between tokens	Throughput optimisation
Latency	`vllm:e2e_request_latency_seconds`	End-to-end request time	SLA monitoring
Throughput	`vllm:request_success_total`	Successful requests counter	Capacity planning
Resource	`vllm:kv_cache_usage_perc`	KV cache memory usage	Memory management
Queue	`vllm:num_requests_running`	Active requests	Load monitoring
Queue	`vllm:num_requests_waiting`	Queued requests	Overload detection
Capacity	`vllm:num_preemptions_total`	Request preemptions	Peak load indicator
Tokens	`vllm:prompt_tokens_total`	Input tokens processed	Usage analytics
Tokens	`vllm:generation_tokens_total`	Output tokens generated	Cost tracking

Well done, you now have at your disposal:

An endpoint of the Ministral 3 14B model deployed with vLLM thanks to OVHcloud AI Deploy and its autoscaling strategies based on custom metrics
Prometheus for metrics collection and Grafana for visualisation/dashboards thanks to OVHcloud MKS

But how can you check that everything will work when the load increases?

Step 7 – Test autoscaling and real-time visualisation

The first objective here is to force AI Deploy to:

Increase vllm:num_requests_running
‘Saturate’ a single replica
Trigger the scale up
Observe replica increase + latency drop

1. Autoscaling testing strategy

The goal is to combine:

High concurrency
Long prompts (KVcache heavy)
Long generations
Bursty load

This is what vLLM autoscaling actually reacts to.

To do so, a Python code can simulate the expected behaviour:

import time
import threading
import random
from statistics import mean
from openai import OpenAI
from tqdm import tqdm

APP_URL = "https://.app.gra.ai.cloud.ovh.net/v1" # /!\ REPLACE THE  by yours /!\
MODEL = "mistralai/Ministral-3-14B-Instruct-2512"
API_KEY = $MY_OVHAI_ACCESS_TOKEN

CONCURRENT_WORKERS = 500          # concurrency (main scaling trigger)
REQUESTS_PER_WORKER = 25
MAX_TOKENS = 768                  # generation pressure

# some random prompts
SHORT_PROMPTS = [
    "Summarize the theory of relativity.",
    "Explain what a transformer model is.",
    "What is Kubernetes autoscaling?"
]

MEDIUM_PROMPTS = [
    "Explain how attention mechanisms work in transformer-based models, including self-attention and multi-head attention.",
    "Describe how vLLM manages KV cache and why it impacts inference performance."
]

LONG_PROMPTS = [
    "Write a very detailed technical explanation of how large language models perform inference, "
    "including tokenization, embedding lookup, transformer layers, attention computation, KV cache usage, "
    "GPU memory management, and how batching affects latency and throughput. Use examples.",
]

PROMPT_POOL = (
    SHORT_PROMPTS * 2 +
    MEDIUM_PROMPTS * 4 +
    LONG_PROMPTS * 6    # bias toward long prompts
)

# openai compliance
client = OpenAI(
    base_url=APP_URL,
    api_key=API_KEY,
)

# basic metrics
latencies = []
errors = 0
lock = threading.Lock()

# worker
def worker(worker_id):
    global errors
    for _ in range(REQUESTS_PER_WORKER):
        prompt = random.choice(PROMPT_POOL)

        start = time.time()
        try:
            client.chat.completions.create(
                model=MODEL,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=MAX_TOKENS,
                temperature=0.7,
            )
            elapsed = time.time() - start

            with lock:
                latencies.append(elapsed)

        except Exception as e:
            with lock:
                errors += 1

# run
threads = []
start_time = time.time()

print("Starting autoscaling stress test...")
print(f"Concurrency: {CONCURRENT_WORKERS}")
print(f"Total requests: {CONCURRENT_WORKERS * REQUESTS_PER_WORKER}")

for i in range(CONCURRENT_WORKERS):
    t = threading.Thread(target=worker, args=(i,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

total_time = time.time() - start_time

# results
print("\n=== AUTOSCALING BENCH RESULTS ===")
print(f"Total requests sent: {len(latencies) + errors}")
print(f"Successful requests: {len(latencies)}")
print(f"Errors: {errors}")
print(f"Total wall time: {total_time:.2f}s")

if latencies:
    print(f"Avg latency: {mean(latencies):.2f}s")
    print(f"Min latency: {min(latencies):.2f}s")
    print(f"Max latency: {max(latencies):.2f}s")
    print(f"Throughput: {len(latencies)/total_time:.2f} req/s")

How can you verify that autoscaling is working and that the load is being handled correctly without latency skyrocketing?

2. Hardware and platform-level monitoring

First, AI Deploy Grafana answers ‘What resources are being used and how many replicas exist?‘.

GPU utilisation, GPU memory, CPU, RAM and replica count are monitored through OVHcloud AI Deploy Grafana (monitoring URL), which exposes infrastructure and runtime metrics for the AI Deploy application. This layer provides visibility into resource saturation and scaling events managed by the AI Deploy platform itself.

Access it using the following URL (do not forget to replace by yours): https://monitoring.gra.ai.cloud.ovh.net/d/app/app-monitoring?var-app=&orgId=1

For example, check GPU/RAM metrics:

You can also monitor scale ups and downs in real time, as well as information on HTTP calls and much more!

3. Software and application-level monitoring

Next the combination of MKS + Prometheus + Grafana answers ‘How the inference engine behaves internally’.

In fact, vLLM internal metrics (request concurrency, token throughput, latency indicators, KV cache pressure, etc.) are collected via the vLLM /metrics endpoint and scraped by Prometheus running on OVHcloud MKS, then visualised in a dedicated Grafana instance. This layer focuses on model behaviour and inference performance.

Find all these metrics via (just replace ): http:///d/vllm-ministral-monitoring/ministral-14b-vllm-metrics-monitoring?orgId=1

Find key metrics such as TTF, etc:

You can also find some information about ‘Model load and throughput’:

To go further and add even more metrics, you can refer to the vLLM documentation on ‘Prometheus and Grafana‘.

Conclusion

This reference architecture provides a scalable, and production-ready approach for deploying LLM inference on OVHcloud using AI Deploy and the autoscaling on custom metric feature.

OVHcloud MKS is dedicated to running Prometheus and Grafana, enabling secure scraping and visualisation of vLLM internal metrics exposed via the /metrics endpoint.

By scraping vLLM metrics securely from AI Deploy into Prometheus and exposing them through Grafana, the architecture provides full visibility into model behaviour, performance and load, enabling informed scaling analysis, troubleshooting and capacity planning in production environments.

Reference Architecture: build a sovereign n8n RAG workflow for AI agent using OVHcloud Public Cloud solutions

Eléa Petton — Tue, 27 Jan 2026 13:12:03 +0000

What if an n8n workflow, deployed in a sovereign environment, saved you time while giving you peace of mind? From document ingestion to targeted response generation, n8n acts as the conductor of your RAG pipeline without compromising data protection.

n8n workflow overview

In the current landscape of AI agents and knowledge assistants, connecting your internal documentation with Large Language Models (LLMs) is becoming a strategic differentiator.

How? By building Agentic RAG systems capable of retrieving, reasoning, and acting autonomously based on external knowledge.

To make this possible, engineers need a way to connect retrieval pipelines (RAG) with tool-based orchestration.

This article outlines a reference architecture for building a fully automated RAG pipeline orchestrated by n8n, leveraging OVHcloud AI Endpoints and PostgreSQL with pgvector as core components.

The final result will be a system that automatically ingests Markdown documentation from Object Storage, creates embeddings with OVHcloud’s BGE-M3 model available on AI Endpoints, and stores them in a Managed Database PostgreSQL with pgvector extension.

Lastly, you’ll be able to build an AI Agent that lets you chat with an LLM (GPT-OSS-120B on AI Endpoints). This agent, utilising the RAG implementation carried out upstream, will be an expert on OVHcloud products.

You can further improve the process by using an LLM guard to protect the questions sent to the LLM, and set up a chat memory to use conversation history for higher response quality.

But what about n8n?

n8n, the open-source workflow automation tool, offers many benefits and connects seamlessly with over 300 APIs, apps, and services:

Open-source: n8n is a 100% self-hostable solution, which means you retain full data control;
Flexible: combines low-code nodes and custom JavaScript/Python logic;
AI-ready: includes useful integrations for LangChain, OpenAI, and embedding support capabilities;
Composable: enables simple connections between data, APIs, and models in minutes;
Sovereign by design: compliant with privacy-sensitive or regulated sectors.

This reference architecture serves as a blueprint for building a sovereign, scalable Retrieval Augmented Generation (RAG) platform using n8n and OVHcloud Public Cloud solutions.

This setup shows how to orchestrate data ingestion, generate embedding, and enable conversational AI by combining OVHcloud Object Storage, Managed Databases with PostgreSQL, AI Endpoints and AI Deploy.The result? An AI environment that is fully integrated, protects privacy, and is exclusively hosted on OVHcloud’s European infrastructure.

Overview of the n8n workflow architecture for RAG

The workflow involves the following steps:

Ingestion: documentation in markdown format is fetched from OVHcloud Object Storage (S3);
Preprocessing: n8n cleans and normalises the text, removing YAML front-matter and encoding noise;
Vectorisation: Each document is embedded using the BGE-M3 model, which is available via OVHcloud AI Endpoints;
Persistence: vectors and metadata are stored in OVHcloud PostgreSQL Managed Database using pgvector;
Retrieval: when a user sends a query, n8n triggers a LangChain Agent that retrieves relevant chunks from the database;
Reasoning and actions: The AI Agent node combines LLM reasoning, memory, and tool usage to generate a contextual response or trigger downstream actions (Slack reply, Notion update, API call, etc.).

In this tutorial, all services are deployed within the OVHcloud Public Cloud.

Prerequisites

Before you start, double-check that you have:

an OVHcloud Public Cloud account
an OpenStack user with the following roles:
- Administrator
- AI Operator
- Object Storage Operator
An API key for AI Endpoints
ovhai CLI available – install the ovhai CLI
Hugging Face access – create a Hugging Face account and generate an access token

🚀 Now that you have everything you need, you can start building your n8n workflow!

Architecture guide: n8n agentic RAG workflow

You’re all set to configure and deploy your n8n workflow

⚙️ Keep in mind that the following steps can be completed using OVHcloud APIs!

Step 1 – Build the RAG data ingestion pipeline

This first step involves building the foundation of the entire RAG workflow by preparing the elements you need:

n8n deployment
Object Storage bucket creation
PostgreSQL database creation
and more

Remember to set up the proper credentials in n8n so the different elements can connect and function.

1. Deploy n8n on OVHcloud VPS

OVHcloud provides VPS solutions compatible with n8n. Get a ready-to-use virtual server with pre-installed n8n and start building automation workflows without manual setup. With plans ranging from 6 vCores / 12 GB RAM to 24 vCores / 96 GB RAM, you can choose the capacity that suits your workload.

How to set up n8n on a VPS?

Setting up n8n on an OVHcloud VPS generally involves:

Choosing and provisioning your OVHcloud VPS plan;
Connecting to your server via SSH and carrying out the initial server configuration, which includes updating the OS;
Installing n8n, typically with Docker (recommended for ease of management and updates), or npm by following this guide;
Configuring n8n with a domain name, SSL certificate for HTTPS, and any necessary environment variables for databases or settings.

While OVHcloud provides a robust VPS platform, you can find detailed n8n installation guides in the official n8n documentation.

Once the configuration is complete, you can configure the database and bucket in Object Storage.

2. Create Object Storage bucket

First, you have to set up your data source. Here you can store all your documentation in an S3-compatible Object Storage bucket.

Here, assume that all the documentation files are in Markdown format.

From OVHcloud Control Panel, create a new Object Storage container with S3-compatible API solution; follow this guide.

When the bucket is ready, add your Markdown documentation to it.

Note: For this tutorial, we’re using the various OVHcloud product documentation available in Open-Source on the GitHub repository maintained by OVHcloud members.

Click this link to access the repository.

How do you do that? Extract all the guide.en-gb.md files from the GitHub repository and rename each one to match its parent folder.

Example: the documentation about ovhai cli installation docs/pages/public_cloud/ai_machine_learning/cli_10_howto_install_cli/guide.en-gb.md is stored in ovhcloud-products-documentation-md bucket as cli_10_howto_install_cli.md

You should get an overview that looks like this:

Keep the following elements and create a new credential in n8n named OVHcloud S3 gra credentials:

S3 Endpoint: https://s3.gra.io.cloud.ovh.net/
Region: gra
Access Key ID:
Secret Access Key:

Then, create a new n8n node by selecting S3, then Get Multiple Files.
Configure this node as follows:

Connect the node to the previous one before moving on to the next step.

With the first phase done, you can now configure the vector DB.

3. Configure PostgreSQL Managed DB (pgvector)

In this step, you can set up the vector database that lets you store the embeddings generated from your documents.

How? By using OVHcloud’s managed databases, a pgvector extension of PostgreSQL. Go to your OVHcloud Control Panel and follow the steps.

1. Navigate to Databases & Analytics > Databases

2. Create a new database and select PostgreSQL and a datacenter location

3. Select Production plan and Instance type

4. Reset the user password and save it

5. Whitelist the IP of your n8n instance as follows

6. Take note of te following parameters

Make a note of this information and create a new credential in n8n named OVHcloud PGvector credentials:

Host:
Database: defaultdb
User: avnadmin
Password:
Port: 20184

Consider enabling the Ignore SSL Issues (Insecure) button as needed and setting the Maximum Number of Connections value to 1000.

✅ You’re now connected to the database! But what about the PGvector extension?

Add a PosgreSQL node in your n8n workflow Execute a SQL query, and create the extension through an SQL query, which should look like this:

-- drop table as needed
DROP TABLE IF EXISTS md_embeddings;

-- activate pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- create table
CREATE TABLE md_embeddings (
    id SERIAL PRIMARY KEY,
    text TEXT,
    embedding vector(1024),
    metadata JSONB
);

You should get this n8n node:

Finally, you can create a new table and name it md_embeddings using this node. Create a Stop and Error node if you run into errors setting up the table.

All set! Your vector DB is prepped and ready for data! Keep in mind, you still need an embeddings model for the RAG data ingestion pipeline.

4. Access to OVHcloud AI Endpoints

OVHcloud AI Endpoints is a managed service that provides ready-to-use APIs for AI models, including LLM, CodeLLM, embeddings, Speech-to-Text, and image models hosted within OVHcloud’s European infrastructure.

To vectorise the various documents in Markdown format, you have to select an embedding model: BGE-M3.

Usually, your AI Endpoints API key should already be created. If not, head to the AI Endpoints menu in your OVHcloud Control Panel to generate a new API key.

Once this is done, you can create new OpenAI credentials in your n8n.

Why do I need OpenAI credentials? Because AI Endpoints API is fully compatible with OpenAI’s, integrating it is simple and ensures the sovereignty of your data.

How? Thanks to a single endpoint https://oai.endpoints.kepler.ai.cloud.ovh.net/v1, you can request the different AI Endpoints models.

This means you can create a new n8n node by selecting Postgres PGVector Store and Add documents to Vector Store.
Set up this node as shown below:

Then configure the Data Loader with a custom text splitting and a JSON type.

For the text splitter, here are some options:

To finish, select the BGE-M3 embedding model from the model list and set the Dimensions to 1024.

You now have everything you need to build the ingestion pipeline.

5. Set up the ingestion pipeline loop

To make use of a fully automated document ingestion and vectorisation pipeline, you have to integrate some specific nodes, mainly:

a Loop Over Items that downloads each markdown file one by one so that it can be vectorised;
a Code in JavaScript that counts the number of files processed, which subsequently determines the number of requests sent to the embedding model;
an If condition that allows you to check when the 400 requests have been reached;
a Wait node that pauses after every 400 requests to avoid getting rate-limited;
an S3 block Download a file to download each markdown;
another Code in JavaScript to extract and process text from Markdown files by cleaning and removing special characters before sending it to the embeddings model;
a PostgreSQL node to Execute a SQL query to check that the table contains vectors after the process (loop) is complete.

5.1. Create a loop to process each documentation file

Begin by creating a Loop Over Items to process all the Markdown files one at a time. Set the batch size to 1 in this loop.

Add the Loop statement right after the S3 Get Many Files node as shown below:

Time to put the loop’s content into action!

5.2. Count the number of files using a code snippet

Next, choose the Code in JavaScript node from the list to see how many files have been processed. Set “Run Once for Each Item” Mode and “JavaScript” code Language, then add the following code snippet to the designated block.

// simple counter per item
const counter = $runIndex + 1;

return {
  counter
};

Make sure this code snippet is included in the loop.

You can start adding the if part to the loop now.

5.3. Add a condition that applies a rule every 400 requests

Here, you need to create an If node and add the following condition, which you have set as an expression.

{{ (Number($json["counter"]) % 400) === 0 }}

Add it immediately after counting the files:

If this condition is true, trigger the Wait node.

5.4. Insert a pause after each set of 400 requests

Then insert a Wait node to pause for a few seconds before resuming. You can insert Resume “After Time Interval” and set the Wait Amount to “60:00” seconds.

Link it to the If condition when this is True.

Next, you can go ahead and download the Markdown file, and then process it.

5.5. Launch documentation download

To do this, create a new Download a file S3 node and configure it with this File Key expression:

{{ $('Process each documentation file').item.json.Key }}

Want to connect it? That’s easy, link it to the output of the Wait and If statements when the ‘if’ statement returns False; this will allow the file to be processed only if the rate limit is not exceeded.

You’re almost done! Now you need to extract and process the text from the Markdown files – clean and remove any special characters before sending it to the embedding model.

5.6 Clean Markdown text content

Next, create another Code in JavaScript to process text from Markdown files:

// extract binary content
const binary = $input.item.binary.data;

// decoding into clean UTF-8 text
let text = Buffer.from(binary.data, 'base64').toString('utf8');

// cleaning - remove non-printable characters
text = text
  .replace(/[^\x09\x0A\x0D\x20-\x7EÀ-ÿ€£¥•–—‘’“”«»©®™°±§¶÷×]/g, ' ')
  .replace(/\s{2,}/g, ' ')
  .trim();

// check lenght
if (text.length > 14000) {
  text = text.slice(0, 14000);
}

return [{
  text,
  fileName: binary.fileName,
  mimeType: binary.mimeType
}];

Select the “Run Once for Each Item” Mode and place the previous code in the dedicated JavaScript block.

To finish, check that the output text has been sent to the document vectorisation system, which was set up in Step 3 – Configure PostgreSQL Managed DB (pgvector).

How do I confirm that the table contains all elements after vectorisation?

5.7 Double-check that the documents are in the table

To confirm that your RAG system is working, make sure your vector database has different vectors; use a PostgreSQL node with Execute a SQL query in your n8n workflow.

Then, run the following query:

-- count the number of elements
SELECT COUNT(*) FROM md_embeddings;

Next, link this element to the Done section of your Loop, so the elements are counted when the process is complete.

Congrats! You can now run the workflow to begin ingesting documents.

Click the Execute workflow button and wait until the vectorization process is complete.

Remember, everything should be green when it’s finished ✅.

Step 2 – RAG chatbot

With the data ingestion and vectorisation steps completed, you can now begin implementing your AI agent.

This involves building a RAG-based AI Agent by simply starting a chat with an LLM.

1. Set up the chat box to start a conversation

First, configure your AI Agent based on the RAG system, and add a new node in the same n8n workflow: Chat Trigger.

This node will allow you to interact directly with your AI agent! But before that, you need to check that your message is safe.

This node will allow you to interact directly with your AI agent! But before that, you need to check that your message is secure.

2. Set up your LLM Guard with AI Deploy

To check whether a message is secure or not, use an LLM Guard.

What’s an LLM Guard? This is a safety and control layer that sits between users and an LLM, or between the LLM and an external connection. Its main goal is to filter, monitor, and enforce rules on what goes into or comes out of the model 🔐.

You can use AI Deploy from OVHcloud to deploy your desired LLM guard. With a single command line, this AI solution lets you deploy a Hugging Face model using vLLM Docker containers.

For more details, please refer to this blog.

For the use case covered in this article, you can use the open-source model meta-llama/Llama-Guard-3-8B available on Hugging Face.

2.1 Create a Bearer token to request your custom AI Deploy endpoint

Create a token to access your AI Deploy app once it’s deployed.

ovhai token create --role operator ai_deploy_token=my_operator_token

The following output is returned:

Id: 47292486-fb98-4a5b-8451-600895597a2b Created At: 20-10-25 8:53:05 Updated At: 20-10-25 8:53:05 Spec: Name: ai_deploy_token=my_operator_token Role: AiTrainingOperator Label Selector: Status: Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Version: 1

You can now store and export your access token to add it as a new credential in n8n.

export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

2.1 Start Llama Guard 3 model with AI Deploy

Using ovhai CLI, launch the following command and vLLM start inference server.

ovhai app run \
	--name vllm-llama-guard3 \
        --default-http-port 8000 \
        --gpu 1 \
	--flavor l40s-1-gpu \
        --label ai_deploy_token=my_operator_token \
	--env OUTLINES_CACHE_DIR=/tmp/.outlines \
	--env HF_TOKEN=$MY_HF_TOKEN \
	--env HF_HOME=/hub \
	--env HF_DATASETS_TRUST_REMOTE_CODE=1 \
	--env HF_HUB_ENABLE_HF_TRANSFER=0 \
	--volume standalone:/workspace:RW \
	--volume standalone:/hub:RW \
	vllm/vllm-openai:v0.10.1.1 \
	-- bash -c python3 -m vllm.entrypoints.openai.api_server                       
                           --model meta-llama/Llama-Guard-3-8B \                     
                           --tensor-parallel-size 1 \                     
                           --dtype bfloat16

Full command explained:

ovhai app run

This is the core command to run an app using the OVHcloud AI Deploy platform.

--name vllm-llama-guard3

Sets a custom name for the job. For example, vllm-llama-guard3.

--default-http-port 8000

Exposes port 8000 as the default HTTP endpoint. vLLM server typically runs on port 8000.

--gpu 1
--flavor l40s-1-gpu

Allocates 1 GPU L40S for the app. You can adjust the GPU type and number depending on the model you have to deploy.

--volume standalone:/workspace:RW
--volume standalone:/hub:RW

Mounts two persistent storage volumes: /workspace which is the main working directory and /hub to store Hugging Face model files.

--env OUTLINES_CACHE_DIR=/tmp/.outlines
--env HF_TOKEN=$MY_HF_TOKEN
--env HF_HOME=/hub
--env HF_DATASETS_TRUST_REMOTE_CODE=1
--env HF_HUB_ENABLE_HF_TRANSFER=0

These are Hugging Face environment variables you have to set. Please export your Hugging Face access token as environment variable before starting the app: export MY_HF_TOKEN=***********

vllm/vllm-openai:v0.10.1.1

Use the vllm/vllm-openai Docker image (a pre-configured vLLM OpenAI API server).

-- bash -c python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-Guard-3-8B \ --tensor-parallel-size 1 \ --dtype bfloat16

Finally, run a bash shell inside the container and executes a Python command to launch the vLLM API server.

2.2 Check to confirm your AI Deploy app is RUNNING

Replace the by yours.

ovhai app get

You should get:

History: DATE STATE 20-1O-25 09:58:00 QUEUED 20-10-25 09:58:01 INITIALIZING 04-04-25 09:58:07 PENDING 04-04-25 10:03:10 RUNNING Info: Message: App is running

2.3 Create a new n8n credential with AI Deploy app URL and Bearer access token

First, using your , retrieve your AI Deploy app URL.

ovhai app get  -o json | jq '.status.url' -r

Then, create a new OpenAI credential from your n8n workflow, using your AI Deploy URL and the Bearer token as an API key.

Don’t forget to replace 6e10e6a5-2862-4c82-8c08-26c458ca12c7 with your .

2.4 Create the LLM Guard node in n8n workflow

Create a new OpenAI node to Message a model and select the new AI Deploy credential for LLM Guard usage.

Next, create the prompt as follows:

{{ $('Chat with the OVHcloud product expert').item.json.chatInput }}

Then, use an If node to determine if the scenario is safe or unsafe:

If the message is unsafe, send an error message right away to stop the workflow.

But if the message is safe, you can send the request to the AI Agent without issues 🔐.

3. Set up AI Agent

The AI Agent node in n8n acts as an intelligent orchestration layer that combines LLMs, memory, and external tools within an automated workflow.

It allows you to:

Connect a Large Language Model using APIs (e.g., LLMs from AI Endpoints);
Use tools such as HTTP requests, databases, or RAG retrievers so the agent can take actions or fetch real information;
Maintain conversational memory via PostgreSQL databases;
Integrate directly with chat platforms (e.g., Slack, Teams) for interactive assistants (optional).

Simply put, n8n becomes an agentic automation framework, enabling LLMs to not only provide answers, but also think, choose, and perform actions.

Please note that you can change and customise this n8n AI Agent node to fit your use cases, using features like function calling or structured output. This is the most basic configuration for the given use case. You can go even further with different agents.

🧑‍💻 How do I implement this RAG?

First, create an AI Agent node in n8n as follows:

Then, a series of steps are required, the first of which is creating prompts.

3.1 Create prompts

In the AI Agent node on your n8n workflow, edit the user and system prompts.

Begin by creating the prompt, which is also the user message:

{{ $('Chat with the OVHcloud product expert').item.json.chatInput }}

Then create the System Message as shown below:

You have access to a retriever tool connected to a knowledge base.  
Before answering, always search for relevant documents using the retriever tool.  
Use the retrieved context to answer accurately.  
If no relevant documents are found, say that you have no information about it.

You should get a configuration like this:

🤔 Well, an LLM is now needed for this to work!

3.2 Select LLM using AI Endpoints API

First, add an OpenAI Chat Model node, and then set it as the Chat Model for your agent.

Next, select one of the OVHcloud AI Endpoints from the list provided, because they are compatible with Open AI APIs.

✅ How? By using the right API https://oai.endpoints.kepler.ai.cloud.ovh.net/v1

The GPT OSS 120B model has been selected for this use case. Other models, such as Llama, Mistral, and Qwen, are also available.

⚠️ WARNING ⚠️

If you are using a recent version of n8n, you will likely encounter the /responses issue (linked to OpenAI compatibility). To resolve this, you will need to disable the button Use Responses API and everything will work correctly

Tips to fix /responses issue

Your LLM is now set to answer your questions! Don’t forget, it needs access to the knowledge base.

3.3 Connect the knowledge base to the RAG retriever

As usual, the first step is to create an n8n node called PGVector Vector Store node and enter your PGvector credentials.

Next, link this element to the Tools section of the AI Agent node.

Remember to connect your PG vector database so that the retriever can access the previously generated embeddings. Here’s an overview of what you’ll get.

⏳Nearly done! The final step is to add the database memory.

3.4 Manage conversation history with database memory

Creating Database Memory node in n8n (PostgreSQL) lets you link it to your AI Agent, so it can store and retrieve past conversation history. This enables the model to remember and use context from multiple interactions.

So link this PostgreSQL database to the Memory section of your AI agent.

Congrats! 🥳 Your n8n RAG workflow is now complete. Ready to test it?

4. Make the most of your automated workflow

Want to try it? It’s easy!

By clicking the orange Open chat button, you can ask the AI agent questions about OVHcloud products, particularly where you need technical assistance.

For example, you can ask the LLM about rate limits in OVHcloud AI Endpoints and get the information in seconds.

You can now build your own autonomous RAG system using OVHcloud Public Cloud, suited for a wide range of applications.

What’s next?

To sum up, this reference architecture provides a guide on using n8n with OVHcloud AI Endpoints, AI Deploy, Object Storage, and PostgreSQL + pgvector to build a fully controlled, autonomous RAG AI system.

Teams can build scalable AI assistants that work securely and independently in their cloud environment by orchestrating ingestion, embedding generation, vector storage, retrieval, and LLM safety check, and reasoning within a single workflow.

With the core architecture in place, you can add more features to improve the capabilities and robustness of your agentic RAG system:

Web search
Images with OCR
Audio files transcribed using the Whisper model

This delivers an extensive knowledge base and a wider variety of use cases!

Safety first: Detect harmful texts using an AI safeguard agent

Alexandre Movsessian — Thu, 22 Jan 2026 10:46:11 +0000

This article explains how to use the Qwen 3 Guard safeguard models provided by OVHCloud.

Using this guide, you can analyse and moderate texts for LLM applications, chat platforms, customer support systems, or any other text-based services requiring safe and compliant interactions.

Our focus will be on written content, such as conversations or plain text. Although image moderators exist, they won’t be covered here.

Introduction

As Large Language Models (LLMs) continue to grow, access to information has become more seamless, but this ease of access makes it easier to generate, and be exposed to, harmful or toxic content.

LLMs can be prompted with malicious queries (e.g., “How do I make a bomb?”) and some models might comply by generating potentially dangerous responses. This risk is particularly concerning given the widespread availability of LLMs, to both minors and malicious actors alike.

To combat this, LLM providers train their models to reject toxic prompts, and integrate safety features to prevent the creation of harmful content. Even so, users often craft ‘jailbreaks’, which are specific prompts designed to get around these safety measures.

As a result, providers have created specialised safeguard models to find and remove toxic content in writing.

What is toxicity?

Toxicity is inherently difficult to define, as perceptions vary depending on factors such as individual sensitivity, cultural background, age, and personal experience.

Perceptions of content can vary widely. For example, some users may find certain jokes offensive, while others consider them perfectly acceptable. Similarly, roleplaying with an AI chat may be enjoyable for some, yet deemed inappropriate by others depending on the context.

Furthermore, each moderation system focuses on different categories of harmful content, based on the specific data and instructions it was trained on. For instance, models developed in the United States tend to be highly sensitive to hate speech, political content, and other related categories.

Because jailbreak attempts are a fairly new issue, existing moderation models often fail to address them.

Below are the toxicity categories for the Qwen 3 Guard models:

Name	Description
Violent	Content that provides detailed instructions, methods, or advice on how to commit acts of violence, including the manufacture, acquisition, or use of weapons. Also includes depictions of violence.
Nonviolent illegal acts	Content providing guidance or advice for nonviolent criminal activities like hacking, unauthorised drug manufacturing, or theft.
Sexual content or sexual acts	Content with sexual depictions, references, or descriptions of people. Also includes content with explicit sexual imagery, references, or descriptions of illegal or unethical sexual acts, such as rape, bestiality, incest, and sexual slavery.
Personally identifiable information	Content that shares or discloses sensitive personal identifying information, with authorisation, such as name, ID number, address, phone number, medical records, financial details, and account passwords, etc.
Suicide & self-harm	Content advocating, directly encouraging, or detailing methods for self-harm, suicide, or dangerous activities that could lead to serious injury or death.
Unethical acts	Any immoral or unethical content or acts, including but not limited to bias, discrimination, stereotype, injustice, hate speech, offensive language, harassment, insults, threat, defamation, extremism, misinformation regarding ethics, and other behaviours that, while not illegal, are still considered unethical.
Politically sensitive topics	The deliberate creation or spread of false information about government actions, historical events, or public figures that is demonstrably untrue and poses risk of public deception or social harm.
Copyright violation	Content that includes unauthorised reproduction, distribution, public display, or derivative use of copyrighted materials, such as novels, scripts, lyrics, and other legally protected creative works, without the copyright holder’s clear consent.
Jailbreak	Content that explicitly attempts to override the model’s system prompt or model conditioning.

These categories are not mutually exclusive. A text may very well contain both Unethical Acts and Violence, for example. Most notably, jailbreaks often include another kind of toxic query as it is designed to bypass security guardrails. The Qwen 3 Guard moderator, however, will only return one category.

These categories were arbitrarily chosen by Qwen 3 Guard creators; they can’t be changed, but you may choose to ignore some depending on your use case.

Metrics

Attack: An attack refers to any attempt to produce harmful or toxic content. This is either a prompt crafted to make an LLM generate harmful output, or just a user’s toxic message in a chat system.

Attack Success Rates (ASR): This is a metric used to assess the effectiveness of a moderation system. It represents the proportion of attacks that successfully bypass the moderator and go undetected. A lower ASR indicates a more robust moderation system.

False positive: A false positive occurs when benign, nontoxic content is incorrectly flagged as harmful by the moderator.

False Positive Rate (FPR): The FPR measures how often a moderation system misclassifies safe content as toxic. It complements the ASR by reflecting the model’s ability to correctly allow harmless content through. A lower FPR indicates better reliability.

Qwen 3 Guard

Qwen 3 Guard was launched in October 2025 by Qwen, Alibaba’s AI team. After extensive testing and evaluation, we found this model to be the most effective in safeguarding content.

Besides being efficient, Qwen 3 Guard can detect toxicity across nine categories, including jailbreak attempts, a feature that isn’t common in safeguard models.

It also provides explanations by specifying the exact category detected.

Specs

Base model: Qwen 3
Flavours: 0.6B, 4B, 8B
Context size: 32,768 tokens
Languages: English, French and 117 other languages and dialects
Tasks:
- Detection of toxicity in raw text
- Detection of toxicity in LLM dialogue
- Detection of answer refusal (LLM dialogue only)
- Classification of toxicity

Availability

https://www.ovhcloud.com/en/public-cloud/ai-endpoints/catalog

There are two flavours of Qwen 3 Guard available on OVHCloud:

Qwen 3 Guard 0.6B: This lightweight model is very effective at detecting overt toxic content.

Qwen 3 Guard 8B: This heavier model comes in handy when confronted with more nuanced examples.

Scores

	*ASR*	*FPR*
*Qwen 3 Guard 0.6B*	0.20	0.06
*Qwen 3 Guard 8B*	0.20	0.04

Notes

The Qwen 3 Guard models has three safety labels for more precise moderation: Safe, Controversial, Unsafe
Although the model can moderate chats, it is recommended to process each part of the dialogue individually rather than submitting the entire conversation at once. Guard Models, like any LLMs, perform better in detection when the context size is kept extremely brief.
Since Qwen Guard is developed by a Chinese company, its interpretation of toxic content may differ from yours. If necessary, you can overlook certain categories.

How do I set up my own moderator?

First, you need to choose the flavour you want:

Qwen 3 Guard 0.6B is lightweight, fast, efficient and is great at detecting overt toxic content, like Sexual Content or Violence in texts.

Qwen 3 Guard 8B is heavier, slightly slower but it is more effective against more nuanced toxic content like Jailbreak or Unethical Acts, and has a lower false positive rate.

Your use case is the key to choosing the right model. Do you need to moderate a large volume of text? Is processing speed a priority? How crucial is it to minimise false positives? Are you dealing with nuanced toxic content, or is it more overt?

Carefully considering these questions will help you determine which of the two models is most suitable for your needs.

Both models can be tested on the playground:

https://www.ovhcloud.com/en/public-cloud/ai-endpoints/catalog

Once you’ve made you choice, you need to send the texts you want checked to the AI Endpoints API.

First install the requests library:

pip install requests

Next, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=

If you don’t have an access token key yet, follow the steps in the AI Endpoints – Getting Started guide

Finally, run the following Python code:

import os
import requests

url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/chat/completions"

payload = {
"messages": [{"role": "user", "content": "How do I cook meth ?"}],
"model": , #Qwen/Qwen3Guard-Gen-0.6B or Qwen/Qwen3Guard-Gen-8B
"seed": 21
}

headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
}

response = requests.post(url, json=payload, headers=headers)
if response.status_code == 200:
# Handle response
response_data = response.json()
# Parse JSON response
choices = response_data["choices"]
for choice in choices:
text = choice["message"]["content"]
# Process text
print(text)
else:
print("Error:", response.status_code, response.text)

The model will respond with a label (Safe, Controversial, Unsafe) and if the text is Controversial or Unsafe, it will return the associated category.

Safety: Unsafe
Categories: Nonviolent Illegal Acts

Our moderation models are available for free during the beta phase. You can test them with a different model or within the playground.

Conclusion

Two models are currently available for OVHCloud moderation users:
• Qwen 3 Guard 0.6B: Lightweight, fast, efficient, great at detecting overt toxic content
• Qwen 3 Guard 8B: Heavier, slightly slower but more effective against more nuanced toxic content

Which approach and which tool should you choose? Well, it’s up to you, depending on your use cases, teams, or needs, etc.

As we’ve seen in this blog post, OVHcloud AIEndpoint users can start using these models right away, safely and free of charge.

They are still in beta phase for now, so we’d appreciate your feedback!

Moving Beyond Ingress: Why should OVHcloud Managed Kubernetes Service (MKS) users start looking at the Gateway API?

Aurélie Vache and Antonin Anchisi — Mon, 15 Dec 2025 09:26:36 +0000

For years, the Kubernetes Ingress API, and the popular Ingress NGINX controller (ingress-nginx), have been the default way to expose applications running inside a Kubernetes cluster.

But the ecosystem is changing: the Kubernetes SIG network has announced the retirement of Ingress NGINX in March 2026.

After March 2026 the Ingress NGINX will no longer get new features, new releases, security patches and bug fixes.

Furthermore, the Kubernetes project recommends using Gateway instead of Ingress.

The Ingress API has already been frozen, which means it is no longer being developed, and will have no further changes or updates made to it. The Kubernetes project has no plans to remove Ingress from Kubernetes.

While OVHcloud Managed Kubernetes Service (MKS) does not yet provide a native GatewayClass, you can already benefit from Gateway API capabilities today by deploying your own controller 💪 .

Also, until Gateway API becomes fully integrated with OpenStack providers, there is an intermediate option: using a modern, actively maintained Ingress controller other than ingress-nginx.

The limitations of the current Ingress controller model

The traditional Kubernetes Ingress model was intentionally simple: define an Ingress, install an Ingress Controller, and let it configure a single proxy (usually Nginx) to route traffic.

This design works, but it comes with limitations:

– Single Monolithic “Entry Point”: All HTTP routing for the entire cluster goes through one shared proxy. It adds complexity, configuration conflicts and scaling challenges.
– Protocol limitations: only HTTP and HTTPS.Support for gRPC, HTTP/2, TCP, UDP or TLS passthrough is inconsistent and controller-specific.
– Heavy Reliance on Annotations: Advanced features (timeouts, rewrites, header handling…) rely on custom annotations.
– Strong 3rd parties and cloud Load Balancers support: Every Ingress controllers (3rd parties providers) come with their specialized annotations.

Finally, as mentioned, the most used Ingress controller, Ingress NGINX, will be retired in March 2026.

A Transitional Solution: Using a Modern Ingress Controller (Traefik, Contour, HAProxy…)

Before moving to the Gateway API, as a transitional solution, OVHcloud MKS users can simply replace Ingress Nginx with a modern, actively maintained Ingress controller.

This allows you to:

– keep using your existing Ingress manifests
– keep the same architecture: Service type LoadBalancer → OVHcloud Public Cloud Load Balancer → Ingress Controller
– avoid relying on unsupported or deprecated components
– gain features (better gRPC support, built‑in dashboards, improved L7 behaviour…)

Popular alternatives:

Traefik:
– Very easy to deploy
– Excellent support for HTTP/2, gRPC, WebSockets
– Built‑in dashboard
– Supports both Ingress and Gateway API
– Actively maintained
– Seamless migration from NGINX Ingress Controller to Traefik with NGINX annotation compatibility

Contour (Envoy):
– Envoy-based Ingress Controller
– Excellent performance
– Good stepping‑stone toward Gateway API

HAProxy Ingress:
– Extremely performant
– Enterprise-grade L7 routing
– Optional Gateway API support

NGINX Gateway Fabric (NGF):
– The successor to Ingress NGINX
– Built directly around Gateway API
– Still maturing but a strong long‑term candidate

If you are interested, you can read the more exhaustive list of Ingress controllers.

Installing an Alternative Ingress Controller on OVHcloud MKS

We will show you how to install Traefik, as an alternative Ingress controller and use it to spawn a single OVHcloud Public Cloud Load Balancer (based on OpenStack Octavia).

Install Traefik:

helm repo add traefik https://traefik.github.io/charts
helm repo update

helm install traefik traefik/traefik --namespace traefik --create-namespace --set service.type=LoadBalancer

This automatically triggers:
– the OpenStack CCM (used by OVHcloud)
– the creation of an OVHcloud Public Cloud Load Balancer
– exposure of Traefik through a public IP

After several seconds, the Load Balancer will be active.

Check that Traefik is running:

$ kubectl get all -n traefik
NAME                           READY   STATUS    RESTARTS   AGE
pod/traefik-6777c5db85-pddd6   1/1     Running   0          31s

NAME              TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
service/traefik   LoadBalancer   10.3.129.188        80:30267/TCP,443:30417/TCP   31s

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/traefik   1/1     1            1           31s

NAME                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/traefik-6777c5db85   1         1         1       31s

Then in order to use it, create an ingress.yaml file with the following content:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app-ingress
  namespace: default
  annotations:
    kubernetes.io/ingress.class: "traefik"  # Specifies Traefik as the ingress controller
spec:
  rules:
    - host: my-app.local
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: my-app-service
                port:
                  number: 80

And apply it in your cluster:

kubectl apply -f ingress.yaml

Using this type of alternative provides a fully supported, modern Ingress Controller while you prepare a long‑term transition to the Gateway API.

Gateway API: A modern, flexible networking model

The Gateway API is the next-generation Kubernetes networking specification. It introduces clearer roles and more flexible architectures.

Gateway API splits responsibilities across:
– GatewayClass: defines the type of gateway and which controller manages it
– Gateway: the actual entry point (e.g., a Load Balancer)
– Routes: routing rules, protocol-specific (HTTPRoute, TLSRoute, GRPCRoute, TCPRoute…)

Gateway API supports:
– HTTP(S)
– HTTP/2
– gRPC
– TCP
– TLS passthrough
…in a consistent and portable way.

Unlike Ingress, Gateway API is explicitly designed to allow providers like OVHcloud, AWS, GCP, Azure to:
– provision Load Balancers (LB)
– manage listeners
– expose multiple ports
– integrate with their LB features
This paves the way for native OVHcloud GatewayClass support.

How does it work today on OVHcloud MKS?

OVHcloud MKS relies on the OpenStack Cloud Controller Manager (CCM) to provision OVHcloud Public Cloud Load Balancers in response to a Service of type LoadBalancer.

Since MKS does not yet include a native GatewayClass, you can use Gateway API today as follows:

1. You deploy an existing Gateway Controller (Envoy Gateway, Traefik, Contour/Envoy…) and its GatewayClass.
2. The controller deploys a Data Plane proxy inside the cluster.
3. To expose that proxy, you still have to create a Service of type LoadBalancer (and your app of course).
4. The CCM provisions an OVHcloud Public Cloud Load Balancer and forwards traffic to your proxy.

Thanks to that, you will have a fully functional Gateway API. The workflow is very similar to that which is required for using NGINX Ingress controller.

Using the Gateway API on OVHcloud MKS today

You can already use the Gateway API by deploying your preferred controller.

Here’s an example using Envoy Gateway, one of the most future-proof options.

Install Gateway API CRDs:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/latest/download/standard-install.yaml

Deploy Envoy Gateway:

helm install eg oci://docker.io/envoyproxy/gateway-helm -n envoy-gateway-system --create-namespace

You should have a result like this:

$ helm install eg oci://docker.io/envoyproxy/gateway-helm -n envoy-gateway-system --create-namespace

Pulled: docker.io/envoyproxy/gateway-helm:1.6.0
Digest: sha256:5c55e7844ae8cff3152ca00330234ef61b1f9fa3d466f50db2c63a279f1cd1df
NAME: eg
LAST DEPLOYED: Mon Dec  1 16:27:07 2025
NAMESPACE: envoy-gateway-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
**************************************************************************
*** PLEASE BE PATIENT: Envoy Gateway may take a few minutes to install ***
**************************************************************************

Envoy Gateway is an open source project for managing Envoy Proxy as a standalone or Kubernetes-based application gateway.

Thank you for installing Envoy Gateway! 🎉

Your release is named: eg. 🎉

Your release is in namespace: envoy-gateway-system. 🎉

To learn more about the release, try:

  $ helm status eg -n envoy-gateway-system
  $ helm get all eg -n envoy-gateway-system

To have a quickstart of Envoy Gateway, please refer to https://gateway.envoyproxy.io/latest/tasks/quickstart.

To get more details, please visit https://gateway.envoyproxy.io and https://github.com/envoyproxy/gateway.

Check the Envoy gateway is running:

$ kubectl get po -n envoy-gateway-system
NAME                            READY   STATUS    RESTARTS   AGE
envoy-gateway-9cbbc577c-5h5qw   1/1     Running   0          16m

As a quickstart, you can install directly the GatewayClass, Gateway, HTTPRoute and an example app:

kubectl apply -f https://github.com/envoyproxy/gateway/releases/download/latest/quickstart.yaml -n default

This command deploys a GatewayClass, a Gateway, a HTTPRoute and an app deployed in a deployment and exposed through a service:

gatewayclass.gateway.networking.k8s.io/eg created
gateway.gateway.networking.k8s.io/eg created
serviceaccount/backend created
service/backend created
deployment.apps/backend created
httproute.gateway.networking.k8s.io/backend created

As you can see, a GatewayClass have been deployed:

$ kubectl get gatewayclass -o yaml | kubectl neat
apiVersion: v1
items:
- apiVersion: gateway.networking.k8s.io/v1
  kind: GatewayClass
  metadata:
    name: eg
  spec:
    controllerName: gateway.envoyproxy.io/gatewayclass-controller
kind: List
metadata:
  resourceVersion: ""

Note that a GatewayClass is a cluster-wide resource so you don’t have to specify any namespace.

A Gateway have been deployed also:

$ kubectl get gateway -o yaml -n default | kubectl neat
apiVersion: v1
items:
- apiVersion: gateway.networking.k8s.io/v1
  kind: Gateway
  metadata:
    name: eg
    namespace: default
  spec:
    gatewayClassName: eg
    listeners:
    - allowedRoutes:
        namespaces:
          from: Same
      name: http
      port: 80
      protocol: HTTP
kind: List
metadata:
  resourceVersion: ""

A HTTPRoute also:

$ kubectl get httproute -o yaml -n default | kubectl neat
apiVersion: v1
items:
- apiVersion: gateway.networking.k8s.io/v1
  kind: HTTPRoute
  metadata:
    name: backend
    namespace: default
  spec:
    hostnames:
    - www.example.com
    parentRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: eg
    rules:
    - backendRefs:
      - group: ""
        kind: Service
        name: backend
        port: 3000
        weight: 1
      matches:
      - path:
          type: PathPrefix
          value: /
kind: List
metadata:
  resourceVersion: ""

In order to retrieve the external IP (of the external Load Balancer), you just have to get information about the Gateway and export it in an environment variable:

$ kubectl get gateway eg
NAME   CLASS   ADDRESS        PROGRAMMED   AGE
eg     eg      xx.xxx.xx.xxx   True        18m

$ export GATEWAY_HOST=$(kubectl get gateway/eg -o jsonpath='{.status.addresses[0].value}')

$ echo $GATEWAY_HOST
xx.xxx.xx.xxx

And finally, a backend service have been deployed with its deployment:

$ kubectl get pod,svc -l app=backend -n default
NAME                           READY   STATUS    RESTARTS   AGE
pod/backend-765694d47f-zr6hh   1/1     Running   0          21m

NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/backend   ClusterIP   10.3.114.179           3000/TCP   21m

In order to create your own Gateway and *Route resources, don’t hesitate to take a look at the Gateway API website.

Conclusion

Two migration paths are currently available for OVHcloud MKS users:

Short-term: switch to a modern Ingress Controller (Traefik, Contour, HAProxy, NGF…). It provides full support for current Ingress usage, without requiring API changes.
Long-term: adopt the Gateway API. Gateway API brings multi‑protocol support, clearer separation of roles, and is the strategic direction of Kubernetes networking.

Which approach and which tool should you choose? Well, it’s up to you, depending on your use cases, your teams, your needs… 🙂

As we have seen in this blog post, OVHcloud MKS users can begin adopting these technologies today, safely and incrementally.

This ecosystem is evolving quickly, so stay tuned to find out about the coming release of a pre-installed official GatewayClass (based on OpenStack Octavia) 💪.

Industrial Excellence meets Artificial Intelligence: Behind the Scenes with Smart Datacenter

Ali Chehade, Julien Jay and Christian Sharp — Fri, 12 Dec 2025 14:35:42 +0000

At OVHcloud, we are constantly looking for ways to improve our operations and reduce our impact on the environment. This has been a defining part of the company since 1999 and is a key part of our organisational DNA and our commercial model.

We are very proud to present the new Smart Datacenter cooling system, which significantly improves energy and water efficiency while delivering a significant reduction in carbon impact across the entire cooling chain, from manufacturing and transport to daily operations.

The system is a new way of building and deploying datacenter infrastructure, changing how we manage and monitor water supply and demand, using a combination of industrial design, IoT sensors and AI innovation, specifically in our smart racks, advanced cooling distribution units (CDUs) and intelligent dry coolers.

Smart Datacenter delivers a reduction in power consumption of up to 50% across the entire cooling loop, from server water blocks to dry coolers, and consumes 30% less water compared to OVHcloud’s earliest design, driving major sustainability benefits. The system also uses complex mathematical models capturing detailed rack-level and environmental data to optimize cooling performance in real time. Furthermore, all operational data is fed into a centralized data lake, enabling cutting-edge artificial intelligence to predict, adapt, and enhance system efficiency and reliability.

Let’s get into the detail.

The system has three main components:

Smart Racks: These are designed with an innovative hydraulic “pull” architecture, where each rack autonomously draws exactly the water flow, pressure, and temperature it needs, dynamically adapting to server load and performance.
Advanced Cooling Distribution Unit (CDU): This is a compact, next-generation primary loop unit that autonomously balances flow and pressure across all racks without manual intervention or any electrical communication. It uses only hydraulic signals (pressure, flow and temperature of water) to “understand” rack demands and continuously optimizes operation for lowest power consumption and extended pump lifespan.
Intelligent Dry Cooler: This is operated seamlessly by the CDU, eliminating the need for separate control systems (“brains”) on both the dry cooler and the CDU. This unified control architecture ensures optimized, coordinated performance across the entire cooling infrastructure.

OVHcloud’s new Single-Circuit System (SCS) replaces the previous Dual-Circuit System cooling architecture (DCS), which consisted of a primary facility loop and a secondary in-rack loop separated by an in-rack Coolant Distribution Unit (CDU), installed inline directly after the rear door heat exchangers (RDHX), as shown in Figure 1. The CDU housed multiple pumps, several plate heat exchangers (PHEX), and a network of valves and sensors.

Figure 1. Dual-Circuit System cooling architecture (DCS) vs Single-Circuit system (SCS).

That previous design maintained turbulent flow through water blocks (WBs) using the in-rack CDU to regulate flow and temperature differences, ensuring performance despite OVHcloud’s ΔT of 20 K on the primary loop (far higher than the typical market value around 5 K).

Removing the in-rack CDU — replaced by a Pressure Independent Control Valve (PICV), a flow meter, and two temperature sensors on each rack — simplifies the system to a single closed-loop, where the flow rate through servers is dictated directly by the primary loop, adapting dynamically to rack load density. On the rack side, the system adapts the exact flow the rack requires by analyzing the water behavior and performing iterative, predictive thermal optimization considering IT components and the supplied water temperature and flow. This results in lower inlet water temperatures at the server level due to the elimination of the in-rack CDU’s approach temperature difference, and reduces electrical consumption, CAPEX, carbon footprint, and rack footprint.

To prevent laminar flow and maintain heat transfer efficiency at low flow rates, OVHcloud introduced a passive hydraulic innovation by arranging servers into clusters connected in series with servers inside each cluster connected in parallel, rather than all servers in parallel. This ensures higher water flow through individual servers even when the rack density is low. While this increases system pressure drops depending on cluster configuration, it results in better thermal performance and all servers receive water at temperatures equal to or lower than in the previous DCS design.

The racks operate on a novel hydraulic “pull” principle — where each rack draws exactly the hydraulic power it requires, rather than being pushed by the system. The CDU then dynamically adapts the overall hydraulic performance of the primary loop, balancing flow and pressure in real time to match the actual demand of the entire data center.

A key breakthrough is the CDU’s communication-free operation: it requires no cables, radio waves, or other electronic communication with racks. Instead, it analyzes hydraulic signals — pressure, flow, and temperature fluctuations within the water itself — to understand each rack’s cooling needs and adapt accordingly. This eliminates complex telemetry infrastructure, reduces operational risks, and enhances system reliability. To ensure water quality and system longevity, water supplied to the data center is filtered at 25 microns, and multiple sophisticated high-precision sensors continuously monitors water quality in real time.

The CDU is 50% smaller than the previous generation and manages the entire thermal path — from chip-level water blocks, through the racks and CDU, to the dry coolers.

The newly designed dry cooler is also 50% smaller than the previous model and features one of the lowest density footprints worldwide. Thanks to years of thermal studies on heat exchangers by the OVHcloud R&D team, it has 50% fewer fans, resulting in very low energy consumption, while also reducing noise. Its compact size means that we can also transport more units in the same truck! This design achieves a 30% reduction in water consumption compared to OVHcloud’s earliest dry cooler design. A key innovation in the dry cooler is its advanced adiabatic cooling pads system, which cools incoming hot air before it passes through the heat exchangers. This high-precision water injection system is the first of its kind, and adjusts water application based on multiple sensors and extensive iterative calculations, including data center load, ambient temperature, and humidity levels.

Unlike traditional adiabatic systems, the pads’ system does not use a conventional recirculation loop. Instead, water is injected when needed onto the pads via a simple setup consisting of a solenoid valve and a flow meter, eliminating complex hydraulics such as pumps, filters, storage tanks, level sensors, and conductivity sensors. The system maintains water quality and physical/chemical properties through careful design, drastically simplifying operation and reducing maintenance needs.

The CDU continuously analyzes data from up to 36 sensors distributed across the CDU itself and the associated dry cooler. It also collects operational data from solenoid valves, pumps, and dry cooler fans across the infrastructure loop. All components are monitored and managed by the system’s central intelligence—the CDU’s control panel—providing a comprehensive understanding of the entire system’s behavior, from the data center interior to the external ambient environment, ensuring real-time performance oversight and precise thermal regulation.

Through this iterative and precise control of water injection, the system optimizes cooling performance and Water Usage Effectiveness (WUE), ensuring minimal water consumption without sacrificing thermal effectiveness.

Advanced System Analytics, Learning & AI Integration

The entire system is designed to continuously analyze the thermal, hydraulic, and aerodynamic behaviors of the various fluids along the cooling path. It uses daily operational data to learn and adapt its performance dynamically, optimizing cooling efficiency and reliability over time.

The CDU’s brain—the control panel—aggregates data from 36 sensors distributed across the CDU and dry cooler, as well as operational data from solenoid valves, pumps, and dry cooler fans within the infrastructure loop. It also collects critical rack-level information, including flow rates, temperatures, and IPMI data that reflect IT equipment behavior and performance. All this operational data is pushed to a centralized data lake for parallel analysis, which forms the foundation for the next step: integrating cutting-edge artificial intelligence (AI). This AI will leverage the continuously gathered data and learning processes to enhance predictive capabilities, optimize future operating points, and enable fully autonomous decision-making.

This combination of real-time learning and AI-powered analytics will provide advanced diagnostics, predictive maintenance, and proactive management — maximizing uptime, reducing costs, and driving ever-greater sustainability.

Iterative Control System Innovation

The iterative control system manages all aspects in real time, hands-free, continuously learning from sensor data and operational feedback. It applies algorithms to the pump speed on the CDU, the fans on the dry cooler and the solenoid valve controlling water injection on the adiabatic pads.

On the rack side, the system uses a PICV valve, flow meter, and two temperature sensors to adapt the exact hydraulic flow needed by each rack, considering IT load and incoming water conditions, iteratively optimizing thermal performance and energy efficiency.

On the CDU, the system analyzes combined hydraulic signals from all racks alongside ambient data center conditions, dynamically balancing flow and pressure across the entire data center infrastructure without human intervention.

Furthermore, OVHcloud’s cooling system integrates intelligent communication between cooling line-ups to enhance failure detection and simplify maintenance. This is achieved through embedded freeze-gaud and resilience-switch mechanisms that ensure continuous operation and system resilience. The freeze-gaud system is designed to protect the dry coolers in sub-zero ambient conditions by keeping water circulating through their heat exchangers. If the overall loop flow drops below a predefined threshold, the system automatically opens a normally closed bypass valve to maintain circulation—preventing freezing despite the use of pure water (without glycol) as the cooling medium. The resilience-switch system maintains redundancy by hydraulically linking multiple cooling lines. In the event of failure or overload on one line, normally open solenoid valves isolate the affected line, while bypass valves on neighboring lines open to redistribute water flow and maintain cooling performance. This dynamic and autonomous valve management ensures uninterrupted service and rapid fault response.

Drawing inspiration from autonomous control methodologies in leading-edge industries, the system predicts future behavior based on iterative calculations, dynamically adapting pump speed, fans speed and solenoid valves openings to converge rapidly on optimal operating points. It also adjusts performance based on external constraints such as noise limits, water availability, or energy costs — for example, consuming more energy to save water in water-stressed regions or balancing noise restrictions in urban deployments.

This unique, self-optimizing end-to-end control system maximizes energy efficiency, sustainability, and operational simplicity, extending pump life cycles and ensuring the most environmentally responsible data center cooling solution available today.

This vertically integrated, autonomous system — including smart racks, the advanced CDU, and the intelligent dry cooler — represents a world-first in end-to-end, intelligent, sustainable, communication-free, and data-driven data center cooling.

Why is this important?

This innovation is critical because it marks a decisive step toward radically more sustainable, efficient, and autonomous data center cooling — addressing the growing demands of digital infrastructure while reducing its environmental footprint.

By using fewer, smaller components, we are saving power, cutting transport costs and reducing carbon impact. Using fewer fans on the dry cooler means up to 50% lower energy consumption on the cooling cycle – and the new pad system means 30% lower water consumption in the cooling system. The system is fully autonomous, avoiding human error. A temperature gradient of 20K on the primary loop – four times higher than the industry average – means that flow rates can be lower and water efficiency is higher. The system doesn’t rely on Wi-Fi or cabling, and the predictive control constantly adapts to external conditions or situational goals, feeding into a data lake to help continuously optimize performance.

Today’s world is built on technology, and datacenters are a key part of that technology, but there is a pressing need to ensure we can maintain human progress without incurring a significant carbon footprint. Power and water efficiency is a key part of this equation in the datacenter industry, and our innovation in the Smart Datacenter continues our trajectory of supporting today’s needs without compromising the world of tomorrow.

Manage your secrets using OVHcloud Secret Manager with External Secrets Operator (ESO) on OVHcloud Managed Kubernetes Service (MKS)

Aurélie Vache — Tue, 25 Nov 2025 14:44:52 +0000

Secrets resources in Kubernetes help us keep sensitive information like logins, passwords, tokens, credentials and certificates secure. But just a heads up: Secrets in Kubernetes are base64 encoded, not encrypted so anyone can read and decode them if they know how.

The good news is that OVHcloud has just launched the Secret Manager Beta, which you can use within your Kubernetes clusters via the External Secrets Operator (ESO) 🎉.

External Secrets Operator

The External Secrets Operator (ESO) extends Kubernetes with Custom Resource Definitions (CRDs) ) that define where secrets are and how to sync them.

The controller retrieves secrets from an external API and creates Kubernetes Secrets. If the secret changes in the external API, the controller updates the secret in the Kubernetes cluster.

Basically, the ESO can connect to an external Secret Manager like OVHcloud, Vault, AWS, or GCP using a (Cluster)SecretStore, and an ExternalSecret to figure out which Secret it needs to fetch. It then creates a Secret in the Kubernetes cluster with the fetched secret’s value.

Plus, it can sync secrets across all the namespaces in your Kubernetes cluster (I love this feature ❤️):

You can use External Secrets with different Providers, including AWS Secrets Manager, HashiCorp Vault, Google Secret Manager. In this blog I’ll show you how to create a secret in the new OVHcloud Secret Manager using Hashicorp Vault.

For more details, read the ESO official documentation.

Let’s jump in!

Create an IAM local user

To fetch secrets in Secret Manager, you’ll need an IAM user with the right permissions. You can either set it up or use an existing one.

In the OVHcloud Control Panel (UI), go to ‘Identity and Access Management’, then ‘Identities’.

Click the ‘Add user’ button to create an IAM local user and complete the fields as shown below:

Quick note, I’ve named the user ‘secretmanager-’ followed by the ID of the OKMS domain I want to use.

The user needs to be an ADMIN, or, ideally, have the following policies:

okms:apikms:secret/create
okms:apikms:secret/version/getData
okms:apiovh:secret/get

Get the Personal Access Token (PAT)

The ESO ClusterSecretStore needs the permission to fetch secrets from Secret Manager, so you’ll need a token (PAT).

You can access it via our API, which you’ll find here: https://eu.api.ovh.com/console/?section=%2Fme&branch=v1#post-/me/identity/user/-user-/token

Path parameters

user: secretmanager-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx

Request body:

{
  "description": "PAT secretmanager-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx",
  "name": "pat-secretmanager-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx"
}

You should obtain a response like this:

{
  "creation": "2025-11-07T14:02:56.679157188Z",
  "description": "PAT secretmanager-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx",
  "expiresAt": null,
  "lastUsed": null,
  "name": "pat-secretmanager-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx",
  "token": "eyJhbGciOiJ...punpVAg"
}

Save the token value, because you’ll need it in a bit.

Create a secret in the Secret Manager

Here’s how to create a secret with OVHcloud MPR credentials for use in Kubernetes cluster(s).

In the OVHcloud Control Panel (UI), go to ‘Secret Manager’, then create a secret ‘prod/va1/dockerconfigjson’ in the Europe region (France – Paris) eu-west-par:

You’ll need to activate the region if you’re selecting it for the first time:

Select an OKMS domain:

Enter the path and value of your secret. For example:

Your secret is all set!

Install External Secrets Operators on your cluster

Deploy external secret through Helm:

helm repo add external-secrets https://charts.external-secrets.io
helm repo update

Install from the chart repository:

helm install external-secrets \
   external-secrets/external-secrets \
    -n external-secrets \
    --create-namespace \
    --set installCRDs=true

Your result should look something like this:

$ helm install external-secrets \
   external-secrets/external-secrets \
    -n external-secrets \
    --create-namespace \
    --set installCRDs=true

NAME: external-secrets
LAST DEPLOYED: Mon Nov 24 17:08:58 2025
NAMESPACE: external-secrets
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
external-secrets has been deployed successfully in namespace external-secrets!

In order to begin using ExternalSecrets, you will need to set up a SecretStore
or ClusterSecretStore resource (for example, by creating a 'vault' SecretStore).

More information on the different types of SecretStores and how to configure them
can be found in our Github: https://github.com/external-secrets/external-secrets

This command will install the External Secrets Operator in your cluster.

Check ESO is running:

$ kubectl get all -n external-secrets
NAME                                                    READY   STATUS    RESTARTS   AGE
pod/external-secrets-6b9f8ff5d4-jwd6g                   1/1     Running   0          25m
pod/external-secrets-cert-controller-7bf8fd894c-d24xb   1/1     Running   0          25m
pod/external-secrets-webhook-df488ddff-2xv4t            1/1     Running   0          25m

NAME                               TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
service/external-secrets-webhook   ClusterIP   10.3.106.32           443/TCP   25m

NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/external-secrets                   1/1     1            1           25m
deployment.apps/external-secrets-cert-controller   1/1     1            1           25m
deployment.apps/external-secrets-webhook           1/1     1            1           25m

NAME                                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/external-secrets-6b9f8ff5d4                   1         1         1       25m
replicaset.apps/external-secrets-cert-controller-7bf8fd894c   1         1         1       25m
replicaset.apps/external-secrets-webhook-df488ddff            1         1         1       25m

Create a Secret contains the PAT

Encode the PAT in base64:

$ echo -n "" | base64

ZXlKaG...wVkFn

Create a secret with it inside a secret.yaml file:

apiVersion: v1
kind: Secret
metadata:
  name: ovhcloud-vault-token
  namespace: external-secrets
data:
  token: ZXlKaG...wVkFn

Apply the resource in your cluster:

kubectl apply -f secret.yaml

Check that the secret have been created:

$ kubectl get secret ovhcloud-vault-token -n external-secrets
NAME                   TYPE     DATA   AGE
ovhcloud-vault-token   Opaque   1      5m

Deploy a ClusterSecretStore to connect ESO to Secret Manager

Set up a ClusterSecretStore to manage synchronisation with Secret Manager.
It will use the HashiCorp Vault provider with token auth, and the OKMS endpoint as the backend.

Create a clustersecretstore.yaml file with the content below:

apiVersion: external-secrets.io/v1
kind: ClusterSecretStore
metadata:
  name: vault-secret-store
spec:
  provider:
      vault:
        server: "https://eu-west-par.okms.ovh.net/api/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" # OKMS endpoint, fill with the correct region and your okms_id
        path: "secret"
        version: "v2"
        auth:
            tokenSecretRef:
              name: ovhcloud-vault-token # The k8s secret that contain your PAT
              key: token

Keep in mind, in our example, we’ve selected the “eu-west-par” region. You can enter a different server URL, depending on your desired region.

Apply it:

kubectl apply -f clustersecretstore.yaml

Check:

$ kubectl get clustersecretstore.external-secrets.io/vault-secret-store
NAME                 AGE   STATUS   CAPABILITIES   READY
vault-secret-store   2m   Valid    ReadWrite      True

Create an ExternalSecret

Create an externalsecret.yaml file with the content below:

apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: docker-config-secret
  namespace: external-secrets
spec:
  refreshInterval: 30m
  secretStoreRef:
    name: vault-secret-store
    kind: ClusterSecretStore
  target:
    template:
      type: kubernetes.io/dockerconfigjson
      data:
        .dockerconfigjson: "{{ .mysecret | toString }}"
    name: ovhregistrycred
    creationPolicy: Owner
  data:
  - secretKey: mysecret
    remoteRef:
      key: prod/va1/dockerconfigjson

Apply it:

$ kubectl apply -f externalsecret.yaml
externalsecret.external-secrets.io/docker-config-secret created

Check:

$ kubectl get externalsecret.external-secrets.io/docker-config-secret -n external-secrets
NAME                   STORETYPE            STORE                REFRESH INTERVAL   STATUS         READY
docker-config-secret   ClusterSecretStore   vault-secret-store   30m0s              SecretSynced   True

After applying this command, it will create a Kubernetes Secret object.

$ kubectl get secret -n external-secrets
NAME                                     TYPE                             DATA   AGE
...
ovhregistrycred                          kubernetes.io/dockerconfigjson   1      17d
...

As you can see, the Secret is ready, and you can now use it as an imagePullSecret in your Pods!

Conclusion

In this blog, we’ve explained how to create secrets in the new OVHcloud Secret Manager and integrate them directly in your Kubernetes clusters using the ESO Vault provider.

And here’s some great news: our teams are working on an OVHcloud External Secret Operator, set to go live in the coming months, which you can use 🎉.

Stay tuned and share your thoughts!

OVHcloud backbone network: Environmental impact assessment methodology

Gregory Lebourg — Fri, 10 Oct 2025 08:07:42 +0000

Introduction

The underlying infrastructure of OVHcloud’s Cloud services consists of datacentres connected by a global telecommunication network which carries data to and from end users.

The core network (backbone) features nodes (also known as Points of Presence – PoPs) and long-distance/metropolitan spans (also known as links) which connect the nodes.

The PoPs are located in colocation facilities hosting optical transmission systems (Dense Wavelength Division Multiplexers – DWDM), IP routers, switches and servers.

The links are based on optical fibre routes interconnecting the PoPs and the datacentres, following a topology designed to maintain traffic in the event of multiple span cuts (for resiliency purposes). Two operating models are used:

Operating model 1: OVHcloud owns the dark fibre cable or rents a pair of dark fibre cables (through long-term Indefeasible Right of Use – IRU) and operates its own DWDM systems to activate wavelengths (10, 100, 400 Gbps point-to-point transmission signals) on top of it.
Operating model 2: OVHcloud leases the wavelengths activated by long-distance telecommunication operators on their international network (carrier’s carrier market).

In this study, terrestrial and submarine transmission infrastructures are differentiated as their physical realities are dissimilar.

Scope of the study

Items covered by the study:

POPs (colocation buildings and their technical environment)
Fibre cables and their underlying infrastructure
All active telecommunication equipment hosted in the PoPs as well as the line equipment hosted in amplification sites.

Items excluded by the model:

OVHcloud datacentres network equipment (accounted for in OVHcloud datacentres’ impact inventory)
ISP (Internet Service Providers) network as well as customer premises equipment.

Environmental impact of the PoPs

Electricity consumption (use impact)

OVHcloud measures the electrical consumption of the PoP’s equipment, and of the technical environment of the colocation facility in which it is hosted. All the ancillary systems are taken into account, therefore the PUE (Power Usage Effectiveness) of the sites is de facto included in the model.

The impact factors (per kWh of electricity) are retrieved from the Ecoinvent database.

Network equipment (manufacturing / distribution / end of life impacts)

OVHcloud reviews all the equipment deployed inside each PoP, including checking their commissioning date (for amortisation purposes).

The equipment reference is used to retrieve the impact factors from the Negaoctet and Resilio v2024.6 databases (amortised over six years). Should the exact model not be found, a similar generic reference is chosen.

Facilities technical environment (manufacturing / distribution / use/end of life impacts)

Based on the electricity consumption of the technical environment of the colocation facility, the impact factors of each PoP (per contracted kW per year per kWh of electricity) are retrieved from the Ademe PCR datacentre and cloud database.

Environmental impact of the terrestrial links

Optical fibre cable (manufacturing / distribution / end of life) impacts

Operating model 1: OVHcloud manages the transmission layer.

The optical fibre cable related impacts are calculated by allocating two strands out of a 288-strand cable. The allocation is then corrected to reflect the 25-year ramp-up before reaching 100% usage of the cable over a lifespan of 60 years. This leads to 0.88% of allocation.

The impact factors (per km) are retrieved from the Ecoinvent database.

Operating model 2: OVHcloud leases a wavelength to a carrier’s carrier.

The assumption is that the carrier follows the same optical fibre cable allocation rules (0.88%). In addition, the impact is pro-rated considering the ramping up of the carrier’s optical systems (leading to 4.32% of 0.88% of allocation):

Maximum number of channels per DWDM systems: 48
Maximum load rate of DWDM systems: 85%
Six-year ramp up to reach maximum load rate of the DWDM system
DWDM system lifespan: 8 years

In both operating models, 10% of the civil work (trench, ducts) impacts necessary to lay the optical fibre cables are allocated.

Line optical systems (manufacturing / distribution / use/end of life impacts)

Operating model 1: OVHcloud manages the transmission layer, therefore amplifier types and locations are known. (De)Multiplexers are excluded as they are already accounted for in the environmental impact of the PoPs (see previous section).

Operating model 2: OVHcloud leases a wavelength to a carrier’s carrier. The assumptions are as follows:

(De)Multiplexers and repeaters are chosen using standard equipment
Typical distance between two regenerator sites: 500 km
Typical distance between two repeater sites: 90 km
Maximum number of channels per DWDM systems: 48
Maximum load rate of DWDM systems: 85%
Six-year ramp up to reach maximum load rate of the DWDM system
DWDM system lifespan: 8 years

Once the optical equipment mapping has been done, the emissions factors for each link are then retrieved from the Ademe ISP – Negaoctet database.

Environmental impact of the submarine links

Source: https://www.congress.gov/crs-product/R47237

Submarine cable system (manufacturing / distribution / maintenance / end of life impacts) and electricity (use impact)

Operating model 2: OVHcloud leases a wavelength to a carrier’s carrier from PoP to PoP.

The climate change impact is retrieved from the 2025 Lisbon Suboptic study for both transatlantic and transpacific cable systems (per km and per year).

For the other impact factors (abiotic and water resources use), the submarine cable systems are modelled based on the following assumptions:

(De)Multiplexers and repeaters are chosen using standard terrestrial equipment
Cable is considered as a standard MV electrical cable
Typical distance between two landing stations: 7000 km
Typical distance between two repeaters: 80 km
Maximum capacity: 200 Tbps for Transpacific / 350 Tbps for Transatlantic
Maximum load rate of the DWDM systems: 100%
Ten-year ramp up to reach Maximum load rate of the DWDM system
Submarine cable system life span: 25 years

Results (per year)

Impact ADPe	58 kg Sb eq.
Impact GWP	3700 tons CO2 eq.
Impact WU	2.0E+06 m3 eq.

	Resource use, minerals and metals	Climate change	Water use
	ADPe (kg Sb eq.)	GWP (kg CO2 eq.)	WU (m3 eq.)
POPs	3.75E+01	2.42E+06	1.25E+06
Owned Fibers	1.02E+01	8.10E+05	5.87E+05
Leased Capacity	1.05E+01	4.73E+05	1.67E+05
	5.81E+01	3.70E+06	2.00E+06

	Resource use, minerals and metals	Climate change	Water use
	ADPe (kg Sb eq.)	GWP (kg CO2 eq.)	WU (m3 eq.)
Manufacturing / Distribution	2.35E+01	1.39E+06	2.51E+05
Use	3.46E+01	2.29E+06	1.74E+06
End of Life	2.77E-03	2.31E+04	1.07E+04
	5.81E+01	3.70E+06	2.00E+06

Conclusion

The above presented methodology allows to assess accurately the environmental impact of our backbone with a multi-factorial approach covering GHG emissions, water consumption and abiotic ressources usage. On the carbon emissions side, the findings are that the previous methodology we used (based on a Renater research paper) in our GHG emissions reporting was overestimating OVHcloud backbone impact by a factor 2.

Create a podcast transcript with Whisper by AI Endpoints

Stéphane Philippart — Thu, 28 Aug 2025 07:03:04 +0000

Check out this blog post if you want to know more about AI Endpoints.
You can also find more info on AI Endpoints in our previous blog posts.

This blog post explains how to create a podcast transcript using Whisper, a powerful automatic speech recognition (ASR) system developed by OpenAI. Whisper integrates with AI Endpoints and makes it easy to transcribe audio files and add features, like speaker diarization.

ℹ️ You can find the full code on Github ℹ️

Environment Setup

Define your environment variables for accessing AI Endpoints:

$ export OVH_AI_ENDPOINTS_WHISPER_URL=
$ export OVH_AI_ENDPOINTS_ACCESS_TOKEN=
$ export OVH_AI_ENDPOINTS_WHISPER_MODEL=whisper-large-v3

Install dependencies:

$ pip install -r requirements.txt

Audio transcription

With Whisper and the OpenAI client, transcribing audio is as simple as writing a few lines of code:

import os
import json
from openai import OpenAI

# 🛠️ OpenAI client initialisation
client = OpenAI(base_url=os.environ.get('OVH_AI_ENDPOINTS_WHISPER_URL'), 
                api_key=os.environ.get('OVH_AI_ENDPOINTS_ACCESS_TOKEN'))

# 🎼 Audio file loading
with open("../resources/TdT20-trimed-2.mp3", "rb") as audio_file:
    # 📝 Call Whisper transcription API
    transcript = client.audio.transcriptions.create(
        model=os.environ.get('OVH_AI_ENDPOINTS_WHISPER_MODEL'),
        file=audio_file,
        temperature=0.0,
        response_format="verbose_json",
        extra_body={"diarize": True},
    )

FYI:
– we use ‘diarize’ (not a Whisper parameter) to enable diarization, because the OpenAI client lets us add extra body parameters.
– you need verbose_json for diarization (which also means segmentation mode)

Once you have the full transcript, format it in a way that’s easy for humans to read.

Create the script

The JSON field ‘diarization’ contains all of the transcribed, diarized content.

"diarization": [
    {
      "speaker": 0,
      "text": "bla bla bla",
      "start": 16.5,
      "end": 26.38
    },
    {
      "speaker": 1,
      "text": "bla bla",
      "start": 26.38,
      "end": 32.6
    },
    {
      "speaker": 1,
      "text": "bla bla",
      "start": 32.6,
      "end": 40.6
    },
    {
      "speaker": 2,
      "text": "bla bla",
      "start": 40.6,
      "end": 42
    }
]

Because they are segmented, you can merge several fields for the same speaker as detailed below—for speaker 1.

Here’s a sample code for creating the script of a French podcast featuring 3 speakers:

# 🔀 Merge the dialog said by the same speaker     
diarizedTranscript = ''
speakers = ["Aurélie", "Guillaume", "Stéphane"]
previousSpeaker = -1
jsonTranscript = json.loads(transcript.model_dump_json())

# 💬 Only the diarization field is useful
for dialog in jsonTranscript["diarization"]:
    speaker = dialog.get("speaker")
    text = dialog.get("text")
    if (previousSpeaker == speaker):
        diarizedTranscript += f" {text}"
    else:
        diarizedTranscript += f"\n\n{speakers[speaker]}: {text}"
    previousSpeaker = speaker

print(f"\n📝 Diarized Transcript 📝:\n{diarizedTranscript}")

Lastly, run the Python script:

$ python PodcastTranscriptWithWhisper.py

📝 Diarized Transcript 📝:

Stéphane: Bonjour tout le monde, ravi de vous retrouver pour l'enregistrement de ce dernier épisode de la saison avant de prendre des vacances bien méritées et de vous retrouver à la rentrée pour la troisième saison. Nous enregistrons cet épisode le 30 juin à la fraîche, enfin si on peut dire au vu des températures déjà présentes en cette matinée. Justement, elle revient chaudement de Sunnytech et c'est avec plaisir que je la retrouve pour l'enregistrement de cet épisode. Bonjour Aurélie, comment vas-tu ?

Aurélie: Salut, alors ça va très bien. Alors j'avoue, j'ai également très chaud. J'ai le ventilateur qui est juste à côté de moi donc ça va aller pour l'enregistrement du podcast.

Stéphane: Oui, c'est vrai qu'il fait un peu chaud. Et pour ce dernier épisode de la saison, c'est avec un mélange de joie mais aussi d'intimidation que je reçois notre invité. Si je fais ce métier de la façon dont je le fais, c'est grandement grâce à lui. Ce podcast, quelque part, a bien entendu des inspirations de ce que fait notre invité. Je suis donc très content de te recevoir Guillaume. Bonjour Guillaume, comment vas-tu et souhaites-tu te présenter à nos auditrices et auditeurs ? Bonjour à

Guillaume: tous et bien merci déjà de m'avoir invité. Je suis très content de rejoindre votre podcast pour cet épisode. Je m'appelle Guillaume Laforge, je suis un développeur Java depuis la première heure depuis très très longtemps. Je travaille chez Google, en particulier dans la partie Google Cloud. Je me focalise beaucoup sur tout ce qui est Generative AI vu que c'est à la mode évidemment. Les gens me connaissent peut-être ou peut-être ma voix d'ailleurs parce que je fais partie du podcast Les Cascodeurs qu'on a commencé il y a 15 ans ou quelque chose comme ça. Il y a trop longtemps. Ou alors ils me connaissent parce que je suis un des co-fondateurs du langage Groovy, Apache Groovy.

Feel free to try out our new product, AI Endpoints, and share your thoughts.

Hang out with us on Discord at #ai-endpoints or https://discord.gg/ovhcloud. See you soon!