<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Eléa Petton, Author at OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/author/elea-petton/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/author/elea-petton/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Fri, 10 Apr 2026 09:23:41 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>Eléa Petton, Author at OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/author/elea-petton/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Reference Architecture: Deploying a vision-language model with vLLM on OVHcloud MKS for high performance inference and full observability</title>
		<link>https://blog.ovhcloud.com/reference-architecture-deploying-a-vision-language-model-with-vllm-on-ovhcloud-mks-for-high-performance-inference-and-full-observability/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Fri, 10 Apr 2026 07:48:53 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[Kubernetes]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[prometheus]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<category><![CDATA[vLLM]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=30455</guid>

					<description><![CDATA[Ensure complete&#160;digital sovereignty&#160;of your AI models with end-to-end control through open-source solutions on OVHcloud’s&#160;Managed Kubernetes Service. This reference architecture demonstrates how to deploy a Large Language Model (LLM) inference system using vLLM on&#160;OVHcloud Managed Kubernetes Service&#160;(MKS). The solution leverages NVIDIA L40S GPUs to serve the&#160;Qwen3-VL-8B-Instruct&#160;multimodal model (vision + text) with OpenAI-compatible API endpoints. This comprehensive [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-deploying-a-vision-language-model-with-vllm-on-ovhcloud-mks-for-high-performance-inference-and-full-observability%2F&amp;action_name=Reference%20Architecture%3A%20Deploying%20a%20vision-language%20model%20with%20vLLM%20on%20OVHcloud%20MKS%20for%20high%20performance%20inference%20and%20full%20observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em><em>Ensure complete&nbsp;<strong>digital sovereignty</strong>&nbsp;of your AI models with end-to-end control through open-source solutions on OVHcloud’s&nbsp;<strong>Managed Kubernetes Service</strong>.</em></em></p>



<figure class="wp-block-image aligncenter size-large is-resized"><img fetchpriority="high" decoding="async" width="703" height="1024" src="https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-703x1024.jpg" alt="vLLM on OVHcloud MKS for high availability and full observability" class="wp-image-31153" style="width:710px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-703x1024.jpg 703w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-206x300.jpg 206w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-768x1118.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-1055x1536.jpg 1055w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm.jpg 1260w" sizes="(max-width: 703px) 100vw, 703px" /><figcaption class="wp-element-caption"><em><em>vLLM on OVHcloud MKS for high availability and full observability</em></em></figcaption></figure>



<p>This reference architecture demonstrates how to deploy a Large Language Model (LLM) inference system using vLLM on&nbsp;<a href="https://www.ovhcloud.com/fr/public-cloud/kubernetes/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Managed Kubernetes Service</a>&nbsp;(MKS). The solution leverages NVIDIA L40S GPUs to serve the&nbsp;<a href="https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Qwen3-VL-8B-Instruct</a>&nbsp;multimodal model (vision + text) with OpenAI-compatible API endpoints.</p>



<p>This comprehensive guide shows you how to deploy, to scale automatically, and how to monitor vLLM-based LLM workloads on the OVHcloud infrastructure.</p>



<p><strong>What are the key benefits?</strong></p>



<ul class="wp-block-list">
<li><strong>Cost-effectiveness:</strong>&nbsp;Leverage managed services to minimise operational overhead</li>



<li><strong>Real-time observability:</strong>&nbsp;Track Time-to-First-Token (TTFT), throughput, and resource utilisation</li>



<li><strong>Sovereign infrastructure:</strong>&nbsp;Keep all metrics and data within European datacentres</li>



<li><strong>Scalable by design:</strong>&nbsp;Automatically scale GPU inference replicas based on real workload demand</li>
</ul>



<h2 class="wp-block-heading">Context</h2>



<h3 class="wp-block-heading">Managed Kubernetes Service</h3>



<p><strong>OVHcloud MKS</strong>&nbsp;is a fully managed Kubernetes platform designed to help you deploy, operate, and scale containerised applications in production. It provides a secure and reliable Kubernetes environment without the operational overhead of managing the control plane.</p>



<p><strong>How does this benefit you?</strong></p>



<ul class="wp-block-list">
<li><strong>Cost-efficient</strong>: Pay only for worker nodes and consumed resources, with no additional charge for the Kubernetes control plane</li>



<li><strong>Fully managed Kubernetes</strong>: Certified upstream Kubernetes with automated control plane management, provided upgrades and high availability</li>



<li><strong>Production-ready by design</strong>: Built-in integrations with OVHcloud Load Balancers, networking, and persistent storage</li>



<li><strong>Scalable and flexible</strong>: Scale workloads easily, node pools to match application demand</li>



<li><strong>Open and portable</strong>: Based on standard Kubernetes APIs, enable seamless integration with open-source ecosystems and avoid vendor lock-in</li>
</ul>



<p>In the following guide, all services are deployed within the&nbsp;<strong>OVHcloud Public Cloud</strong>.</p>



<h2 class="wp-block-heading">Architecture overview</h2>



<p>This reference architecture demonstrates a basic deployment of vLLM for vision-language model inference on OVHcloud Managed Kubernetes Service, featuring:</p>



<ul class="wp-block-list">
<li><strong>High-availability deployment</strong>&nbsp;with 2 GPU nodes (NVIDIA L40S)</li>



<li><strong>Optimised GPU utilisation</strong>&nbsp;with proper driver configuration</li>



<li><strong>Scalable infrastructure</strong>&nbsp;supporting vision-language models</li>



<li><strong>Comprehensive monitoring</strong>&nbsp;using Prometheus, Grafana, and DCGM</li>



<li><strong>Full observability</strong>&nbsp;for both application and hardware metrics</li>
</ul>



<p><strong>Data flow</strong>:</p>



<figure class="wp-block-image aligncenter size-large"><img decoding="async" width="1024" height="538" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-1024x538.jpg" alt="" class="wp-image-30985" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-768x403.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-1536x806.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-2048x1075.jpg 2048w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Data Flow</em></figcaption></figure>



<ol class="wp-block-list">
<li><strong>Inference request:</strong>
<ul class="wp-block-list">
<li>User → LoadBalancer → Gateway → NGINX Ingress → &#8220;Qwen3 VL&#8221; Service → vLLM Pod → GPU</li>



<li>Response follows reverse path with streaming support</li>
</ul>
</li>



<li><strong>Metrics collection:</strong>
<ul class="wp-block-list">
<li>vLLM Pods expose <code>/metrics</code> endpoint (port <code><strong><mark class="has-inline-color has-ast-global-color-0-color">8000</mark></strong></code>)</li>



<li>DCGM Exporters expose GPU metrics (port <code><strong><mark class="has-inline-color has-ast-global-color-0-color">9400</mark></strong></code>)</li>



<li>Prometheus scrapes both endpoints every 30 seconds</li>



<li>Grafana queries Prometheus for visualization</li>
</ul>
</li>



<li><strong>Load distribution</strong>
<ul class="wp-block-list">
<li>NGINX Ingress uses cookie-based session affinity</li>



<li>vLLM Service uses ClientIP session affinity</li>



<li>Anti-affinity ensures 1 pod per GPU node</li>
</ul>
</li>
</ol>



<h2 class="wp-block-heading">Prerequisites</h2>



<p>Before you begin, ensure you have:</p>



<ul class="wp-block-list">
<li>An&nbsp;<strong>OVHcloud Public Cloud</strong>&nbsp;account</li>



<li>An&nbsp;<strong>OpenStack user</strong>&nbsp;with the<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">&nbsp;</a><strong><code>Administrator</code></strong>&nbsp;role</li>



<li><strong>Hugging Face access</strong>&nbsp;–&nbsp;<em>create a&nbsp;<a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face account</a>&nbsp;and generate an&nbsp;<a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">access token</a></em></li>



<li><code><strong>kubectl</strong></code>&nbsp;already installed and&nbsp;<code><strong>helm</strong></code>&nbsp;installed (at least version 3.x)</li>
</ul>



<p><strong>🚀 Now you have all the ingredients, it’s time to deploy the recipe for&nbsp;<a href="https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Qwen/Qwen3-VL-8B-Instruct</a>&nbsp;using vLLM and MKS!</strong></p>



<h2 class="wp-block-heading">Architecture guide: Native GPU deployment of vLLM on MKS with full stack observability</h2>



<p>This reference architecture describes a<strong>&nbsp;Large Language Model</strong>&nbsp;deployment using&nbsp;<strong>vLLM inference server&nbsp;</strong>and&nbsp;<strong>Kubernetes</strong>, to enjoy the&nbsp;benefits of a service that&#8217;s both highly available and monitorable in real time.</p>



<h3 class="wp-block-heading">Step 1 &#8211; Create MKS cluster and Node pools</h3>



<p>From&nbsp;<a href="https://www.ovh.com/manager/" target="_blank" rel="noreferrer noopener" data-wpel-link="exclude">OVHcloud Control Panel</a>, create a Kubernetes cluster using the&nbsp;<strong>MKS</strong>. </p>



<p>Navigate to: <code>Public Cloud</code> → <code>Managed Kubernetes Service</code> → <code>Create a cluster</code></p>



<h4 class="wp-block-heading">1. Configure cluster</h4>



<p>Consider using the following configuration for the current use case:</p>



<ul class="wp-block-list">
<li><strong>Name:</strong> <code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm-deployment-l40s-qwen3-8b</mark></strong></code></li>



<li><strong>Location</strong>: 1-AZ Region &#8211; Gravelines (<code><strong><mark class="has-inline-color has-ast-global-color-0-color">GRA11</mark></strong></code>)</li>



<li><strong>Plan:</strong> Free (or Standard)</li>



<li><strong>Network</strong>: attach a <strong>Private network </strong>(e.g. <code><strong><mark class="has-inline-color has-ast-global-color-0-color">0000 - AI Private Network</mark></strong></code>)</li>



<li><strong>Version:</strong> Latest stable (e.g. <code><strong><mark class="has-inline-color has-ast-global-color-0-color">1.34</mark></strong></code>)</li>
</ul>



<h4 class="wp-block-heading">2. Create GPU Node pool</h4>



<p>During the cluster creation, configure the vLLM Node pool for GPUs:</p>



<ul class="wp-block-list">
<li><strong>Node pool name:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">vllm</mark></code></li>



<li><strong>Flavor:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">L40S-90</mark></code></li>



<li><strong>Number of nodes:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">2</mark></code></li>



<li><strong>Autoscaling:</strong> Disabled (OFF)</li>
</ul>



<p><strong>Why L40S-90?</strong></p>



<ul class="wp-block-list">
<li>Cost-effective for single-model deployment (1 GPU per node)</li>



<li>Sufficient RAM (90GB) for <strong><code><mark class="has-inline-color has-ast-global-color-0-color">Qwen3-VL-8B</mark></code></strong> model</li>
</ul>



<p>You should see your cluster (e.g.&nbsp;<code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm-deployment-l40s-qwen3-8b</mark></strong></code>) in the list, along with the following information:</p>



<figure class="wp-block-image size-full"><img decoding="async" width="930" height="588" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1.png" alt="" class="wp-image-30745" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1.png 930w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1-300x190.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1-768x486.png 768w" sizes="(max-width: 930px) 100vw, 930px" /></figure>



<p>You can now set up the node pool dedicated to monitoring.</p>



<h4 class="wp-block-heading">3. Create CPU Node pool</h4>



<p>From your cluster, click on <code><strong><mark class="has-inline-color has-ast-global-color-0-color">Add a node pool</mark></strong></code> and configure it as follow:</p>



<ul class="wp-block-list">
<li><strong>Node pool name:</strong> <mark class="has-inline-color has-ast-global-color-0-color"><code>monitoring</code></mark></li>



<li><strong>Flavor:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">B2-15</mark></code></li>



<li><strong>Number of nodes:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">1</mark></code></li>



<li><strong>Autoscaling:</strong> Disabled (OFF)</li>
</ul>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>✅ <strong>Note</strong></p>



<p><strong><em>Monitoring stack can run on GPU nodes if cost is a concern. Dedicated CPU node provides better isolation and resource management.</em></strong></p>
</blockquote>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="365" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-1024x365.png" alt="" class="wp-image-30743" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-1024x365.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-300x107.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-768x274.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation.png 1283w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>If the status is green with the&nbsp;<strong><code><mark class="has-inline-color has-ast-global-color-0-color">OK</mark></code></strong>&nbsp;label, you can proceed to the next step.</p>



<h4 class="wp-block-heading">4. Configure Kubernetes access</h4>



<p>Once your nodes have been provisioned, you can download the <strong>Kubeconfig file</strong> and configure kubectl with your MKS cluster.</p>



<pre class="wp-block-code"><code class=""># configure kubectl with your MKS cluster<br>export KUBECONFIG=/path/to/your/kubeconfig-xxxxxx.yml<br><br># verify cluster connectivity<br>kubectl cluster-info<br>kubectl get nodes</code></pre>



<p>Returning:</p>



<p><code>NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; STATUS &nbsp; ROLES&nbsp; &nbsp; AGE &nbsp; VERSION<br>monitoring-node-xxxxxx &nbsp; Ready&nbsp; &nbsp; &lt;none&gt; &nbsp; 1d &nbsp; v1.34.2<br>vllm-node-yyyyyy &nbsp; &nbsp; &nbsp; &nbsp; Ready&nbsp; &nbsp; &lt;none&gt; &nbsp; 1d &nbsp; v1.34.2<br>vllm-node-zzzzzz &nbsp; &nbsp; &nbsp; &nbsp; Ready&nbsp; &nbsp; &lt;none&gt; &nbsp; 1d &nbsp; v1.34.2</code></p>



<p>Before going further, add a label to the CPU node for monitoring workloads.</p>



<pre class="wp-block-code"><code class="">CPU_NODE=$(kubectl get nodes -o json | \<br>  jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" == null) | .metadata.name')<br>kubectl label node $CPU_NODE node-role=monitoring</code></pre>



<p>Finally, check with the following command:</p>



<pre class="wp-block-code"><code class="">NAME                     GPU      ROLE<br>monitoring-node-xxxxxx   &lt;none&gt;   monitoring<br>vllm-node-yyyyyy         1        &lt;none&gt;<br>vllm-node-zzzzzz         1        &lt;none&gt;</code></pre>



<p>Once both nodes are in <strong>Ready</strong> status, you can proceed to the next step.</p>



<h3 class="wp-block-heading">Step 2 &#8211; Install GPU operator</h3>



<p>To start, consider setting up the GPU operator.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>✅ Note</strong></p>



<p><em><strong>This step is based on this OVHcloud documentation: <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-kubernetes-deploy-gpu-application?id=kb_article_view&amp;sysparm_article=KB0049707" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Deploying a GPU application on OVHcloud Managed Kubernetes Service</a></strong></em></p>
</blockquote>



<h4 class="wp-block-heading">1. Add NVIDIA helm repository and create namespace</h4>



<p>Add NVIDIA helm repo:</p>



<pre class="wp-block-code"><code class="">helm repo add nvidia https://helm.ngc.nvidia.com/nvidia<br>helm repo update</code></pre>



<p>And create Namespace as follow.</p>



<pre class="wp-block-code"><code class="">kubectl create namespace gpu-operator</code></pre>



<h4 class="wp-block-heading">2. Install GPU operator with correct configuration</h4>



<p>The GPU Operator must be configured with specific driver versions to ensure compatibility with vLLM containers.</p>



<p>However, the default installation uses recent drivers (<code><strong><mark class="has-inline-color has-ast-global-color-0-color">580.x</mark></strong></code> with <strong><code><mark class="has-inline-color has-ast-global-color-0-color">CUDA 13.x</mark></code></strong>) which are incompatible with vLLM containers (<strong><code><mark class="has-inline-color has-ast-global-color-0-color">CUDA 12.x</mark></code></strong>).</p>



<p><strong>Solution:</strong> Force driver version <strong><code><mark class="has-inline-color has-ast-global-color-0-color">535.183.01</mark></code></strong> (<code><strong><mark class="has-inline-color has-ast-global-color-0-color">CUDA 12.2</mark></strong></code>).</p>



<pre class="wp-block-code"><code class="">helm install gpu-operator nvidia/gpu-operator \<br>  -n gpu-operator \<br>  --set driver.enabled=true \<br>  --set driver.version="535.183.01" \<br>  --set toolkit.enabled=true \<br>  --set operator.defaultRuntime=containerd \<br>  --set devicePlugin.enabled=true \<br>  --set dcgmExporter.enabled=true \<br>  --set dcgmExporter.image="dcgm-exporter" \<br>  --set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04" \<br>  --set gfd.enabled=true \<br>  --set migManager.enabled=false \<br>  --set nodeStatusExporter.enabled=true \<br>  --set validator.driver.enable=false \<br>  --set validator.toolkit.enable=false \<br>  --set validator.plugin.enable=false \<br>  --timeout 20m</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>✅ <strong>Note </strong></p>



<p><em><strong>Specifying the DCGM version may only be necessary if you encounter problems with the default image (e.g. <code><mark class="has-inline-color has-ast-global-color-0-color">‘ImagePullBackOff’</mark></code>). If this is the case, add the following parameters:<br><code><mark class="has-inline-color has-ast-global-color-0-color">--set dcgmExporter.repository="nvcr.io/nvidia/k8s"<br>--set dcgmExporter.image="dcgm-exporter"<br>--set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04"</mark></code></strong></em></p>
</blockquote>



<pre class="wp-block-code"><code class="">kubectl get pods -n gpu-operator</code></pre>



<p>Note that all pods should reach <strong>Running</strong> state in 5-10 minutes.</p>



<p>You can also check the GPU availability:</p>



<pre class="wp-block-code"><code class="">kubectl get nodes -o json | jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" != null) | "\(.metadata.name): \(.status.allocatable."nvidia.com/gpu") GPU(s)"'</code></pre>



<p>Returning:</p>



<p><code>vllm-node-<code>yyyyyy</code>: 1 GPU(s)<br>vllm-node-zzzzzz: 1 GPU(s)</code></p>



<p>And you can test to run <code><strong><mark class="has-inline-color has-ast-global-color-0-color">nvidia-smi</mark></strong></code>:</p>



<pre class="wp-block-code"><code class="">DRIVER_POD=$(kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset -o name | head -1)<br>kubectl exec -n gpu-operator $DRIVER_POD -- nvidia-smi</code></pre>



<p>If GPU tests are working properly, you can move on DCGM service configuration.</p>



<h4 class="wp-block-heading">3. Configure DCGM service</h4>



<p><strong>Why is DCGM Exporter required?</strong></p>



<p>DCGM (Data Centre GPU Manager) is NVIDIA&#8217;s official tool for monitoring GPUs in production. The goal is to be able to collect and display metrics from both GPU nodes.</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="571" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-1024x571.jpg" alt="" class="wp-image-30746" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-1024x571.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-300x167.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-768x428.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-1536x856.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1.jpg 1733w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>GPU monitoring with DCGM</em></figcaption></figure>



<p>The metrics provided are:</p>



<ul class="wp-block-list">
<li><code><strong><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_GPU_UTIL</mark></strong></code> &#8211; GPU utilisation (%)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_GPU_TEMP</mark></code></strong> &#8211; GPU temperature (°C)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_FB_USED</mark></code></strong> &#8211; VRAM used (MB)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_FB_FREE</mark></code></strong> &#8211; Free VRAM (MB)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_POWER_USAGE</mark></code></strong> &#8211; Power consumption (W)</li>



<li>And 50+ other GPU metrics</li>
</ul>



<p>Next, ensure DCGM service has the correct labels and port configuration:</p>



<pre class="wp-block-code"><code class="">kubectl patch svc nvidia-dcgm-exporter -n gpu-operator --type merge -p '{<br>  "metadata": {<br>    "labels": {<br>      "app": "nvidia-dcgm-exporter"<br>    }<br>  },<br>  "spec": {<br>    "ports": [<br>      {<br>        "name": "metrics",<br>        "port": 9400,<br>        "targetPort": 9400,<br>        "protocol": "TCP"<br>      }<br>    ]<br>  }<br>}'</code></pre>



<p>Verify the endpoints (should show 2 IPs, one per GPU node).</p>



<pre class="wp-block-code"><code class="">kubectl get endpoints nvidia-dcgm-exporter -n gpu-operator</code></pre>



<p><code>NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ENDPOINTS &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; AGE<br>nvidia-dcgm-exporter &nbsp; x.x.x.x:9400,x.x.x.x:9400 &nbsp; 17d</code></p>



<h3 class="wp-block-heading">Step 3 &#8211; Deploy Qwen3 VL 8B with vLLM inference server</h3>



<p>The deployment of the <strong>Qwen 3 VL 8B</strong> model on two L40S GPU nodes is carried out in several stages.</p>



<h4 class="wp-block-heading">1. Create namespace and Hugging Face secret</h4>



<p>Start by creating Namespace:</p>



<pre class="wp-block-code"><code class="">kubectl create namespace vllm</code></pre>



<p>Next, you must retrieve your Hugging Face token and replace the&nbsp;<code><strong><mark class="has-inline-color has-ast-global-color-0-color">HF_TOKEN</mark></strong></code>&nbsp;value by your own:</p>



<pre class="wp-block-code"><code class="">export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"</code></pre>



<p>Create your secret as follow:</p>



<pre class="wp-block-code"><code class="">kubectl create secret generic huggingface-secret \<br>  --from-literal=token=$HF_TOKEN \<br>  --namespace=vllm</code></pre>



<p>Verify you obtain the following output by launching:</p>



<pre class="wp-block-code"><code class="">kubectl get secret huggingface-secret -n vllm</code></pre>



<p><code>NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; TYPE &nbsp; &nbsp; DATA &nbsp; AGE<br>huggingface-secret &nbsp; Opaque &nbsp; 1&nbsp; &nbsp; &nbsp; 14d</code></p>



<h4 class="wp-block-heading">2. Create vLLM deployment configuration</h4>



<p>First, you can create <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/vllm/vllm-deployment-2nodes.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-deployment-2nodes.yaml</a></strong></code> file.</p>



<p>Deploy vLLM:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f vllm-deployment-2nodes.yaml</code></pre>



<p>You can monitor the deployment (it should take 8-10 minutes for model download and loading).</p>



<pre class="wp-block-code"><code class="">kubectl get pods -n vllm -o wide -w</code></pre>



<p>Expected output after 10 minutes:</p>



<pre class="wp-block-code"><code class="">NAME               READY  STATUS   RESTARTS  AGE  IP       NODE  <br>qwen3-vl-xxxx-yyy  1/1    Running  0         1d   X.X.X.X  vllm-node-yyyyyy<br>qwen3-vl-xxxx-zzz  1/1    Running  0         1d   X.X.X.X  vllm-node-zzzzzz</code></pre>



<p>You can also check the container logs:</p>



<pre class="wp-block-code"><code class="">kubectl logs -f -n vllm &lt;pod-name&gt;</code></pre>



<p>You should find in the logs: &#8220;<code>Uvicorn running on http://0.0.0.0:8000</code>&#8220;</p>



<p>Is everything installed correctly? Then let&#8217;s move on to the next step.</p>



<h4 class="wp-block-heading">3. Add service label</h4>



<p>Ensure service has the correct label for <strong><code><mark class="has-inline-color has-ast-global-color-0-color">ServiceMonitor</mark></code></strong> discovery.</p>



<pre class="wp-block-code"><code class="">kubectl label svc qwen3-vl-service -n vllm app=qwen3-vl --overwrite</code></pre>



<p>You can now verify by launching the following command.</p>



<pre class="wp-block-code"><code class="">kubectl get svc qwen3-vl-service -n vllm --show-labels | grep "app=qwen3-vl"</code></pre>



<p>Returning:</p>



<p><code>qwen3-vl-service&nbsp; ClusterIP&nbsp; X.X.X.X &nbsp;&lt;none&gt;  8000/TCP  1d &nbsp;app=qwen3-vl</code></p>



<h3 class="wp-block-heading">Step 4 &#8211; Install NGINX ingress controller</h3>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><mark style="color:#cf2e2e" class="has-inline-color">⚠️ <strong>Moving beyond Ingress</strong></mark></p>



<p><strong><em><mark style="color:#cf2e2e" class="has-inline-color">Follow this <a href="https://blog.ovhcloud.com/moving-beyond-ingress-why-should-ovhcloud-managed-kubernetes-service-mks-users-start-looking-at-the-gateway-api/" data-wpel-link="internal">tutorial</a> if you want to use Gateway instead of Ingress.</mark></em></strong></p>
</blockquote>



<h4 class="wp-block-heading">1. Add helm repository and configure Ingress</h4>



<p>First of all, add helm repository:</p>



<pre class="wp-block-code"><code class="">helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx<br>helm repo update</code></pre>



<p>Create configuration file with <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/ingress/ingress-nginx-values.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">ingress-nginx-values.yaml</a></strong></code>.</p>



<p>Then, install NGINX Ingress:</p>



<pre class="wp-block-code"><code class="">helm install ingress-nginx ingress-nginx/ingress-nginx \<br>  --namespace ingress-nginx \<br>  --create-namespace \<br>  -f ingress-nginx-values.yaml \<br>  --wait</code></pre>



<p>Wait for LoadBalancer IP. The external IP assignment should take 1-2 minutes.</p>



<pre class="wp-block-code"><code class="">kubectl get svc -n ingress-nginx ingress-nginx-controller -w</code></pre>



<p>Once <code><strong><mark class="has-inline-color has-ast-global-color-0-color">&lt;EXTERNAL-IP&gt;</mark></strong></code> is no longer , Ctrl+C and export it:</p>



<pre class="wp-block-code"><code class="">export EXTERNAL_IP=&lt;EXTERNAL-IP&gt;<br>echo "API URL: http://$EXTERNAL_IP"</code></pre>



<h4 class="wp-block-heading">2. Create vLLM Ingress resource</h4>



<p>Next, create vLLM Ingress using <strong><code><a href="https://github.com/ovh/public-cloud-examples/blob/ep-vllm-deployment-observability-mks/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/vllm/vllm-ingress.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-ingress.yaml</a></code></strong>.</p>



<p>Apply it as follow:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f vllm-ingress.yaml</code></pre>



<p>You can now test different API calls to verify that your deployment is functional.</p>



<h4 class="wp-block-heading">3. Test API</h4>



<p>Firstly, check if the model is available:</p>



<pre class="wp-block-code"><code class="">curl http://$EXTERNAL_IP/v1/models | jq</code></pre>



<pre class="wp-block-preformatted"><code>{<br>  "object": "list",<br>  "data": [<br>    {<br>      "id": "qwen3-vl-8b",<br>      "object": "model",<br>      "created": 1772472143,<br>      "owned_by": "vllm",<br>      "root": "Qwen/Qwen3-VL-8B-Instruct",<br>      "parent": null,<br>      "max_model_len": 8192,<br>      "permission": [<br>        {<br>          "id": "modelperm-8fb35cdd3208b068",<br>          "object": "model_permission",<br>          "created": 1772472143,<br>          "allow_create_engine": false,<br>          "allow_sampling": true,<br>          "allow_logprobs": true,<br>          "allow_search_indices": false,<br>          "allow_view": true,<br>          "allow_fine_tuning": false,<br>          "organization": "*",<br>          "group": null,<br>          "is_blocking": false<br>        }<br>      ]<br>    }<br>  ]<br>}</code></pre>



<p>Next, test inference using the following request:</p>



<pre class="wp-block-code"><code class="">curl http://$EXTERNAL_IP/v1/chat/completions \<br>  -H "Content-Type: application/json" \<br>  -d '{<br>    "model": "qwen3-vl-8b",<br>    "messages": [{"role": "user", "content": "Count from 1 to 10."}],<br>    "max_tokens": 100<br>  }' | jq '.choices[0].message.content'</code></pre>



<p><code>"1, 2, 3, 4, 5, 6, 7, 8, 9, 10"</code></p>



<p>Great! You&#8217;re almost there…</p>



<h3 class="wp-block-heading">Step 5 &#8211; Install Prometheus stack</h3>



<p>Now, set up the monitoring stack that provides complete observability for&nbsp;<strong>application-level&nbsp;</strong>(vLLM) and&nbsp;<strong>hardware-level</strong>&nbsp;(GPU) metrics:</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="763" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-1024x763.jpg" alt="" class="wp-image-30871" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-1024x763.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-300x223.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-768x572.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-1536x1144.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture.jpg 1673w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Monitoring architecture</em></figcaption></figure>



<h4 class="wp-block-heading">1. Add helm repository and create namespace</h4>



<p>Add Prometheus helm repo:</p>



<pre class="wp-block-code"><code class="">helm repo add prometheus-community https://prometheus-community.github.io/helm-charts<br>helm repo update</code></pre>



<p>Then, create the <code><strong><mark class="has-inline-color has-ast-global-color-0-color">monitoring</mark></strong></code> Namespace.</p>



<pre class="wp-block-code"><code class="">kubectl create namespace monitoring</code></pre>



<h4 class="wp-block-heading">2. Create Prometheus deployment configuration and installation</h4>



<p>First, create <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/monitoring/prometheus.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">prometheus.yaml</a></strong></code> file.</p>



<p>Install Prometheus stack:</p>



<pre class="wp-block-code"><code class="">helm install prometheus prometheus-community/kube-prometheus-stack \<br>  -n monitoring \<br>  -f prometheus.yaml \<br>  --timeout 10m \<br>  --wait</code></pre>



<p>Now,&nbsp;monitor its installation and wait until the pods are ready:</p>



<pre class="wp-block-code"><code class="">kubectl get pods -n monitoring -w</code></pre>



<p>If all pods are running successfully, you can proceed to the next step.</p>



<h4 class="wp-block-heading">3. Check that the installation is operational</h4>



<p>First access Grafana in background:</p>



<pre class="wp-block-code"><code class="">kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &amp;</code></pre>



<p>Test Grafana health:</p>



<pre class="wp-block-code"><code class="">curl -s http://localhost:3000/api/health | jq</code></pre>



<pre class="wp-block-preformatted"><code>{<br>  "database": "ok",<br>  "version": "12.3.3",<br>  "commit": "2a14494b2d6ab60f860d8b27603d0ccb264336f6"<br>}</code></pre>



<p>You can now access to Grafana locally via <strong><a href="http://localhost:3000" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code>http://localhost:3000</code></a></strong>. You will have to use:</p>



<ul class="wp-block-list">
<li>Login: <code><strong><mark style="color:#cf2e2e" class="has-inline-color">admin</mark></strong></code></li>



<li>Password: <code><strong><mark style="color:#cf2e2e" class="has-inline-color">Admin123!vLLM</mark></strong></code></li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="518" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-1024x518.png" alt="" class="wp-image-30804" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-1024x518.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-300x152.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-768x389.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2.png 1322w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Well done! You can now proceed to the configuration step.</p>



<h3 class="wp-block-heading">Step 6 &#8211; Configure ServiceMonitors</h3>



<p>The ServiceMonitors is used to tell Prometheus which endpoints to scrape for metrics.</p>



<h4 class="wp-block-heading">1. Create vLLM ServiceMonitor</h4>



<p>Retrieve the file from the GitHub repository: <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/monitoring/vllm-servicemonitor.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-servicemonitor.yaml</a></strong></code>.</p>



<p>Next, apply and check that the ServiceMonitor <code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm-metrics</mark></strong></code> exists:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f vllm-servicemonitor.yaml<br>kubectl get servicemonitor -n vllm</code></pre>



<h4 class="wp-block-heading">2. Create DCGM ServiceMonitor</h4>



<p>First, create the <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/monitoring/dcgm-servicemonitor.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">dcgm-servicemonitor.yaml</a></strong></code> file.</p>



<p>Once again, apply and verify:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f dcgm-servicemonitor.yaml<br>kubectl get servicemonitor -n gpu-operator</code></pre>



<pre class="wp-block-preformatted"><code>gpu-operator                  1d<br>nvidia-dcgm-exporter          1d<br>nvidia-node-status-exporter   1d</code></pre>



<h4 class="wp-block-heading">3. Configure Prometheus for Cross-Namespace discovery</h4>



<p>Apply a patch to allow Prometheus to discover ServiceMonitors in all namespaces.</p>



<pre class="wp-block-code"><code class="">kubectl patch prometheus prometheus-kube-prometheus-prometheus -n monitoring --type merge -p '{<br>  "spec": {<br>    "serviceMonitorNamespaceSelector": {},<br>    "podMonitorNamespaceSelector": {}<br>  }<br>}'</code></pre>



<p>Now you have to restart Prometheus.</p>



<ol class="wp-block-list">
<li>Delete Prometheus pod to force configuration reload</li>



<li>Wait for Prometheus to restart</li>
</ol>



<pre class="wp-block-code"><code class="">kubectl delete pod prometheus-prometheus-kube-prometheus-prometheus-0 -n monitoring<br><br>kubectl wait --for=condition=Ready \<br>  pod/prometheus-prometheus-kube-prometheus-prometheus-0 \<br>  -n monitoring \<br>  --timeout=180s</code></pre>



<p>Wait about 2 minutes for discovery and finally, verify targets:</p>



<pre class="wp-block-code"><code class="">kubectl port-forward -n monitoring \<br>  prometheus-prometheus-kube-prometheus-prometheus-0 9090:9090 &amp;</code></pre>



<p>You can open in browser: <a href="http://localhost:9090/targets" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code><strong><mark class="has-inline-color has-ast-global-color-0-color">http://localhost:9090/targets</mark></strong></code></a> and search for:</p>



<ul class="wp-block-list">
<li><code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm</mark></strong></code></li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">dcgm</mark></code></strong></li>
</ul>



<p>Note that the expected targets are: </p>



<ul class="wp-block-list">
<li>serviceMonitor/vllm/vllm-metrics/0   (2/2 UP)</li>



<li>serviceMonitor/gpu-operator/nvidia-dcgm-exporter/0 (2/2 UP)</li>
</ul>



<h3 class="wp-block-heading">Step 7 &#8211; Create Grafana dashboards</h3>



<p>In this final step, the goal is to create two Grafana dashboards to track both the software side with vLLM metrics and the hardware metrics that will monitor the GPU consumption and system.</p>



<h4 class="wp-block-heading">1. vLLM application metrics</h4>



<p>The dashboard provides insights into vLLM application performance, request handling, and resource utilization based on the following metrics:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>Type</th><th>Description</th><th>Unit</th><th>Dashboard Usage</th></tr></thead><tbody><tr><td><code>vllm:request_success_total</code></td><td>Counter</td><td>Total successful requests</td><td>count</td><td>Request Rate, Total Requests</td></tr><tr><td><code>vllm:num_requests_running</code></td><td>Gauge</td><td>Requests currently being processed</td><td>count</td><td>Queue Depth, Active Requests</td></tr><tr><td><code>vllm:num_requests_waiting</code></td><td>Gauge</td><td>Requests waiting in queue</td><td>count</td><td>Queue Depth, Queued Requests</td></tr><tr><td><code>vllm:time_to_first_token_seconds</code></td><td>Histogram</td><td>Latency until first token generated</td><td>seconds</td><td>TTFT P50/P95/P99</td></tr><tr><td><code>vllm:e2e_request_latency_seconds</code></td><td>Histogram</td><td>Total end-to-end latency</td><td>seconds</td><td>E2E Latency P50/P95/P99</td></tr><tr><td><code>vllm:generation_tokens_total</code></td><td>Counter</td><td>Total tokens generated (output)</td><td>count</td><td>Token Generation Rate, Throughput</td></tr><tr><td><code>vllm:prompt_tokens_total</code></td><td>Counter</td><td>Total prompt tokens (input)</td><td>count</td><td>Token Generation Rate, Avg Tokens</td></tr><tr><td><code>vllm:kv_cache_usage_perc</code></td><td>Gauge</td><td>GPU KV cache utilization</td><td>0-1 (0-100%)</td><td>KV Cache Usage</td></tr><tr><td><code>vllm:prefix_cache_hits_total</code></td><td>Counter</td><td>Number of prefix cache hits</td><td>count</td><td>Cache Hit Rate</td></tr><tr><td><code>vllm:prefix_cache_queries_total</code></td><td>Counter</td><td>Number of prefix cache queries</td><td>count</td><td>Cache Hit Rate</td></tr><tr><td><code>vllm:request_queue_time_seconds</code></td><td>Histogram</td><td>Time spent waiting in queue</td><td>seconds</td><td>Request Queue Time</td></tr><tr><td><code>vllm:request_prefill_time_seconds</code></td><td>Histogram</td><td>Prefill phase time</td><td>seconds</td><td>Prefill Time</td></tr><tr><td><code>vllm:request_decode_time_seconds</code></td><td>Histogram</td><td>Decode phase time</td><td>seconds</td><td>Decode Time</td></tr><tr><td><code>vllm:inter_token_latency_seconds</code></td><td>Histogram</td><td>Latency between each token</td><td>seconds</td><td>Inter-Token Latency</td></tr><tr><td><code>vllm:num_preemptions_total</code></td><td>Counter</td><td>Number of preemptions (OOM)</td><td>count</td><td>Preemptions</td></tr><tr><td><code>vllm:prompt_tokens_cached_total</code></td><td>Counter</td><td>Prompt tokens cached</td><td>count</td><td>Cached Tokens</td></tr><tr><td><code>vllm:request_prompt_tokens</code></td><td>Histogram</td><td>Prompt size distribution</td><td>count</td><td>(Table)</td></tr><tr><td><code>vllm:request_generation_tokens</code></td><td>Histogram</td><td>Generated tokens distribution</td><td>count</td><td>(Table)</td></tr><tr><td><code>vllm:iteration_tokens_total</code></td><td>Histogram</td><td>Tokens per iteration</td><td>count</td><td>(Advanced analysis)</td></tr></tbody></table></figure>



<p>This <strong>vLLM Grafana dashboard</strong> is composed of 23 panels:</p>



<p>The dashboard provides insights into LLM application performance, request handling, and resource utilisation based on the previous metrics.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Type</th><th>Nombre</th><th>Panels</th></tr></thead><tbody><tr><td><strong>Timeseries</strong></td><td>12</td><td>Request Rate, Queue Depth, TTFT, E2E Latency, Token Gen, Cache Usage, Cache Hit, Queue Time, Prefill/Decode, Inter-Token, Preemptions, Avg Tokens</td></tr><tr><td><strong>Stat</strong></td><td>10</td><td>Throughput, TTFT P95, Active Req, Queued Req, Cache Hit Rate, Cache Usage, Total Req, Total Tokens, Cached Tokens, Preemptions</td></tr><tr><td><strong>Table</strong></td><td>1</td><td>Pod Performance</td></tr></tbody></table></figure>



<p>Now create the dashboard using <a href="https://github.com/ovh/public-cloud-examples/blob/ep-vllm-deployment-observability-mks/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/grafana-dashboards/vllm-app-dashboard.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"></a><code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/grafana-dashboards/vllm-app-dashboard.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-app-dashboard.json</a></strong></code>. Then, launch:</p>



<pre class="wp-block-code"><code class="">echo "Importing vLLM application dashboard..."<br>curl -X POST \<br>  'http://localhost:3000/api/dashboards/db' \<br>  -H 'Content-Type: application/json' \<br>  -u 'admin:Admin123!vLLM' \<br>  -d @vllm-app-dashboard.json | jq '.status, .url'</code></pre>



<p>Next, you an access the vLLM dashboard and follow metrics in real time:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="686" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-1024x686.png" alt="" class="wp-image-30858" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-1024x686.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-300x201.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-768x514.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3.png 1230w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>This dashboard is also essential to track hardware consumption for comprehensive monitoring.</p>



<h4 class="wp-block-heading">2. GPU hardware metrics</h4>



<p>Take advantage of the most useful DCGM metrics to check both the functioning and consumption of your hardware resources:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>Type</th><th>Description</th><th>Unit</th><th>Normal Thresholds</th><th>Dashboard Usage</th></tr></thead><tbody><tr><td><code>DCGM_FI_DEV_GPU_UTIL</code></td><td>Gauge</td><td>GPU utilization (compute)</td><td>% (0-100)</td><td>70-95% optimal</td><td>GPU Utilization</td></tr><tr><td><code>DCGM_FI_DEV_GPU_TEMP</code></td><td>Gauge</td><td>GPU temperature</td><td>°C</td><td>&lt; 85°C normal</td><td>GPU Temperature</td></tr><tr><td><code>DCGM_FI_DEV_FB_USED</code></td><td>Gauge</td><td>VRAM used</td><td>MB</td><td>Variable by model</td><td>GPU Memory Used</td></tr><tr><td><code>DCGM_FI_DEV_FB_FREE</code></td><td>Gauge</td><td>VRAM free</td><td>MB</td><td>&gt; 2GB recommended</td><td>GPU Memory Free</td></tr><tr><td><code>DCGM_FI_DEV_POWER_USAGE</code></td><td>Gauge</td><td>Power consumption</td><td>Watts</td><td>&lt; 300W (L40S)</td><td>GPU Power Usage</td></tr><tr><td><code>DCGM_FI_DEV_SM_CLOCK</code></td><td>Gauge</td><td>GPU clock speed (compute)</td><td>MHz</td><td>Variable</td><td>GPU Clock Speed</td></tr><tr><td><code>DCGM_FI_DEV_MEM_CLOCK</code></td><td>Gauge</td><td>Memory clock speed</td><td>MHz</td><td>Variable</td><td>Memory Clock Speed</td></tr><tr><td><code>DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL</code></td><td>Counter</td><td>Total NVLink bandwidth</td><td>bytes/s</td><td>(If multi-GPU)</td><td>NVLink Bandwidth</td></tr><tr><td><code>DCGM_FI_DEV_PCIE_TX_BYTES</code></td><td>Counter</td><td>PCIe data transmitted</td><td>bytes</td><td>(I/O monitoring)</td><td>PCIe TX</td></tr><tr><td><code>DCGM_FI_DEV_PCIE_RX_BYTES</code></td><td>Counter</td><td>PCIe data received</td><td>bytes</td><td>(I/O monitoring)</td><td>PCIe RX</td></tr><tr><td><code>DCGM_FI_DEV_ECC_DBE_VOL_TOTAL</code></td><td>Counter</td><td>ECC double-bit errors</td><td>count</td><td>0 ideal</td><td>(Health check)</td></tr><tr><td><code>DCGM_FI_DEV_ECC_SBE_VOL_TOTAL</code></td><td>Counter</td><td>ECC single-bit errors</td><td>count</td><td>&lt; 10/day acceptable</td><td>(Health check)</td></tr></tbody></table></figure>



<p>This&nbsp;<strong>hardware Grafana dashboard</strong>&nbsp;is composed of 13 panels with GPU hardware and system metrics. A detailed view is also available GPU util (%), temperature (°C), vRAM (GB) and power (Watt).</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Type</th><th>Count</th><th>Panels</th></tr></thead><tbody><tr><td><strong>Timeseries</strong></td><td>8</td><td>GPU Util, GPU Mem, GPU Temp, GPU Power, CPU Usage, RAM Usage, Network I/O, Disk I/O</td></tr><tr><td><strong>Stat</strong></td><td>4</td><td>Avg GPU Util, Avg GPU Temp, Total GPU Mem, Total GPU Power</td></tr><tr><td><strong>Table</strong></td><td>1</td><td>Hardware Status</td></tr></tbody></table></figure>



<p>Please refer to <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/grafana-dashboards/hardware-dashboard.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">hardware-dashboard.json</a></strong></code> by loading it as follows:</p>



<pre class="wp-block-code"><code class="">echo "Importing hardware dashboard..."<br>curl -X POST \<br>  'http://localhost:3000/api/dashboards/db' \<br>  -H 'Content-Type: application/json' \<br>  -u 'admin:Admin123!vLLM' \<br>  -d @hardware-dashboard.json | jq '.status, .url'</code></pre>



<p>Finally, track resource consumption using this hardware dashboard:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="686" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-1024x686.png" alt="" class="wp-image-30859" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-1024x686.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-300x201.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-768x514.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4.png 1230w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Congratulations! Everything is working. You can now test your model and track the various metrics in real time.</p>



<h3 class="wp-block-heading">Step 8 &#8211; LLM testing and performance tracking</h3>



<p>Start by installing Python dependencies:</p>



<pre class="wp-block-code"><code class="">pip3 install openai tqdm</code></pre>



<p>Replace the <strong><mark class="has-inline-color has-ast-global-color-0-color">&lt;EXTERNAL_IP&gt;</mark></strong> by the vLLM service external IP and launch the performance test thanks to the following <a href="https://github.com/ovh/public-cloud-examples/blob/ep-vllm-deployment-observability-mks/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/llm-inference-performance-test.py" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code><strong>Python code</strong></code></a>:</p>



<pre class="wp-block-code"><code class="">import time<br>import threading<br>import random<br>from statistics import mean<br>from openai import OpenAI<br>from tqdm import tqdm<br><br>APP_URL = "http://94.23.185.22/v1"<br>MODEL = "qwen3-vl-8b"<br><br>CONCURRENT_WORKERS = 500          # concurrency<br>REQUESTS_PER_WORKER = 10<br>MAX_TOKENS = 200                  # generation pressure<br><br># some random prompts<br>SHORT_PROMPTS = [<br>    "Summarize the theory of relativity.",<br>    "Explain what a transformer model is.",<br>    "What is Kubernetes autoscaling?"<br>]<br><br>MEDIUM_PROMPTS = [<br>    "Explain how attention mechanisms work in transformer-based models, including self-attention and multi-head attention.",<br>    "Describe how vLLM manages KV cache and why it impacts inference performance."<br>]<br><br>LONG_PROMPTS = [<br>    "Write a very detailed technical explanation of how large language models perform inference, "<br>    "including tokenization, embedding lookup, transformer layers, attention computation, KV cache usage, "<br>    "GPU memory management, and how batching affects latency and throughput. Use examples.",<br>]<br><br>PROMPT_POOL = (<br>    SHORT_PROMPTS * 2 +<br>    MEDIUM_PROMPTS * 4 +<br>    LONG_PROMPTS * 6    # bias toward long prompts<br>)<br><br># openai compliance<br>client = OpenAI(<br>    base_url=APP_URL,<br>    api_key="foo"<br>)<br><br># basic metrics<br>latencies = []<br>errors = 0<br>lock = threading.Lock()<br><br># worker<br>def worker(worker_id):<br>    global errors<br>    for _ in range(REQUESTS_PER_WORKER):<br>        prompt = random.choice(PROMPT_POOL)<br><br>        start = time.time()<br>        try:<br>            client.chat.completions.create(<br>                model=MODEL,<br>                messages=[{"role": "user", "content": prompt}],<br>                max_tokens=MAX_TOKENS,<br>                temperature=0.7,<br>            )<br>            elapsed = time.time() - start<br><br>            with lock:<br>                latencies.append(elapsed)<br><br>        except Exception as e:<br>            with lock:<br>                errors += 1<br><br># run<br>threads = []<br>start_time = time.time()<br><br>print("\n-&gt; STARTING PERFORMANCE TEST:")<br>print(f"Concurrency: {CONCURRENT_WORKERS}")<br>print(f"Total requests: {CONCURRENT_WORKERS * REQUESTS_PER_WORKER}")<br><br>for i in range(CONCURRENT_WORKERS):<br>    t = threading.Thread(target=worker, args=(i,))<br>    t.start()<br>    threads.append(t)<br><br>for t in threads:<br>    t.join()<br><br>total_time = time.time() - start_time<br><br># results<br>print("\n-&gt; BENCH RESULTS:")<br>print(f"Total requests sent: {len(latencies) + errors}")<br>print(f"Successful requests: {len(latencies)}")<br>print(f"Errors: {errors}")<br>print(f"Total wall time: {total_time:.2f}s")<br><br>if latencies:<br>    print(f"Avg latency: {mean(latencies):.2f}s")<br>    print(f"Min latency: {min(latencies):.2f}s")<br>    print(f"Max latency: {max(latencies):.2f}s")<br>    print(f"Throughput: {len(latencies)/total_time:.2f} req/s")</code></pre>



<p>Returning:</p>



<pre class="wp-block-preformatted"><code>-&gt; STARTING PERFORMANCE TEST:</code><br><code>Concurrency: 500<br>Total requests: 5000</code><br><code><br>-&gt; BENCH RESULTS:<br>Total requests sent: 5000<br>Successful requests: 5000<br>Errors: 0<br>Total wall time: 225.54s<br>Avg latency: 21.45s<br>Min latency: 6.06s<br>Max latency: 25.19s<br>Throughput: 22.17 req/s</code></pre>



<p>Don&#8217;t forget to track GPU and vLLM metrics in your Grafana dashboards!</p>



<h2 class="wp-block-heading">Conslusion</h2>



<p>This reference architecture demonstrates a<strong>&nbsp;vLLM deployment on OVHcloud Managed Kubernetes Service (MKS)</strong>&nbsp;with comprehensive GPU monitoring. Benefits include:</p>



<ul class="wp-block-list">
<li><strong>High Performance</strong>: GPU-accelerated inference with L40S</li>



<li><strong>Scalability</strong>: Kubernetes-native, horizontal scaling-ready</li>



<li><strong>Reliability</strong>: Health checks, auto-restart, monitoring</li>



<li><strong>API Compatibility</strong>: OpenAI-compatible endpoints</li>



<li><strong>Multimodality</strong>: Vision &amp; text capabilities</li>



<li><strong>Full stack monitoring</strong>: Complete vLLM application and hardware dashboards</li>
</ul>



<h2 class="wp-block-heading">Going Further</h2>



<p>Your current architecture is&nbsp;<strong>functional.&nbsp;</strong>However, if desired,&nbsp;<strong>it could be improved into a full production-ready&nbsp;solution.</strong></p>



<p><strong>Wish to take production hardening a step further?</strong></p>



<p>Go further with the following enhancements:</p>



<ol class="wp-block-list">
<li><strong>Authentication &amp; authorization</strong>
<ul class="wp-block-list">
<li>vLLM API authentication</li>



<li>Grafana authentication</li>



<li>Prometheus security</li>
</ul>
</li>



<li><strong>High availability &amp; load balancing</strong>
<ul class="wp-block-list">
<li>Grafana high availability with multiple replicas and shared storage</li>



<li>Prometheus high availability</li>



<li>vLLM Horizontal Pod Autoscaling (HPA) based on custom metrics</li>
</ul>
</li>



<li><strong>Data persistence &amp; backup</strong>
<ul class="wp-block-list">
<li>Prometheus long-term storage with persistent storage</li>



<li>Grafana Dashboard Backup</li>
</ul>
</li>



<li><strong>Observability enhancements</strong>
<ul class="wp-block-list">
<li>Distributed tracing by adding OpenTelemetry for request tracing</li>



<li>Alerting rules with production-ready alert rules</li>
</ul>
</li>
</ol>



<p></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-deploying-a-vision-language-model-with-vllm-on-ovhcloud-mks-for-high-performance-inference-and-full-observability%2F&amp;action_name=Reference%20Architecture%3A%20Deploying%20a%20vision-language%20model%20with%20vLLM%20on%20OVHcloud%20MKS%20for%20high%20performance%20inference%20and%20full%20observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Reference Architecture: Custom metric autoscaling for LLM inference with vLLM on OVHcloud AI Deploy and observability using MKS</title>
		<link>https://blog.ovhcloud.com/reference-architecture-custom-metric-autoscaling-for-llm-inference-with-vllm-on-ovhcloud-ai-deploy-and-observability-using-mks/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Tue, 10 Feb 2026 08:51:11 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[Kubernetes]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[MKS]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[prometheus]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=30203</guid>

					<description><![CDATA[Take your LLM (Large Language Model) deployment to production level with comprehensive custom autoscaling configuration and advanced vLLM metrics observability. This reference architecture describes a comprehensive solution for deploying, autoscaling and monitoring vLLM-based LLM workloads on OVHcloud infrastructure. It combinesAI Deploy, used for model serving with custom metric autoscaling, and Managed Kubernetes Service (MKS), which [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-custom-metric-autoscaling-for-llm-inference-with-vllm-on-ovhcloud-ai-deploy-and-observability-using-mks%2F&amp;action_name=Reference%20Architecture%3A%20Custom%20metric%20autoscaling%20for%20LLM%20inference%20with%20vLLM%20on%20OVHcloud%20AI%20Deploy%20and%20observability%20using%20MKS&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em><strong>Take your LLM (Large Language Model) deployment to production level with comprehensive custom autoscaling configuration and advanced vLLM metrics observability.</strong></em></p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="538" src="https://blog.ovhcloud.com/wp-content/uploads/2026/02/3-1024x538.jpg" alt="" class="wp-image-30579" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/02/3-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/3-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/3-768x403.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/3.jpg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>vLLM metrics monitoring and observability based on OVHcloud infrastructure</em></figcaption></figure>



<p>This reference architecture describes a comprehensive solution for <strong>deploying, autoscaling and monitoring vLLM-based LLM workloads</strong> on OVHcloud infrastructure. It combines<strong>AI Deploy</strong>, used for <strong>model serving with custom metric autoscaling</strong>, and <strong>Managed Kubernetes Service (MKS)</strong>, which hosts the monitoring and observability stack.</p>



<p>By leveraging <strong>application-level Prometheus metrics exposed by vLLM</strong>, AI Deploy can automatically scale inference replicas based on real workload demand, ensuring <strong>high availability, consistent performance under load and efficient GPU utilisation</strong>. This autoscaling mechanism allows the platform to react dynamically to traffic spikes while maintaining predictable latency for end users.</p>



<p>On top of this scalable inference layer, the monitoring architecture provides <strong>observability</strong> through <strong>Prometheus</strong>, <strong>Grafana</strong> and Alertmanager. It enables real-time performance monitoring, capacity planning, and operational insights, while ensuring <strong>full data sovereignty</strong> for organisations running Large Language Models (LLMs) in production environments.</p>



<p><strong>What are the key benefits</strong>?</p>



<ul class="wp-block-list">
<li><strong>Cost-effective</strong>: Leverage managed services to minimise operational overhead</li>



<li><strong>Real-time observability</strong>: Track Time-to-First-Token (TTFT), throughput, and resource utilisation</li>



<li><strong>Sovereign infrastructure</strong>: All metrics and data remain within European datacentres</li>



<li><strong>Production-ready</strong>: Persistent storage, high availability, and automated monitoring</li>
</ul>



<h2 class="wp-block-heading">Context</h2>



<h3 class="wp-block-heading">AI Deploy</h3>



<p>OVHcloud AI Deploy is a<strong>&nbsp;Container as a Service</strong>&nbsp;(CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications/APIs based on Machine Learning (ML), Deep Learning (DL) or Large Language Models (LLMs).</p>



<p><strong>Key points to keep in mind</strong>:</p>



<ul class="wp-block-list">
<li><strong>Easy to use:</strong>&nbsp;Bring your own custom Docker image and deploy it in a command line or a few clicks surely</li>



<li><strong>High-performance computing:</strong>&nbsp;A complete range of GPUs available (H100, A100, V100S, L40S and L4)</li>



<li><strong>Scalability and flexibility:</strong>&nbsp;Supports automatic scaling, allowing your model to effectively handle fluctuating workloads</li>



<li><strong>Cost-efficient:</strong>&nbsp;Billing per minute, no surcharges</li>
</ul>



<h3 class="wp-block-heading">Managed Kubernetes Service</h3>



<p><strong>OVHcloud MKS</strong> is a fully managed Kubernetes platform designed to help you deploy, operate, and scale containerised applications in production. It provides a secure and reliable Kubernetes environment without the operational overhead of managing the control plane.</p>



<p><strong>What should you keep in mind?</strong></p>



<ul class="wp-block-list">
<li><strong>Cost-efficient</strong>: Only pay for worker nodes and consumed resources, with no additional charge for the Kubernetes control plane</li>



<li><strong>Fully managed Kubernetes</strong>: Certified upstream Kubernetes with automated control plane management, upgrades and high availability</li>



<li><strong>Production-ready by design</strong>: Built-in integrations with OVHcloud Load Balancers, networking and persistent storage</li>



<li><strong>Scalability and flexibility</strong>: Easily scale workloads and node pools to match application demand</li>



<li><strong>Open and portable</strong>: Based on standard Kubernetes APIs, enabling seamless integration with open-source ecosystems and avoiding vendor lock-in</li>
</ul>



<p>In the following guide, all services are deployed within the&nbsp;<strong>OVHcloud Public Cloud</strong>.</p>



<h2 class="wp-block-heading">Overview of the architecture</h2>



<p>This reference architecture describes a <strong>complete, secure and scalable solution</strong> to:</p>



<ul class="wp-block-list">
<li>Deploy an LLM with vLLM and <strong>AI Deploy</strong>, benefiting from automatic scaling based on custom metrics to ensure high service availability &#8211; vLLM exposes <code><mark class="has-inline-color has-ast-global-color-0-color"><strong>/metrics</strong></mark></code> via its public HTTPS endpoint on AI Deploy</li>



<li>Collect, store and visualise these vLLM metrics using Prometheus and Grafana on <strong>MKS</strong></li>
</ul>



<figure class="wp-block-image aligncenter size-full"><img loading="lazy" decoding="async" width="1200" height="630" src="https://blog.ovhcloud.com/wp-content/uploads/2026/02/1.jpg" alt="" class="wp-image-30578" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/02/1.jpg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/1-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/1-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/1-768x403.jpg 768w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /><figcaption class="wp-element-caption"><em>vLLM metrics monitoring and observability architecture overview</em></figcaption></figure>



<p>Here you will find the main components of the architecture. The solution comprises three main layers:</p>



<ol class="wp-block-list">
<li><strong>Model serving layer</strong> with AI Deploy
<ul class="wp-block-list">
<li>vLLM containers running on top of GPUs for LLM inference</li>



<li>vLLM inference server exposing Prometheus metrics</li>



<li>Automatic scaling based on custom metrics to ensure high availability</li>



<li>HTTPS endpoints with Bearer token authentication</li>
</ul>
</li>



<li><strong>Monitoring and observability infrastructure</strong> using Kubernetes
<ul class="wp-block-list">
<li>Prometheus for metrics collection and storage</li>



<li>Grafana for visualisation and dashboards</li>



<li>Persistent volume storage for long-term retention</li>
</ul>
</li>



<li><strong>Network layer</strong>
<ul class="wp-block-list">
<li>Secure HTTPS communication between components</li>



<li>OVHcloud LoadBalancer for external access</li>
</ul>
</li>
</ol>



<p>To go further, some prerequisites must be checked!</p>



<h2 class="wp-block-heading">Prerequisites</h2>



<p>Before you begin, ensure you have:</p>



<ul class="wp-block-list">
<li>An&nbsp;<strong>OVHcloud Public Cloud</strong>&nbsp;account</li>



<li>An&nbsp;<strong>OpenStack user</strong>&nbsp;with the<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"> </a><strong><code><mark class="has-inline-color has-ast-global-color-0-color">Administrator</mark></code></strong> role</li>



<li><strong>ovhai CLI available</strong> &#8211;&nbsp;<em>install the&nbsp;<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">ovhai CLI</a></em></li>



<li>A <strong>Hugging Face access</strong> &#8211; <em>create a&nbsp;<a href="https://huggingface.co/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Hugging Face account</a>&nbsp;and generate an&nbsp;<a href="https://huggingface.co/settings/tokens" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">access token</a></em></li>



<li><code><strong><mark class="has-inline-color has-ast-global-color-0-color">kubectl</mark></strong></code> installed and <code><strong><mark class="has-inline-color has-ast-global-color-0-color">helm</mark></strong></code> installed (at least version 3.x)</li>
</ul>



<p><strong>🚀 Now you have all the ingredients for our recipe, it’s time to deploy the Ministral 14B using AI Deploy and vLLM Docker container!</strong></p>



<h2 class="wp-block-heading">Architecture guide: From autoscaling to observability for LLMs served by vLLM</h2>



<p>Let’s set up and deploy this architecture!</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="538" src="https://blog.ovhcloud.com/wp-content/uploads/2026/02/2-1024x538.jpg" alt="" class="wp-image-30580" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/02/2-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/2-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/2-768x403.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/2.jpg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Overview of the deployment workflow</em></figcaption></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>✅ <em>Note</em></strong></p>



<p><strong><em>In this example, <a href="https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">mistralai/Ministral-3-14B-Instruct-2512</a> is used. Choose the open-source model of your choice and follow the same steps, adapting the model slug (from Hugging Face), the versions and the GPU(s) flavour.</em></strong></p>
</blockquote>



<p><em>Remember that all of the following steps can be automated using OVHcloud APIs!</em></p>



<h3 class="wp-block-heading">Step 1 &#8211; Manage access tokens</h3>



<p>Before introducing the monitoring stack, this architecture starts with the <strong>deployment of the <strong>Ministral 3 14B</strong> on OVHcloud AI Deploy</strong>, configured to <strong>autoscale based on custom Prometheus metrics exposed by vLLM itself</strong>.</p>



<p>Export your&nbsp;<a href="https://huggingface.co/settings/tokens" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Hugging Face token</a>.</p>



<pre class="wp-block-code"><code class="">export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx</code></pre>



<p><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-app-token?id=kb_article_view&amp;sysparm_article=KB0035280" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Create a Bearer token</a>&nbsp;to access your AI Deploy app once it&#8217;s been deployed.</p>



<pre class="wp-block-code"><code class="">ovhai token create --role operator ai_deploy_token=my_operator_token</code></pre>



<p>Returning the following output:</p>



<p><code><strong>Id: 47292486-fb98-4a5b-8451-600895597a2b<br>Created At: 20-01-26 11:53:05<br>Updated At: 20-01-26 11:53:05<br>Spec:<br>Name: ai_deploy_token=my_operator_token<br>Role: AiTrainingOperator<br>Label Selector:<br>Status:<br>Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX<br>Version: 1</strong></code></p>



<p>You can now store and export your access token:</p>



<pre class="wp-block-code"><code class="">export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</code></pre>



<h3 class="wp-block-heading">Step 2 &#8211; LLM deployment using AI Deploy</h3>



<p>Before introducing the monitoring stack, this architecture starts with the <strong>deployment of the <strong>Ministral 3 14B</strong> on OVHcloud AI Deploy</strong>, configured to <strong>autoscale based on custom Prometheus metrics exposed by vLLM itself</strong>.</p>



<h4 class="wp-block-heading">1. Define the targeted vLLM metric for autoscaling</h4>



<p>Before proceeding with the deployment of the <strong>Ministral 3 14B</strong> endpoint, you have to choose the metric you want to use as the trigger for scaling.</p>



<p>Instead of relying solely on CPU/RAM utilisation, AI Deploy allows autoscaling decisions to be driven by <strong>application-level signals</strong>.</p>



<p>To do this, you can consult the <a href="https://docs.vllm.ai/en/latest/design/metrics/#v1-metrics" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">metrics exposed by vLLM</a>.</p>



<p>In this example, you can use a basic metric such as <code><mark class="has-inline-color has-ast-global-color-0-color"><strong>vllm:num_requests_running</strong></mark></code> to scale the number of replicas based on <strong>real inference load</strong>.</p>



<p>This enables:</p>



<ul class="wp-block-list">
<li>Faster reaction to traffic spikes</li>



<li>Better GPU utilisation</li>



<li>Reduced inference latency under load</li>



<li>Cost-efficient scaling</li>
</ul>



<p>Finally, the configuration chosen for scaling this application is as follows:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Parameter</th><th>Value</th><th>Description</th></tr></thead><tbody><tr><td>Metric source</td><td><code>/metrics</code></td><td>vLLM Prometheus endpoint</td></tr><tr><td>Metric name</td><td><code>vllm:num_requests_running</code></td><td>Number of in-flight requests</td></tr><tr><td>Aggregation</td><td><code>AVERAGE</code></td><td>Mean across replicas</td></tr><tr><td>Target value</td><td><code>50</code></td><td>Desired load per replica</td></tr><tr><td>Min replicas</td><td><code>1</code></td><td>Baseline capacity</td></tr><tr><td>Max replicas</td><td><code>3</code></td><td>Burst capacity</td></tr></tbody></table></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>✅ <em>Note</em></strong></p>



<p><em><strong>You can choose the metric that best suits your use case. You can also apply a patch to your AI Deploy deployment at any time to change the target metric for scaling</strong></em>.</p>
</blockquote>



<p>When the <strong>average number of running requests exceeds 50</strong>, AI Deploy automatically provisions <strong>additional GPU-backed replicas</strong>.</p>



<h4 class="wp-block-heading">2. Deploy Ministral 3 14B using AI Deploy</h4>



<p>Now you can deploy the LLM using the <strong><code>ovhai</code> CLI</strong>.</p>



<p>Key elements necessary for proper functioning:</p>



<ul class="wp-block-list">
<li>GPU-based inference: <strong><code><mark class="has-inline-color has-ast-global-color-0-color">1 x H100</mark></code></strong></li>



<li>vLLM OpenAI-compatible Docker image: <a href="https://hub.docker.com/r/vllm/vllm-openai/tags" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong><code><mark class="has-inline-color has-ast-global-color-0-color">vllm/vllm-openai:v0.13.0</mark></code></strong></a></li>



<li>Custom autoscaling rules based on Prometheus metrics: <code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm:num_requests_running</mark></strong></code></li>
</ul>



<p>Below is the reference command used to deploy the <strong><a href="https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">mistralai/Ministral-3-14B-Instruct-2512</a></strong>:</p>



<pre class="wp-block-code"><code class="">ovhai app run \<br>  --name vllm-ministral-14B-autoscaling-custom-metric \<br>  --default-http-port 8000 \<br>  --label ai_deploy_token=my_operator_token \<br>  --gpu 1 \<br>  --flavor h100-1-gpu \<br>  -e OUTLINES_CACHE_DIR=/tmp/.outlines \<br>  -e HF_TOKEN=$MY_HF_TOKEN \<br>  -e HF_HOME=/hub \<br>  -e HF_DATASETS_TRUST_REMOTE_CODE=1 \<br>  -e HF_HUB_ENABLE_HF_TRANSFER=0 \<br>  -v standalone:/hub:rw \<br>  -v standalone:/workspace:rw \<br>  --liveness-probe-path /health \<br>  --liveness-probe-port 8000 \<br>  --liveness-initial-delay-seconds 300 \<br>  --probe-path /v1/models \<br>  --probe-port 8000 \<br>  --initial-delay-seconds 300 \<br>  --auto-min-replicas 1 \<br>  --auto-max-replicas 3 \<br>  --auto-custom-api-url "http://&lt;SELF&gt;:8000/metrics" \<br>  --auto-custom-metric-format PROMETHEUS \<br>  --auto-custom-value-location vllm:num_requests_running \<br>  --auto-custom-target-value 50 \<br>  --auto-custom-metric-aggregation-type AVERAGE \<br>  vllm/vllm-openai:v0.13.0 \<br>  -- bash -c "python3 -m vllm.entrypoints.openai.api_server \<br>    --model mistralai/Ministral-3-14B-Instruct-2512 \<br>    --tokenizer_mode mistral \<br>    --load_format mistral \<br>    --config_format mistral \<br>    --enable-auto-tool-choice \<br>    --tool-call-parser mistral \<br>    --enable-prefix-caching"</code></pre>



<p>How to understand the different parameters of this command?</p>



<h5 class="wp-block-heading"><strong>a. Start your AI Deploy app</strong></h5>



<p>Launch a new app using&nbsp;<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">ovhai CLI</a>&nbsp;and name it.</p>



<p><code><strong>ovhai app run --name vllm-ministral-14B-autoscaling-custom-metric</strong></code></p>



<h5 class="wp-block-heading"><strong>b. Define access</strong></h5>



<p>Define the HTTP API port and restrict access to your token.</p>



<p><strong><code>--default-http-port 8000</code><br><code>--label ai_deploy_token=my_operator_token</code></strong></p>



<h5 class="wp-block-heading"><strong>c. Configure GPU resources</strong></h5>



<p>Specify the hardware type (<code><strong>h100-1-gpu</strong></code>), which refers to an&nbsp;<strong>NVIDIA H100 GPU</strong>&nbsp;and the number (<strong>1</strong>).</p>



<p><code><strong>--gpu 1<br>--flavor h100-1-gpu</strong></code></p>



<p><strong><mark>⚠️WARNING!</mark></strong>&nbsp;For this model, one H100 is sufficient, but if you want to deploy another model, you will need to check which GPU you need. Note that you can also access L40S and A100 GPUs for your LLM deployment.</p>



<h5 class="wp-block-heading"><strong>d. Set up environment variables</strong></h5>



<p>Configure caching for the&nbsp;<strong>Outlines library</strong>&nbsp;(used for efficient text generation):</p>



<p><code><strong>-e OUTLINES_CACHE_DIR=/tmp/.outlines</strong></code></p>



<p>Pass the&nbsp;<strong>Hugging Face token</strong>&nbsp;(<code>$MY_HF_TOKEN</code>) for model authentication and download:</p>



<p><code><strong>-e HF_TOKEN=$MY_HF_TOKEN</strong></code></p>



<p>Set the&nbsp;<strong>Hugging Face cache directory</strong>&nbsp;to&nbsp;<code>/hub</code>&nbsp;(where models will be stored):</p>



<p><code><strong>-e HF_HOME=/hub</strong></code></p>



<p>Allow execution of&nbsp;<strong>custom remote code</strong>&nbsp;from Hugging Face datasets (required for some model behaviours):</p>



<p><code><strong>-e HF_DATASETS_TRUST_REMOTE_CODE=1</strong></code></p>



<p>Disable&nbsp;<strong>Hugging Face Hub transfer acceleration</strong>&nbsp;(to use standard model downloading):</p>



<p><code><strong>-e HF_HUB_ENABLE_HF_TRANSFER=0</strong></code></p>



<h5 class="wp-block-heading"><strong>e. Mount persistent volumes</strong></h5>



<p>Mount&nbsp;<strong>two persistent storage volumes</strong>:</p>



<ol class="wp-block-list">
<li><code>/hub</code>&nbsp;→ Stores Hugging Face model files</li>



<li><code>/workspace</code>&nbsp;→ Main working directory</li>
</ol>



<p>The&nbsp;<code>rw</code>&nbsp;flag means&nbsp;<strong>read-write access</strong>.</p>



<p><code><strong>-v standalone:/hub:rw<br>-v standalone:/workspace:rw</strong></code></p>



<h5 class="wp-block-heading"><strong>f. Health checks and readiness</strong></h5>



<p>Configure <strong>liveness and readiness probes</strong>:</p>



<ol class="wp-block-list">
<li><code>/health</code> verifies the container is alive</li>



<li><code>/v1/models</code> confirms the model is loaded and ready to serve requests</li>
</ol>



<p>The long initial delays (300 seconds) can be reduced; they correspond to the startup time of vLLM and the loading of the model on the GPU.</p>



<p><code><strong>--liveness-probe-path /health<br>--liveness-probe-port 8000<br>--liveness-initial-delay-seconds 300<br><br>--probe-path /v1/models<br>--probe-port 8000<br>--initial-delay-seconds 300</strong></code></p>



<h5 class="wp-block-heading"><strong>g. Autoscaling configuration (custom metrics)</strong></h5>



<p>First set the minimum and maximum number of replicas.</p>



<p><strong><code>--auto-min-replicas 1<br>--auto-max-replicas 3</code></strong></p>



<p>This guarantees basic availability (one replica always up) while allowing for peak capacity.</p>



<p>Then enable autoscaling based on application-level metrics exposed by vLLM.</p>



<p><strong><code>--auto-custom-api-url "http://&lt;SELF&gt;:8000/metrics"<br>--auto-custom-metric-format PROMETHEUS<br>--auto-custom-value-location vllm:num_requests_running<br>--auto-custom-target-value 50<br>--auto-custom-metric-aggregation-type AVERAGE</code></strong></p>



<p>AI Deploy:</p>



<ul class="wp-block-list">
<li>Scrapes the local <mark class="has-inline-color has-ast-global-color-0-color"><strong><code>/metrics</code></strong></mark> endpoint</li>



<li>Parses Prometheus-formatted metrics</li>



<li>Extracts the <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>vllm:num_requests_running</code></mark></strong> gauge</li>



<li>Computes the average value across replicas</li>
</ul>



<p>Scaling behaviour:</p>



<ul class="wp-block-list">
<li>When the average number of in-flight requests exceeds <strong><code><mark class="has-inline-color has-ast-global-color-0-color">50</mark></code></strong>, AI Deploy adds replicas</li>



<li>When load decreases, replicas are scaled down</li>
</ul>



<p>This approach ensures high availability and predictable latency under fluctuating traffic.</p>



<h5 class="wp-block-heading"><strong>h. Choose the target Docker image and the startup command</strong></h5>



<p>Use the official <strong><a href="https://hub.docker.com/r/vllm/vllm-openai/tags" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vLLM OpenAI-compatible Docker image</a></strong>.</p>



<p><strong><code>vllm/vllm-openai:v0.13.0</code></strong></p>



<p>Finally, run the model inside the container using a Python command to launch the vLLM API server:</p>



<ul class="wp-block-list">
<li><strong><code>python3 -m vllm.entrypoints.openai.api_server</code></strong>&nbsp;→ Starts the OpenAI-compatible vLLM API server</li>



<li><strong><code>--model mistralai/Ministral-3-14B-Instruct-2512</code></strong>&nbsp;→ Loads the&nbsp;<strong>Ministral 3 14B</strong>&nbsp;model from Hugging Face</li>



<li><strong><code>--tokenizer_mode mistral</code></strong>&nbsp;→ Uses the&nbsp;<strong>Mistral tokenizer</strong></li>



<li><strong><code>--load_format mistral</code></strong>&nbsp;→ Uses Mistral’s model loading format</li>



<li><strong><code>--config_format mistral</code></strong>&nbsp;→ Ensures the model configuration follows Mistral’s standard</li>



<li><code><strong>--enable-auto-tool-choice </strong></code>→ Automatic call of tools if necessary (function/tool call)</li>



<li><strong><code>--tool-call-parser mistral </code></strong>→ Tool calling support</li>



<li><strong><code>--enable-prefix-caching</code></strong> → Prefix caching for improved throughput and reduced latency</li>
</ul>



<p>You can now launch this command using <strong>ovhai CLI</strong>.</p>



<h4 class="wp-block-heading">3. Check AI Deploy app status</h4>



<p>You can now check if your&nbsp;<strong>AI Deploy</strong>&nbsp;app is alive:</p>



<pre class="wp-block-code"><code class="">ovhai app get &lt;your_vllm_app_id&gt;</code></pre>



<p><strong>Is your app in&nbsp;<code>RUNNING</code>&nbsp;status?</strong>&nbsp;Perfect! You can check in the logs that the server is started:</p>



<pre class="wp-block-code"><code class="">ovhai app logs &lt;your_vllm_app_id&gt;</code></pre>



<p><strong><mark>⚠️WARNING!</mark></strong>&nbsp;This step may take a little time as the LLM must be loaded.</p>



<h4 class="wp-block-heading">4. Test that the deployment is functional</h4>



<p>First you can request and send a prompt to the LLM. Launch the following query by asking the question of your choice:</p>



<pre class="wp-block-code"><code class="">curl https://&lt;your_vllm_app_id&gt;.app.gra.ai.cloud.ovh.net/v1/chat/completions \<br>  -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \<br>  -H "Content-Type: application/json" \<br>  -d '{<br>    "model": "mistralai/Ministral-3-14B-Instruct-2512",<br>    "messages": [<br>      {"role": "system", "content": "You are a helpful assistant."},<br>      {"role": "user", "content": "Give me the name of OVHcloud’s founder."}<br>    ],<br>    "stream": false<br>  }'</code></pre>



<p>You can also verify access to vLLM metrics.</p>



<pre class="wp-block-code"><code class="">curl -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \<br>  https://&lt;your_vllm_app_id&gt;.app.gra.ai.cloud.ovh.net/metrics</code></pre>



<p>If both tests show that the model deployment is functional and you receive 200 HTTP responses, you are ready to move on to the next step!</p>



<p>The next step is to set up the observability and monitoring stack. This autoscaling mechanism is <strong>fully independent</strong> from Prometheus used for observability:</p>



<ul class="wp-block-list">
<li>AI Deploy queries the local <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>/metrics</code></mark></strong> endpoint internally</li>



<li>Prometheus scrapes the <strong>same metrics endpoint</strong> externally for monitoring, dashboards and potentially alerting</li>
</ul>



<p>This ensures:</p>



<ul class="wp-block-list">
<li>A single source of truth for metrics</li>



<li>No duplication of exporters</li>



<li>Consistent signals for scaling and observability</li>
</ul>



<h3 class="wp-block-heading">Step 3 &#8211; Create an MKS cluster</h3>



<p>From <a href="https://manager.eu.ovhcloud.com/#/hub/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Control Panel</a>, create a Kubernetes cluster using the <strong>MKS</strong>.</p>



<p>Consider using the following configuration for the current use case:</p>



<ul class="wp-block-list">
<li><strong>Location</strong>: GRA ( Gravelines) &#8211; <em>you can select the same region as for AI Deploy</em></li>



<li><strong>Network</strong>: Public</li>



<li><strong>Node pool</strong> :
<ul class="wp-block-list">
<li>Flavour : <code><strong><mark class="has-inline-color has-ast-global-color-0-color">b2-15</mark></strong></code> (or something similar)</li>



<li>Number of nodes: <strong><code><mark class="has-inline-color has-ast-global-color-0-color">3</mark></code></strong></li>



<li>Autoscaling : <strong><code><mark class="has-inline-color has-ast-global-color-0-color">OFF</mark></code></strong></li>
</ul>
</li>



<li><strong>Name your node pool:</strong> <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>monitoring</code></mark></strong></li>
</ul>



<p>You should see your cluster (e.g. <code><mark class="has-inline-color has-ast-global-color-0-color"><strong>prometheus-vllm-metrics-ai-deploy</strong></mark></code>) in the list, along with the following information:</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="632" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-3-1024x632.png" alt="" class="wp-image-30242" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-3-1024x632.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-3-300x185.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-3-768x474.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-3-1536x948.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-3-2048x1264.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>If the status is green with the <strong><mark style="color:#00d084" class="has-inline-color"><code>OK</code></mark></strong> label, you can proceed to the next step.</p>



<h3 class="wp-block-heading">Step 4 &#8211; Configure Kubernetes access</h3>



<p>Download your <strong>kubeconfig file</strong> from the OVHcloud Control Panel and configure <strong><code><mark class="has-inline-color has-ast-global-color-0-color">kubectl</mark></code></strong>:</p>



<pre class="wp-block-code"><code class=""># configure kubectl with your MKS cluster<br>export KUBECONFIG=/path/to/your/kubeconfig-xxxxxx.yml<br><br># verify cluster connectivity<br>kubectl cluster-info<br>kubectl get nodes</code></pre>



<p>Now,- you can create the <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>values-prometheus.yaml</code></mark></strong> file:</p>



<pre class="wp-block-code"><code class=""># general configuration<br>nameOverride: "monitoring"<br>fullnameOverride: "monitoring"<br><br># Prometheus configuration<br>prometheus:<br>  prometheusSpec:<br>    # data retention (15d)<br>    retention: 15d<br>    <br>    # scrape interval (15s)<br>    scrapeInterval: 15s<br>    <br>    # persistent storage (required for production deployment)<br>    storageSpec:<br>      volumeClaimTemplate:<br>        spec:<br>          storageClassName: csi-cinder-high-speed  # OVHcloud storage<br>          accessModes: ["ReadWriteOnce"]<br>          resources:<br>            requests:<br>              storage: 50Gi  # (can be modified according to your needs)<br>    <br>    # scrape vLLM metrics from your AI Deploy instance (Ministral 3 14B)<br>    additionalScrapeConfigs:<br>      - job_name: 'vllm-ministral'<br>        scheme: https<br>        metrics_path: '/metrics'<br>        scrape_interval: 15s<br>        scrape_timeout: 10s<br>        <br>        # authentication using AI Deploy Bearer token stored Kubernetes Secret<br>        bearer_token_file: /etc/prometheus/secrets/vllm-auth-token/token<br>        static_configs:<br>          - targets:<br>              - '&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net'  # /!\ REPLACE THE &lt;APP_ID&gt; by yours /!\<br>            labels:<br>              service: 'vllm'<br>              model: 'ministral'<br>              environment: 'production'<br>        <br>        # TLS configuration<br>        tls_config:<br>          insecure_skip_verify: false<br>    <br>    # kube-prometheus-stack mounts the secret under /etc/prometheus/secrets/ and makes it accessible to Prometheus<br>    secrets:<br>      - vllm-auth-token<br><br># Grafana configuration (visualization layer)<br>grafana:<br>  enabled: true<br>  <br>  # disable automatic datasource provisioning<br>  sidecar:<br>    datasources:<br>      enabled: false<br>  <br>  # persistent dashboards<br>  persistence:<br>    enabled: true<br>    storageClassName: csi-cinder-high-speed<br>    size: 10Gi<br>  <br>  # /!\ DEFINE ADMIN PASSWORD - REPLACE "test" BY YOURS /!\<br>  adminPassword: "test"<br>  <br>  # access via OVHcloud LoadBalancer (public IP and managed LB)<br>  service:<br>    type: LoadBalancer<br>    port: 80<br>    annotations:<br>      # optional : limiter l'accès à certaines IPs<br>      # service.beta.kubernetes.io/ovh-loadbalancer-allowed-sources: "1.2.3.4/32"<br>  <br># alertmanager (optional but recommended for production)<br>alertmanager:<br>  enabled: true<br>  <br>  alertmanagerSpec:<br>    storage:<br>      volumeClaimTemplate:<br>        spec:<br>          storageClassName: csi-cinder-high-speed<br>          accessModes: ["ReadWriteOnce"]<br>          resources:<br>            requests:<br>              storage: 10Gi<br><br># cluster observability components<br>nodeExporter:<br>  enabled: true<br>  <br>kubeStateMetrics:<br>  enabled: true</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>✅ <em>Note</em></strong></p>



<p><strong><em>On OVHcloud MKS, persistent storage is handled automatically through the Cinder CSI driver. When a PersistentVolumeClaim (PVC) references a supported <code>storageClassName</code> such as <code>csi-cinder-high-speed</code>, OVHcloud dynamically provisions the underlying Block Storage volume and attaches it to the node running the pod. This enables stateful components like Prometheus, Alertmanager and Grafana to persist data reliably without any manual volume management, making the architecture fully cloud-native and operationally simple.</em></strong></p>
</blockquote>



<p>Then create the <strong><code><mark class="has-inline-color has-ast-global-color-0-color">monitoring</mark></code></strong> namespace:</p>



<pre class="wp-block-code"><code class=""># create namespace<br>kubectl create namespace monitoring<br><br># verify creation<br>kubectl get namespaces | grep monitoring</code></pre>



<p>Finally,  configure the Bearer token secret to access vLLM metrics.</p>



<pre class="wp-block-code"><code class=""># create bearer token secret<br>kubectl create secret generic vllm-auth-token \<br>  --from-literal=token='"$MY_OVHAI_ACCESS_TOKEN"' \<br>  -n monitoring<br><br># verify secret creation<br>kubectl get secret vllm-auth-token -n monitoring<br><br># test token (optional)<br>kubectl get secret vllm-auth-token -n monitoring \<br>  -o jsonpath='{.data.token}' | base64 -d </code></pre>



<p>Right, if everything is working, let&#8217;s move on to deployment.</p>



<h3 class="wp-block-heading">Step 5 &#8211; Deploy Prometheus stack</h3>



<p>Add the Prometheus Helm repository and install the monitoring stack. The deployment creates:</p>



<ul class="wp-block-list">
<li>Prometheus StatefulSet with persistent storage</li>



<li>Grafana deployment with LoadBalancer access</li>



<li>Alertmanager for future alert configuration (optional)</li>



<li>Supporting components (node exporters, kube-state-metrics)</li>
</ul>



<pre class="wp-block-code"><code class=""># add Helm repository<br>helm repo add prometheus-community \<br>  https://prometheus-community.github.io/helm-charts<br>helm repo update<br><br># install monitoring stack<br>helm install monitoring prometheus-community/kube-prometheus-stack \<br>  --namespace monitoring \<br>  --values values-prometheus.yaml \<br>  --wait</code></pre>



<p>Then you can retrieve the LoadBalancer IP address to access Grafana:</p>



<pre class="wp-block-code"><code class="">kubectl get svc -n monitoring monitoring-grafana</code></pre>



<p>Finally, open your browser to <code><strong><mark class="has-inline-color has-ast-global-color-0-color">http://&lt;EXTERNAL-IP&gt;</mark></strong></code> and login with:</p>



<ul class="wp-block-list">
<li><strong>Username</strong>: <code><mark class="has-inline-color has-ast-global-color-0-color"><strong>admin</strong></mark></code></li>



<li><strong>Password</strong>: as configured in your <code><strong><mark class="has-inline-color has-ast-global-color-0-color">values-prometheus.yaml</mark></strong></code> file</li>
</ul>



<h3 class="wp-block-heading">Step 6 &#8211; Create Grafana dashboards</h3>



<p>In this step, you will be able to access Grafana interface and add your Prometheus as a new data source, then create a complete dashboard with different vLLM metrics.</p>



<h4 class="wp-block-heading">1. Add a new data source in Grafana</h4>



<p>First of all, create a new Prometheus connection inside Grafana:</p>



<ul class="wp-block-list">
<li>Navigate to <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>Connections</code></mark></strong> → <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>Data sources</code></mark></strong> → <strong><code><mark class="has-inline-color has-ast-global-color-0-color">Add data source</mark></code></strong></li>



<li>Select <strong>Prometheus</strong></li>



<li>Configure URL: <code><strong><mark class="has-inline-color has-ast-global-color-0-color">http://monitoring-prometheus:9090</mark></strong></code></li>



<li>Click <strong>Save &amp; test</strong></li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="609" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-4-1024x609.png" alt="" class="wp-image-30247" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-4-1024x609.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-4-300x178.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-4-768x457.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-4-1536x913.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-4-2048x1218.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Now that your Prometheus has been configured as a new data source, you can create your Grafana dashboard.</p>



<h4 class="wp-block-heading">2. Create your monitoring dashboard</h4>



<p>To begin with, you can use the following pre-configured Grafana dashboard by downloading this JSON file locally:</p>





<p>In the left-hand menu, select <strong><code><mark class="has-inline-color has-ast-global-color-0-color">Dashboard</mark></code></strong>:</p>



<ol class="wp-block-list">
<li>Navigate to <strong><code><mark class="has-inline-color has-ast-global-color-0-color">Dashboards</mark></code></strong> → <strong><code><mark class="has-inline-color has-ast-global-color-0-color">Import</mark></code></strong></li>



<li>Upload the provided dashboard JSON</li>



<li>Select <strong>Prometheus</strong> as datasource</li>



<li>Click <strong>Import</strong> and select the <strong><code><mark class="has-inline-color has-ast-global-color-0-color">vLLM-metrics-grafana-monitoring.json</mark></code></strong> file</li>
</ol>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="449" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-6-1024x449.png" alt="" class="wp-image-30250" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-6-1024x449.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-6-300x131.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-6-768x337.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-6-1536x673.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-6-2048x897.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>The dashboard provides real-time visibility for <strong>Ministral 3 14B</strong> deployed with vLLM container and OVHcloud AI Deploy.</p>



<p>You can now track:</p>



<ul class="wp-block-list">
<li><strong>Performance metrics</strong>: TTFT, inter-token latency, end-to-end latency</li>



<li><strong>Throughput indicators</strong>: Requests per second, token generation rates</li>



<li><strong>Resource utilisation</strong>: KV cache usage, active/waiting requests</li>



<li><strong>Capacity indicators</strong>: Queue depth, preemption rates</li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="540" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-7-1024x540.png" alt="" class="wp-image-30253" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-7-1024x540.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-7-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-7-768x405.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-7-1536x811.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-7-2048x1081.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Here are the key metrics tracked and displayed in the Grafana dashboard:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric Category</th><th>Prometheus Metric</th><th>Description</th><th>Use case</th></tr></thead><tbody><tr><td><strong>Latency</strong></td><td><code>vllm:time_to_first_token_seconds</code></td><td>Time until first token generation</td><td>User experience monitoring</td></tr><tr><td><strong>Latency</strong></td><td><code>vllm:inter_token_latency_seconds</code></td><td>Time between tokens</td><td>Throughput optimisation</td></tr><tr><td><strong>Latency</strong></td><td><code>vllm:e2e_request_latency_seconds</code></td><td>End-to-end request time</td><td>SLA monitoring</td></tr><tr><td><strong>Throughput</strong></td><td><code>vllm:request_success_total</code></td><td>Successful requests counter</td><td>Capacity planning</td></tr><tr><td><strong>Resource</strong></td><td><code>vllm:kv_cache_usage_perc</code></td><td>KV cache memory usage</td><td>Memory management</td></tr><tr><td><strong>Queue</strong></td><td><code>vllm:num_requests_running</code></td><td>Active requests</td><td>Load monitoring</td></tr><tr><td><strong>Queue</strong></td><td><code>vllm:num_requests_waiting</code></td><td>Queued requests</td><td>Overload detection</td></tr><tr><td><strong>Capacity</strong></td><td><code>vllm:num_preemptions_total</code></td><td>Request preemptions</td><td>Peak load indicator</td></tr><tr><td><strong>Tokens</strong></td><td><code>vllm:prompt_tokens_total</code></td><td>Input tokens processed</td><td>Usage analytics</td></tr><tr><td><strong>Tokens</strong></td><td><code>vllm:generation_tokens_total</code></td><td>Output tokens generated</td><td>Cost tracking</td></tr></tbody></table></figure>



<p>Well done, you now have at your disposal:</p>



<ul class="wp-block-list">
<li>An endpoint of the Ministral 3 14B model deployed with vLLM thanks to <strong>OVHcloud AI Deploy</strong> and its autoscaling strategies based on custom metrics</li>



<li>Prometheus for metrics collection and Grafana for visualisation/dashboards thanks to <strong>OVHcloud MKS</strong></li>
</ul>



<p><strong>But how can you check that everything will work when the load increases?</strong></p>



<h3 class="wp-block-heading">Step 7 &#8211; Test autoscaling and real-time visualisation</h3>



<p>The first objective here is to force AI Deploy to:</p>



<ul class="wp-block-list">
<li>Increase <code>vllm:num_requests_running</code></li>



<li>&#8216;Saturate&#8217; a single replica</li>



<li>Trigger the <strong>scale up</strong></li>



<li>Observe replica increase + latency drop</li>
</ul>



<h4 class="wp-block-heading">1. Autoscaling testing strategy</h4>



<p>The goal is to combine:</p>



<ul class="wp-block-list">
<li><strong>High concurrency</strong></li>



<li><strong>Long prompts</strong> (KVcache heavy)</li>



<li><strong>Long generations</strong></li>



<li><strong>Bursty load</strong></li>
</ul>



<p>This is what vLLM autoscaling actually reacts to.</p>



<p>To do so, a Python code can simulate the expected behaviour:</p>



<pre class="wp-block-code"><code class="">import time<br>import threading<br>import random<br>from statistics import mean<br>from openai import OpenAI<br>from tqdm import tqdm<br><br>APP_URL = "https://&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net/v1" # /!\ REPLACE THE &lt;APP_ID&gt; by yours /!\<br>MODEL = "mistralai/Ministral-3-14B-Instruct-2512"<br>API_KEY = $MY_OVHAI_ACCESS_TOKEN<br><br>CONCURRENT_WORKERS = 500          # concurrency (main scaling trigger)<br>REQUESTS_PER_WORKER = 25<br>MAX_TOKENS = 768                  # generation pressure<br><br># some random prompts<br>SHORT_PROMPTS = [<br>    "Summarize the theory of relativity.",<br>    "Explain what a transformer model is.",<br>    "What is Kubernetes autoscaling?"<br>]<br><br>MEDIUM_PROMPTS = [<br>    "Explain how attention mechanisms work in transformer-based models, including self-attention and multi-head attention.",<br>    "Describe how vLLM manages KV cache and why it impacts inference performance."<br>]<br><br>LONG_PROMPTS = [<br>    "Write a very detailed technical explanation of how large language models perform inference, "<br>    "including tokenization, embedding lookup, transformer layers, attention computation, KV cache usage, "<br>    "GPU memory management, and how batching affects latency and throughput. Use examples.",<br>]<br><br>PROMPT_POOL = (<br>    SHORT_PROMPTS * 2 +<br>    MEDIUM_PROMPTS * 4 +<br>    LONG_PROMPTS * 6    # bias toward long prompts<br>)<br><br># openai compliance<br>client = OpenAI(<br>    base_url=APP_URL,<br>    api_key=API_KEY,<br>)<br><br># basic metrics<br>latencies = []<br>errors = 0<br>lock = threading.Lock()<br><br># worker<br>def worker(worker_id):<br>    global errors<br>    for _ in range(REQUESTS_PER_WORKER):<br>        prompt = random.choice(PROMPT_POOL)<br><br>        start = time.time()<br>        try:<br>            client.chat.completions.create(<br>                model=MODEL,<br>                messages=[{"role": "user", "content": prompt}],<br>                max_tokens=MAX_TOKENS,<br>                temperature=0.7,<br>            )<br>            elapsed = time.time() - start<br><br>            with lock:<br>                latencies.append(elapsed)<br><br>        except Exception as e:<br>            with lock:<br>                errors += 1<br><br># run<br>threads = []<br>start_time = time.time()<br><br>print("Starting autoscaling stress test...")<br>print(f"Concurrency: {CONCURRENT_WORKERS}")<br>print(f"Total requests: {CONCURRENT_WORKERS * REQUESTS_PER_WORKER}")<br><br>for i in range(CONCURRENT_WORKERS):<br>    t = threading.Thread(target=worker, args=(i,))<br>    t.start()<br>    threads.append(t)<br><br>for t in threads:<br>    t.join()<br><br>total_time = time.time() - start_time<br><br># results<br>print("\n=== AUTOSCALING BENCH RESULTS ===")<br>print(f"Total requests sent: {len(latencies) + errors}")<br>print(f"Successful requests: {len(latencies)}")<br>print(f"Errors: {errors}")<br>print(f"Total wall time: {total_time:.2f}s")<br><br>if latencies:<br>    print(f"Avg latency: {mean(latencies):.2f}s")<br>    print(f"Min latency: {min(latencies):.2f}s")<br>    print(f"Max latency: {max(latencies):.2f}s")<br>    print(f"Throughput: {len(latencies)/total_time:.2f} req/s")</code></pre>



<p><strong>How can you verify that autoscaling is working and that the load is being handled correctly without latency skyrocketing?</strong></p>



<h4 class="wp-block-heading">2. Hardware and platform-level monitoring</h4>



<p>First, <strong>AI Deploy Grafana</strong> answers <strong>&#8216;What resources are being used and how many replicas exist?</strong>&#8216;.</p>



<p>GPU utilisation, GPU memory, CPU, RAM and replica count are monitored through <strong>OVHcloud AI Deploy Grafana</strong> (monitoring URL), which exposes infrastructure and runtime metrics for the AI Deploy application. This layer provides visibility into <strong>resource saturation and scaling events</strong> managed by the AI Deploy platform itself.</p>



<p>Access it using the following URL (do not forget to replace <code><mark class="has-inline-color has-ast-global-color-0-color"><strong>&lt;APP_ID&gt;</strong></mark></code> by yours): <strong><code>https://monitoring.gra.ai.cloud.ovh.net/d/app/app-monitoring?var-app=</code><mark class="has-inline-color has-ast-global-color-0-color"><code>&lt;APP_ID&gt;</code></mark><code>&amp;orgId=1</code></strong></p>



<p>For example, check GPU/RAM metrics:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="540" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-8-1024x540.png" alt="" class="wp-image-30260" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-8-1024x540.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-8-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-8-768x405.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-8-1536x811.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-8-2048x1081.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>You can also monitor scale ups and downs in real time, as well as information on HTTP calls and much more!</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="540" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-9-1024x540.png" alt="" class="wp-image-30261" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-9-1024x540.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-9-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-9-768x405.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-9-1536x811.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-9-2048x1081.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h4 class="wp-block-heading">3. Software and application-level monitoring</h4>



<p>Next the combination of MKS + Prometheus + Grafana answers <strong>&#8216;How the inference engine behaves internally&#8217;</strong>.</p>



<p>In fact, vLLM internal metrics (request concurrency, token throughput, latency indicators, KV cache pressure, etc.) are collected via the <strong>vLLM <code>/metrics</code> endpoint</strong> and scraped by <strong>Prometheus running on OVHcloud MKS</strong>, then visualised in a <strong>dedicated Grafana instance</strong>. This layer focuses on <strong>model behaviour and inference performance</strong>.</p>



<p>Find all these metrics via (just replace <strong><code><mark class="has-inline-color has-ast-global-color-0-color">&lt;EXTERNAL-IP&gt;</mark></code></strong>): <strong><code>http://<mark class="has-inline-color has-ast-global-color-0-color">&lt;EXTERNAL-IP&gt;</mark>/d/vllm-ministral-monitoring/ministral-14b-vllm-metrics-monitoring?orgId=1</code></strong></p>



<p>Find key metrics such as TTF, etc:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="540" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-10-1024x540.png" alt="" class="wp-image-30263" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-10-1024x540.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-10-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-10-768x405.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-10-1536x811.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-10-2048x1081.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>You can also find some information about <strong>&#8216;Model load and throughput&#8217;</strong>:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="540" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-11-1024x540.png" alt="" class="wp-image-30264" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-11-1024x540.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-11-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-11-768x405.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-11-1536x811.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-11-2048x1081.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>To go further and add even more metrics, you can refer to the vLLM documentation on &#8216;<a href="https://docs.vllm.ai/en/v0.7.2/getting_started/examples/prometheus_grafana.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Prometheus and Grafana</a>&#8216;.</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p>This reference architecture provides a scalable, and production-ready approach for deploying LLM inference on OVHcloud using <strong>AI Deploy</strong> and the <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-apps-deployments?id=kb_article_view&amp;sysparm_article=KB0047997#advanced-custom-metrics-for-autoscaling" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">autoscaling on custom metric feature</a>.</p>



<p>OVHcloud <strong>MKS</strong> is dedicated to running Prometheus and Grafana, enabling secure scraping and visualisation of <strong>vLLM internal metrics</strong> exposed via the <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>/metrics</code> </mark></strong>endpoint.</p>



<p>By scraping vLLM metrics securely from AI Deploy into Prometheus and exposing them through Grafana, the architecture provides full visibility into model behaviour, performance and load, enabling informed scaling analysis, troubleshooting and capacity planning in production environments.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-custom-metric-autoscaling-for-llm-inference-with-vllm-on-ovhcloud-ai-deploy-and-observability-using-mks%2F&amp;action_name=Reference%20Architecture%3A%20Custom%20metric%20autoscaling%20for%20LLM%20inference%20with%20vLLM%20on%20OVHcloud%20AI%20Deploy%20and%20observability%20using%20MKS&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Reference Architecture: build a sovereign n8n RAG workflow for AI agent using OVHcloud Public Cloud solutions</title>
		<link>https://blog.ovhcloud.com/reference-architecture-build-a-sovereign-n8n-rag-workflow-for-ai-agent-using-ovhcloud-public-cloud-solutions/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Tue, 27 Jan 2026 13:12:03 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[AI Endpoints]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[Managed Database]]></category>
		<category><![CDATA[n8n]]></category>
		<category><![CDATA[Object Storage]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<category><![CDATA[RAG]]></category>
		<category><![CDATA[S3]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=29694</guid>

					<description><![CDATA[What if an n8n workflow, deployed in a&#160;sovereign environment, saved you time while giving you peace of mind? From document ingestion to targeted response generation, n8n acts as the conductor of your RAG pipeline without compromising data protection. In the current landscape of AI agents and knowledge assistants, connecting your internal documentation with&#160;Large Language Models&#160;(LLMs) [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-build-a-sovereign-n8n-rag-workflow-for-ai-agent-using-ovhcloud-public-cloud-solutions%2F&amp;action_name=Reference%20Architecture%3A%20build%20a%20sovereign%20n8n%20RAG%20workflow%20for%20AI%20agent%20using%20OVHcloud%20Public%20Cloud%20solutions&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em><em>What if an n8n workflow, deployed in a&nbsp;</em><strong><em>sovereign environment</em></strong><em>, saved you time while giving you peace of mind? From document ingestion to targeted response generation, n8n acts as the conductor of your RAG pipeline without compromising data protection.</em></em></p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://blog.ovhcloud.com/wp-content/uploads/2025/11/ref-archi-n8n-rag-1024x576.jpg" alt="" class="wp-image-30002" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/11/ref-archi-n8n-rag-1024x576.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/ref-archi-n8n-rag-300x169.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/ref-archi-n8n-rag-768x432.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/ref-archi-n8n-rag-1536x864.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/ref-archi-n8n-rag.jpg 1920w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>n8n workflow overview</em></figcaption></figure>



<p>In the current landscape of AI agents and knowledge assistants, connecting your internal documentation with&nbsp;<strong>Large Language Models</strong>&nbsp;(LLMs) is becoming a strategic differentiator.</p>



<p><strong>How?</strong>&nbsp;By building&nbsp;<strong>Agentic RAG systems</strong>&nbsp;capable of retrieving, reasoning, and acting autonomously based on external knowledge.</p>



<p>To make this possible, engineers need a way to connect&nbsp;<strong>retrieval pipelines (RAG)</strong>&nbsp;with&nbsp;<strong>tool-based orchestration</strong>.</p>



<p>This article outlines a&nbsp;<strong>reference architecture</strong>&nbsp;for building a&nbsp;<strong>fully automated RAG pipeline orchestrated by n8n</strong>, leveraging&nbsp;<strong>OVHcloud AI Endpoints</strong>&nbsp;and&nbsp;<strong>PostgreSQL with pgvector</strong>&nbsp;as core components.</p>



<p>The final result will be a system that automatically ingests Markdown documentation from&nbsp;<strong>Object Storage</strong>, creates embeddings with OVHcloud’s&nbsp;<strong>BGE-M3</strong>&nbsp;model available on AI Endpoints, and stores them in a&nbsp;<strong>Managed Database PostgreSQL</strong>&nbsp;with pgvector extension.</p>



<p>Lastly, you’ll be able to build an AI Agent that lets you chat with an LLM (<strong>GPT-OSS-120B</strong>&nbsp;on AI Endpoints). This agent, utilising the RAG implementation carried out upstream, will be an expert on OVHcloud products.</p>



<p>You can further improve the process by using an&nbsp;<strong>LLM guard</strong>&nbsp;to protect the questions sent to the LLM, and set up a chat memory to use conversation history for higher response quality.</p>



<p><strong>But what about n8n?</strong></p>



<p><a href="https://n8n.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>n8n</strong></a>, the open-source workflow automation tool,&nbsp;offers many benefits and connects seamlessly with over&nbsp;<strong>300</strong>&nbsp;APIs, apps, and services:</p>



<ul class="wp-block-list">
<li><strong>Open-source</strong>: n8n is a 100% self-hostable solution, which means you retain full data control;</li>



<li><strong>Flexible</strong>: combines low-code nodes and custom JavaScript/Python logic;</li>



<li><strong>AI-ready</strong>: includes useful integrations for LangChain, OpenAI, and embedding support capabilities;</li>



<li><strong>Composable</strong>: enables simple connections between data, APIs, and models in minutes;</li>



<li><strong>Sovereign by design</strong>: compliant with privacy-sensitive or regulated sectors.</li>
</ul>



<p>This reference architecture serves as a blueprint for building a sovereign, scalable Retrieval Augmented Generation (<strong>RAG</strong>) platform using&nbsp;<strong>n8n</strong>&nbsp;and&nbsp;<strong>OVHcloud Public Cloud</strong>&nbsp;solutions.</p>



<p>This setup shows how to orchestrate data ingestion, generate embedding, and enable conversational AI by combining&nbsp;<strong>OVHcloud Object Storage</strong>,&nbsp;<strong>Managed Databases with PostgreSQL</strong>,&nbsp;<strong>AI Endpoints</strong>&nbsp;and&nbsp;<strong>AI Deploy</strong>.<strong>The result?</strong>&nbsp;An AI environment that is fully integrated, protects privacy, and is exclusively hosted on <strong>OVHcloud’s European infrastructure</strong>.</p>



<h2 class="wp-block-heading">Overview of the n8n workflow architecture for RAG </h2>



<p>The workflow involves the following steps:</p>



<ul class="wp-block-list">
<li><strong>Ingestion:</strong>&nbsp;documentation in markdown format is fetched from <strong>OVHcloud Object Storage (S3);</strong></li>



<li><strong>Preprocessing:</strong> n8n cleans and normalises the text, removing YAML front-matter and encoding noise;</li>



<li><strong>Vectorisation:</strong>&nbsp;Each document is embedded using the <strong>BGE-M3</strong> model, which is available via <strong>OVHcloud AI Endpoints;</strong></li>



<li><strong>Persistence:</strong> vectors and metadata are stored in <strong>OVHcloud PostgreSQL Managed Database</strong> using pgvector;</li>



<li><strong>Retrieval:</strong> when a user sends a query, n8n triggers a <strong>LangChain Agent</strong> that retrieves relevant chunks from the database;</li>



<li><strong>Reasoning and actions:</strong>&nbsp;The <strong>AI Agent node</strong> combines LLM reasoning, memory, and tool usage to generate a contextual response or trigger downstream actions (Slack reply, Notion update, API call, etc.).</li>
</ul>



<p>In this tutorial, all services are deployed within the <strong>OVHcloud Public Cloud</strong>.</p>



<h2 class="wp-block-heading">Prerequisites</h2>



<p>Before you start, double-check that you have:</p>



<ul class="wp-block-list">
<li>an <strong>OVHcloud Public Cloud</strong> account</li>



<li>an <strong>OpenStack user</strong> with the <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">&nbsp;following roles</a>:
<ul class="wp-block-list">
<li>Administrator</li>



<li>AI Operator</li>



<li>Object Storage Operator</li>
</ul>
</li>



<li>An <strong>API key</strong> for <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-endpoints-getting-started?id=kb_article_view&amp;sysparm_article=KB0065401" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints</a></li>



<li><strong>ovhai CLI available</strong> – <em>install the </em><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>ovhai CLI</em></a></li>



<li><strong>Hugging Face access</strong> – <em>create a </em><a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>Hugging Face account</em></a><em> and generate an </em><a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>access token</em></a></li>
</ul>



<p><strong>🚀 Now that you have everything you need, you can start building your n8n workflow!</strong></p>



<h2 class="wp-block-heading">Architecture guide: n8n agentic RAG workflow</h2>



<p>You’re all set to configure and deploy your n8n workflow</p>



<p>⚙️<em> Keep in mind that the following steps can be completed using OVHcloud APIs!</em></p>



<h3 class="wp-block-heading">Step 1 &#8211; Build the RAG data ingestion pipeline</h3>



<p>This first step involves building the foundation of the entire RAG workflow by preparing the elements you need:</p>



<ul class="wp-block-list">
<li>n8n deployment</li>



<li>Object Storage bucket creation</li>



<li>PostgreSQL database creation</li>



<li>and more</li>
</ul>



<p>Remember to set up the proper credentials in n8n so the different elements can connect and function.</p>



<h4 class="wp-block-heading">1. Deploy n8n on OVHcloud VPS</h4>



<p>OVHcloud provides <a href="https://www.ovhcloud.com/en-gb/vps/vps-n8n/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>VPS solutions compatible with n8n</strong></a><strong>.</strong> Get a ready-to-use virtual server with <strong>pre-installed n8n </strong>and start building automation workflows without manual setup. With plans ranging from <strong>6 vCores&nbsp;/&nbsp;12 GB RAM</strong> to <strong>24 vCores&nbsp;/&nbsp;96 GB RAM</strong>, you can choose the capacity that suits your workload.</p>



<p><strong>How to set up n8n on a VPS?</strong></p>



<p>Setting up n8n on an OVHcloud VPS generally involves:</p>



<ul class="wp-block-list">
<li>Choosing and provisioning your OVHcloud VPS plan;</li>



<li>Connecting to your server via SSH and carrying out the initial server configuration, which includes updating the OS;</li>



<li>Installing n8n, typically with Docker (recommended for ease of management and updates), or npm by following this <a href="https://help.ovhcloud.com/csm/en-gb-vps-install-n8n?id=kb_article_view&amp;sysparm_article=KB0072179" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">guide</a>;</li>



<li>Configuring n8n with a domain name, SSL certificate for HTTPS, and any necessary environment variables for databases or settings.</li>
</ul>



<p>While OVHcloud provides a robust VPS platform, you can find detailed n8n installation guides in the <a href="https://docs.n8n.io/hosting/installation/docker/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">official n8n documentation</a>.</p>



<p>Once the configuration is complete, you can configure the database and bucket in Object Storage.</p>



<h4 class="wp-block-heading">2. Create Object Storage bucket</h4>



<p>First, you have to set up your data source. Here you can store all your documentation in an S3-compatible <a href="https://www.ovhcloud.com/en-gb/public-cloud/object-storage/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Object Storage</a> bucket.</p>



<p>Here, assume that all the documentation files are in Markdown format.</p>



<p>From <strong>OVHcloud Control Panel</strong>, create a new Object Storage container with <strong>S3-compatible API </strong>solution; follow this <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-storage-s3-getting-started-object-storage?id=kb_article_view&amp;sysparm_article=KB0034674" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">guide</a>.</p>



<p>When the bucket is ready, add your Markdown documentation to it.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1024x580.png" alt="" class="wp-image-29733" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Note:</strong>&nbsp;For this tutorial, we’re using the various OVHcloud product documentation available in Open-Source on the GitHub repository maintained by OVHcloud members.</p>



<p><em>Click this </em><a href="https://github.com/ovh/docs.git" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>link</em></a><em> to access the repository.</em></p>
</blockquote>
</blockquote>



<p>How do you do that? Extract all the <a href="http://guide.en-gb.md" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>guide.en-gb.md</strong></a> files from the GitHub repository and rename each one to match its parent folder.</p>



<p>Example: the documentation about ovhai cli installation <code><strong>docs/pages/public_cloud/ai_machine_learning/cli_10_howto_install_cli/</strong></code><a href="http://guide.en-gb.md" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>guide.en-gb.md</strong></a> is stored in <strong>ovhcloud-products-documentation-md</strong> bucket as <a href="http://cli_10_howto_install_cli.md" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>cli_10_howto_install_cli.md</strong></a></p>



<p>You should get an overview that looks like this:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1-1024x580.png" alt="" class="wp-image-29735" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Keep the following elements and create a new credential in n8n named <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">OVHcloud S3 gra credentials</mark></strong></code>:</p>



<ul class="wp-block-list">
<li>S3 Endpoint: <a href="https://s3.gra.io.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">https://s3.gra.io.cloud.ovh.net/</mark></code></strong></a></li>



<li>Region: <strong><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">gra</mark></code></strong></li>



<li>Access Key ID: <strong><code>&lt;your_object_storage_user_access_key&gt;</code></strong></li>



<li>Secret Access Key: <strong><code>&lt;your_pbject_storage_user_secret_key&gt;</code></strong></li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-2-1024x580.png" alt="" class="wp-image-29736" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-2-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-2-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-2-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-2-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-2-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Then, create a new n8n node by selecting&nbsp;<strong>S3</strong>, then&nbsp;<strong>Get Multiple Files</strong>.<br>Configure this node as follows:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-15-a-16.20.47-1024x580.png" alt="" class="wp-image-29740" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-15-a-16.20.47-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-15-a-16.20.47-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-15-a-16.20.47-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-15-a-16.20.47-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-15-a-16.20.47-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Connect the node to the previous one before moving on to the next step.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-15-a-16.18.00-1024x580.png" alt="" class="wp-image-29741" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-15-a-16.18.00-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-15-a-16.18.00-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-15-a-16.18.00-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-15-a-16.18.00-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-15-a-16.18.00-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>With the first phase done, you can now configure the vector DB.</p>



<h4 class="wp-block-heading">3. Configure PostgreSQL Managed DB (pgvector)</h4>



<p>In this step, you can set up the vector database that lets you store the embeddings generated from your documents.</p>



<p>How? By using OVHcloud’s managed databases, a pgvector extension of&nbsp;<a href="https://www.ovhcloud.com/en-gb/public-cloud/postgresql/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">PostgreSQL</a>. Go to your OVHcloud Control Panel and follow the steps.</p>



<p>1. Navigate to&nbsp;<strong>Databases &amp; Analytics &gt; Databases</strong></p>



<p><strong>2. Create a new database and select&nbsp;<em>PostgreSQL</em>&nbsp;and a datacenter location</strong></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/4-1024x580.png" alt="" class="wp-image-29758" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/4-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/4-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/4-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/4-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/4-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p><strong>3. Select&nbsp;<em>Production</em>&nbsp;plan and&nbsp;<em>Instance type</em></strong></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/5-1024x580.png" alt="" class="wp-image-29759" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/5-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/5-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/5-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/5-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/5-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p><strong>4. Reset the user password and save it</strong></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1-1-1024x580.png" alt="" class="wp-image-29762" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1-1-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1-1-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1-1-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1-1-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-1-1-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p><strong>5. Whitelist the IP of your n8n instance as follows</strong></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/7-1024x580.png" alt="" class="wp-image-29761" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/7-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/7-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/7-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/7-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/7-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p><strong>6. Take note of te following parameters</strong></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/6-1024x580.png" alt="" class="wp-image-29760" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/6-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/6-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/6-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/6-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/6-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Make a note of this information and create a new credential in n8n named&nbsp;<strong>OVHcloud PGvector credentials</strong>:</p>



<ul class="wp-block-list">
<li>Host:<strong>&nbsp;<code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">&lt;db_hostname&gt;</mark></code></strong></li>



<li>Database:&nbsp;<strong>defaultdb</strong></li>



<li>User:&nbsp;<code>avnadmin</code></li>



<li>Password:&nbsp;<code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">&lt;db_password&gt;</mark></strong></code></li>



<li>Port:&nbsp;<strong>20184</strong></li>
</ul>



<p>Consider&nbsp;<code>enabling</code>&nbsp;the&nbsp;<strong>Ignore SSL Issues (Insecure)</strong>&nbsp;button as needed and setting the&nbsp;<strong>Maximum Number of Connections</strong>&nbsp;value to&nbsp;<strong><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">1000</mark></code></strong>.</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/8-1024x580.png" alt="" class="wp-image-29763" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/8-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/8-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/8-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/8-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/8-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>✅ You’re now connected to the database! But what about the PGvector extension?</p>



<p>Add a PosgreSQL node in your n8n workflow&nbsp;<code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Execute a SQL query</mark></strong></code>,&nbsp;and create the extension through an SQL query, which should look like this:</p>



<pre class="wp-block-code"><code class="">-- drop table as needed<br>DROP TABLE IF EXISTS md_embeddings;<br><br>-- activate pgvector<br>CREATE EXTENSION IF NOT EXISTS vector;<br><br>-- create table<br>CREATE TABLE md_embeddings (<br>    id SERIAL PRIMARY KEY,<br>    text TEXT,<br>    embedding vector(1024),<br>    metadata JSONB<br>);</code></pre>



<p>You should get this n8n node:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-16-a-14.43.39-1024x580.png" alt="" class="wp-image-29752" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-16-a-14.43.39-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-16-a-14.43.39-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-16-a-14.43.39-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-16-a-14.43.39-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-16-a-14.43.39-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Finally, you can create a new table and name it&nbsp;<code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">md_embeddings</mark></strong></code>&nbsp;using this node. Create a&nbsp;<code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Stop and Error</mark></strong></code>&nbsp;node if you run into errors setting up the table.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-16-a-14.51.45-1024x580.png" alt="" class="wp-image-29753" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-16-a-14.51.45-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-16-a-14.51.45-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-16-a-14.51.45-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-16-a-14.51.45-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-16-a-14.51.45-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>All set! Your vector DB is prepped and ready for data! Keep in mind, you still need an&nbsp;<strong>embeddings model</strong> for the RAG data ingestion pipeline.</p>



<h4 class="wp-block-heading">4. Access to OVHcloud AI Endpoints</h4>



<p><strong>OVHcloud AI Endpoints</strong>&nbsp;is a managed service that provides&nbsp;<strong>ready-to-use APIs for AI models</strong>, including&nbsp;<strong>LLM, CodeLLM, embeddings, Speech-to-Text, and image models</strong>&nbsp;hosted within OVHcloud’s European infrastructure.</p>



<p>To vectorise the various documents in Markdown format, you have to select an embedding model:&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/models/bge-m3" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>BGE-M3</strong></a>.</p>



<p>Usually, your AI Endpoints API key should already be created. If not, head to the AI Endpoints menu in your OVHcloud Control Panel to generate a new API key.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-3-1-1024x580.png" alt="" class="wp-image-29775" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-3-1-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-3-1-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-3-1-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-3-1-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/ref-archi-n8n-3-1-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Once this is done, you can create new OpenAI credentials in your n8n.</p>



<p>Why do I need OpenAI credentials? Because <strong>AI Endpoints API&nbsp;</strong>is fully compatible with OpenAI’s, integrating it is simple and ensures the&nbsp;<strong>sovereignty of your data.</strong></p>



<p>How? Thanks to a single endpoint&nbsp;<a href="https://oai.endpoints.kepler.ai.cloud.ovh.net/v1" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>https://oai.endpoints.kepler.ai.cloud.ovh.net/v1</code></mark></strong></a>, you can request the different AI Endpoints models.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.45.33-1024x580.png" alt="" class="wp-image-29776" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.45.33-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.45.33-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.45.33-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.45.33-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.45.33-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>This means you can create a new n8n node by selecting&nbsp;<strong>Postgres PGVector Store</strong>&nbsp;and&nbsp;<strong>Add documents to Vector Store</strong>.<br>Set up this node as shown below:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.24-1024x580.png" alt="" class="wp-image-29781" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.24-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.24-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.24-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.24-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.24-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Then configure the <strong>Data Loader</strong> with a custom text splitting and a JSON type.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.38-1-1024x580.png" alt="" class="wp-image-29780" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.38-1-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.38-1-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.38-1-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.38-1-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.38-1-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>For the text splitter, here are some options:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-12.02.43-1024x580.png" alt="" class="wp-image-29786" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-12.02.43-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-12.02.43-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-12.02.43-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-12.02.43-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-12.02.43-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>To finish, select the&nbsp;<strong>BGE-M3</strong> embedding model from the model list and set the&nbsp;<strong>Dimensions</strong> to 1024.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.51-1024x580.png" alt="" class="wp-image-29784" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.51-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.51-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.51-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.51-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.50.51-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>You now have everything you need to build the ingestion pipeline.</p>



<h4 class="wp-block-heading">5. Set up the ingestion pipeline loop</h4>



<p>To make use of a fully automated document ingestion and vectorisation pipeline, you have to integrate some specific nodes, mainly:</p>



<ul class="wp-block-list">
<li>a <strong><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Loop Over Items</mark></code></strong> that downloads each markdown file one by one so that it can be vectorised;</li>



<li>a <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Code in JavaScript</mark></strong></code> that counts the number of files processed, which subsequently determines the number of requests sent to the embedding model;</li>



<li>an <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">If</mark></strong></code> condition that allows you to check when the 400 requests have been reached;</li>



<li>a <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Wait</mark></strong></code> node that pauses after every 400 requests to avoid getting rate-limited;</li>



<li>an S3 block <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Download a file</mark></strong></code> to download each markdown;</li>



<li>another <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Code in JavaScript</mark></strong></code> to extract and process text from Markdown files by cleaning and removing special characters before sending it to the embeddings model;</li>



<li>a PostgreSQL node to <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Execute a SQL</mark></strong></code> query to check that the table contains vectors after the process (loop) is complete.</li>
</ul>



<h5 class="wp-block-heading">5.1. Create a loop to process each documentation file</h5>



<p>Begin by creating a <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Loop Over Items</mark></strong></code> to process all the Markdown files one at a time. Set the <strong>batch size</strong> to <strong><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">1</mark></code></strong> in this loop.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-10.50.13-1024x580.png" alt="" class="wp-image-29788" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-10.50.13-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-10.50.13-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-10.50.13-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-10.50.13-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-10.50.13-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Add the <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>Loop</code></mark></strong> statement right after the S3 <strong><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Get Many Files</mark></code></strong> node as shown below:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.30.00-1024x580.png" alt="" class="wp-image-29797" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.30.00-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.30.00-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.30.00-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.30.00-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.30.00-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Time to put the loop’s content into action!</p>



<h5 class="wp-block-heading">5.2. Count the number of files using a code snippet</h5>



<p>Next, choose the <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Code in JavaScript</mark></strong></code> node from the list to see how many files have been processed. Set “Run Once for Each Item” <code><strong>Mode</strong></code> and “JavaScript” code <strong>Language</strong>, then add the following code snippet to the designated block.</p>



<pre class="wp-block-code"><code class="">// simple counter per item<br>const counter = $runIndex + 1;<br><br>return {<br>  counter<br>};</code></pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.05.47-1024x580.png" alt="" class="wp-image-29792" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.05.47-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.05.47-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.05.47-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.05.47-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.05.47-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Make sure this code snippet is included in the loop.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.33.57-1024x580.png" alt="" class="wp-image-29798" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.33.57-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.33.57-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.33.57-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.33.57-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.33.57-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>You can start adding the <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><strong><code>if</code></strong></mark> part to the loop now.</p>



<h5 class="wp-block-heading">5.3. Add a condition that applies a rule every 400 requests</h5>



<p>Here, you need to create an <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">If</mark></strong></code> node and add the following condition, which you have set as an expression.</p>



<pre class="wp-block-code"><code class="">{{ (Number($json["counter"]) % 400) === 0 }}</code></pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.11.42-1024x580.png" alt="" class="wp-image-29794" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.11.42-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.11.42-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.11.42-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.11.42-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.11.42-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Add it immediately after counting the files:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.44.10-1024x580.png" alt="" class="wp-image-29800" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.44.10-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.44.10-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.44.10-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.44.10-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.44.10-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>If this condition <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">is true</mark></strong></code>, trigger the <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Wait</mark></strong></code> node.</p>



<h5 class="wp-block-heading">5.4. Insert a pause after each set of 400 requests</h5>



<p>Then insert a <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Wait</mark></strong></code> node to pause for a few seconds before resuming. You can insert <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Resume</mark></strong></code> “After Time Interval” and set the <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Wait Amount</mark></strong></code> to “60:00” seconds.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.23.39-1024x580.png" alt="" class="wp-image-29796" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.23.39-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.23.39-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.23.39-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.23.39-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.23.39-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Link it to the <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">If</mark></strong></code> condition when this is <strong>True</strong>.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.45.08-1024x580.png" alt="" class="wp-image-29801" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.45.08-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.45.08-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.45.08-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.45.08-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-11.45.08-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Next, you can go ahead and download the Markdown file, and then process it.</p>



<h5 class="wp-block-heading">5.5. Launch documentation download</h5>



<p>To do this, create a new <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Download a file</mark></strong></code> S3 node and configure it with this File Key expression:</p>



<pre class="wp-block-code"><code class="">{{ $('Process each documentation file').item.json.Key }}</code></pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-16.42.12-1024x580.png" alt="" class="wp-image-29804" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-16.42.12-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-16.42.12-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-16.42.12-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-16.42.12-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-16.42.12-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Want to connect it?  That’s easy, link it to the output of the <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Wait</mark></strong></code> and <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">If</mark></strong></code> statements when the ‘if’ statement returns <strong>False</strong>; this will allow the file to be processed only if the rate limit is not exceeded.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-16.49.05-1024x580.png" alt="" class="wp-image-29805" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-16.49.05-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-16.49.05-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-16.49.05-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-16.49.05-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-16.49.05-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>You’re almost done! Now you need to extract and process the text from the Markdown files – clean and remove any special characters before sending it to the embedding model.</p>



<h5 class="wp-block-heading">5.6 Clean Markdown text content</h5>



<p>Next, create another <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Code in JavaScript</mark></strong></code> to process text from Markdown files:</p>



<pre class="wp-block-code"><code class="">// extract binary content<br>const binary = $input.item.binary.data;<br><br>// decoding into clean UTF-8 text<br>let text = Buffer.from(binary.data, 'base64').toString('utf8');<br><br>// cleaning - remove non-printable characters<br>text = text<br>  .replace(/[^\x09\x0A\x0D\x20-\x7EÀ-ÿ€£¥•–—‘’“”«»©®™°±§¶÷×]/g, ' ')<br>  .replace(/\s{2,}/g, ' ')<br>  .trim();<br><br>// check lenght<br>if (text.length &gt; 14000) {<br>  text = text.slice(0, 14000);<br>}<br><br>return [{<br>  text,<br>  fileName: binary.fileName,<br>  mimeType: binary.mimeType<br>}];</code></pre>



<p>Select the <em>“Run Once for Each Item”</em> <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Mode</mark></strong></code> and place the previous code in the dedicated JavaScript block.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-17.02.04-1024x580.png" alt="" class="wp-image-29806" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-17.02.04-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-17.02.04-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-17.02.04-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-17.02.04-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-17.02.04-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>To finish, check that the output text has been sent to the document vectorisation system, which was set up in <strong>Step 3 – Configure PostgreSQL Managed DB (pgvector)</strong>.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-17.15.45-1024x580.png" alt="" class="wp-image-29808" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-17.15.45-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-17.15.45-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-17.15.45-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-17.15.45-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-17.15.45-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>How do I confirm that the table contains all elements after vectorisation?</p>



<h5 class="wp-block-heading">5.7 Double-check that the documents are in the table</h5>



<p>To confirm that your RAG system is working, make sure your vector database has different vectors; use a PostgreSQL node with <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Execute a SQL query</mark></strong></code> in your n8n workflow.</p>



<p>Then, run the following query:</p>



<pre class="wp-block-code"><code class="">-- count the number of elements<br>SELECT COUNT(*) FROM md_embeddings;</code></pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-20.28.49-1024x580.png" alt="" class="wp-image-29818" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-20.28.49-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-20.28.49-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-20.28.49-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-20.28.49-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-20-a-20.28.49-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Next, link this element to the <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Done</mark></strong></code> section of your <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Loop</mark></strong>, so the elements are counted when the process is complete.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.14.41-1024x580.png" alt="" class="wp-image-29773" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.14.41-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.14.41-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.14.41-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.14.41-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-17-a-11.14.41-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Congrats! You can now run the workflow to begin ingesting documents.</p>



<p>Click the <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Execute workflow</mark></strong></code> button and wait until the vectorization process is complete.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-11.41.52-1024x580.png" alt="" class="wp-image-29823" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-11.41.52-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-11.41.52-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-11.41.52-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-11.41.52-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-11.41.52-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Remember, everything should be green when it’s finished ✅.</p>



<h3 class="wp-block-heading">Step 2 – RAG chatbot</h3>



<p>With the data ingestion and vectorisation steps completed, you can now begin implementing your AI agent.</p>



<p>This involves building a <strong>RAG-based AI Agent</strong>&nbsp;by simply starting a chat with an LLM.</p>



<h4 class="wp-block-heading">1. Set up the chat box to start a conversation</h4>



<p>First, configure your AI Agent based on the RAG system, and add a new node in the same n8n workflow: <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Chat Trigger</mark></strong></code>.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-16.31.24-1024x580.png" alt="" class="wp-image-29834" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-16.31.24-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-16.31.24-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-16.31.24-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-16.31.24-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-16.31.24-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>This node will allow you to interact directly with your AI agent! But before that, you need to check that your message is safe.</p>



<p>This node will allow you to interact directly with your AI agent! But before that, you need to check that your message is secure.</p>



<h4 class="wp-block-heading">2. Set up your LLM Guard with AI Deploy</h4>



<p>To check whether a message is secure or not, use an LLM Guard.</p>



<p><strong>What’s an LLM Guard?</strong>&nbsp;This is a safety and control layer that sits between users and an LLM, or between the LLM and an external connection. Its main goal is to filter, monitor, and enforce rules on what goes into or comes out of the model 🔐.</p>



<p>You can use <a href="file:///Users/jdutse/Downloads/www.ovhcloud.com/en-gb/public-cloud/ai-deploy" data-wpel-link="internal">AI Deploy</a> from OVHcloud to deploy your desired LLM guard. With a single command line, this AI solution lets you deploy a Hugging Face model using vLLM Docker containers.</p>



<p>For more details, please refer to this <a href="https://blog.ovhcloud.com/mistral-small-24b-served-with-vllm-and-ai-deploy-one-command-to-deploy-llm/" data-wpel-link="internal">blog</a>.</p>



<p>For the use case covered in this article, you can use the open-source model <strong>meta-llama/Llama-Guard-3-8B</strong> available on <a href="https://huggingface.co/meta-llama/Llama-Guard-3-8B" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face</a>.</p>



<h5 class="wp-block-heading">2.1 Create a Bearer token to request your custom AI Deploy endpoint</h5>



<p><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-app-token?id=kb_article_view&amp;sysparm_article=KB0035280" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Create a token</a> to access your AI Deploy app once it’s deployed.</p>



<pre class="wp-block-code"><code class="">ovhai token create --role operator ai_deploy_token=my_operator_token</code></pre>



<p>The following output is returned:</p>



<p><code><strong>Id: 47292486-fb98-4a5b-8451-600895597a2b<br>Created At: 20-10-25 8:53:05<br>Updated At: 20-10-25 8:53:05<br>Spec:<br>Name: ai_deploy_token=my_operator_token<br>Role: AiTrainingOperator<br>Label Selector:<br>Status:<br>Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX<br>Version: 1</strong></code></p>



<p>You can now store and export your access token to add it as a new credential in n8n.</p>



<pre class="wp-block-code"><code class="">export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</code></pre>



<h5 class="wp-block-heading">2.1 Start Llama Guard 3 model with AI Deploy</h5>



<p>Using <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">ovhai</mark></strong></code> CLI, launch the following command and vLLM start inference server.</p>



<pre class="wp-block-code"><code class="">ovhai app run \<br>	--name vllm-llama-guard3 \<br>        --default-http-port 8000 \<br>        --gpu 1 \<br>	--flavor l40s-1-gpu \<br>        --label ai_deploy_token=my_operator_token \<br>	--env OUTLINES_CACHE_DIR=/tmp/.outlines \<br>	--env HF_TOKEN=$MY_HF_TOKEN \<br>	--env HF_HOME=/hub \<br>	--env HF_DATASETS_TRUST_REMOTE_CODE=1 \<br>	--env HF_HUB_ENABLE_HF_TRANSFER=0 \<br>	--volume standalone:/workspace:RW \<br>	--volume standalone:/hub:RW \<br>	vllm/vllm-openai:v0.10.1.1 \<br>	-- bash -c python3 -m vllm.entrypoints.openai.api_server                       <br>                           --model meta-llama/Llama-Guard-3-8B \                     <br>                           --tensor-parallel-size 1 \                     <br>                           --dtype bfloat16</code></pre>



<p><em>Full command explained:</em></p>



<ul class="wp-block-list">
<li><code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">ovhai app run</mark></strong></code></li>
</ul>



<p>This is the core command to&nbsp;<strong>run an app</strong>&nbsp;using the&nbsp;<strong>OVHcloud AI Deploy</strong>&nbsp;platform.</p>



<ul class="wp-block-list">
<li><code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">--name vllm-llama-guard3</mark></strong></code></li>
</ul>



<p>Sets a&nbsp;<strong>custom name</strong>&nbsp;for the job. For example,&nbsp;<code>vllm-llama-guard3</code>.</p>



<ul class="wp-block-list">
<li><code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">--default-http-port 8000</mark></strong></code></li>
</ul>



<p>Exposes&nbsp;<strong>port 8000</strong>&nbsp;as the default HTTP endpoint. vLLM server typically runs on port 8000.</p>



<ul class="wp-block-list">
<li><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>--gpu&nbsp;</code>1</mark></strong></li>



<li><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>--flavor l40s-1-gpu</code></mark></strong></li>
</ul>



<p>Allocates&nbsp;<strong>1 GPU L40S</strong>&nbsp;for the app. You can adjust the GPU type and number depending on the model you have to deploy.</p>



<ul class="wp-block-list">
<li><code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">--volume standalone:/workspace:RW</mark></strong></code></li>



<li><code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">--volume standalone:/hub:RW</mark></strong></code></li>
</ul>



<p>Mounts&nbsp;<strong>two persistent storage volumes</strong>: <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>/workspace</code></mark></strong> which is the main working directory and <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">/hub</mark></strong></code>&nbsp;to store Hugging Face model files.</p>



<ul class="wp-block-list">
<li><code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">--env OUTLINES_CACHE_DIR=/tmp/.outlines</mark></strong></code></li>



<li><strong><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">--env HF_TOKEN=$MY_HF_TOKEN</mark></code></strong></li>



<li><code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">--env HF_HOME=/hub</mark></strong></code></li>



<li><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><strong>--env HF_DATASETS_TRUST_REMOTE_CODE=1</strong></mark></code></li>



<li><code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">--env HF_HUB_ENABLE_HF_TRANSFER=0</mark></strong></code></li>
</ul>



<p>These are Hugging Face&nbsp;<strong>environment variables</strong> you have to set. Please export your Hugging Face access token as environment variable before starting the app: <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">export MY_HF_TOKEN=***********</mark></strong></code></p>



<ul class="wp-block-list">
<li><code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">vllm/vllm-openai:v0.10.1.1</mark></strong></code></li>
</ul>



<p>Use the&nbsp;<strong><code>v<strong><code>llm/vllm-openai</code></strong></code></strong>&nbsp;Docker image (a pre-configured vLLM OpenAI API server).</p>



<ul class="wp-block-list">
<li><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><strong>-- bash -c python3 -m vllm.entrypoints.openai.api_server                       <br>                    --model meta-llama/Llama-Guard-3-8B \                     <br>                    --tensor-parallel-size 1 \                     <br>                    --dtype bfloat16</strong></mark></code></li>
</ul>



<p>Finally, run a<strong>&nbsp;bash shell</strong>&nbsp;inside the container and executes a Python command to launch the vLLM API server.</p>



<h5 class="wp-block-heading">2.2 Check to confirm your AI Deploy app is RUNNING</h5>



<p>Replace the <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">&lt;app_id></mark></strong></code> by yours.</p>



<pre class="wp-block-code"><code class="">ovhai app get &lt;app_id&gt;</code></pre>



<p>You should get:</p>



<p><code>History:<br>DATE STATE<br>20-1O-25 09:58:00 QUEUED<br>20-10-25 09:58:01 INITIALIZING<br>04-04-25 09:58:07 PENDING<br>04-04-25 10:03:10&nbsp;<strong>RUNNING</strong><br>Info:<br>Message: App is running</code></p>



<h5 class="wp-block-heading">2.3 Create a new n8n credential with AI Deploy app URL and Bearer access token</h5>



<p>First, using your <code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><strong>&lt;app_id></strong></mark></code>, retrieve your AI Deploy app URL.</p>



<pre class="wp-block-code"><code class="">ovhai app get <span style="background-color: initial; font-family: inherit; font-size: inherit; text-align: initial; font-weight: inherit;">&lt;app_id&gt;</span> -o json | jq '.status.url' -r</code></pre>



<p>Then, create a new OpenAI credential from your n8n workflow, using your AI Deploy URL and the Bearer token as an API key.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-16.49.14-1024x580.png" alt="" class="wp-image-29837" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-16.49.14-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-16.49.14-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-16.49.14-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-16.49.14-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-16.49.14-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Don&#8217;t forget to replace <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>6e10e6a5-2862-4c82-8c08-26c458ca12c7</code></mark></strong> with your <span style="background-color: initial; font-family: inherit; font-size: inherit; text-align: initial; font-weight: inherit;"><strong><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">&lt;app_id></mark></code></strong></span>.</p>



<h5 class="wp-block-heading">2.4 Create the LLM Guard node in n8n workflow</h5>



<p>Create a new <strong>OpenAI node</strong> to <strong>Message a model</strong> and select the new AI Deploy credential for LLM Guard usage.</p>



<p>Next, create the prompt as follows:</p>



<pre class="wp-block-code"><code class="">{{ $('Chat with the OVHcloud product expert').item.json.chatInput }}</code></pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.09.43-1024x580.png" alt="" class="wp-image-29840" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.09.43-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.09.43-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.09.43-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.09.43-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.09.43-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Then, use an <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">If</mark></strong></code> node to determine if the scenario is <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>safe</code></mark></strong> or <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>unsafe</code></mark></strong>:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.25.29-1024x580.png" alt="" class="wp-image-29842" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.25.29-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.25.29-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.25.29-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.25.29-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.25.29-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>If the message is <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">unsafe</mark></strong></code>, send an error message right away to stop the workflow.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.26.49-1024x580.png" alt="" class="wp-image-29843" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.26.49-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.26.49-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.26.49-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.26.49-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/10/Capture-decran-2025-10-21-a-18.26.49-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>But if the message is <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">safe</mark></strong></code>, you can send the request to the AI Agent without issues 🔐.</p>



<h4 class="wp-block-heading">3. Set up AI Agent</h4>



<p>The&nbsp;<strong>AI Agent</strong>&nbsp;node in&nbsp;<strong>n8n</strong>&nbsp;acts as an intelligent orchestration layer that combines&nbsp;<strong>LLMs, memory, and external tools</strong>&nbsp;within an automated workflow.</p>



<p>It allows you to:</p>



<ul class="wp-block-list">
<li>Connect a <strong>Large Language Model</strong> using APIs (e.g., LLMs from AI Endpoints);</li>



<li>Use <strong>tools</strong> such as HTTP requests, databases, or RAG retrievers so the agent can take actions or fetch real information;</li>



<li>Maintain <strong>conversational memory</strong> via PostgreSQL databases;</li>



<li>Integrate directly with chat platforms (e.g., Slack, Teams) for interactive assistants (optional).</li>
</ul>



<p>Simply put, n8n becomes an&nbsp;<strong>agentic automation framework</strong>, enabling LLMs to not only provide answers, but also think, choose, and perform actions.</p>



<p>Please note that you can change and customise this n8n <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">AI Agent</mark></strong></code> node to fit your use cases, using features like function calling or structured output. This is the most basic configuration for the given use case. You can go even further with different agents.</p>



<p>🧑‍💻&nbsp;<strong>How do I implement this RAG?</strong></p>



<p>First, create an <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">AI Agent</mark></strong></code> node in <strong>n8n</strong> as follows:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-1024x580.png" alt="" class="wp-image-29933" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Then, a series of steps are required, the first of which is creating prompts.</p>



<h5 class="wp-block-heading">3.1 Create prompts</h5>



<p>In the AI Agent node on your n8n workflow, edit the user and system prompts.</p>



<p>Begin by creating the&nbsp;<strong>prompt</strong>,&nbsp;which is also the&nbsp;<strong>user message</strong>:</p>



<pre class="wp-block-code"><code class="">{{ $('Chat with the OVHcloud product expert').item.json.chatInput }}</code></pre>



<p>Then create the <strong>System Message</strong> as shown below:</p>



<pre class="wp-block-code"><code class="">You have access to a retriever tool connected to a knowledge base.  <br>Before answering, always search for relevant documents using the retriever tool.  <br>Use the retrieved context to answer accurately.  <br>If no relevant documents are found, say that you have no information about it.</code></pre>



<p>You should get a configuration like this:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-1-1024x580.png" alt="" class="wp-image-29935" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-1-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-1-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-1-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-1-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-1-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>🤔 Well, an LLM is now needed for this to work!</p>



<h5 class="wp-block-heading">3.2 Select LLM using AI Endpoints API</h5>



<p>First, add an <strong>OpenAI Chat Model</strong> node, and then set it as the <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Chat Model</mark></strong></code> for your agent.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-3-1024x580.png" alt="" class="wp-image-29939" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-3-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-3-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-3-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-3-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-3-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Next, select one of the&nbsp;<a href="https://www.ovhcloud.com/en/public-cloud/ai-endpoints/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud AI Endpoints</a>&nbsp;from the list provided, because they are compatible with Open AI APIs.</p>



<p>✅ <strong>How?</strong> By using the right API <a href="https://oai.endpoints.kepler.ai.cloud.ovh.net/v1" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>https://oai.endpoints.kepler.ai.cloud.ovh.net/v1</code></mark></strong></a></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-2-1024x580.png" alt="" class="wp-image-29936" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-2-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-2-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-2-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-2-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-2-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>The <a href="https://www.ovhcloud.com/en/public-cloud/ai-endpoints/catalog/gpt-oss-120b/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>GPT OSS 120B</strong></a> model has been selected for this use case. Other models, such as Llama, Mistral, and Qwen, are also available.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><mark style="background-color:#fcb900" class="has-inline-color">⚠️ <strong>WARNING</strong> ⚠️</mark></p>



<p>If you are using a recent version of n8n, you will likely encounter the <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>/responses</code></mark></strong> issue (linked to OpenAI compatibility). To resolve this, you will need to disable the button <strong><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Use Responses API</mark></code></strong> and everything will work correctly</p>
</blockquote>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" width="829" height="675" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/02_44_08-1.jpg" alt="" class="wp-image-30352" style="aspect-ratio:1.2281554640124863;width:409px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/02_44_08-1.jpg 829w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/02_44_08-1-300x244.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/02_44_08-1-768x625.jpg 768w" sizes="auto, (max-width: 829px) 100vw, 829px" /><figcaption class="wp-element-caption"><em>Tips to fix /responses issue</em></figcaption></figure>



<p>Your LLM is now set to answer your questions! Don’t forget, it needs access to the knowledge base.</p>



<h5 class="wp-block-heading">3.3 Connect the knowledge base to the RAG retriever</h5>



<p>As usual, the first step is to create an n8n node called <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">PGVector Vector Store nod</mark>e</strong></code> and enter your PGvector credentials.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-4-1024x580.png" alt="" class="wp-image-29943" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-4-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-4-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-4-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-4-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-4-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Next, link this element to the <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>Tools</code></mark></strong> section of the AI Agent node.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-5-1024x580.png" alt="" class="wp-image-29944" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-5-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-5-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-5-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-5-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-5-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Remember to connect your PG vector database so that the retriever can access the previously generated embeddings. Here’s an overview of what you’ll get.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-6-1024x580.png" alt="" class="wp-image-29945" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-6-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-6-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-6-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-6-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-6-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>⏳Nearly done! The final step is to add the database memory.</p>



<h5 class="wp-block-heading">3.4 Manage conversation history with database memory</h5>



<p>Creating&nbsp;<strong>Database Memory</strong>&nbsp;node in n8n (PostgreSQL) lets you link it to your AI Agent, so it can store and retrieve past conversation history. This enables the model to remember and use context from multiple interactions.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-7-1024x580.png" alt="" class="wp-image-29946" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-7-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-7-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-7-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-7-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-7-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>So link this PostgreSQL database to the <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">Memory</mark></strong></code> section of your AI agent.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="580" src="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-8-1024x580.png" alt="" class="wp-image-29947" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-8-1024x580.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-8-300x170.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-8-768x435.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-8-1536x870.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/11/image-8-2048x1160.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Congrats! 🥳 Your&nbsp;<strong>n8n RAG workflow</strong>&nbsp;is now complete. Ready to test it?</p>



<h4 class="wp-block-heading">4. Make the most of your automated workflow</h4>



<p>Want to try it? It’s easy!</p>



<p>By clicking the orange <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>Open chat</code></mark></strong> button, you can ask the AI agent questions about OVHcloud products, particularly where you need technical assistance.</p>



<figure class="wp-block-video"><video height="1660" style="aspect-ratio: 2930 / 1660;" width="2930" controls src="https://blog.ovhcloud.com/wp-content/uploads/2025/11/video-n8n1.mp4"></video></figure>



<p>For example, you can ask the LLM about rate limits in OVHcloud AI Endpoints and get the information in seconds.</p>



<figure class="wp-block-video"><video height="1660" style="aspect-ratio: 2930 / 1660;" width="2930" controls src="https://blog.ovhcloud.com/wp-content/uploads/2025/11/video-n8n2.mp4"></video></figure>



<p>You can now build your own autonomous RAG system using OVHcloud Public Cloud, suited for a wide range of applications.</p>



<h2 class="wp-block-heading">What’s next?</h2>



<p>To sum up, this reference architecture provides a guide on using&nbsp;<strong>n8n</strong> with&nbsp;<strong>OVHcloud AI Endpoints</strong>,&nbsp;<strong>AI Deploy</strong>,&nbsp;<strong>Object Storage</strong>, and&nbsp;<strong>PostgreSQL + pgvector</strong> to build a fully controlled, autonomous&nbsp;<strong>RAG AI system</strong>.</p>



<p>Teams can build scalable AI assistants that work securely and independently in their cloud environment by orchestrating ingestion, embedding generation, vector storage, retrieval, and LLM safety check, and reasoning within a single workflow.</p>



<p>With the core architecture in place, you can add more features to improve the capabilities and robustness of your agentic RAG system:</p>



<ul class="wp-block-list">
<li>Web search</li>



<li>Images with OCR</li>



<li>Audio files transcribed using the Whisper model</li>
</ul>



<p>This delivers an extensive knowledge base and a wider variety of use cases!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-build-a-sovereign-n8n-rag-workflow-for-ai-agent-using-ovhcloud-public-cloud-solutions%2F&amp;action_name=Reference%20Architecture%3A%20build%20a%20sovereign%20n8n%20RAG%20workflow%20for%20AI%20agent%20using%20OVHcloud%20Public%20Cloud%20solutions&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2025/11/video-n8n1.mp4" length="11190376" type="video/mp4" />
<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2025/11/video-n8n2.mp4" length="9881210" type="video/mp4" />

			</item>
		<item>
		<title>Reference Architecture: deploying the Mistral Large 123B model in a sovereign environment with OVHcloud</title>
		<link>https://blog.ovhcloud.com/reference-architecture-deploy-mistral-large-model-in-sovereign-environment-ovhcloud/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Wed, 18 Jun 2025 12:45:51 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[AI Training]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[Mistral]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=29186</guid>

					<description><![CDATA[Are you ready to think bigger with the Mistral Large model 🚀 ? As Artificial Intelligence (AI) becomes a strategic pillar for both enterprises and public institutions, data sovereignty and infrastructure control have become essential. Deploying advanced large language models (LLMs) like Mistral Large, under a commercial license, requires a secure, high-performance environment that complies [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-deploy-mistral-large-model-in-sovereign-environment-ovhcloud%2F&amp;action_name=Reference%20Architecture%3A%C2%A0deploying%20the%20Mistral%20Large%20123B%20model%20in%20a%20sovereign%20environment%20with%20OVHcloud&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em><strong>Are you ready to think bigger with the Mistral Large model 🚀 ?</strong></em></p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="461" src="https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_ref-1024x461.png" alt="" class="wp-image-29249" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_ref-1024x461.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_ref-300x135.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_ref-768x346.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_ref-1536x691.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_ref.png 1920w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Mistral Large model deployed on OVHcloud infrastructure<br></em></figcaption></figure>



<p>As Artificial Intelligence (<strong>AI</strong>) becomes a strategic pillar for both enterprises and public institutions, <strong>data sovereignty</strong> and <strong>infrastructure control</strong> have become essential. Deploying advanced large language models (LLMs) like <strong>Mistral Large</strong>, under a commercial license, requires a secure, high-performance environment that complies with <strong>European data regulations</strong>.</p>



<p><strong>OVHcloud Machine Learning Services</strong> offer a trusted solution for deploying AI models in a <strong>fully sovereign cloud environment</strong> — hosted in Europe, under <strong>EU jurisdiction</strong>, and fully <strong>GDPR-compliant</strong>.</p>



<p>This <strong>Reference Architecture</strong> will show you how to:</p>



<ul class="wp-block-list">
<li>Access Mistral AI registry using your own license</li>



<li>Download the Mistral Large 123B model automatically using <strong>AI Training</strong></li>



<li>Store the model into a dedicated bucket with <strong>OVHcloud Object Storage</strong></li>



<li>Deploy a production-ready inference API for <strong>Mistral Large</strong> using <strong>AI Deploy</strong> </li>
</ul>



<h2 class="wp-block-heading">Context</h2>



<h3 class="wp-block-heading">Mistral Large model</h3>



<p>The <strong>Mistral Large</strong> model is a <strong>state-of-the-art (LLM)</strong> developed by <strong><a href="https://mistral.ai/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral AI</a>,</strong> a French AI company. It&#8217;s designed to compete with top-tier models like GPT-4, Claude, while emphasizing performance and efficiency.</p>



<p>This is a model with <strong>123 billion</strong> parameters. <strong>Mistral AI</strong> recommends deploying this model in FP8 with 4 H100 GPUs. For more information, refer to <a href="https://help.mistral.ai/en/articles/235545-mistral-models" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral documentation</a>.</p>



<p>This model requires the use of a <strong>commercial licence</strong>. To do this, you need to create an account on <a href="https://console.mistral.ai/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">La Plateforme</a> via the Mistral AI console (<strong>console.mistral.ai</strong>).</p>



<h3 class="wp-block-heading">AI Training </h3>



<p><strong>OVHcloud AI Training</strong> is a fully managed platform designed to help you <strong>train, tune</strong> Machine Learning (ML), Deep Learning (DL), and Large Language Models (LLMs) efficiently. Whether you&#8217;re working on computer vision, NLP, or tabular data, this solution lets you launch training jobs on high-performance GPUs in seconds.</p>



<p><strong>What are the key benefits?</strong></p>



<ul class="wp-block-list">
<li><strong>Easy to use</strong>: launch processing or training jobs in one CLI command or a few clicks using your own Docker image</li>



<li><strong>High-performance computing</strong>: access GPUs like H100, A100, V100S, L40S, and L4 as of June 2025 &#8211; new references are added regularly</li>



<li><strong>Cost-efficient</strong>:<strong> </strong>pay-per-minute billing with no upfront commitment. You only pay for compute time used, with precise control over resources thanks to automatic job stop and synchronisation</li>
</ul>



<p><strong>💡 Why do we need AI Training? </strong>To download the Mistral Large model automatically and efficiently, using a single command to launch the job.</p>



<h3 class="wp-block-heading">AI Deploy</h3>



<p>OVHcloud AI Deploy is a<strong>&nbsp;Container as a Service</strong>&nbsp;(CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications / APIs based on Machine Learning (ML), Deep Learning (DL) or LLMs.</p>



<p><strong>The key benefits are:</strong></p>



<ul class="wp-block-list">
<li><strong>Easy to use:</strong>&nbsp;bring your own custom Docker image and deploy it in a command line or a few clicks surely</li>



<li><strong>High-performance computing:</strong>&nbsp;a complete range of GPUs available (H100, A100, V100S, L40S and L4)</li>



<li><strong>Scalability and flexibility:</strong>&nbsp;supports automatic scaling, allowing your model to effectively handle fluctuating workloads</li>



<li><strong>Cost-efficient:</strong>&nbsp;billing per minute, no surcharges</li>
</ul>



<p>✅ To go further, some prerequisites must be checked!</p>



<h2 class="wp-block-heading">Overview of the Mistral Large deployment architecture</h2>



<p>Here is how will be deployed <strong>Mistral Large 123B</strong>:</p>



<ol class="wp-block-list">
<li>Install the <strong>ovhai CLI</strong></li>



<li>Create a bucket for <strong>model storage</strong></li>



<li>Retrieve the <strong>license information</strong> from <a href="https://console.mistral.ai/on-premise/licenses" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral Console</a></li>



<li>Configure and set up the<strong> environment</strong></li>



<li>Download the <strong>Mistral Large model weights</strong></li>



<li>Deploy the <strong>Mistral Large service</strong></li>



<li>Test it with simple request and <strong>advanced usage</strong> thanks to LangChain</li>
</ol>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="173" src="https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_process-1024x173.png" alt="" class="wp-image-29251" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_process-1024x173.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_process-300x51.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_process-768x130.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_process-1536x259.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_process.png 1920w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Let’s go for the set up and deployment of your own Mistral Large service!</p>



<h2 class="wp-block-heading">Prerequisites</h2>



<p>Before you begin, ensure you have:</p>



<ul class="wp-block-list">
<li>A <strong><a href="https://console.mistral.ai/on-premise/licenses" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral AI license</a></strong> to access to the <strong>Mistral Large model</strong></li>



<li>An&nbsp;<strong>OVHcloud Public Cloud</strong>&nbsp;account</li>



<li>An&nbsp;<strong>OpenStack user</strong>&nbsp;with the following roles:
<ul class="wp-block-list">
<li>Administrator</li>



<li>AI Training Operator</li>



<li>Object Storage Operator</li>
</ul>
</li>
</ul>



<p><strong>🚀 Having all the ingredients for our recipe, it’s time to </strong>deploy the Mistral Large model on 4 H100<strong>!</strong></p>



<h2 class="wp-block-heading">Architecture guide:&nbsp;Mistral Large on OVHcloud infrastructure</h2>



<p>Let’s go for the set up and deployment of the <strong>Mistral Large</strong> model!</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>✅ Note</strong></p>
<cite><strong>In this example, the <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>Mistral Large 25.02</code></mark> is used. Choose the mistral model under the licence of your choice and repeat the same steps, adapting the model name and versions.</strong></cite></blockquote>



<p>⚙️<em>&nbsp;Also consider that all of the following steps can be automated using OVHcloud APIs!</em></p>



<h3 class="wp-block-heading">Step 1 &#8211; Install&nbsp;<code>ovhai</code>&nbsp;CLI</h3>



<p>If the <code><strong>ovhai</strong></code> CLI is not install, start by setting up your CLI environment.</p>



<pre class="wp-block-code"><code class="">curl https://cli.gra.ai.cloud.ovh.net/install.sh | bash</code></pre>



<p>Secondly, login using your&nbsp;<strong>OpenStack credentials</strong>.</p>



<pre class="wp-block-code"><code class="">ovhai login -u &lt;openstack-username&gt; -p &lt;openstack-password&gt;</code></pre>



<p>Now, it’s time to create your bucket inside OVHcloud Object Storage!</p>



<h3 class="wp-block-heading">Step 2 – Provision Object Storage</h3>



<ol class="wp-block-list">
<li>Go to&nbsp;<strong>Public Cloud &gt; Storage &gt; Object Storage</strong>&nbsp;in the OVHcloud Control Panel.</li>



<li>Create a&nbsp;<strong>datastore</strong>&nbsp;and a new&nbsp;<strong>S3 bucket</strong>&nbsp;(e.g.,&nbsp;<strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>s3-mistral-large-model</code>)</mark></strong>.</li>



<li>Register the datastore with the&nbsp;<code>ovhai</code>&nbsp;CLI:</li>
</ol>



<pre class="wp-block-code"><code class="">ovhai datastore add s3 &lt;ALIAS&gt; https://s3.gra.perf.cloud.ovh.net/ gra &lt;my-access-key&gt; &lt;my-secret-key&gt; --store-credentials-locally</code></pre>



<p>💡 <em>Note that, for this use case, we recommend the <strong>High Performance Object Storage</strong> range using <code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><strong>https://s3.gra.perf.cloud.ovh.net/</strong></mark></code> instead of <code>https://s3.gra.io.cloud.ovh.net/</code></em></p>



<h3 class="wp-block-heading">Step 3 &#8211; Access the Mistral AI registry</h3>



<p><em>⚠️ Please note that you must have a <strong>licence for the Mistral Large model </strong>to be able to carry out the following steps.</em></p>



<ul class="wp-block-list">
<li>Go to the Mistral AI platform: <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">https://console.mistral.ai/home</mark></strong></li>



<li>Retrieve <strong>credentials</strong> and the <strong>license key</strong> from the Mistral console:<strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"> https://console.mistral.ai/on-premise/licenses</mark></strong></li>



<li>Authenticate to the Mistral AI Docker registry:</li>
</ul>



<pre class="wp-block-code"><code class="">docker login &lt;mistral-ai-registry&gt; --username $DOCKER_USERNAME --password $DOCKER_PASSWORD</code></pre>



<ul class="wp-block-list">
<li>Add the private registry to the config using the <code><strong>ovhai</strong></code> CLI:</li>
</ul>



<pre class="wp-block-code"><code class="">ovhai registry add &lt;mistral-ai-registry&gt;</code></pre>



<ul class="wp-block-list">
<li>Check that it is present in the list:</li>
</ul>



<pre class="wp-block-code"><code class="">ovhai registry list</code></pre>



<h3 class="wp-block-heading">Step 4 &#8211; Define environment variables</h3>



<p>The next step is to define an<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"> <strong><code>.env</code></strong></mark> file that will list all the environment variables required to download and deploy the Mistral Large model.</p>



<ul class="wp-block-list">
<li>Create the <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><strong><code>.env</code></strong></mark> file, enter the following information:</li>
</ul>



<pre class="wp-block-code"><code class=""><code>SERVED_MODEL=mistral-large-2502
RECIPES_VERSION=v0.0.76TP_SIZE=4
LICENSE_KEY=&lt;your-mistral-license-key&gt;
DOCKER_IMAGE_INFERENCE_ENGINE=&lt;<span style="background-color: initial; font-family: inherit; font-size: inherit; font-weight: inherit;">mistral-inference-server</span>-docker-image&gt;
DOCKER_IMAGE_MISTRAL_UTILS=<span style="background-color: rgba(248, 248, 242, 0.2); font-family: inherit; font-size: inherit; font-weight: inherit;">&lt;</span><span style="font-family: inherit; font-size: inherit; font-weight: inherit; background-color: initial;">mistral-utils</span><span style="background-color: rgba(248, 248, 242, 0.2); font-family: inherit; font-size: inherit; font-weight: inherit;">-docker-image&gt;</span></code></code></pre>



<ul class="wp-block-list">
<li>Then, create a script to load theses environment variables easily. Name it <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">load_env.sh</mark></strong></code>:</li>
</ul>



<pre class="wp-block-code"><code class="">#!/bin/bash

# Vérifie si le fichier .env existe
if [ ! -f .env ]; then
  echo "Error: .env not found"
  exit 1
fi

# Exporter toutes les variables du .env
export $(grep -v '^#' .env | xargs)

echo "Environment variables are loaded from .env"</code></pre>



<ul class="wp-block-list">
<li>Now, launch this script :</li>
</ul>



<pre class="wp-block-code"><code class="">source load_env.sh</code></pre>



<p>✅ You have everything you need to start the implementation!</p>



<h3 class="wp-block-heading">Step 5 &#8211; Download Mistral Large model weights</h3>



<p>The aim here is to download the model and its artefacts into the S3 bucket created earlier.</p>



<p>To achieve this, you can launch a download job that will run automatically with AI Training.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong> 💡 Here&#8217;s a tip! </strong></p>
<cite><strong>Note that here you are not using AI Training to train models, but as an easy-to-use Container as a Service solution. With a single command line, you can launch a one-shot download of the Mistral Large model with automatic synchronisation to Object Storage.</strong></cite></blockquote>



<ul class="wp-block-list">
<li>Launch the <strong>AI Training</strong> download job by attaching the object container:</li>
</ul>



<pre class="wp-block-code"><code class="">ovhai job run --name DOWNLOAD_MISTRAL_LARGE_123B \
              --cpu 12 \
              --volume s3-mistral-large-model@&lt;ALIAS&gt;/:/opt/ml/model:RW \
              -e RECIPES_VERSION=$RECIPES_VERSION \
              $<span style="background-color: initial; font-family: inherit; font-size: inherit; font-weight: inherit;">DOCKER_IMAGE_MISTRAL_UTILS</span> \
                -- bash -c "cd /app/mistral-rclone &amp;&amp; \ 
                  poetry run python mistral-rclone.py \
                  --license-key $LICENSE_KEY \
                  --download-model $SERVED_MODEL"</code></pre>



<p><em>Full command explained:</em></p>



<ul class="wp-block-list">
<li><code>ovhai job run</code></li>
</ul>



<p>This is the core command to&nbsp;<strong>run a job</strong>&nbsp;using the&nbsp;<strong>OVHcloud AI Training</strong>&nbsp;platform.</p>



<ul class="wp-block-list">
<li><code>--name DOWNLOAD_MISTRAL_LARGE_123B</code></li>
</ul>



<p>Sets a&nbsp;<strong>custom name</strong>&nbsp;for the job. For example,&nbsp;<code><code>DOWNLOAD_MISTRAL_LARGE_123B</code></code>.</p>



<ul class="wp-block-list">
<li><code>--cpu&nbsp;12</code></li>
</ul>



<p>Allocates&nbsp;<strong>12 CPU</strong>&nbsp;for the job.</p>



<ul class="wp-block-list">
<li><code>--volume s3-mistral-large-model@&lt;ALIAS&gt;/:/opt/ml/model:RW</code></li>
</ul>



<p>This mounts your&nbsp;<strong>OVHcloud Object Storage volume</strong>&nbsp;into the job’s file system:<br>–&nbsp;<code>s3-mistral-large-model@&lt;ALIAS&gt;/</code>: refers to your&nbsp;<strong>S3 bucket volume</strong>&nbsp;from the OVHcloud Object Storage<br>–&nbsp;<code>:<code>/opt/ml/model</code></code>: mounts the volume into the container under&nbsp;<code><code>/opt/ml/model</code></code><br>–&nbsp;<code>RW</code>: enables&nbsp;<strong>Read/Write</strong>&nbsp;permissions</p>



<ul class="wp-block-list">
<li><code>-e RECIPES_VERSION=$RECIPES_VERSION</code></li>
</ul>



<p>This is from your <strong>environment variables</strong>&nbsp;defined previously.</p>



<ul class="wp-block-list">
<li><code>$<span style="background-color: initial; font-family: inherit; font-size: inherit; font-weight: inherit;">DOCKER_IMAGE_MISTRAL_UTILS</span></code></li>
</ul>



<p>This is the<strong>&nbsp;Mistral Large utils Docker image</strong>&nbsp;you are running inside the job.</p>



<ul class="wp-block-list">
<li><code>-- bash -c "cd /app/mistral-rclone &amp;&amp; \</code><br><code>               poetry run python mistral-rclone.py \</code><br><code>                   --license-key $LICENSE_KEY \</code><br><code>                   --download-model $SERVED_MODEL"</code></li>
</ul>



<p>Refers to the specific command to <strong>launch the model download</strong>.</p>



<p><em>Note that synchronisation with Object Storage will be <strong>automatic at the end of the AI Training job</strong>.</em></p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>⚠️ <strong>WARNING!</strong></p>
<cite><strong>Wait for the job to go to <code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">DONE</mark></code> before proceeding to the next step</strong>.</cite></blockquote>



<ul class="wp-block-list">
<li>Check that the various elements are present in the bucket:</li>
</ul>



<pre class="wp-block-code"><code class="">ovhai bucket object list s3-mistral-large-model@&lt;ALIAS&gt;</code></pre>



<p>The bucket must be organized and split into 4 different folders:</p>



<ul class="wp-block-list">
<li>grammars</li>



<li>recipes</li>



<li>tokenizers</li>



<li>weights</li>
</ul>



<p>Note that a total of 6 elements must be present.</p>



<p>🚀 It&#8217;s all there? So let&#8217;s move on to the <strong>deployment of the Mistral Large model</strong>!</p>



<h3 class="wp-block-heading">Step 6 &#8211; Deploy Mistral Large service</h3>



<p>To deploy the Mistral Large 123B model using the previously downloaded weights, you will use OVHcloud&#8217;s <strong>AI Deploy </strong>product.</p>



<p>But first you need to create an API key that will allow you to consume the model and query it, in particular using Open AI compatibility.</p>



<ul class="wp-block-list">
<li>Creation of an access token:</li>
</ul>



<pre class="wp-block-code"><code class="">ovhai token create --role read mistral_large=api_key_reader</code></pre>



<ul class="wp-block-list">
<li>Export this token as an environment variable:</li>
</ul>



<pre class="wp-block-code"><code class="">export MY_OVHAI_MISTRAL_LARGE_TOKEN=&lt;your_ovh_access_token_value&gt;</code></pre>



<ul class="wp-block-list">
<li>Launch the <strong>Mistral Large service</strong> with <strong>AI Deploy </strong>by running the following command:</li>
</ul>



<pre class="wp-block-code"><code class="">ovhai app run --name DEPLOY_MISTRAL_LARGE_123B \
              --gpu 4 \
              --flavor h100-1-gpu \
              --default-http-port 5000 \
              --label mistral_large=api_key_reader \
              -e SERVED_MODEL=$SERVED_MODEL \
              -e RECIPES_VERSION=$RECIPES_VERSION \
              -e TP_SIZE=$TP_SIZE \
              --volume s3-mistral-large-model@&lt;ALIAS&gt;/:/opt/ml/model:RW \
              --volume standalone:/tmp:RW \
              --volume standalone:/workspace:RW \
              $<span style="background-color: initial; font-family: inherit; font-size: inherit; font-weight: inherit;">DOCKER_IMAGE_INFERENCE_ENGINE</span></code></pre>



<p><em>Full command explained:</em></p>



<ul class="wp-block-list">
<li><code>ovhai app run</code></li>
</ul>



<p>This is the core command to&nbsp;<strong>run an app / API</strong>&nbsp;using the&nbsp;<strong>OVHcloud AI Deploy</strong>&nbsp;platform.</p>



<ul class="wp-block-list">
<li><code>--name DEPLOY_MISTRAL_LARGE_123B</code></li>
</ul>



<p>Sets a&nbsp;<strong>custom name</strong>&nbsp;for the app. For example,&nbsp;<code>DEPLOY_MISTRAL_LARGE_123B</code>.</p>



<ul class="wp-block-list">
<li><code>--default-http-port 5000</code></li>
</ul>



<p>Exposes&nbsp;<strong>port 5000</strong>&nbsp;as the default HTTP endpoint.</p>



<ul class="wp-block-list">
<li><code>--gpu&nbsp;</code>4</li>
</ul>



<p>Allocates&nbsp;<strong>4 GPUs</strong>&nbsp;for the app.</p>



<ul class="wp-block-list">
<li><code>--flavor h100-1-gpu</code></li>
</ul>



<p>Chooses&nbsp;<strong>H100 GPUs</strong>&nbsp;for the app.</p>



<ul class="wp-block-list">
<li><code>--volume s3-mistral-large-model@&lt;ALIAS&gt;/:/opt/ml/model:RW</code></li>
</ul>



<p>This mounts your&nbsp;<strong>OVHcloud Object Storage volume</strong>&nbsp;into the job’s file system:<br>–&nbsp;<code>s3-mistral-large-model@&lt;ALIAS&gt;/</code>: refers to your&nbsp;<strong>S3 bucket volume</strong>&nbsp;from the OVHcloud Object Storage<br>–&nbsp;<code>:<code>/opt/ml/model</code></code>: mounts the volume into the container under&nbsp;<code><code>/opt/ml/model</code></code><br>–&nbsp;<code>RW</code>: enables&nbsp;<strong>Read/Write</strong>&nbsp;permissions</p>



<ul class="wp-block-list">
<li><code>--label mistral_large=api_key_reader</code></li>
</ul>



<p>Means that the access is restricted to your token</p>



<ul class="wp-block-list">
<li><code>-e SERVED_MODEL=$SERVED_MODEL</code></li>



<li><code>-e RECIPES_VERSION=$RECIPES_VERSION</code></li>



<li><code>-e TP_SIZE=$TP_SIZE</code></li>
</ul>



<p>These are&nbsp;<strong>environment variables</strong>&nbsp;defined previously.</p>



<ul class="wp-block-list">
<li><code>-v standalone:/tmp:rw</code></li>



<li><code>-v standalone:/workspace:rw</code></li>
</ul>



<p>Mounts&nbsp;<strong>two persistent storage volumes</strong>:<br>&#8211; <code>/tmp</code><br>&#8211; <code>/workspace</code>&nbsp;→ Main working directory</p>



<ul class="wp-block-list">
<li><code>$<span style="background-color: initial; font-family: inherit; font-size: inherit; font-weight: inherit;">DOCKER_IMAGE_INFERENCE_ENGINE</span></code></li>
</ul>



<p>This is the<strong>&nbsp;Mistral Large inference Docker image</strong>&nbsp;you are running inside the app.</p>



<p><em>It&nbsp;may&nbsp;take&nbsp;a&nbsp;few&nbsp;minutes&nbsp;for&nbsp;the&nbsp;resources&nbsp;to&nbsp;be&nbsp;allocated&nbsp;and&nbsp;for&nbsp;the&nbsp;<strong>Docker image</strong>&nbsp;to&nbsp;be&nbsp;pulled.&nbsp;</em></p>



<p>To&nbsp;check&nbsp;the&nbsp;progress&nbsp;and&nbsp;get&nbsp;additional&nbsp;information&nbsp;about&nbsp;the&nbsp;<strong>AI&nbsp;deploy&nbsp;app</strong>,&nbsp;run&nbsp;the&nbsp;following&nbsp;command:</p>



<pre class="wp-block-code"><code class="">ovhai app get &lt;ai_deploy_mistral_app_id&gt;</code></pre>



<p>Once in <strong><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">RUNNING</mark></code></strong> status, the model will be loaded. To check that the load was successful, you can check the container logs:</p>



<pre class="wp-block-code"><code class="">ovhai app logs &lt;ai_deploy_mistral_app_id&gt;</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>⚠️ <strong>WARNING!</strong></p>
<cite><strong>To&nbsp;consume&nbsp;the&nbsp;service,&nbsp;you&nbsp;must&nbsp;wait&nbsp;for&nbsp;the&nbsp;app&nbsp;to&nbsp;go&nbsp;into&nbsp;<code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">RUNNING</mark></code>&nbsp;status,&nbsp;AND&nbsp;for&nbsp;the&nbsp;model&nbsp;to&nbsp;finish&nbsp;loading.</strong></cite></blockquote>



<p>🎉 Is&nbsp;that&nbsp;it?&nbsp;Everything&nbsp;ready?&nbsp;It&nbsp;is&nbsp;therefore&nbsp;possible&nbsp;to&nbsp;start&nbsp;playing&nbsp;with&nbsp;the&nbsp;model!</p>



<h3 class="wp-block-heading">Step 7 &#8211; Test the Mistral Large model by sending your first requests</h3>



<ul class="wp-block-list">
<li>Access the API doc via your app URL:</li>
</ul>



<p><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code><strong>https://&lt;ai_deploy_mistral_app_id>.app.gra.ai.cloud.ovh.net/docs</strong></code></mark></p>



<p>To find the information, please refer to <a href="https://console.mistral.ai/on-premise/licenses" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">https://console.mistral.ai/on-premise/licenses</mark></strong></a></p>



<ul class="wp-block-list">
<li>Test with a basic cURL:</li>
</ul>



<pre class="wp-block-code"><code class="">curl -X 'POST' \
'https://&lt;ai_deploy_mistral_app_id&gt;.app.gra.ai.cloud.ovh.net/v1/chat/completions' \
  -H 'accept: application/json' \
  -H "Authorization: Bearer $MY_OVHAI_MISTRAL_LARGE_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "mistral-large-&lt;version&gt;",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant!"
    },
    {
      "role": "user",
      "content": "What is the capital of France?"     
    }
  ]
}'</code></pre>



<p><strong>⚠️ Note that you have also to replace <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>&lt;version&gt;</code></mark> in the model name by the one you are using: </strong><br><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code><strong>"model": "mistral-large-&lt;version&gt;"</strong></code></mark></p>



<p>To take implementation a step further and take advantage of all the features of this endpoint, you can also integrate it with <strong>Langchain</strong> thanks to its fuOpenAI compatibility.</p>



<ul class="wp-block-list">
<li>LangChain integration:</li>
</ul>



<pre class="wp-block-code"><code class="">import time
import os 
from langchain.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

def chat_completion_basic(new_message: str):

  model = ChatOpenAI(model_name="mistral-large-&lt;version&gt;",
                        openai_api_key=$MY_OVHAI_MISTRAL_LARGE_TOKEN,
                        openai_api_base='https://&lt;ai_deploy_mistral_app_id&gt;.app.gra.ai.cloud.ovh.net/v1',
                       )

  prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant!"),
    ("human", "{question}"),
  ])

  chain = prompt | model

  print("🤖: ")
  for r in chain.stream({"question", new_message}):
    print(r.content, end="", flush=True)
    time.sleep(0.150)

chat_completion_basic("What is the capital of France?)</code></pre>



<p>🥹 Congratulations! You have successfully completed the deployment!</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p>You can now consume your <strong>Mistral Large 123B</strong> in a secure environment!</p>



<p>The result of your implementation? The deployment of a sovereign, scalable, production-quality 123B LLM, powered by <strong>OVHcloud AI Deploy</strong>.</p>



<p>➡️ <strong>To go further? </strong></p>



<ul class="wp-block-list">
<li>Update your model in a single command line and without interruption following this <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-update-custom-docker-image?id=kb_article_view&amp;sysparm_article=KB0057968" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">documentation</a></li>



<li>Go to the next replica in the event of a heavy load to ensure high availability using this <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-apps-deployments?id=kb_article_view&amp;sysparm_article=KB0047997" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">method</a></li>
</ul>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-deploy-mistral-large-model-in-sovereign-environment-ovhcloud%2F&amp;action_name=Reference%20Architecture%3A%C2%A0deploying%20the%20Mistral%20Large%20123B%20model%20in%20a%20sovereign%20environment%20with%20OVHcloud&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Reference Architecture: set up MLflow Remote Tracking Server on OVHcloud</title>
		<link>https://blog.ovhcloud.com/mlflow-remote-tracking-server-ovhcloud-databases-object-storage-ai-solutions/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Tue, 15 Apr 2025 07:52:46 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Notebooks]]></category>
		<category><![CDATA[AI Training]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[Managed Database]]></category>
		<category><![CDATA[MLflow]]></category>
		<category><![CDATA[Object Storage]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=28564</guid>

					<description><![CDATA[Travel through the Data &#38; AI universe of OVHcloud with the MLflow integration. As Artificial Intelligence (AI) continues to grow in importance, Data Scientists and Machine Learning Engineers need a robust and scalable platform to manage the entire Machine Learning (ML) lifecycle. MLflow, an open-source platform, provides a comprehensive framework for managing ML experiments, models, [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmlflow-remote-tracking-server-ovhcloud-databases-object-storage-ai-solutions%2F&amp;action_name=Reference%20Architecture%3A%20set%20up%20MLflow%20Remote%20Tracking%20Server%20on%20OVHcloud&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>Travel through the Data &amp; AI universe of OVHcloud with the <em>MLflow</em> integration.</em></p>



<figure class="wp-block-image aligncenter size-full"><img decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2025/04/mlflow_ref_archi.svg" alt="" class="wp-image-28689"/><figcaption class="wp-element-caption"><em>Mlflow Remote Tracking Server on OVHcloud</em></figcaption></figure>



<p>As <strong>Artificial Intelligence</strong> (AI) continues to grow in importance, <em>Data Scientists</em> and <em>Machine Learning Engineers</em> need a robust and scalable platform to manage the entire Machine Learning (ML) lifecycle. <br><a href="https://mlflow.org/docs/latest/introduction/index.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">MLflow</a>, an open-source platform, provides a comprehensive framework for managing ML experiments, models, and deployments. </p>



<p><strong>Mlflow</strong> offers many benefits and provides a complete framework for ML lifecycle management with features such as:</p>



<ul class="wp-block-list">
<li>Experiment tracking and model management</li>



<li>Reproducibility and collaboration</li>



<li>Scalability, flexibility, and integration</li>



<li>Automated ML and model serving capabilities</li>



<li>Improved model accuracy, faster time-to-market, and reduced costs.</li>
</ul>



<p>In this reference architecture, you will explore how to leverage remote experience tracking with the <strong>MLflow Tracking Server</strong> on the <a href="https://www.ovhcloud.com/fr/public-cloud/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Public Cloud</a> infrastructure.<br>In fact, you will be able to build a scalable and efficient ML platform, streamlining your ML workflow and accelerating model development using <strong>OVHcloud AI Notebooks</strong>, <strong>AI Training</strong>, <strong>Managed Databases (PostgreSQL)</strong>, and <strong>Object Storage</strong>.</p>



<p><strong>The result?</strong> A fully remote, <strong>production-ready ML experiment tracking pipeline</strong>, powered by OVHcloud&#8217;s Data &amp; Machine Learning Services (e.g. AI Notebooks and AI Training).</p>



<h2 class="wp-block-heading">Overview of the MLflow server architecture</h2>



<p>Here is how will be configured MLflow:</p>



<ul class="wp-block-list">
<li><strong>Development and training environment:</strong> create and train model with <strong>AI Notebooks</strong></li>



<li><strong>Remote Tracking Server</strong>: host in an <strong>AI Training</strong> job (Container as a Service)</li>



<li><strong>Backend Store</strong>: benefit from a managed <strong>PostgreSQL</strong> database (DBaaS).</li>



<li><strong>Artifact Store</strong>: use OVHcloud <strong>Object Storage</strong> (S3-compatible).</li>
</ul>



<figure class="wp-block-image aligncenter size-full"><img decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2025/04/mlflow_overview.svg" alt="" class="wp-image-28688"/><figcaption class="wp-element-caption"><em>MLflow remote server deployment steps</em></figcaption></figure>



<p>In the following tutorial, all services are deployed within the <strong>OVHcloud Public Cloud</strong>.</p>



<h2 class="wp-block-heading">Prerequisites</h2>



<p>Before you begin, ensure you have:</p>



<ul class="wp-block-list">
<li>An <strong>OVHcloud Public Cloud</strong> account</li>



<li>An <strong>OpenStack user</strong> with the following roles:
<ul class="wp-block-list">
<li>Administrator</li>



<li>AI Training Operator</li>



<li>Object Storage Operator</li>
</ul>
</li>
</ul>



<p><strong>🚀 Having all the ingredients for our recipe, it’s time to set up your MLflow remote tracking server!</strong></p>



<h2 class="wp-block-heading">Architecture guide: MLflow remote tracking server</h2>



<p>Let’s go for the set up and deployment of your custom MLflow tracking tool!</p>



<p>⚙️<em> Also consider that all of the following steps can be automated using OVHcloud APIs!</em></p>



<h4 class="wp-block-heading">Step 1 – Install <code>ovhai</code> CLI</h4>



<p>Firstly, start by setting up your CLI environment.</p>



<pre class="wp-block-code"><code class="">curl https://cli.gra.ai.cloud.ovh.net/install.sh | bash</code></pre>



<p>Secondly, login using your <strong>OpenStack credentials</strong>.</p>



<pre class="wp-block-code"><code class="">ovhai login -u &lt;openstack-username&gt; -p &lt;openstack-password&gt;</code></pre>



<p>Now, it&#8217;s time to create your bucket inside OVHcloud Object Storage!</p>



<h4 class="wp-block-heading">Step 2 – Provision Object Storage (Artifact Store)</h4>



<ol class="wp-block-list">
<li>Go to <strong>Public Cloud &gt; Storage &gt; Object Storage</strong> in the OVHcloud Control Panel.</li>



<li>Create a <strong>datastore</strong> and a new <strong>S3 bucket</strong> (e.g., <code>mlflow-s3-bucket</code>).</li>



<li>Register the datastore with the <code>ovhai</code> CLI:</li>
</ol>



<pre class="wp-block-code"><code class="">ovhai datastore add s3 &lt;ALIAS&gt; https://s3.gra.io.cloud.ovh.net/ gra &lt;my-access-key&gt; &lt;my-secret-key&gt; --store-credentials-locally</code></pre>



<h4 class="wp-block-heading">Step 3 – Create PostgreSQL Managed DB (Backend Store)</h4>



<p>1. Navigate to <strong>Databases &amp; Analytics &gt; Databases</strong></p>



<p><strong>2. Create a new <em>PostgreSQL</em> instance with <em>Essential plan</em></strong></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="627" src="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-13-1024x627.png" alt="" class="wp-image-28580" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-13-1024x627.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-13-300x184.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-13-768x470.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-13-1536x941.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-13-2048x1254.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p><strong>3. Select <em>Location</em> and <em>Node type</em></strong></p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="661" src="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-14-1024x661.png" alt="" class="wp-image-28581" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-14-1024x661.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-14-300x194.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-14-768x495.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-14-1536x991.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-14-2048x1321.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p><strong>4. Reset the user password</strong></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="2384" height="1340" src="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-15-edited.png" alt="" class="wp-image-28590" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-15-edited.png 2384w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-15-edited-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-15-edited-1024x576.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-15-edited-768x432.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-15-edited-1536x863.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-15-edited-2048x1151.png 2048w" sizes="auto, (max-width: 2384px) 100vw, 2384px" /></figure>



<p><strong>5. Take note of te following parameters</strong></p>



<p>Go to your database dashboard:</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="640" src="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-16-1024x640.png" alt="" class="wp-image-28583" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-16-1024x640.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-16-300x188.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-16-768x480.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-16-1536x960.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-16-2048x1280.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Then, copy the <strong>connexion information</strong>:</p>



<pre class="wp-block-code"><code class="">&lt;db_hostname&gt;
&lt;db_username&gt;
&lt;db_password&gt;
&lt;db_name&gt;
&lt;db_port&gt;
&lt;ssl_mode&gt;</code></pre>



<p>Your <strong>Backend Store</strong> is now ready to use!</p>



<h4 class="wp-block-heading">Step 4 -Build you custom MLflow Docker image and </h4>



<p>1. Develop MLflow launching script</p>



<p>Firstly, you have to write a script in bash to launch the server: <strong><em>mlflow_server.sh</em></strong></p>



<pre class="wp-block-code"><code class="">echo "The MLflow server is starting..."

mlflow server \
  --backend-store-uri postgresql://${POSTGRE_USER}:${POSTGRE_PASSWORD}@${PG_HOST}:${PG_PORT}/${PG_DB}?sslmode=${SSL_MODE} \
  --default-artifact-root ${S3_BUCKET_NAME}/ \
  --host 0.0.0.0 \
  --port 5000</code></pre>



<p><strong>2. Create Dockerfile</strong></p>



<p>Install the required Python dependency and give the rights on the<strong> /mlruns</strong> path to the OVHcloud user.</p>



<pre class="wp-block-code"><code class="">FROM ghcr.io/mlflow/mlflow:latest

# Install Python dependencies
RUN pip install psycopg2-binary

COPY mlflow_server.sh .

# Change the ownership of `mlruns` directory to the OVHcloud user (42420:42420)
RUN mkdir -p /mlruns
RUN chown -R 42420:42420 /mlruns

# Start MLflow server inside container
CMD ["bash", "mlflow_server.sh"]</code></pre>



<p><strong>3. Build your custom MLflow docker image</strong></p>



<p>Build the docker image using the previous Dockerfile.</p>



<pre class="wp-block-code"><code class="">docker build . -t mlflow-server-ai-training:latest</code></pre>



<p><strong>4. Tag and push the docker image to your registry</strong></p>



<p>Finally, you can push the Docker image to your registry.</p>



<pre class="wp-block-code"><code class="">docker tag mlflow-server-ai-training:latest &lt;your-registry-address&gt;/mlflow-server-ai-training:latest</code></pre>



<pre class="wp-block-code"><code class="">docker push &lt;your-registry-address&gt;/mlflow-server-ai-training:latest</code></pre>



<p>Congrats! You can now use the Docker image to launch MLflow server.</p>



<h4 class="wp-block-heading">Step 5 &#8211; Start MLflow Tracking Server inside container</h4>



<p>You can use AI Training to start MLflow server inside a job.</p>



<p><strong>1. Using <code>ovhai</code> CLI, run the following command inside terminal</strong></p>



<pre class="wp-block-code"><code class="">ovhai job run --name mlflow-server \
              --default-http-port 5000 \
              --cpu 4 \
              -v mlflow-s3-bucket@DEMO/:/artifacts:RW:cache \
              -e POSTGRE_USER=avnadmin \
              -e POSTGRE_PASSWORD=&lt;db_password&gt; \
              -e S3_ENDPOINT=https://s3.gra.io.cloud.ovh.net/ \
              -e S3_BUCKET_NAME=mlflow-s3-bucket \
              -e PG_HOST=&lt;db_hostname&gt; \
              -e PG_DB=defaultdb \
              -e PG_PORT=20184 \
              -e SSL_MODE=require \
              &lt;your_registry_address&gt;/mlflow-server-ai-training:latest</code></pre>



<p><em>Full command explained:</em></p>



<ul class="wp-block-list">
<li><code>ovhai job run</code></li>
</ul>



<p>This is the core command to <strong>run a job</strong> using the <strong>OVHcloud AI Training</strong> platform.</p>



<ul class="wp-block-list">
<li><code>--name mlflow-server</code></li>
</ul>



<p>Sets a <strong>custom name</strong> for the job. For example, <code>mlflow-server</code>.</p>



<ul class="wp-block-list">
<li><code>--default-http-port 5000</code></li>
</ul>



<p>Exposes <strong>port 5000</strong> as the default HTTP endpoint. MLflow’s web UI typically runs on port 5000, so this ensures the UI is accessible once the job is running.</p>



<ul class="wp-block-list">
<li><code>--cpu </code>4</li>
</ul>



<p>Allocates <strong>4 CPUs</strong> for the job. You can adjust this based on how heavy your MLflow workload is.</p>



<ul class="wp-block-list">
<li><code>-v mlflow-s3-bucket@DEMO/:/artifacts:RW:cache</code></li>
</ul>



<p>This mounts your <strong>OVHcloud Object Storage volume</strong> into the job’s file system:<br>&#8211; <code>mlflow-s3-bucket@DEMO/</code>: refers to your <strong>S3 bucket volume</strong> from the OVHcloud Object Storage<br>&#8211; <code>:/artifacts</code>: mounts the volume into the container under <code>/artifacts</code><br>&#8211; <code>RW</code>: enables <strong>Read/Write</strong> permissions<br>&#8211; <code>cache</code>: enables <strong>volume caching</strong>, improving performance for frequent reads/writes</p>



<ul class="wp-block-list">
<li><code>-e POSTGRE_USER=avnadmin</code></li>



<li><code>-e POSTGRE_PASSWORD=&lt;db_password&gt;</code></li>



<li><code>-e PG_HOST=&lt;db_hostname&gt;</code></li>



<li><code>-e PG_DB=defaultdb</code></li>



<li><code>-e PG_PORT=20184</code></li>



<li><code>-e SSL_MODE=require</code></li>
</ul>



<p>These are <strong>environment variables</strong> for connecting to the <strong>PostgreSQL </strong>backend store:<br>&#8211; <code>avnadmin</code>: the default admin user for OVHcloud’s managed PostgreSQL<br>&#8211; <code>POSTGRE_PASSWORD</code>: must be replaced with your actual database password<br>&#8211; <code>PG_HOST</code>: the hostname of your managed PostgreSQL instance<br>&#8211; <code>PG_DB</code>: the name of the database to use (default: <code>defaultdb</code>)<br>&#8211; <code>PG_PORT</code>: the port your PostgreSQL server is listening on<br>&#8211; <code>SSL_MODE</code>: enforce SSL connection to secure DB traffic</p>



<ul class="wp-block-list">
<li><code>-e S3_ENDPOINT=https://s3.gra.io.cloud.ovh.net/</code></li>
</ul>



<p>Tells MLflow where the <strong>S3-compatible endpoint</strong> is hosted. This is specific to OVHcloud&#8217;s GRA (Gravelines) region Object Storage.</p>



<ul class="wp-block-list">
<li><code>-e S3_BUCKET_NAME=mlflow-s3-bucket</code></li>
</ul>



<p>Sets the <strong>name of the S3 bucket</strong> where MLflow should store artifacts (models, metrics, etc.).</p>



<ul class="wp-block-list">
<li><code>&lt;your_registry_address&gt;/mlflow-server-ai-training:latest</code></li>
</ul>



<p>This is the<strong> custom MLflow Docker image</strong> you are running inside the job.</p>



<p><strong>2. Check if your AI Training job is RUNNING</strong></p>



<p>Replace the <code>&lt;job_id&gt;</code> by yours.</p>



<pre class="wp-block-code"><code class="">ovhai job get &lt;job_id&gt;</code></pre>



<p>You should obtain:</p>



<p><code>History:<br>    DATE                  STATE<br>    04-04-25 09:58:00     QUEUED<br>    04-04-25 09:58:01     INITIALIZING<br>    04-04-25 09:58:07     PENDING<br>    04-04-25 09:58:10     <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">RUNNING</mark></strong><br>  Info:<br>    Message:   Job is running</code></p>



<p><strong>3. Recover the IP and external IP of your AI Training job</strong></p>



<p>Using, your <code>&lt;job_id&gt;</code>, you can retrieve your AI Training <strong>job IP</strong>.</p>



<pre class="wp-block-code"><code class="">ovhai job get &lt;job_id&gt; -o json | jq '.status.ip' -r</code></pre>



<p>For example, you can obtain something like that: <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">10.42.80.176</mark></strong></p>



<p>You also need the External IP:</p>



<pre class="wp-block-code"><code class="">ovhai job get &lt;job_id&gt; -o json | jq '.status.externalIp' -r</code></pre>



<p>Returning the IP address you will have to whitelist to be able to connect to your database (e.g. <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><strong>51.210.38.188</strong></mark>)</p>



<h4 class="wp-block-heading">Step 6 – Whitelist AI Training job IP in PostgreSQL DB</h4>



<p>From <strong>Databases &amp; Analytics &gt; Databases</strong>, edit your DB configuration to <strong>allow access from the job Extranal IP</strong>.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="475" src="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-19-1024x475.png" alt="" class="wp-image-28593" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-19-1024x475.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-19-300x139.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-19-768x356.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-19-1536x712.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-19-2048x950.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Then, you can see that the job External IP is now white listed.</p>



<p>Well done! Your MLflow server and the backend store are now connected.</p>



<h4 class="wp-block-heading">Step 7 –  Create an AI Notebook</h4>



<p>It&#8217;s time to train and track your Machine Learning models using MLflow!</p>



<p>To do so, use the OVHcloud <code>ovhai</code> CLI and start a new AI Notebook with GPU.</p>



<pre class="wp-block-code"><code class="">ovhai notebook run conda jupyterlab \
  --name mlflow-notebook \
  --framework-version conda-py311-cudaDevel11.8 \
  --gpu 1</code></pre>



<p><em>Full command explained:</em></p>



<ul class="wp-block-list">
<li><code>ovhai noteb</code>ook<code> run</code></li>
</ul>



<p>This is the core command to <strong>run a notebook</strong> using the <strong>OVHcloud AI Notebooks</strong> platform.</p>



<ul class="wp-block-list">
<li><code>--name mlflow-notebook</code></li>
</ul>



<p>Sets a <strong>custom name</strong> for the notebook. In this case, you can name it <code>mlflow-notebook</code>.</p>



<ul class="wp-block-list">
<li><code>--framework-version conda-py311-cudaDevel11.8</code></li>
</ul>



<p>Define the framework and version you want to use in your notebook. Here, you are using Python 3.11 with Conda framework and CUDA compatibility.</p>



<ul class="wp-block-list">
<li><code>--gpu 1</code></li>
</ul>



<p>Allocates <strong>1 GPU</strong> for the job, by default a <strong>Tesla V100S</strong> from NVIDIA (<code>ai1-1-gpu</code>). You can select the flavor you want from the OVHcloud GPU range.</p>



<p>Then, check if your AI Notebook is RUNNING.</p>



<pre class="wp-block-code"><code class="">ovhai notebook get &lt;notebook_id&gt;</code></pre>



<p>Once your notebook is in RUNNING status, you should be able to access it using its URL:</p>



<p><code>State:          <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">RUNNING</mark></strong><br>Duration:       1411412   <br>Url:            <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">https://&lt;notebook_id&gt;.notebook.gra.ai.cloud.ovh.net</mark></strong><br>Grpc Address:   &lt;notebook_id&gt;.nb-grpc.gra.ai.cloud.ovh.net:443<br>Info Url:       https://ui.gra.ai.cloud.ovh.net/notebook/&lt;notebook_id&gt;</code></p>



<p>You can start your AI model development inside notebook.</p>



<h4 class="wp-block-heading">Step 8 – Model training inside Jupyter notebook</h4>



<p>To begin with, set up your notebook environment.</p>



<p><strong>1. Create the <code>requirements.txt</code> file</strong></p>



<pre class="wp-block-code"><code class="">numpy==2.2.3
scipy==1.15.2
mlflow==2.20.3
sklearn==1.6.1</code></pre>



<p><strong>2. Install dependencies</strong></p>



<p>From a notebook cell, launch the following command.</p>



<pre class="wp-block-code"><code class="">!pip3 install -r requirements.txt</code></pre>



<p>Perfect! You can start coding&#8230;</p>



<p><strong> 3. Import Python librairies</strong></p>



<p>Here, you have to import os, mlflow and scikit-learn.</p>



<pre class="wp-block-code"><code class=""># import dependencies
import os
import mlflow
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor</code></pre>



<p>In another notebook cell, set the MLflow tracking URI. Note that you have to replace <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">10.42.80.176</mark></strong> by your own <strong>job IP</strong>.</p>



<pre class="wp-block-code"><code class="">mlflow.set_tracking_uri("http://<strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">10.42.80.176</mark></strong>:5000")</code></pre>



<p>Then start training your model!</p>



<pre class="wp-block-code"><code class="">mlflow.autolog()

db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

# Create and train models.
rf = RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
rf.fit(X_train, y_train)

# Use the model to make predictions on the test dataset.
predictions = rf.predict(X_test)</code></pre>



<p><strong>Output:</strong></p>



<p><code>🏃 View run dashing-foal-850 at: http://<strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">10.42.80.176</mark></strong>:5000/#/experiments/0/runs/e7dad7c073634ec28675c0defce2b9ec </code><br><code>🧪 View experiment at: http://<strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">10.42.80.176</mark></strong>:5000/#/experiments/0</code></p>



<p>Congrats! You can now track your model training from<strong> MLflow remote server</strong>&#8230;</p>



<h4 class="wp-block-heading">Step 9 – Track and compare models from MLflow remote server</h4>



<p>Finally, access to MLflow dashboard using the job URL: <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>https://&lt;job_id&gt;.job.gra.ai.cloud.ovh.net</code></mark></strong></p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="578" src="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-23-1024x578.png" alt="" class="wp-image-28598" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-23-1024x578.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-23-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-23-768x433.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-23-1536x867.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-23-2048x1155.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Then, you can check your model trainings and evaluations:</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="577" src="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-24-1024x577.png" alt="" class="wp-image-28599" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-24-1024x577.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-24-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-24-768x433.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-24-1536x866.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/image-24-2048x1154.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>What a success! You can finally use your MLflow to evaluate, compare and archive your various trainings.</p>



<h4 class="wp-block-heading">Step 10 &#8211; Monitor everything remotely</h4>



<p>You now have a complete Machine Learning pipeline with remote experiment tracking. Access:</p>



<ul class="wp-block-list">
<li><strong>Metrics, Parameters, and Tags</strong> → PostgreSQL</li>



<li><strong>Artifacts (Models, Files)</strong> → S3 bucket</li>
</ul>



<p>This setup is reusable, automatable, and production-ready!</p>



<h2 class="wp-block-heading">What’s next?</h2>



<ul class="wp-block-list">
<li>Automate deployment with <strong><a href="https://eu.api.ovh.com/" data-wpel-link="exclude">OVHcloud APIs</a></strong></li>



<li>Run different training sessions in parallel and compare them with your <strong>remote MLflow tracking server</strong></li>



<li>Use <strong><a href="https://www.ovhcloud.com/fr/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Deploy</a></strong> to serve your trained models</li>
</ul>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmlflow-remote-tracking-server-ovhcloud-databases-object-storage-ai-solutions%2F&amp;action_name=Reference%20Architecture%3A%20set%20up%20MLflow%20Remote%20Tracking%20Server%20on%20OVHcloud&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Mistral Small 24B served with vLLM and AI Deploy &#8211; a single command to deploy an LLM (Part 1)</title>
		<link>https://blog.ovhcloud.com/mistral-small-24b-served-with-vllm-and-ai-deploy-one-command-to-deploy-llm/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Mon, 24 Feb 2025 10:08:37 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[Mistral]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=28212</guid>

					<description><![CDATA[You are not dreaming! You can deploy open-source LLM in a single command line. Deploying advanced language models can be a challenge! But this sometimes this arduous task is becoming increasingly accessible, enabling developers to integrate sophisticated AI capabilities into their applications. In this guide, we will walk through deploying the Mistral-Small-24B-Instruct-2501 model using vLLM [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmistral-small-24b-served-with-vllm-and-ai-deploy-one-command-to-deploy-llm%2F&amp;action_name=Mistral%20Small%2024B%20served%20with%20vLLM%20and%20AI%20Deploy%20%26%238211%3B%20a%20single%20command%20to%20deploy%20an%20LLM%20%28Part%201%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><strong><em>You are not dreaming! You can deploy open-source LLM in a single command line</em>.</strong></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="724" src="https://blog.ovhcloud.com/wp-content/uploads/2025/02/image_blog_post_mistral_small_ai_deploy-1024x724.png" alt="Rocket in MistralAI colors in a data center with a French rooster showing rapid LLM deployment" class="wp-image-28219" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/02/image_blog_post_mistral_small_ai_deploy-1024x724.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/image_blog_post_mistral_small_ai_deploy-300x212.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/image_blog_post_mistral_small_ai_deploy-768x543.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/image_blog_post_mistral_small_ai_deploy-1536x1086.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/image_blog_post_mistral_small_ai_deploy.png 2000w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Deploying advanced language models can be a challenge! But this sometimes this arduous task is becoming increasingly accessible, enabling developers to integrate sophisticated AI capabilities into their applications.</p>



<p>In this guide, we will walk through deploying the <strong><a href="https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral-Small-24B-Instruct-2501</a></strong> model using <strong>vLLM</strong> on OVHcloud&#8217;s <a href="https://www.ovhcloud.com/fr/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Deploy platform</a>. This combination offers a powerful solution for efficient and scalable AI model serving.</p>



<p>Deploying a model is great, but doing it quickly is even better!</p>



<p>🤯 <strong>What if a single command line was enough?</strong> That&#8217;s the challenge we&#8217;re tackling today!</p>



<h2 class="wp-block-heading">Context</h2>



<p>Before deployment, let’s take a closer look at our key technologies!</p>



<h3 class="wp-block-heading">Mistral Small</h3>



<p>The <code><strong>mistralai/Mistral-Small-24B-Instruct-2501</strong></code> is a 24-billion-parameter instruction-fine-tuned model, renowned for its compact size and performance comparable to larger models.</p>



<p>This model, from <a href="https://mistral.ai/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">MistralAI</a>, is an instruction-fine-tuned version of the base model:&nbsp;<a href="https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral-Small-24B-Base-2501</a>.</p>



<p>To serve this model efficiently, we will utilize vLLM, an open-source library for <strong>LLM inference</strong>.</p>



<h3 class="wp-block-heading">vLLM</h3>



<p><a href="https://docs.vllm.ai/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vLLM</a> (<strong>Virtual LLM</strong>) is a highly optimized service engine designed to efficiently run large language models. It takes advantage of several key optimizations, such as:</p>



<ul class="wp-block-list">
<li><strong>PagedAttention:</strong> an attention mechanism that reduces memory fragmentation and enables more efficient use of GPU memory</li>



<li><strong>Continuous Batching:</strong> vLLM dynamically adjusts batch sizes in real time, ensuring that the GPU is always used efficiently, even with multiple simultaneous requests</li>



<li><strong>Tensor parallelism:</strong> enables model inference across multiple GPUs to boost performance</li>



<li><strong>Optimized kernel implementations:</strong> vLLM uses custom CUDA kernels for faster execution, reducing latency compared to traditional inference frameworks</li>
</ul>



<p>These features make vLLM one of the best choices for large models such as Mistral Small 24B, enabling low-latency, high-throughput inference on the latest GPUs.</p>



<p>By deploying on OVHcloud&#8217;s AI Deploy platform, you can deploy this model in a single command line.</p>



<h3 class="wp-block-heading">AI Deploy </h3>



<p>OVHcloud AI Deploy is a<strong> Container as a Service</strong> (CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications / APIs based on Machine Learning (ML), Deep Learning (DL) or LLMs.</p>



<p>The key benefits are:</p>



<ul class="wp-block-list">
<li><strong>Easy to use:</strong> bring your own custom Docker image and deploy it in a command line or a few clicks surely</li>



<li><strong>High-performance computing:</strong> a complete range of GPUs available (H100, A100, V100S, L40S and L4)</li>



<li><strong>Scalability and flexibility:</strong> supports automatic scaling, allowing your model to effectively handle fluctuating workloads</li>



<li><strong>Cost-efficient:</strong> billing per minute, no surcharges</li>
</ul>



<p>✅ To go further, some prerequisites must be checked!</p>



<h2 class="wp-block-heading">Prerequisites</h2>



<p>Before you begin, ensure that you have:</p>



<ul class="wp-block-list">
<li><strong>OVHcloud account</strong>: access to the&nbsp;<a href="https://www.ovh.com/auth/?action=gotomanager&amp;from=https://www.ovh.co.uk/&amp;ovhSubsidiary=GB" data-wpel-link="exclude">OVHcloud Control Panel</a></li>



<li><strong>ovhai CLI available:</strong> install the <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">ovhai CLI</a></li>



<li><strong>AI Deploy access</strong>: ensure you have a <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">user for AI Deploy</a></li>



<li><strong>Hugging Face access</strong>: create an <a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face account</a> and generate an <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">access token</a></li>



<li><strong>Gated model authorization</strong>: be sure you have been granted access to <a href="https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral-Small-24B-Instruct-2501</a> model</li>
</ul>



<p><strong>🚀 Having all the ingredients for our recipe, it&#8217;s time to deploy!</strong></p>



<h2 class="wp-block-heading">Deployment of the Mistral Small 24B LLM</h2>



<p>Let&#8217;s go for the deployment of the model <code><strong>mistralai/Mistral-Small-24B-Instruct-2501</strong></code></p>



<h3 class="wp-block-heading">Manage access tokens</h3>



<p>Export your <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face token</a>.</p>



<pre class="wp-block-code"><code class="">export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx</code></pre>



<p><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-app-token?id=kb_article_view&amp;sysparm_article=KB0035280" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Create a token</a> to access your AI Deploy app once it will be deployed.</p>



<pre class="wp-block-code"><code class="">ovhai token create --role operator ai_deploy_token=my_operator_token</code></pre>



<p>Returning the following output:</p>



<p><code><strong>Id:         47292486-fb98-4a5b-8451-600895597a2b<br>Created At: 20-02-25 11:53:05<br>Updated At: 20-02-25 11:53:05<br>Spec:<br>  Name:           ai_deploy_token=my_operator_token<br>  Role:           AiTrainingOperator<br>  Label Selector: <br>Status:<br>  Value:   XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX<br>  Version: 1</strong></code></p>



<p>You can now store and export your access token:</p>



<pre class="wp-block-code"><code class="">export MY_OVHAI_ACCESS_TOKEN=<span style="background-color: initial; font-family: inherit; font-size: inherit; font-weight: inherit;">XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</span></code></pre>



<h3 class="wp-block-heading">Launch Mistral Small LLM with AI Deploy</h3>



<p>You are ready to start<strong> Mistral-Small-24B</strong> using vLLM and AI Deploy:</p>



<pre class="wp-block-code"><code class="">ovhai app run --name vllm-mistral-small \
              --default-http-port 8000 \
              --label ai_deploy_token=my_operator_token \
              --gpu 2 \
              --flavor l40s-1-gpu \
              -e OUTLINES_CACHE_DIR=/tmp/.outlines \
              -e HF_TOKEN=$MY_HF_TOKEN \
              -e HF_HOME=/hub \
              -e HF_DATASETS_TRUST_REMOTE_CODE=1 \
              -e HF_HUB_ENABLE_HF_TRANSFER=0 \
              -v standalone:/hub:rw \
              -v standalone:/workspace:rw \
              vllm/vllm-openai:v0.8.2 \
              -- bash -c "python3 -m vllm.entrypoints.openai.api_server \
                        --model mistralai/Mistral-Small-24B-Instruct-2501 \
                        --tensor-parallel-size 2 \
                        --tokenizer_mode mistral \
                        --load_format mistral \
                        --config_format mistral \
                        --dtype half"</code></pre>



<p><strong>How to understand the different parameters of this command?</strong></p>



<h5 class="wp-block-heading">1. Start your AI Deploy app</h5>



<p>Launch a new app using <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">ovhai CLI</a> and name it.</p>



<p><code><strong>ovhai app run --name vllm-mistral-small</strong></code></p>



<h5 class="wp-block-heading">2. Define access</h5>



<p>Define the HTTP API port and restrict access to your token.</p>



<p><strong><code>--default-http-port 8000</code><br><code>--label ai_deploy_token=my_operator_token</code></strong></p>



<h5 class="wp-block-heading">3. Configure GPU resources</h5>



<p>Specifies the hardware type (<code><strong>l40s-1-gpu</strong></code>), which refers to an <strong>NVIDIA L40S GPU</strong> and the number (<code><strong>2</strong></code>).</p>



<p><code><strong>--gpu 2<br>--flavor l40s-1-gpu</strong></code></p>



<p><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">⚠️WARNING!</mark></strong> For this model, two L40S are sufficient, but if you want to deploy another model, you will need to check which GPU you need. Note that you can also access to A100 and H100 GPUs for your larger models.</p>



<h5 class="wp-block-heading">4. Set up environment variables</h5>



<p>Configure caching for the <strong>Outlines library</strong> (used for efficient text generation):</p>



<p><code><strong>-e OUTLINES_CACHE_DIR=/tmp/.outlines</strong></code></p>



<p>Pass the <strong>Hugging Face token</strong> (<code>$MY_HF_TOKEN</code>) for model authentication and download:</p>



<p><code><strong>-e HF_TOKEN=$MY_HF_TOKEN</strong></code></p>



<p>Set the <strong>Hugging Face cache directory</strong> to <code>/hub</code> (where models will be stored):</p>



<p><code><strong>-e HF_HOME=/hub</strong></code></p>



<p>Allow execution of <strong>custom remote code</strong> from Hugging Face datasets (required for some model behaviors):</p>



<p><code><strong>-e HF_DATASETS_TRUST_REMOTE_CODE=1</strong></code></p>



<p>Disable <strong>Hugging Face Hub transfer acceleration</strong> (to use standard model downloading):</p>



<p><code><strong>-e HF_HUB_ENABLE_HF_TRANSFER=0</strong></code></p>



<h5 class="wp-block-heading">5. Mount persistent volumes</h5>



<p>Mounts <strong>two persistent storage volumes</strong>:</p>



<ul class="wp-block-list">
<li><code>/hub</code> → Stores Hugging Face model files</li>



<li><code>/workspace</code> → Main working directory</li>
</ul>



<p>The <code>rw</code> flag means <strong>read-write access</strong>.</p>



<p><code><strong>-v standalone:/hub:rw<br>-v standalone:/workspace:rw</strong></code></p>



<h5 class="wp-block-heading">6. Choose the target Docker image</h5>



<p>Uses the <strong><code>v<strong><code>llm/vllm-openai:v0.8.2</code></strong></code></strong> Docker image (a pre-configured vLLM OpenAI API server).</p>



<p><strong><code>vllm/vllm-openai:v0.8.2</code></strong></p>



<h5 class="wp-block-heading">7. Running the model inside the container</h5>



<p>Runs a<strong> bash shell</strong> inside the container and executes a Python command to launch the vLLM API server:</p>



<ul class="wp-block-list">
<li><strong><code>python3 -m vllm.entrypoints.openai.api_server</code></strong> → Starts the OpenAI-compatible vLLM API server</li>



<li><strong><code>--model mistralai/Mistral-Small-24B-Instruct-2501</code></strong> → Loads the <strong>Mistral Small 24B</strong> model from Hugging Face</li>



<li><strong><code>--tensor-parallel-size 2</code></strong> → Distributes the model across <strong>2 GPUs</strong></li>



<li><strong><code>--tokenizer_mode mistral</code></strong> → Uses the <strong>Mistral tokenizer</strong></li>



<li><strong><code>--load_format mistral</code></strong> → Uses Mistral’s model loading format</li>



<li><strong><code>--config_format mistral</code></strong> → Ensures the model configuration follows Mistral&#8217;s standard</li>



<li><strong><code>--dtype half</code></strong> → Uses <strong>FP16 (half-precision floating point)</strong> for optimized GPU performance</li>
</ul>



<p>You can now check if your <strong>AI Deploy</strong> app is alive:</p>



<pre class="wp-block-code"><code class="">ovhai app get &lt;your_vllm_app_id&gt;</code></pre>



<p>💡<strong>Is your app in <code>RUNNING</code> status?</strong> Perfect! You can check in the logs that the server is started&#8230;</p>



<pre class="wp-block-code"><code class="">ovhai app logs &lt;your_vllm_app_id&gt;</code></pre>



<p><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">⚠️WARNING!</mark></strong> This step may take a little time as the template must be loaded&#8230;<br>After a few minutes, you should get the following information in the logs:</p>



<p><code><strong>2025-02-20T13:48:07Z [app] [tcmzt] INFO:     Started server process [13] 2025-02-20T13:48:07Z [app] [tcmzt] INFO:     Waiting for application startup. 2025-02-20T13:48:07Z [app] [tcmzt] INFO:     Application startup complete. 2025-02-20T13:48:07Z [app] [tcmzt] INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)</strong></code></p>



<p>🚦 <strong>Are all the indicators green? </strong>Then it&#8217;s off to inference!</p>



<h3 class="wp-block-heading">Request and send prompt to the LLM</h3>



<p>Launch the following query by asking the question of your choice:</p>



<pre class="wp-block-code"><code class="">curl https://&lt;your_vllm_app_id&gt;.app.gra.ai.cloud.ovh.net/v1/chat/completions \
  -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Small-24B-Instruct-2501",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Give me the name of OVHcloud’s founder."}
    ],
    "stream": false
  }'</code></pre>



<p>Returning the following result:</p>



<pre class="wp-block-code"><code class="">{
  "id":"chatcmpl-d6ea734b524bd851668e71d4111ba496",
  "object":"chat.completion",
  "created":1740059807,
  "model":"mistralai/Mistral-Small-24B-Instruct-2501",
  "choices":[
    {
      "index":0,
      "message":{
        "role":"assistant",
        "reasoning_content":null, 
        "content":"The founder of OVHcloud is Octave Klaba.",
        "tool_calls":[]
      },
      "logprobs":null,
      "finish_reason":"stop",
      "stop_reason":null
    }
  ],
  "usage":{
    "prompt_tokens":22,
    "total_tokens":35,
    "completion_tokens":13,
    "prompt_tokens_details":null
  },
  "prompt_logprobs":null
}</code></pre>



<h2 class="wp-block-heading">Conclusion</h2>



<p>By following these steps, you have successfully deployed the <code><strong>mistralai/Mistral-Small-24B-Instruct-2501</strong></code> model using <strong>vLLM</strong> on OVHcloud&#8217;s AI Deploy platform. This setup provides a scalable and efficient solution for serving advanced language models in production environments.</p>



<p>For further customization and optimization, refer to the <a href="https://help.ovhcloud.com/csm/en-ie-documentation-public-cloud-ai-and-machine-learning-ai-deploy?id=kb_browse_cat&amp;kb_id=574a8325551974502d4c6e78b7421938&amp;kb_category=3241efc6a052d910f078d4b4ef43651f&amp;spa=1" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vLLM documentation</a> and <a>OVHcloud AI Deploy resources</a>.</p>



<p>💪 <strong>Challenges taken!</strong> You can now enjoy the power of your LLM deployed in a single command line!</p>



<p>Want even more simplicity? You can also use ready-to-use APIs with <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints</a>!</p>



<p><strong><em>But… what’s next?</em></strong></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmistral-small-24b-served-with-vllm-and-ai-deploy-one-command-to-deploy-llm%2F&amp;action_name=Mistral%20Small%2024B%20served%20with%20vLLM%20and%20AI%20Deploy%20%26%238211%3B%20a%20single%20command%20to%20deploy%20an%20LLM%20%28Part%201%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Master Speech AI and build your own Video Translator app with AI Endpoints!</title>
		<link>https://blog.ovhcloud.com/master-speech-ai-and-build-your-own-video-translator-app-with-ai-endpoints/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Thu, 25 Jul 2024 08:10:49 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Endpoints]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=27091</guid>

					<description><![CDATA[Extend the impact of any video by embarking on the development of a transcription and translation solution for your multimedia content. Today, media and social networks are omnipresent in our professional and personal lives: videos, tweets, posts, forums and Twitch lives&#8230; These different types of media enable companies and content creators to promote their activities [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmaster-speech-ai-and-build-your-own-video-translator-app-with-ai-endpoints%2F&amp;action_name=Master%20Speech%20AI%20and%20build%20your%20own%20Video%20Translator%20app%20with%20AI%20Endpoints%21&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>Extend the impact of any video by embarking on the development of a transcription and translation solution for your multimedia content.</em></p>



<figure class="wp-block-image aligncenter size-full"><img loading="lazy" decoding="async" width="923" height="618" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post.png" alt="" class="wp-image-27116" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post.png 923w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post-300x201.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post-768x514.png 768w" sizes="auto, (max-width: 923px) 100vw, 923px" /><figcaption class="wp-element-caption"><em>A robot capable of transcribing, translating and dubbing videos in any language</em></figcaption></figure>



<p>Today, <strong>media</strong> and <strong>social networks</strong> are omnipresent in our professional and personal lives: videos, tweets, posts, forums and Twitch lives&#8230; These different types of media enable companies and content creators to promote their activities and build community loyalty.</p>



<p>But have you ever wondered about the <strong>role of language</strong> when creating your content? Using just one language can be a <strong>hindrance to your business</strong>!</p>



<p><strong>Transcribing</strong> and<strong> translating</strong> your videos could be the solution! Adapt your videos into different languages and make the content accessible to a wider audience, increasing its reach and impact.</p>



<p>💡 <strong>How can we achieve this?</strong></p>



<p>By automatically subtitling and dubbing voices using AI APIs! With <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints</a>, you will benefit from APIs based on several <strong>Speech AI</strong> models: ASR (Automatic Speech Recognition), NMT (Neural Machine Translation) and TTS (Text To Speech).</p>



<h3 class="wp-block-heading">Objective</h3>



<p>Whatever your level in AI, whether you&#8217;re a beginner or an expert, this tutorial will enable you to create your own powerful <strong>Video Translator</strong> in just a few lines of code.</p>



<p>The aim of this article is to show you how to make the most of Speech AI&#8217;s APIs, with no prior knowledge required!</p>



<p>⚡️ <strong>How to?</strong></p>



<p>By <strong>connecting the Speech AI API endpoints</strong> using easy-to-implement features, and<strong> developing a web-app</strong> using the Gradio framework.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" width="1024" height="665" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post-puzzles-1024x665.png" alt="" class="wp-image-27117" style="width:750px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post-puzzles-1024x665.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post-puzzles-300x195.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post-puzzles-768x498.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post-puzzles.png 1057w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>AI Endpoints “puzzles” connexion</em></figcaption></figure>



<p>🚀 <strong>At the end?</strong></p>



<p>We will have a web app that lets you <strong>upload a video</strong> in French, <strong>transcribe</strong> it and then <strong>subtitle</strong> it in English. </p>



<p>But that&#8217;s not all… They will also be able to <strong>dub the voice</strong> of a video into another language!</p>



<p>👀 Before we start coding, let&#8217;s take a look at the different concepts&#8230;</p>



<h3 class="wp-block-heading">Concept</h3>



<p>To better understand the technologies that revolve around the&nbsp;<strong>Video Translator</strong>, let’s start by examining the models and notions of ASR, NMT, TTS…</p>



<h5 class="wp-block-heading">AI Endpoints in a few words</h5>



<p><a href="https://labs.ovhcloud.com/en/ai-endpoints/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Endpoints</a>&nbsp;is a new serverless platform powered by OVHcloud and designed for developers.</p>



<p>The aim of&nbsp;<strong>AI Endpoints</strong>&nbsp;is to enable developers to enhance their applications with AI APIs, whatever their level and without the need for AI expertise.</p>



<p>It offers a&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/catalog" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">curated catalog</a>&nbsp;of world-renowned AI models and Nvidia’s optimized models, with a commitment to privacy as data is not stored or shared during or after model use.</p>



<p>AI Endpoints provides access to advanced AI models, including Large Language Models (LLMs), natural language processing, translation, speech recognition, image recognition, and more.</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="578" src="https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-1024x578.png" alt="" class="wp-image-28463" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-1024x578.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-768x433.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18.png 1312w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>OVHcloud AI Endpoints website</em></figcaption></figure>



<p>To know more about AI Endpoints, refer to this&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">website</a>.</p>



<p>AI Endpoints proposes several ASR APIs in different languages… But what means ASR?</p>



<h5 class="wp-block-heading">Transcribe video using ASR</h5>



<p><strong>Automatic Speech Recognition</strong>&nbsp;(ASR) technology, also known as Speech-To-Text, is the process of converting spoken language into written text.</p>



<p>This process consists of several stages, including preparing the speech signal, extracting features, creating acoustic models, developing language models, and utilizing speech recognition engines.</p>



<p>With AI Endpoints, we simplify the use of ASR technology through our&nbsp;<strong>ready-to-use inference APIs</strong>. Learn how to use our APIs by following this&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/models/30da93c6-f951-43d0-bb9b-ea6e75354af4" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">link</a>.</p>



<p>These APIs can be used to transcribe audio from video into text, which can then be sent to NMT model in order to translate it into an other language.</p>



<h5 class="wp-block-heading">Translate thanks to NMT</h5>



<p>NMT, for <strong>Neural Machine Translation</strong>, is a subfield of Machine Translation (MT) that uses Artificial Neural Networks (ANNs) to predict or generate translations from one language to another. </p>



<p>If you want to learn more, the best way is to try it out for yourself! You can do so by following this&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/models/23600fb3-7b3e-4d39-823c-7d66b8203e24" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">link</a>.</p>



<p>In this particular application, the NMT models will translate into an other language the results of the ASR (Automatic Speech Recognition) endpoint.</p>



<p>Then, there are two options:</p>



<ul class="wp-block-list">
<li><strong>generate subtitles</strong> .srt file based on the NMT translation</li>



<li><strong>apply voice dubbing</strong> thanks to speech synthesis</li>
</ul>



<p>🤯 Would you like to use speech synthesis ? Challenge accepted, that’s what TTS is for.</p>



<h5 class="wp-block-heading">Allow voice dubbing with TTS</h5>



<p>TTS stands for&nbsp;<strong>Text-To-Speech</strong>, which is a type of technology that converts written text into spoken words.</p>



<p>This technology uses Artificial Intelligence algorithms to interpret and generate human-like speech from text input.</p>



<p>It is commonly used in various applications such as voice assistants, audiobooks, language learning platforms, and accessibility tools for individuals with visual or reading impairments.</p>



<p>With AI Endpoints, TTS is easy to use thanks to the&nbsp;<strong>turnkey inference APIs</strong>. Test it for free&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/models/30da93c6-f951-43d0-bb9b-ea6e75354af4" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">here</a>.</p>



<p>🤖 Are you ready to start coding the&nbsp;<strong>Video Translator</strong>? Let&#8217;s go!</p>



<h3 class="wp-block-heading">Technical implementation of the Audio Virtual Assistant</h3>



<p>This technical section covers the following points:</p>



<ul class="wp-block-list">
<li>send your video (<strong>.mp4</strong>) and extract audio as <strong>.wav</strong> file</li>



<li>use<strong> ASR endpoint</strong> to transcribe audio into text</li>



<li>translate ASR transcription into the target language using <strong>NMT endpoint</strong></li>



<li>create .srt file to add video subtitles</li>



<li>use <strong>TTS endpoint</strong> to convert NMT translation into spoken words</li>



<li>implement voice dubbing function to merge generated audio with input video</li>
</ul>



<p>Finally, create a web app with <a href="https://www.gradio.app/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Gradio</a> to make it easy to use!</p>



<p><strong>➡️ Access the full code&nbsp;<a href="https://github.com/ovh/public-cloud-examples/tree/main/ai/ai-endpoints/speech-ai-video-translator" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">here</a>.</strong></p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="567" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post-app-1-1024x567.png" alt="" class="wp-image-27165" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post-app-1-1024x567.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post-app-1-300x166.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post-app-1-768x425.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/master-speech-ai-blog-post-app-1.png 1095w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Working principle of the web app resulting from technical implementation</em></figcaption></figure>



<p>In order to build the<strong> Video Translator</strong>, start by setting up the environment.</p>



<h5 class="wp-block-heading">Set up the environment</h5>



<p>In order to use<strong>&nbsp;AI Endpoints</strong>&nbsp;APIs easily, create a&nbsp;<code><strong>.env</strong></code>&nbsp;file to store environment variables.</p>



<pre class="wp-block-code"><code class="">ASR_GRPC_ENDPOINT=nvr-asr-fr-fr.endpoints-grpc.kepler.ai.cloud.ovh.net:443
NMT_GRPC_ENDPOINT=nvr-nmt-en-fr.endpoints-grpc.kepler.ai.cloud.ovh.net:443
TTS_GRPC_ENDPOINT=nvr-tts-en-us.endpoints-grpc.kepler.ai.cloud.ovh.net:443
OVH_AI_ENDPOINTS_ACCESS_TOKEN=&lt;ai-endpoints-api-token&gt;</code></pre>



<p>⚠️&nbsp;<em>Test AI Endpoints and get your&nbsp;<strong>free token</strong>&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">here</a></em></p>



<p>In the next step, install the needed Python dependencies.</p>



<p>If the library&nbsp;<strong><code>ffmpeg</code></strong>&nbsp;is not already installed, launch the following command:&nbsp;</p>



<pre class="wp-block-code"><code class=""><code>sudo apt install ffmpeg</code></code></pre>



<p>Create the&nbsp;<code><strong>requirements.txt</strong></code>&nbsp;file with the following libraries and launch the installation.</p>



<p>⚠️<em>The environnement workspace is based on&nbsp;<strong>Python 3.11</strong></em></p>



<p><code>nvidia-riva-client==2.15.0<br>gradio==4.16.0<br>moviepy==1.0.3<br>librosa==0.10.1<br>pysrt==1.1.2</code></p>



<pre class="wp-block-code"><code class="">pip install -r requirements.txt</code></pre>



<p>Once this is done, you have to create a five Python files:</p>



<ul class="wp-block-list">
<li><code><strong>asr.py</strong></code> &#8211; transcribe audio into text</li>



<li><code><strong>nmt.py</strong></code> &#8211; translate the transcription into an other language</li>



<li><strong><code>tts.py</code></strong> &#8211; synthesize the text into speech</li>



<li><code><strong>utils.py</strong></code> &#8211; extract audio from video, connect ASR, NMT and TTS functions together and merge the result with the input video</li>



<li><strong><code>main.py</code></strong> &#8211; create the Gradio app to make it easy to use</li>
</ul>



<p><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">⚠️ </mark></strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><strong>Note that only a few functions will be covered in this article! To create the entire app, refer to the <a href="https://github.com/ovh/public-cloud-examples/tree/main/ai/ai-endpoints/speech-ai-video-translator" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GitHub repo</a>, which contains all the code </strong></mark><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">⚠️</mark></strong></p>



<h5 class="wp-block-heading">Transcribe the audio part of the video in writing</h5>



<p>First, create the&nbsp;<strong>Automatic Speech Recognition</strong>&nbsp;(ASR) function in order to obtain the video transcription into French.</p>



<p>💡 <strong>How it works?</strong></p>



<p>The <strong><code>asr_transcription</code></strong> function allows you to transcribe the audio part of the video into text and to get the beginning and the end of each sentence thanks to the <code><strong>enable_word_time_offsets</strong></code> parameter.</p>



<pre class="wp-block-code"><code class="">def asr_transcription(audio_input):

    # connect with asr server
    asr_service = riva.client.ASRService(
                    riva.client.Auth(
                        uri=os.environ.get('ASR_GRPC_ENDPOINT'), 
                        use_ssl=True, 
                        metadata_args=[["authorization", f"bearer {ai_endpoint_token}"]]
                    )
                )

    # set up config
    asr_config = riva.client.RecognitionConfig(
    language_code="fr-FR",
        max_alternatives=1,
        enable_automatic_punctuation=True,
        enable_word_time_offsets=True,
        audio_channel_count = 1,
    )
    
    # open and read audio file
    with open(audio_input, 'rb') as fh:
        audio = fh.read()
    
    riva.client.add_audio_file_specs_to_config(asr_config, audio)

    # return response
    resp = asr_service.offline_recognize(audio, asr_config)
    output_asr = []
    
    # extract sentence information
    for s in range(len(resp.results)):
        output_sentence = []
        sentence = resp.results[s].alternatives[0].transcript
        output_sentence.append(sentence)
        
        for w in range(len(resp.results[s].alternatives[0].words)):
            start_sentence = resp.results[s].alternatives[0].words[0].start_time
            end_sentence = resp.results[s].alternatives[0].words[w].end_time
        
        # add start time and stop time of the sentence
        output_sentence.append(start_sentence)
        output_sentence.append(end_sentence)
       
        # final asr transcription and time sequences
        output_asr.append(output_sentence)
        
    # return response
    return output_asr</code></pre>



<p>🎉 Congratulations! Your ASR function is ready to use.</p>



<p>⏳ But that&#8217;s just the beginning! Now you have to build the translation part&#8230;.</p>



<h5 class="wp-block-heading">Translate French text into English</h5>



<p>Then, build the&nbsp;<strong>Neural Machine Translation</strong>&nbsp;(NMT) function to transform theFrench transcription into English.</p>



<p>➡️ <strong>In practice?</strong></p>



<p>The <code><strong>nmt_translation</strong></code> function allows you to translate the different sentences in English. Don&#8217;t forget to keep the start and end times for each sentence!</p>



<pre class="wp-block-code"><code class="">def nmt_translation(output_asr):
    
    # connect with nmt server
    nmt_service = riva.client.NeuralMachineTranslationClient(
                    riva.client.Auth(
                        uri=os.environ.get('NMT_GRPC_ENDPOINT'), 
                        use_ssl=True, 
                        metadata_args=[["authorization", f"bearer {ai_endpoint_token}"]]
                    )
                )

    # set up config
    model_name = 'fr_en_24x6'
    
    output_nmt = []
    for s in range(len(output_asr)):
        output_nmt.append(output_asr[s])
        text_translation = nmt_service.translate([output_asr[s][0]], model_name, "fr", "en")
        output_nmt[s][0]=text_translation.translations[0].text
        
    # return response
    return output_nmt</code></pre>



<p>⚡️ You’re almost there! Now all you have to do is build the TTS function.</p>



<h5 class="wp-block-heading">Synthesize the translated text into spoken words</h5>



<p>Finally, create the&nbsp;<strong>Text To Speech</strong>&nbsp;(TTS) function to synthesize the French text into audio.</p>



<p>👀 <strong>How to?</strong></p>



<p>Firstly, create the <strong><code>tts_transcription</code></strong> function, dedicated to audio generation and silences management based on ASR and NMT results.</p>



<pre class="wp-block-code"><code class="">def tts_transcription(output_nmt, video_input, video_title, voice_type):
    
    # connect with tts server
    tts_service = riva.client.SpeechSynthesisService(
                    riva.client.Auth(
                        uri=os.environ.get('TTS_GRPC_ENDPOINT'), 
                        use_ssl=True, 
                        metadata_args=[["authorization", f"bearer {ai_endpoint_token}"]]
                    )
                )

    # set up tts config
    sample_rate_hz = 16000
    req = { 
            "language_code"  : "en-US",
            "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,  
            "sample_rate_hz" : sample_rate_hz,                       
            "voice_name"     : f"English-US.{voice_type}"                    
    }

    output_audio = 0
    output_audio_file = f"{outputs_path}/audios/{video_title}.wav"
    for i in range(len(output_nmt)):
        
        # add silence between audio sample
        if i==0:
            duration_silence = output_nmt[i][1]
        else:
            duration_silence = output_nmt[i][1] - output_nmt[i-1][2]
        silent_segment = AudioSegment.silent(duration = duration_silence)
        output_audio += silent_segment
        
        # create tts transcription
        req["text"] = output_nmt[i][0]
        resp = tts_service.synthesize(**req)
        sound_segment = AudioSegment(
            resp.audio,
            sample_width=2,
            frame_rate=16000,
            channels=1,
        )

        output_audio += sound_segment
    
    # export new audio as wav file
    output_audio.export(output_audio_file, format="wav")
    
    # add new voice on video
    voice_dubbing = add_audio_on_video(output_audio_file, video_input, video_title)
    
    return voice_dubbing</code></pre>



<p>Secondly, build the <code><strong>add_audio_on_video</strong></code> function in order to merge the new audio on the video.</p>



<pre class="wp-block-code"><code class="">def add_audio_on_video(translated_audio, video_input, video_title):

    videoclip = VideoFileClip(video_input)
    audioclip = AudioFileClip(translated_audio)

    new_audioclip = CompositeAudioClip([audioclip])
    new_videoclip = f"{outputs_path}/videos/{video_title}.mp4"
    
    videoclip.audio = new_audioclip
    videoclip.write_videofile(new_videoclip)
    
    return new_videoclip</code></pre>



<p>🤖 Congratulations! Now you&#8217;re ready to put the puzzle pieces together with the <strong><code>utils.py</code></strong> file.</p>



<h5 class="wp-block-heading">Combine the results of the various functions</h5>



<p>This is the most important step! Connect the functions output to return the final video&#8230;</p>



<p>🚀 <strong>What to do?</strong></p>



<p>1. Create the <code>main.py </code>to implement the Gradio app</p>



<p><strong> ➡️ Access to the code <a href="https://github.com/ovh/public-cloud-examples/blob/main/ai/ai-endpoints/speech-ai-video-translator/main.py" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">here</a>.</strong></p>



<p>2. Build <code>utils.py</code> file to connect the results to each other</p>



<p><strong>➡️ Refer to this Python <a href="https://github.com/ovh/public-cloud-examples/blob/main/ai/ai-endpoints/speech-ai-video-translator/utils.py" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">file</a>.</strong></p>



<p>😎 Well done! You can now use your web app to translate any video from French to English.</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="623" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-21-1024x623.png" alt="" class="wp-image-27166" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-21-1024x623.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-21-300x183.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-21-768x467.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-21-1536x935.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-21-2048x1246.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Video Translator web app overview</em></figcaption></figure>



<p>🚀 That’s it! Now get the most out of your tool by launching it locally.</p>



<h5 class="wp-block-heading">Launch Video Translator web app locally</h5>



<p>In this last step , you can start this Gradio app locally by launching the following command:</p>



<pre class="wp-block-code"><code class="">python3 main.py</code></pre>



<p>Benefit from the full power of your tool as follow!</p>



<figure class="wp-block-video aligncenter"><video height="1080" style="aspect-ratio: 1920 / 1080;" width="1920" controls src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/demo_video_translator_ai_endpoints.mp4"></video><figcaption class="wp-element-caption"><em>Video Translator demo</em></figcaption></figure>



<p>☁️ It’s also possible to make your interface accessible to everyone…</p>



<h5 class="wp-block-heading">Go further</h5>



<p>If you want to go further and deploy your web app in the cloud, refer to the following articles and tutorials.</p>



<ul class="wp-block-list">
<li><a href="https://blog.ovhcloud.com/deploy-a-custom-docker-image-for-data-science-project-gradio-sketch-recognition-app-part-1/" data-wpel-link="internal">Deploy a custom Docker image for Data Science project</a></li>



<li><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-build-use-custom-image?id=kb_article_view&amp;sysparm_article=KB0057413" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Deploy – Tutorial – Build &amp; use a custom Docker image</a></li>



<li><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-gradio-sketch-recognition?id=kb_article_view&amp;sysparm_article=KB0048083" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Deploy – Tutorial – Deploy a Gradio app for sketch recognition</a></li>
</ul>



<h3 class="wp-block-heading">Conclusion of the Audio Virtual Assistant</h3>



<p>Congratulations 🎉! You have learned how to build your own&nbsp;<strong>Video Translator</strong>&nbsp;in thanks to AI Endpoints.</p>



<p>You’ve also seen how easy it is to use&nbsp;<strong>AI Endpoints</strong>&nbsp;to create innovative turnkey solutions.</p>



<p><strong>➡️ Access the full code&nbsp;<a href="https://github.com/ovh/public-cloud-examples/tree/main/ai/ai-endpoints/speech-ai-video-translator" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">here</a>.</strong></p>



<p>🚀&nbsp;<strong>What’s next?</strong>&nbsp;Implement an <a href="https://blog.ovhcloud.com/build-a-powerful-audio-virtual-assistant-with-ai-endpoints/" data-wpel-link="internal">Audio Virtual Assistant</a> in less than 100 lines of code!</p>



<h3 class="wp-block-heading">References</h3>



<ul class="wp-block-list">
<li><a href="https://blog.ovhcloud.com/enhance-your-applications-with-ai-endpoints/" data-wpel-link="internal">Enhance your applications with AI Endpoints</a></li>



<li><a href="https://blog.ovhcloud.com/create-audio-summarizer-assistant-with-ai-endpoints/" data-wpel-link="internal">Create your own Audio Summarizer assistant with AI Endpoints!</a></li>



<li><a href="https://blog.ovhcloud.com/build-a-powerful-audio-virtual-assistant-with-ai-endpoints/" data-wpel-link="internal">Build a powerful Audio Virtual Assistant in less than 100 lines of code with AI Endpoints!</a></li>
</ul>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmaster-speech-ai-and-build-your-own-video-translator-app-with-ai-endpoints%2F&amp;action_name=Master%20Speech%20AI%20and%20build%20your%20own%20Video%20Translator%20app%20with%20AI%20Endpoints%21&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2024/07/demo_video_translator_ai_endpoints.mp4" length="4268394" type="video/mp4" />

			</item>
		<item>
		<title>Chatbot memory management with LangChain and AI Endpoints</title>
		<link>https://blog.ovhcloud.com/chatbot-memory-management-with-langchain-and-ai-endpoints/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Thu, 11 Jul 2024 14:03:03 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Endpoints]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=27066</guid>

					<description><![CDATA[Use Conversational Memory to enable your chatbot to answer multiple questions using its knowledge based on previous interactions. When it comes to Conversational Applications, especially those with interfaces, the ability to remember information about past interactions is paramount. Imagine you&#8217;re talking to a Virtual Assistant or Chatbot, and you want it to remember details of [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fchatbot-memory-management-with-langchain-and-ai-endpoints%2F&amp;action_name=Chatbot%20memory%20management%20with%20LangChain%20and%20AI%20Endpoints&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>Use <strong>Conversational Memory</strong> to enable your chatbot to answer multiple questions using its knowledge based on previous interactions.</em></p>



<figure class="wp-block-image aligncenter size-full"><img loading="lazy" decoding="async" width="812" height="509" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/chatbot-memory-langchain-python-blog-post-2.png" alt="" class="wp-image-27078" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/chatbot-memory-langchain-python-blog-post-2.png 812w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/chatbot-memory-langchain-python-blog-post-2-300x188.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/chatbot-memory-langchain-python-blog-post-2-768x481.png 768w" sizes="auto, (max-width: 812px) 100vw, 812px" /><figcaption class="wp-element-caption"><em>A robot assistant with a lot of knowledge talking to a human</em></figcaption></figure>



<p>When it comes to <strong>Conversational Applications</strong>, especially those with interfaces, the ability to remember information about past interactions is paramount.</p>



<p>Imagine you&#8217;re talking to a <strong>Virtual Assistant</strong> or <strong>Chatbot</strong>, and you want it to remember details of previous conversations&#8230;</p>



<p><a href="https://python.langchain.com/v0.2/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LangChain</a>&#8216;s <strong>Memory</strong> module is the solution that rescues our conversation models from the constraints of short-term memory!</p>



<p>In this article we will learn how it is possible to use <strong>OVHcloud&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Endpoints</a></strong>, especially <strong>Mistral7B</strong> API, and <strong>LangChain</strong> in order to add a <strong>Memory window</strong> to a Chatbot.</p>



<p>This step-by-step tutorial will introduce the different types of memory in LangChain. Then, we will compare the Mistral7b model used without memory and the one benefiting from the memory window.</p>



<h3 class="wp-block-heading">Introduction</h3>



<p>Before getting our hands into the code, let&#8217;s contextualize it by introducing AI Endpoints and the notion of memory in the LLM domain.</p>



<h5 class="wp-block-heading">AI Endpoints in a few words</h5>



<p><a href="https://labs.ovhcloud.com/en/ai-endpoints/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Endpoints</a>&nbsp;is a new serverless platform powered by OVHcloud and designed for developers.</p>



<p>The aim of&nbsp;<strong>AI Endpoints</strong>&nbsp;is to enable developers to enhance their applications with AI APIs, whatever their level and without the need for AI expertise.</p>



<p>It offers a&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/catalog" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">curated catalog</a>&nbsp;of world-renowned AI models and Nvidia’s optimized models, with a commitment to privacy as data is not stored or shared during or after model use.</p>



<p>AI Endpoints provides access to advanced AI models, including Large Language Models (LLMs), natural language processing, translation, speech recognition, image recognition, and more.</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="578" src="https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-1024x578.png" alt="" class="wp-image-28463" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-1024x578.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-768x433.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18.png 1312w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>OVHcloud AI Endpoints website</em></figcaption></figure>



<p>To know more about AI Endpoints, refer to this&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">website</a>.</p>



<h5 class="wp-block-heading">Conversational Memory concept</h5>



<p><strong>Conversational memory</strong> for LLMs (Language Learning Models) refers to the ability of these models to remember and use information from previous interactions within the same conversation. </p>



<p>It works in a similar way to how humans use <strong>short-term memory</strong> in day-to-day conversations. </p>



<p>This feature is essential for <strong>maintaining context</strong> and coherence throughout a dialogue. It allows the model to recall details, facts, or inquiries mentioned earlier in the conversation (<strong>chat history</strong>), and use that information effectively to generate more relevant responses corresponding to the <strong>new user inputs</strong>.</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" width="873" height="739" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/chatbot-memory-history-input.png" alt="" class="wp-image-27083" style="width:750px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/chatbot-memory-history-input.png 873w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/chatbot-memory-history-input-300x254.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/chatbot-memory-history-input-768x650.png 768w" sizes="auto, (max-width: 873px) 100vw, 873px" /><figcaption class="wp-element-caption"><em>Distinction between chat history and new user input</em></figcaption></figure>



<p><strong>Conversational memory</strong> can be implemented through various techniques and architecture, especially ing LangChain.</p>



<h3 class="wp-block-heading">Memory&nbsp;types in LangChain</h3>



<p>LangChain offers several types of conversational memory with the <code><strong>ConversationChain</strong></code>.</p>



<p>Each memory type may have its own parameters and concepts that need to be understood&#8230;</p>



<h5 class="wp-block-heading">ConversationBufferMemory</h5>



<p>The first component is the <code><strong>ConversationBufferMemory</strong></code>. This is an extremely simple form of memory that simply holds a list of chat messages in a buffer and passes them on to the prompt model.</p>



<p>All conversation interactions between the human and the AI ​​are passed to the parameter <strong><code>history</code></strong>.</p>



<h5 class="wp-block-heading">ConversationSummaryMemory</h5>



<p>The second component solves a problem that arises when using <code><strong>ConversationBufferMemory</strong></code>: we quickly consume a large number of tokens, often exceeding the context window limit of even the most advanced LLMs.</p>



<p>The solution may be to use the<code><strong> ConversationSummaryMemory</strong></code> component. The latter makes it possible to limit the abusive use of tokens while exploiting memory. This type of memory summarizes the history of interactions to send it to the dedicated parameter (<code><strong>history</strong></code>).</p>



<h5 class="wp-block-heading">ConversationBufferWindowMemory</h5>



<p>The third one is the <code><strong>ConversationBufferWindowMemory</strong></code>. It introduces a window into the buffer memory, keeping only the <strong>K most recent interactions</strong>. </p>



<p>⚠️ <em>Note that this approach reduces the number of tokens used, it also causes the previous K interactions.</em></p>



<h5 class="wp-block-heading">ConversationSummaryBufferMemory</h5>



<p>The <code><strong>ConversationSummaryBufferMemory</strong></code> component is a mix of <strong><code>ConversationSummaryMemory</code></strong> and <code><strong>ConversationBufferWindowMemory</strong></code>. </p>



<p>It summarizes the earliest interactions while retaining the latest tokens in the human / AI conversation.</p>



<p>🧠 Let&#8217;s move on to the technical part and take a look at the component<strong> <code>ConversationBufferWindowMemory</code></strong>!</p>



<h3 class="wp-block-heading">How to add a conversational memory window?</h3>



<p>This technical section covers the following points:</p>



<ul class="wp-block-list">
<li>set up the dev environment</li>



<li>test the Mistral7B model without conversational memory</li>



<li>implement<strong> ConversationBufferWindowMemory</strong> to benefit from the model knowledge during the conversation</li>
</ul>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="690" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/chatbot-with-without-memory-langchain-1-1024x690.png" alt="" class="wp-image-27080" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/chatbot-with-without-memory-langchain-1-1024x690.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/chatbot-with-without-memory-langchain-1-300x202.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/chatbot-with-without-memory-langchain-1-768x517.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/chatbot-with-without-memory-langchain-1.png 1048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Comparaison between chatbot with and without memory</em></figcaption></figure>



<p><strong>➡️ Access the full code&nbsp;<a href="https://github.com/ovh/public-cloud-examples/tree/main/ai/ai-endpoints/python-langchain-conversational-memory" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">here</a>.</strong></p>



<h5 class="wp-block-heading">Set up the environment</h5>



<p>In order to use<strong>&nbsp;AI Endpoints&nbsp;Mistral7B</strong> API easily, create a&nbsp;<code><strong>.env</strong></code>&nbsp;file to store environment variables.</p>



<pre class="wp-block-code"><code class="">LLM_AI_ENDPOINT=https://mistral-7b-instruct-v0-3.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1

OVH_AI_ENDPOINTS_ACCESS_TOKEN=&lt;ai-endpoints-api-token&gt;</code></pre>



<p>⚠️&nbsp;<em>Test AI Endpoints and get your&nbsp;<strong>free token</strong>&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">here</a></em></p>



<p>In the next step, install the needed Python dependencies.</p>



<p>Create the&nbsp;<code><strong>requirements.txt</strong></code>&nbsp;file with the following libraries and launch the installation.</p>



<p>⚠️<em>The environnement workspace is based on&nbsp;<strong>Python 3.11</strong></em></p>



<p><code>python-dotenv==1.0.1<br>langchain_openai==0.1.14<br>langchain==0.2.17</code><br><code>openai==1.68.2</code></p>



<pre class="wp-block-code"><code class="">pip install -r requirements.txt</code></pre>



<p>Once this is done, you can create a notebook named&nbsp;<strong><code>chatbot-memory-langchain.ipynb</code></strong>.</p>



<p>First, import Python librairies as follow:</p>



<pre class="wp-block-code"><code class="">import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import  ChatPromptTemplate
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferWindowMemory</code></pre>



<p>Then, load the environment variables:</p>



<pre class="wp-block-code"><code class="">load_dotenv()

# access the environment variables from the .env file
ai_endpoint_token = os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN")
ai_endpoint_mistral7b = os.getenv("LLM_AI_ENDPOINT")</code></pre>



<p>👀 You are now ready to test your LLM without conversational memory!</p>



<h5 class="wp-block-heading">Test Mistral7b without Conversational Memory</h5>



<p>Test your model in a basic way and see what happens with the context&#8230;</p>



<pre class="wp-block-code"><code class=""># Set up the LLM
llm = ChatOpenAI(
        model_name="Mistral-7B-Instruct-v0.3", 
        openai_api_key=ai_endpoint_token,
        openai_api_base=ai_endpoint_mistral7b, 
        max_tokens=512,
        temperature=0.0
)

prompt = ChatPromptTemplate.from_messages([
("system", "You are an assistant. Answer to the question."),
("human", "{question}"),
])

# Create the conversation chain
chain = prompt | llm

# Start the conversation
question = "Hello, my name is Elea"
response = chain.invoke(question)
print(f"👤: {question}")
print(f"🤖: {response.content}")

question = "What is the capital of France?"
response = chain.invoke(question)
print(f"👤: {question}")
print(f"🤖: {response.content}")

question = "Do you know my name?"
response = chain.invoke(question)
print(f"👤: {question}")
print(f"🤖: {response.content}")</code></pre>



<p>You should obtain the following result:</p>



<p><code>👤: Hello, my name is Elea <br>🤖: Hello Elea, nice to meet you. How can I assist you today? <br>👤: What is the capital of France? <br>🤖: The capital city of France is Paris. Paris is one of the most famous and visited cities in the world. It is known for its art, culture, and cuisine. <br>👤: Do you know my name? <br>🤖: I'm an assistant and I don't have the ability to know your name without being told.</code></p>



<p>Note here that the model <strong>does not</strong> store the conversation in memory, since it no longer remembers the first name sent in the first prompt.</p>



<p><strong>But how to fix it?</strong></p>



<p>💡 You can solve it using <strong>ConversationBufferWindowMemory</strong> from LangChain&#8230;</p>



<h5 class="wp-block-heading">Add Memory Window to your LLM</h5>



<p>In this step, we add a Conversation Window Memory using the following component:</p>



<p><code><strong>memory = ConversationBufferWindowMemory(k=10</strong></code>)</p>



<p><strong>Parameter <code>k</code> defines the number of recorded interactions.<br></strong>➡️ <em>Note that if we set&nbsp;<strong>k=1</strong>, it means that the window will remember the single latest interaction between the human and AI. That is the latest human input and the latest AI response.</em></p>



<p>Then, we have to create the conversation chain:</p>



<p><code><strong>conversation = ConversationChain(llm=llm, memory=memory)</strong></code></p>



<pre class="wp-block-code"><code class=""># Set up the LLM
llm = ChatOpenAI(
        model_name="Mistral-7B-Instruct-v0.3", 
        openai_api_key=ai_endpoint_token,
        openai_api_base=ai_endpoint_mistral7b,
        max_tokens=512,
        temperature=0.0
)

# Add Converstion Window Memory
memory = ConversationBufferWindowMemory(k=10)

# Create the conversation chain
conversation = ConversationChain(llm=llm, memory=memory)

# Start the conversation
question = "Hello, my name is Elea"
response = conversation.predict(input=question)
print(f"👤: {question}")
print(f"🤖: {response}")

question = "What is the capital of France?"
response = conversation.predict(input=question)
print(f"👤: {question}")
print(f"🤖: {response}")

question = "Do you know my name?"
response = conversation.predict(input=question)
print(f"👤: {question}")
print(f"🤖: {response}")</code></pre>



<p>Finally, you should obtain this type of output:</p>



<p><code>👤: Hello, my name is Elea </code><br><code>🤖: Hello Elea, nice to meet you. I'm an AI designed to assist and engage in friendly conversations. How can I help you today? Would you like to know a joke, play a game, or discuss a specific topic? I'm here to help and provide lots of specific details from my context. If I don't know the answer to a question, I'll truthfully say I don't know. So, what would you like to talk about today? I'm all ears! </code><br><code>👤: What is the capital of France? </code><br><code>🤖: The capital city of France is Paris. Paris is one of the most famous and romantic cities in the world. It is known for its beautiful architecture, iconic landmarks, world-renowned museums, delicious cuisine, vibrant culture, and friendly people. Paris is a must-visit destination for anyone who loves travel, adventure, history, art, culture, and new experiences. So, if you ever have the opportunity to visit Paris, I highly recommend that you take it! You won't be disappointed! </code><br><code>👤: Do you know my name? </code><br><code>🤖: Yes, I do. Your name is Elea. How can I help you today, Elea? Would you like to know a joke, play a game, or discuss a specific topic? I'm here to help and provide lots of specific details from my context. If I don't know the answer to a question, I'll truthfully say I don't know. So, what would you like to talk about today, Elea? I'm all ears!</code></p>



<p>As you can see, thanks to the <code><strong>ConversationBufferWindowMemory</strong></code>, your model keeps track of the conversation and retrieves previously exchanged information.</p>



<p>⚠️ Here, the memory window is <code><strong>k=10</strong></code>, so feel free to customize the <code><strong>k</strong></code> value to suit your needs.</p>



<h3 class="wp-block-heading">Conclusion</h3>



<p><strong>Congratulations!</strong> You can now benefit from the memory generated by the history of your interactions with the LLM.</p>



<p>🤖 This will enable you to streamline exchanges with the Chatbot and get more relevant answers!</p>



<p>In this blog, we explored the <strong>LangChain Memory</strong> module and, more specifically, the <code><strong>ConversationBufferWindowMemory</strong></code> component.</p>



<p>This has enabled us to understand the importance of memory in the creation of a Chatbot or Virtual assistant!</p>



<p><strong>➡️ Access the full code&nbsp;<a href="https://github.com/ovh/public-cloud-examples/tree/main/ai/ai-endpoints/python-langchain-conversational-memory" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">here</a>.</strong></p>



<p>🚀&nbsp;<strong>What’s next?</strong>&nbsp;If you would like to find out more, take a look at the following article on <a href="https://blog.ovhcloud.com/memory-chatbot-using-ai-endpoints-and-langchain4j/" data-wpel-link="internal">memory chatbot with LangChain4j</a>.</p>



<h3 class="wp-block-heading">References</h3>



<ul class="wp-block-list">
<li><a href="https://blog.ovhcloud.com/enhance-your-applications-with-ai-endpoints/" data-wpel-link="internal">Enhance your applications with AI Endpoints</a></li>



<li><a href="https://blog.ovhcloud.com/how-to-use-ai-endpoints-and-langchain-to-create-a-chatbot/" data-wpel-link="internal">How to use AI Endpoints and LangChain to create a chatbot</a></li>



<li><a href="https://medium.com/@ranadevrat/how-to-develop-a-chatbot-using-the-open-source-llm-mistral-7b-lang-chain-memory-79f9fb3016df" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">How to develop a chatbot using the open-source&nbsp;LLM Mistral-7B, Lang Chain Memory, ConversationChain, and Flask</a></li>



<li><a href="https://python.langchain.com/v0.1/docs/modules/memory/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LangChain memory module documentation</a></li>
</ul>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fchatbot-memory-management-with-langchain-and-ai-endpoints%2F&amp;action_name=Chatbot%20memory%20management%20with%20LangChain%20and%20AI%20Endpoints&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Build a powerful Audio Virtual Assistant in less than 100 lines of code with AI Endpoints!</title>
		<link>https://blog.ovhcloud.com/build-a-powerful-audio-virtual-assistant-with-ai-endpoints/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Tue, 09 Jul 2024 13:16:29 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Endpoints]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=27063</guid>

					<description><![CDATA[Raise your hands off the keyboard and chat with your LLM by voice with this Audio Virtual Assistant! Nowadays, the creation of virtual assistants has become more accessible than ever, thanks to advances in AI (Artificial Intelligence), particularly in the field of&#160;SpeechAI&#160;and the&#160;GenAI&#160;models. We will explore how&#160;OVHcloud&#160;AI Endpoints&#160;can be leveraged to design and develop an&#160;Audio [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fbuild-a-powerful-audio-virtual-assistant-with-ai-endpoints%2F&amp;action_name=Build%20a%20powerful%20Audio%20Virtual%20Assistant%20in%20less%20than%20100%20lines%20of%20code%20with%20AI%20Endpoints%21&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>Raise your hands off the keyboard and chat with your LLM by voice with this Audio Virtual Assistant!</em></p>



<figure class="wp-block-image aligncenter"><img loading="lazy" decoding="async" width="924" height="618" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-app-blog-post.png" alt="" class="wp-image-27006" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-app-blog-post.png 924w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-app-blog-post-300x201.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-app-blog-post-768x514.png 768w" sizes="auto, (max-width: 924px) 100vw, 924px" /><figcaption class="wp-element-caption"><em>An audio robot assistant talking to a human about the recipe for apple pie</em></figcaption></figure>



<p>Nowadays, the creation of virtual assistants has become more accessible than ever, thanks to advances in AI (Artificial Intelligence), particularly in the field of&nbsp;<strong>SpeechAI</strong>&nbsp;and the&nbsp;<strong>GenAI</strong>&nbsp;models.</p>



<p>We will explore how&nbsp;<strong>OVHcloud&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Endpoints</a></strong>&nbsp;can be leveraged to design and develop an&nbsp;<strong>Audio Virtual Assistant</strong>&nbsp;capable of processing and understanding verbal questions, providing accurate answers, and returning answers verbally through speech synthesis.</p>



<p>In this step-by-step tutorial, we will take a look at how to send audio through the microphone to the&nbsp;<strong>LLM</strong>&nbsp;(Large Language Models) via a written transcript of the&nbsp;<strong>ASR</strong>&nbsp;(Automatic Speech Recognition). The response is then formulated orally by a&nbsp;<strong>TTS</strong>&nbsp;(Text To Speech) model.</p>



<h3 class="wp-block-heading">Objectives</h3>



<p>Whatever your level in AI, whether you’re a beginner or an expert, this tutorial will enable you to create your own powerful&nbsp;<strong>Audio Virtual Assistant</strong>&nbsp;in just a few lines of code.</p>



<p><strong>How to?</strong></p>



<p>By connecting your AI Endpoints like puzzles!</p>



<ul class="wp-block-list">
<li>Retrieve the written transcript of your oral question with ASR endpoint</li>



<li>Get the answer to your question with an LLM endpoint</li>



<li>Take advantage of the TTS endpoint with the oral reply</li>
</ul>



<figure class="wp-block-image aligncenter"><img loading="lazy" decoding="async" width="998" height="568" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-app-blog-post-puzzles.png" alt="" class="wp-image-27008" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-app-blog-post-puzzles.png 998w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-app-blog-post-puzzles-300x171.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-app-blog-post-puzzles-768x437.png 768w" sizes="auto, (max-width: 998px) 100vw, 998px" /><figcaption class="wp-element-caption"><em>AI Endpoints “puzzles” connexion</em></figcaption></figure>



<p>👀 But first of all, a few definitions are needed to fully understand the technical implementation that follows.</p>



<h3 class="wp-block-heading">Concept</h3>



<p>To better understand the technologies that revolve around the&nbsp;<strong>Audio Virtual Assistant</strong>, let’s start by examining the models and notions of ASR, LLM, TTS…</p>



<h5 class="wp-block-heading">AI Endpoints in a few words</h5>



<p><a href="https://labs.ovhcloud.com/en/ai-endpoints/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Endpoints</a>&nbsp;is a new serverless platform powered by OVHcloud and designed for developers.</p>



<p>The aim of&nbsp;<strong>AI Endpoints</strong>&nbsp;is to enable developers to enhance their applications with AI APIs, whatever their level and without the need for AI expertise.</p>



<p>It offers a&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/catalog" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">curated catalog</a>&nbsp;of world-renowned AI models and Nvidia’s optimized models, with a commitment to privacy as data is not stored or shared during or after model use.</p>



<p>AI Endpoints provides access to advanced AI models, including Large Language Models (LLMs), natural language processing, translation, speech recognition, image recognition, and more.</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="578" src="https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-1024x578.png" alt="" class="wp-image-28463" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-1024x578.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-768x433.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18.png 1312w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>OVHcloud AI Endpoints website</em></figcaption></figure>



<p>To know more about AI Endpoints, refer to this&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">website</a>.</p>



<p>AI Endpoints proposes several ASR APIs in different languages… But what means ASR?</p>



<h5 class="wp-block-heading">Questioning with ASR</h5>



<p><strong>Automatic Speech Recognition</strong>&nbsp;(ASR) technology, also known as Speech-To-Text, is the process of converting spoken language into written text.</p>



<p>This process consists of several stages, including preparing the speech signal, extracting features, creating acoustic models, developing language models, and utilizing speech recognition engines.</p>



<p>With AI Endpoints, we simplify the use of ASR technology through our&nbsp;<strong>ready-to-use inference APIs</strong>. Learn how to use our APIs by following this&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/models/30da93c6-f951-43d0-bb9b-ea6e75354af4" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">link</a>.</p>



<p>These APIs can be used to transcribe a recorded audio question into text, which can then be sent to a Large Language Model (LLM) for an answer.</p>



<h5 class="wp-block-heading">Answering using LLM</h5>



<p>LLMs, or&nbsp;<strong>Large Language Models</strong>, are known for producing text that is similar to how humans write.</p>



<p>They use complex algorithms to predict patterns in human language, understand context, and provide relevant responses. With LLM, virtual assistants can engage in meaningful and dynamic conversations with users.</p>



<p>If you want to learn more, the best way is to try it out for yourself! You can do so by following this&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/models/018a19ab-167f-473b-8ec5-acb44380d175" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">link</a>.</p>



<p>In this particular application, the LLM will be configured to answer the user question based on the results of the ASR (Automatic Speech Recognition) endpoint.</p>



<p>🤯 Would you like a verbal response? Don’t worry, that’s what TTS is for.</p>



<h5 class="wp-block-heading">Expressing orally through TTS</h5>



<p>TTS stands for&nbsp;<strong>Text-To-Speech</strong>, which is a type of technology that converts written text into spoken words.</p>



<p>This technology uses Artificial Intelligence algorithms to interpret and generate human-like speech from text input.</p>



<p>It is commonly used in various applications such as voice assistants, audiobooks, language learning platforms, and accessibility tools for individuals with visual or reading impairments.</p>



<p>With AI Endpoints, TTS is easy to use thanks to the&nbsp;<strong>turnkey inference APIs</strong>. Test it for free&nbsp;<a href="https://endpoints.ai.cloud.ovh.net/models/30da93c6-f951-43d0-bb9b-ea6e75354af4" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">here</a>.</p>



<p>🤖 Are you ready to start coding the&nbsp;<strong>Audio Virtual Assistant</strong>? Here we go: 3, 2, 1, begin!</p>



<h3 class="wp-block-heading">Technical implementation of the Audio Virtual Assistant</h3>



<p>This technical section covers the following points:</p>



<ul class="wp-block-list">
<li>the use of the ASR endpoint inside Python code to transcribe audio request</li>



<li>the implementation of the TTS function to convertLLM response into spoken words</li>



<li>the creation of a Chatbot app using LLMs and&nbsp;<a href="https://streamlit.io/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Streamlit</a></li>
</ul>



<p><strong>➡️ Access the full code&nbsp;<a href="https://github.com/ovh/public-cloud-examples/tree/main/ai/ai-endpoints/audio-virtual-assistant" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">here</a>.</strong></p>



<figure class="wp-block-image aligncenter"><img loading="lazy" decoding="async" width="818" height="645" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-app-blog-post-1.png" alt="" class="wp-image-27014" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-app-blog-post-1.png 818w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-app-blog-post-1-300x237.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-app-blog-post-1-768x606.png 768w" sizes="auto, (max-width: 818px) 100vw, 818px" /><figcaption class="wp-element-caption"><em>Working principle of the web app resulting from technical implementation</em></figcaption></figure>



<p>To build the&nbsp;<strong>Audio Virtual Assistant</strong>, start by setting up the environment.</p>



<h5 class="wp-block-heading">Set up the environment</h5>



<p>In order to use<strong>&nbsp;AI Endpoints</strong>&nbsp;APIs easily, create a&nbsp;<code><strong>.env</strong></code>&nbsp;file to store environment variables.</p>



<pre class="wp-block-code"><code class="">ASR_AI_ENDPOINT=https://whisper-large-v3.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1<br>TTS_GRPC_ENDPOINT=nvr-tts-en-us.endpoints-grpc.kepler.ai.cloud.ovh.net:443<br>LLM_AI_ENDPOINT=https://mixtral-8x7b-instruct-v01.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1<br>OVH_AI_ENDPOINTS_ACCESS_TOKEN=&lt;ai-endpoints-api-token></code></pre>



<p>⚠️&nbsp;<em><strong>**Make sure to replace the token value (`OVH_AI_ENDPOINTS_ACCESS_TOKEN`) by yours.**</strong> If you do not have one yet, follow the instructions in the <a href="https://help.ovhcloud.com/csm/de-public-cloud-ai-endpoints-getting-started?id=kb_article_view&amp;sysparm_article=KB0065406" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints &#8211; Getting Started</a> guide.</em></p>



<p>In this tutorial, we will be using the <strong>Whisper-Large-V3</strong> and <strong>Mixtral-8x7b-Instruct-V01</strong> models. Feel free to change it by models available on the <a href="https://catalog.endpoints.ai.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints catalog</a>.</p>



<p>In the next step, install the needed Python dependencies.</p>



<p>Create the&nbsp;<code><strong>requirements.txt</strong></code>&nbsp;file with the following libraries and launch the installation.</p>



<p>⚠️<em>The environnement workspace is based on&nbsp;<strong>Python 3.11</strong></em></p>



<p><code>openai==1.68.2</code><br><code>streamlit==1.36.0<br>streamlit-mic-recorder==0.0.8</code><br><code>nvidia-riva-client==2.15.1<br>python-dotenv==1.0.1</code></p>



<pre class="wp-block-code"><code class="">pip install -r requirements.txt</code></pre>



<p>Once this is done, you can create a Python file named&nbsp;<strong><code>audio-virtual-assistant-app.py</code></strong>.</p>



<p>Then, import Python librairies as follow:</p>



<pre class="wp-block-code"><code lang="" class="">import os<br>import numpy as np<br>from openai import OpenAI<br>import riva.client<br>from dotenv import load_dotenv<br>import streamlit as st<br>from streamlit_mic_recorder import mic_recorder</code></pre>



<p>After these lines, load and access the environnement variables of your <code>.env</code> file:</p>



<pre class="wp-block-code"><code lang="" class=""># access the environment variables from the .env file<br>load_dotenv()<br><br>ASR_AI_ENDPOINT = os.environ.get('ASR_AI_ENDPOINT')<br>TTS_GRPC_ENDPOINT = os.environ.get('TTS_GRPC_ENDPOINT')<br>LLM_AI_ENDPOINT = os.environ.get('LLM_AI_ENDPOINT')<br>OVH_AI_ENDPOINTS_ACCESS_TOKEN = os.environ.get('OVH_AI_ENDPOINTS_ACCESS_TOKEN')</code></pre>



<p>Next, define the clients that will be used to interact with the models:</p>



<pre class="wp-block-code"><code lang="" class="">llm_client = OpenAI(<br>    base_url=LLM_AI_ENDPOINT,<br>    api_key=OVH_AI_ENDPOINTS_ACCESS_TOKEN<br>)<br><br>tts_client = riva.client.SpeechSynthesisService(<br>    riva.client.Auth(<br>        uri=TTS_GRPC_ENDPOINT,<br>        use_ssl=True,<br>        metadata_args=[["authorization", f"bearer {OVH_AI_ENDPOINTS_ACCESS_TOKEN}"]]<br>    )<br>)<br><br>asr_client = OpenAI(<br>    base_url=ASR_AI_ENDPOINT,<br>    api_key=OVH_AI_ENDPOINTS_ACCESS_TOKEN<br>)</code></pre>



<p>💡 You are now ready to start coding your web app!</p>



<h5 class="wp-block-heading">Transcribe input question with ASR</h5>



<p>First, create the&nbsp;<strong>Automatic Speech Recognition</strong>&nbsp;(ASR) function in order to transcribe microphone input into text:</p>



<pre class="wp-block-code"><code class="">def asr_transcription(question, asr_client):<br>    return asr_client.audio.transcriptions.create(<br>        model="whisper-large-v3",<br>        file=question<br>    ).text</code></pre>



<p><strong>How it works?</strong></p>



<ul class="wp-block-list">
<li>The audio input is sent from microphone recording, as <code><strong>question</strong></code></li>



<li>A call is made to the ASR AI Endpoint named&nbsp;<code><strong>whisper-large-v3</strong></code></li>



<li>The text from the transcript response is returned by the function</li>
</ul>



<p>🎉 Congratulations! Your ASR function is ready to use. You are ready to transcribe audio files.</p>



<h5 class="wp-block-heading">Generate LLM response to input question</h5>



<p>Now, create a function that calls the LLM client to provide responses to questions:</p>



<pre class="wp-block-code"><code lang="" class="">def llm_answer(input, llm_client):<br>    response = llm_client.chat.completions.create(<br>                model="Mixtral-8x7B-Instruct-v0.1", <br>                messages=input,<br>                temperature=0,<br>                max_tokens=1024,<br>            )<br>    msg = response.choices[0].message.content<br><br>    return msg</code></pre>



<p><strong>In this function:</strong></p>



<ul class="wp-block-list">
<li>The conversation/messages are retrieved as parameters</li>



<li>A call is made to the chat completion LLM endpoint, using the `Mixtral8x7B` model.</li>



<li>Extracts the model’s response and returns the final message text.</li>
</ul>



<p>⏳ Almost there! All that remains is to implement the TTS to transform the LLM response into spoken words.</p>



<h5 class="wp-block-heading">Return the response using TTS</h5>



<p>Then, build the&nbsp;<strong>Text To Speech</strong>&nbsp;(TTS) function in order to transform the written answer into oral reply:</p>



<pre class="wp-block-code"><code class="">def tts_synthesis(response, tts_client):<br><br>    # set up config<br>    sample_rate_hz = 48000<br>    req = {<br>            "language_code"  : "en-US",                           # languages: en-US<br>            "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,<br>            "sample_rate_hz" : sample_rate_hz,                    # sample rate: 48KHz audio<br>            "voice_name"     : "English-US.Female-1"              # voices: `English-US.Female-1`, `English-US.Male-1`<br>    }<br><br>    # return response<br>    req["text"] = response<br>    synthesized_response = tts_client.synthesize(**req)<br>    <br>    return np.frombuffer(synthesized_response.audio, dtype=np.int16), sample_rate_hz</code></pre>



<p><strong>In practice?</strong></p>



<ul class="wp-block-list">
<li>The LLM response is retrieved</li>



<li>A call is made to the TTS AI Endpoint named&nbsp;<code><strong>nvr-tts-en-us</strong></code></li>



<li>The audio sample and the sample rate are returned to play the audio automatically</li>
</ul>



<p>⚡️ You’re almost there! Now all you have to do is build your&nbsp;<strong>Chatbot app</strong>.</p>



<h5 class="wp-block-heading">Build the LLM chat app with Streamlit</h5>



<p>In this last step, create the&nbsp;<strong>Chatbot app</strong>&nbsp;using&nbsp;<strong><a href="https://streamlit.io/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Streamlit</a></strong>, an open-source Python library that allows to quickly create user interfaces for Machine Learning models and demos. Here is a working code example:</p>



<p><strong>What to do?</strong></p>



<ul class="wp-block-list">
<li>Create a first Streamlit container to put the title using&nbsp;<strong><code>st.container()</code></strong>&nbsp;and&nbsp;<code><strong>st.title()</strong></code></li>



<li>Add a second container for bot and user messages thanks to the following components:&nbsp;<strong><code>st.container()</code></strong>&nbsp;;&nbsp;<strong><code>st.session_state()</code></strong>&nbsp;;&nbsp;<code><strong>st.chat_message()</strong></code></li>



<li>Use a third container for the microphone recording, the usage of the ASR, LLM, TTS, and the automatic audio player.</li>
</ul>



<pre class="wp-block-code"><code class=""># streamlit interface<br>with st.container():<br>    st.title("💬 Audio Virtual Assistant Chatbot")<br><br>with st.container(height=600):<br>    messages = st.container()<br><br>    if "messages" not in st.session_state:<br>        st.session_state["messages"] = [{"role": "system", "content": "Hello, I'm AVA!", "avatar":"🤖"}]<br><br>    for msg in st.session_state.messages:<br>        messages.chat_message(msg["role"], avatar=msg["avatar"]).write(msg["content"])<br><br>with st.container():<br><br>    placeholder = st.empty()<br>    _, recording = placeholder.empty(), mic_recorder(<br>            start_prompt="START RECORDING YOUR QUESTION ⏺️", <br>            stop_prompt="STOP ⏹️", <br>            format="wav",<br>            use_container_width=True,<br>            key='recorder'<br>        )<br><br>    if recording:  <br>        user_question = asr_transcription(recording['bytes'], asr_client)<br><br>        if prompt := user_question:<br>            st.session_state.messages.append({"role": "user", "content": prompt, "avatar":"👤"})<br>            messages.chat_message("user", avatar="👤").write(prompt)<br>            msg = llm_answer(st.session_state.messages, llm_client)<br>            st.session_state.messages.append({"role": "assistant", "content": msg, "avatar": "🤖"})<br>            messages.chat_message("system", avatar="🤖").write(msg)<br><br>            if msg is not None:<br>                audio_samples, sample_rate_hz = tts_synthesis(msg, tts_client)<br>                placeholder.audio(audio_samples, sample_rate=sample_rate_hz, autoplay=True)</code></pre>



<p>Now, the&nbsp;<strong>Audio Virtual Assistant</strong>&nbsp;is ready to use!</p>



<figure class="wp-block-image aligncenter"><img loading="lazy" decoding="async" width="834" height="916" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-16.png" alt="" class="wp-image-27051" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-16.png 834w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-16-273x300.png 273w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-16-768x844.png 768w" sizes="auto, (max-width: 834px) 100vw, 834px" /><figcaption class="wp-element-caption"><em>Audio Virtual Assistant web app</em></figcaption></figure>



<p>🚀 That’s it! Now get the most out of your tool by launching it locally.</p>



<h5 class="wp-block-heading">Launch Streamlit chatbot app locally</h5>



<p>Finally, you can start this Streamlit app locally by launching the following command:</p>



<pre class="wp-block-code"><code class="">streamlit run audio-virtual-assistant.py </code></pre>



<p>Benefit from the full power of your tool as follow!</p>



<figure class="wp-block-video aligncenter"><video height="792" style="aspect-ratio: 1122 / 792;" width="1122" controls src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-demo.mp4"></video></figure>



<h5 class="wp-block-heading">Improvements</h5>



<p>By default, the <code><strong>nvr-tts-en-us</strong></code> model supports only a limited number of characters per request when generating audio. If you exceed this limit, you will encounter errors in your application.</p>



<p>To work around this limitation, you can replace the existing <code><strong>tts_synthesis</strong></code> function with the following implementation, which processes text in chunks:</p>



<pre class="wp-block-code"><code class="">def tts_synthesis(response, tts_client):<br>    # Split response into chunks of max 1000 characters<br>    max_chunk_length = 1000<br>    words = response.split()<br>    chunks = []<br>    current_chunk = ""<br><br>    for word in words:<br>        if len(current_chunk) + len(word) + 1 &lt;= max_chunk_length:<br>            current_chunk += " " + word if current_chunk else word<br>        else:<br>            chunks.append(current_chunk)<br>            current_chunk = word<br>    if current_chunk:<br>        chunks.append(current_chunk)<br><br>    all_audio = np.array([], dtype=np.int16)<br>    sample_rate_hz = 16000<br><br>    # Process each chunk and concatenate the resulting audio<br>    for text in chunks:<br>        req = {<br>            "language_code": "en-US",<br>            "encoding": riva.client.AudioEncoding.LINEAR_PCM,<br>            "sample_rate_hz": sample_rate_hz,<br>            "voice_name": "English-US.Female-1",<br>            "text": text.strip(),<br>        }<br>        synthesized = tts_client.synthesize(**req)<br>        audio_segment = np.frombuffer(synthesized.audio, dtype=np.int16)<br>        all_audio = np.concatenate((all_audio, audio_segment))<br><br>    return all_audio, sample_rate_hz</code></pre>



<p>☁️ Moreover, It’s also possible to make your interface accessible to everyone…</p>



<h5 class="wp-block-heading">Go further</h5>



<p>If you want to go further and deploy your web app in the cloud, refer to the following articles and tutorials.</p>



<ul class="wp-block-list">
<li><a href="https://blog.ovhcloud.com/deploy-a-custom-docker-image-for-data-science-project-gradio-sketch-recognition-app-part-1/" data-wpel-link="internal"></a><a href="https://blog.ovhcloud.com/deploy-a-custom-docker-image-for-data-science-project-streamlit-app-for-eda-and-interactive-prediction-part-2/" data-wpel-link="internal">Deploy a custom Docker image for Data Science project – Streamlit app for EDA and interactive prediction (Part 2)</a></li>



<li><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-build-use-custom-image?id=kb_article_view&amp;sysparm_article=KB0057413" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Deploy – Tutorial – Build &amp; use a custom Docker image</a></li>
</ul>



<h3 class="wp-block-heading">Conclusion of the Audio Virtual Assistant</h3>



<p>Well done 🎉! You have learned how to build your own&nbsp;<strong>Audio Virtual Assistant</strong>&nbsp;in a few lines of code.</p>



<p>You’ve also seen how easy it is to use&nbsp;<strong>AI Endpoints</strong>&nbsp;to create innovative turnkey solutions.</p>



<p><strong>➡️ Access the full code&nbsp;<a href="https://github.com/ovh/public-cloud-examples/tree/main/ai/ai-endpoints/audio-virtual-assistant" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">here</a>.</strong></p>



<p>🚀&nbsp;<strong>What’s next?</strong>&nbsp;Implement&nbsp;<a href="https://blog.ovhcloud.com/rag-chatbot-using-ai-endpoints-and-langchain/" data-wpel-link="internal">RAG chatbot</a>&nbsp;o specialize this Audio Virtual Assistant on your data!</p>



<h3 class="wp-block-heading">References</h3>



<ul class="wp-block-list">
<li><a href="https://blog.ovhcloud.com/enhance-your-applications-with-ai-endpoints/" data-wpel-link="internal">Enhance your applications with AI Endpoints</a></li>



<li><a href="https://blog.ovhcloud.com/rag-chatbot-using-ai-endpoints-and-langchain/" data-wpel-link="internal">RAG chatbot using AI Endpoints and LangChain</a></li>



<li><a href="https://blog.ovhcloud.com/how-to-use-ai-endpoints-langchain-and-javascript-to-create-a-chatbot/" data-wpel-link="internal">How to use AI Endpoints, LangChain and Javascript to create a chatbot</a></li>



<li><a href="https://blog.ovhcloud.com/how-to-use-ai-endpoints-and-langchain-to-create-a-chatbot/" data-wpel-link="internal">How to use AI Endpoints and LangChain to create a chatbot</a></li>
</ul>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fbuild-a-powerful-audio-virtual-assistant-with-ai-endpoints%2F&amp;action_name=Build%20a%20powerful%20Audio%20Virtual%20Assistant%20in%20less%20than%20100%20lines%20of%20code%20with%20AI%20Endpoints%21&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-virtual-assistant-demo.mp4" length="2573065" type="video/mp4" />

			</item>
		<item>
		<title>Create your own Audio Summarizer assistant with AI Endpoints!</title>
		<link>https://blog.ovhcloud.com/create-audio-summarizer-assistant-with-ai-endpoints/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Thu, 04 Jul 2024 07:39:00 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Endpoints]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=26964</guid>

					<description><![CDATA[Do you dream of being able to summarize hours of meetings in a matter of seconds? Don&#8217;t go away, we&#8217;ll explain it all here! Introduction Are you looking for a way to efficiently summarize your meetings, broadcasts, and podcasts for quick reference or to provide to others? Look no further! In this blog post, you [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fcreate-audio-summarizer-assistant-with-ai-endpoints%2F&amp;action_name=Create%20your%20own%20Audio%20Summarizer%20assistant%20with%20AI%20Endpoints%21&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>Do you dream of being able to<strong> summarize hours of meetings in a matter of seconds</strong>? Don&#8217;t go away, we&#8217;ll explain it all here!</em></p>



<figure class="wp-block-image aligncenter size-full"><img loading="lazy" decoding="async" width="923" height="618" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio_transcriber_summarizer_blog_post-1.png" alt="" class="wp-image-26972" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio_transcriber_summarizer_blog_post-1.png 923w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio_transcriber_summarizer_blog_post-1-300x201.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio_transcriber_summarizer_blog_post-1-768x514.png 768w" sizes="auto, (max-width: 923px) 100vw, 923px" /><figcaption class="wp-element-caption"><em>Robot assistant transcribing and summarizing audios into short texts</em></figcaption></figure>



<h3 class="wp-block-heading">Introduction</h3>



<p>Are you looking for a way to efficiently summarize your meetings, broadcasts, and podcasts for quick reference or to provide to others? Look no further!</p>



<p>In this blog post, you will be able to create an Audio Summarizer assistant that can not only transcribe but also summarize all your audios.</p>



<p>Thanks to <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints</a>, it&#8217;s never been easier to create a virtual assistant that can help you stay on top of your meetings and keep track of important information.</p>



<p>This article will explore how AI APIs can be useful to create an advanced virtual assistant to transcribe and summarize any audio file thanks to <strong>ASR</strong> (Automatic Speech Recognition) technologies and famous <strong>LLMs</strong> (Large Language Models).</p>



<h3 class="wp-block-heading">Objectives</h3>



<p>Whether you&#8217;re a professional, a student or just want to make the most of your time, this step-by-step guide will show you how to create an Audio Summarizer assistant that will help you summarize your meetings, shows and podcasts, allowing you to concentrate on what really matters!</p>



<p><strong>How to?</strong></p>



<p>By connecting your AI Endpoints like puzzles!</p>



<figure class="wp-block-image aligncenter size-full"><img loading="lazy" decoding="async" width="895" height="541" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-summarizer-app-blog-post-puzzles.png" alt="" class="wp-image-26981" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-summarizer-app-blog-post-puzzles.png 895w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-summarizer-app-blog-post-puzzles-300x181.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-summarizer-app-blog-post-puzzles-768x464.png 768w" sizes="auto, (max-width: 895px) 100vw, 895px" /><figcaption class="wp-element-caption"><em>AI Endpoints &#8220;puzzles&#8221; connexion </em></figcaption></figure>



<p>👀 But first of all, a few definitions are needed to fully understand the technical implementation that follows.</p>



<h3 class="wp-block-heading">Concept</h3>



<p>In order to better understand the technologies that revolve around the Audio Summarizer, let&#8217;s start by looking at the tools and notions of <strong>ASR</strong>, <strong>LLM</strong>, &#8230;</p>



<h5 class="wp-block-heading">AI Endpoints in a few words</h5>



<p><a href="https://labs.ovhcloud.com/en/ai-endpoints/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints</a> is a new serverless platform powered by OVHcloud and designed for developers.</p>



<p>The aim of <strong>AI Endpoints</strong> is to enable developers to enhance their applications with AI APIs, whatever their level and without the need for AI expertise. </p>



<p>It offers a <a href="https://endpoints.ai.cloud.ovh.net/catalog" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">curated catalog</a> of world-renowned AI models and Nvidia&#8217;s optimized models, with a commitment to privacy as data is not stored or shared during or after model use. </p>



<p>AI Endpoints provides access to advanced AI models, including Large Language Models (LLMs), natural language processing, translation, speech recognition, image recognition, and more. </p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" width="1024" height="578" src="https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-1024x578.png" alt="" class="wp-image-28463" style="width:750px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-1024x578.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18-768x433.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-18.png 1312w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>OVHcloud AI Endpoints website</em></figcaption></figure>



<p>To know more about AI Endpoints, refer to this <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">website</a>.</p>



<p>AI Endpoints proposes several ASR APIs in different languages&#8230; But what means ASR?</p>



<h5 class="wp-block-heading">It all starts with ASR</h5>



<p><strong>Automatic Speech Recognition</strong>&nbsp;(ASR) is a technology that converts spoken language into written text. </p>



<p>It is a complex process that involves several stages, including speech signal preprocessing, feature extraction, acoustic modeling, language modeling, and speech recognition engine.</p>



<p>AI Endpoints makes it easy, with ready-to-use inference APIs. Discover how to use them <a href="https://endpoints.ai.cloud.ovh.net/models/30da93c6-f951-43d0-bb9b-ea6e75354af4" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">here</a>.</p>



<p>In this context, ASR will be used to transcribe long audios into text in order to summarize it with LLMs.</p>



<h5 class="wp-block-heading">Making summary with LLM</h5>



<p>The famous LLMs, for <strong>Large Language Models</strong> are responsible for generating human-like text.</p>



<p>They use complex algorithms to predict patterns in human language, understand context, and provide relevant responses. With LLM, virtual assistants can engage in meaningful and dynamic conversations with users.</p>



<p>To find out more, what better way than to test it yourself? Follow this <a href="https://endpoints.ai.cloud.ovh.net/models/018a19ab-167f-473b-8ec5-acb44380d175" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">link</a>.</p>



<p>For the current use case, the LLM prompt will precise to generate a summary of the input text based on the result of the ASR endpoint.</p>



<p>🤖 Do you want to start coding the Audio Summarizer? 3, 2, 1, get ready, go!</p>



<h3 class="wp-block-heading">Technical implementation of the Audio Summarizer</h3>



<p>In this technical part, the following points will be discussed:</p>



<ul class="wp-block-list">
<li>the use of the ASR API inside Python code</li>



<li>the integration of the Mixtral8x7B LLM</li>



<li>the creation of a web app with <a href="https://gradio.app/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Gradio</a></li>
</ul>



<p><strong>➡️ Access the full code <a href="https://github.com/ovh/public-cloud-examples/tree/main/ai/ai-endpoints/audio-summarizer-assistant" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">here</a>.</strong></p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" width="818" height="489" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-summarizer-app-blog-post.png" alt="" class="wp-image-26974" style="width:750px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-summarizer-app-blog-post.png 818w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-summarizer-app-blog-post-300x179.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/audio-summarizer-app-blog-post-768x459.png 768w" sizes="auto, (max-width: 818px) 100vw, 818px" /><figcaption class="wp-element-caption"><em>Working principle of the web app resulting from technical implementation</em></figcaption></figure>



<p>To build the Audio Summarizer, start by setting up the environment.</p>



<h5 class="wp-block-heading">Set up the environment</h5>



<p>In order to use<strong> AI Endpoints</strong> APIs easily, create a <code><strong>.env</strong></code> file to store environment variables.</p>



<pre class="wp-block-code"><code class="">ASR_AI_ENDPOINT=https://whisper-large-v3.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1<br>LLM_AI_ENDPOINT=https://mixtral-8x7b-instruct-v01.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1<br>OVH_AI_ENDPOINTS_ACCESS_TOKEN=&lt;ai-endpoints-api-token></code></pre>



<p>⚠️ <em>Test AI Endpoints and get your <strong>free token</strong> <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">here</a></em></p>



<p>In the next step, install the needed Python dependencies.</p>



<p>Create the <code><strong>requirements.txt</strong></code> file with the following libraries and launch the installation.</p>



<p>⚠️<em>The environnement workspace is based on <strong>Python 3.11</strong></em></p>



<p><code>openai==1.68.2<br>gradio==4.44.1<br>pydub==0.25.1<br>python-dotenv==1.0.1</code></p>



<pre class="wp-block-code"><code class="">pip install -r requirements.txt</code></pre>



<p>Once this is done, you can create a Python file named <code><strong>audio-summarizer-app.py</strong></code>.</p>



<p>Then, import Python librairies as follow:</p>



<pre class="wp-block-code"><code class="">import gradio as gr<br>import io<br>import os<br>import requests<br>from pydub import AudioSegment<br>from dotenv import load_dotenv<br>from openai import OpenAI<br>import functools</code></pre>



<p>Now, load and access the environnement variables. </p>



<pre class="wp-block-code"><code class=""># access the environment variables from the .env file<br>load_dotenv()<br><br>asr_ai_endpoint_url = os.environ.get('ASR_AI_ENDPOINT') <br>llm_ai_endpoint_url = os.getenv("LLM_AI_ENDPOINT")<br>ai_endpoint_token = os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN")</code></pre>



<p>Then define the clients that communicate with your APIs and authenticate your requests:</p>



<pre class="wp-block-code"><code class="">asr_client = OpenAI(<br>    base_url=asr_ai_endpoint_url,<br>    api_key=ai_endpoint_token<br>)<br><br>llm_client = OpenAI(<br>    base_url=llm_ai_endpoint_url,<br>    api_key=ai_endpoint_token<br>)</code></pre>



<p>💡 You are now ready to start coding your web app!</p>



<h5 class="wp-block-heading">Transcribe audio file with ASR</h5>



<p>First, create the <strong>Automatic Speech Recognition</strong> (ASR) function in order to transcribe audio files into text.</p>



<p><strong>How it works?</strong></p>



<ul class="wp-block-list">
<li>The audio file is preprocessed as follow: <code><strong>.wav</strong></code> format, <code><strong>1</strong></code> channel, <strong><code>16000</code></strong> frame rate</li>



<li>The transformed audio <code><strong>processed_audio</strong></code> is read</li>



<li>An API call is made to the ASR AI Endpoint named <code><strong>whisper-large-v3</strong></code></li>



<li>The full response is stored in <code><strong>resp</strong></code> variable and returned by the function</li>
</ul>



<pre class="wp-block-code"><code class="">def asr_transcription(asr_client, audio):<br>    <br>    if audio is None:<br>        return " "<br><br>    else:<br>        # preprocess audio <br>        processed_audio = "/tmp/my_audio.wav"<br>        audio_input = AudioSegment.from_file(audio, "mp3")<br>        process_audio_to_wav = audio_input.set_channels(1)<br>        process_audio_to_wav = process_audio_to_wav.set_frame_rate(16000)<br>        process_audio_to_wav.export(processed_audio, format="wav")<br>    <br>        with open(processed_audio, "rb") as audio_file:<br>            response = asr_client.audio.transcriptions.create(<br>                model="whisper-large-v3",<br>                file=audio_file,<br>                response_format="verbose_json",<br>                timestamp_granularities=["segment"]<br>            )<br><br>        # return complete transcription<br>        return response.text</code></pre>



<p>🎉 Congratulations! Your ASR function is ready to use.</p>



<p>Now it&#8217;s time to call an LLM to summarize the transcribed text.</p>



<h5 class="wp-block-heading">Summarize audio with LLM</h5>



<p>In this second step, create the <strong>Chat Completion</strong> function to use Mixtral8x7B effectively.</p>



<p><strong>What to do?</strong></p>



<ul class="wp-block-list">
<li>Check that the transcription exists</li>



<li>Use the OpenAI API compatibility to call the LLM</li>



<li>Customize your prompt in order to <strong>specify LLM task</strong></li>



<li>Return the audio summary</li>
</ul>



<pre class="wp-block-code"><code class="">def chat_completion(llm_client, new_message):<br><br>    if new_message==" ":<br>        return "Please, send an input audio to get its summary!"<br>    <br>    else:<br><br>        # prompt<br>        history_openai_format = [{"role": "user", "content": f"Summarize the following text in a few words: {new_message}"}]<br>        # return summary<br>        return llm_client.chat.completions.create(<br>            model="Mixtral-8x7B-Instruct-v0.1",<br>            messages=history_openai_format,<br>            temperature=0,<br>            max_tokens=1024<br>        ).choices.pop().message.content</code></pre>



<p>⚡️ You&#8217;re almost there! Now all you have to do is build your web app.</p>



<p>To make your solution easy to use, what better way than to quickly create an interface with just a few lines of code?</p>



<h5 class="wp-block-heading">Build Gradio app</h5>



<p><strong>Gradio</strong> is an open-source Python library that allows to quickly create user interfaces for Machine Learning models and demos.</p>



<p><strong>What does it mean in practice?</strong></p>



<p>Inside a Gradio Block, you can:</p>



<ul class="wp-block-list">
<li>Define a theme for your UI</li>



<li>Add a title to your web app with <strong><code>gr.HTML()</code></strong></li>



<li>Upload audio thanks to the dedicated component, <strong><code>gr.Audio()</code></strong></li>



<li>Obtain the result of the written transcription with the <strong><code>gr.Textbox()</code></strong></li>



<li>Get a summary of the audio with the powerful LLM and a second <code><strong>gr.Textbox()</strong></code> component</li>



<li>Add a clear button with <code><strong>gr.ClearButton()</strong></code> to reset the page of the web app</li>
</ul>



<pre class="wp-block-code"><code class=""># create partial functions with bound client instances<br>asr_transcribe_fn = functools.partial(asr_transcription, asr_client)<br>chat_completion_fn = functools.partial(chat_completion, llm_client)<br><br><br># gradio<br>with gr.Blocks(theme=gr.themes.Default(primary_hue="blue"), fill_height=True) as demo:<br><br>    # add title and description<br>    with gr.Row():<br>        gr.HTML(<br>            """<br>            &lt;div align="center"><br>                &lt;h1>Welcome on Audio Summarizer web app 💬!&lt;/h1><br>                &lt;i>Transcribe and summarize your broadcast, meetings, conversations, potcasts and much more...&lt;/i><br>            &lt;/div><br>            &lt;br><br>            """<br>        )<br>        <br>    # audio zone for user question<br>    gr.Markdown("## Upload your audio file 📢")<br>    with gr.Row():<br>        inp_audio = gr.Audio(<br>            label = "Audio file in .wav or .mp3 format:",<br>            sources = ['upload'],<br>            type = "filepath",<br>        )<br><br>    # written transcription of user question<br>    with gr.Row():<br>        inp_text = gr.Textbox(<br>            label = "Audio transcription into text:",<br>        )<br>        <br>    # chabot answer<br>    gr.Markdown("## Chatbot summarization 🤖")<br>    with gr.Row():<br>        out_resp = gr.Textbox(<br>            label = "Get a summary of your audio:",<br>        )<br><br>    with gr.Row():<br>        <br>        # clear inputs<br>        clear = gr.ClearButton([inp_audio, inp_text, out_resp])<br>  <br>    # update functions<br>    inp_audio.change(<br>        fn = asr_transcribe_fn,<br>        inputs = inp_audio,<br>        outputs = inp_text<br>      )<br>    inp_text.change(<br>        fn = chat_completion_fn,<br>        inputs = inp_text,<br>        outputs = out_resp<br>      )</code></pre>



<p>Then, you can launch it in the <strong><code>main</code></strong>.</p>



<pre class="wp-block-code"><code class="">if __name__ == '__main__':
 
    demo.launch(server_name="0.0.0.0", server_port=8000)</code></pre>



<p>Now, the web app is ready to be used!</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="702" src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-14-1024x702.png" alt="" class="wp-image-26982" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-14-1024x702.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-14-300x206.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-14-768x526.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2024/07/image-14.png 1167w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Audio Summarizer web app overview</em></figcaption></figure>



<p>🚀 That&#8217;s it! Now get the most out of your tool by launching it locally.</p>



<h5 class="wp-block-heading">Launch Gradio web app locally</h5>



<p>Finally, you can start this Gradio app locally by launching the following command:</p>



<pre class="wp-block-code"><code class="">python audio-summarizer-app.py</code></pre>



<p>Benefit from the full power of your tool and save time!</p>



<p>☁️ It&#8217;s also possible to make your interface accessible to everyone&#8230;</p>



<figure class="wp-block-video aligncenter"><video height="792" style="aspect-ratio: 1122 / 792;" width="1122" controls src="https://blog.ovhcloud.com/wp-content/uploads/2024/07/video-audio-summarizer.mp4"></video><figcaption class="wp-element-caption"><em>Audio Summarizer assistant demo</em></figcaption></figure>



<h5 class="wp-block-heading">Go Further</h5>



<p>If you want to go further and deploy your web app in the cloud, refer to the following articles and tutorials.</p>



<ul class="wp-block-list">
<li><a href="https://blog.ovhcloud.com/deploy-a-custom-docker-image-for-data-science-project-gradio-sketch-recognition-app-part-1/" data-wpel-link="internal">Deploy a custom Docker image for Data Science project</a></li>



<li><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-build-use-custom-image?id=kb_article_view&amp;sysparm_article=KB0057413" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Deploy &#8211; Tutorial &#8211; Build &amp; use a custom Docker image</a></li>



<li><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-gradio-sketch-recognition?id=kb_article_view&amp;sysparm_article=KB0048083" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Deploy &#8211; Tutorial &#8211; Deploy a Gradio app for sketch recognition</a></li>
</ul>



<h3 class="wp-block-heading">Conclusion of the Audio Summarizer </h3>



<p>Well done 🎉! You have learned how to build your own <strong>Audio Summarizer</strong> app in a few lines of code.</p>



<p>You&#8217;ve also seen how easy it is to use <strong>AI Endpoints</strong> to create innovative turnkey solutions.</p>



<p><strong>➡️ Access the full code <a href="https://github.com/ovh/public-cloud-examples/tree/main/ai/ai-endpoints/audio-summarizer-assistant" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">here</a>.</strong></p>



<p><strong>What&#8217;s next?</strong> Modify your prompt and add translation to get the summary in an other language 💡</p>



<h3 class="wp-block-heading">References</h3>



<ul class="wp-block-list">
<li><a href="https://blog.ovhcloud.com/enhance-your-applications-with-ai-endpoints/" data-wpel-link="internal">Enhance your applications with AI Endpoints</a></li>



<li><a href="https://blog.ovhcloud.com/rag-chatbot-using-ai-endpoints-and-langchain/" data-wpel-link="internal">RAG chatbot using AI Endpoints and LangChain</a></li>



<li><a href="https://blog.ovhcloud.com/how-to-use-ai-endpoints-langchain-and-javascript-to-create-a-chatbot/" data-wpel-link="internal">How to use AI Endpoints, LangChain and Javascript to create a chatbot</a></li>



<li><a href="https://blog.ovhcloud.com/how-to-use-ai-endpoints-and-langchain-to-create-a-chatbot/" data-wpel-link="internal">How to use AI Endpoints and LangChain to create a chatbot</a></li>
</ul>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fcreate-audio-summarizer-assistant-with-ai-endpoints%2F&amp;action_name=Create%20your%20own%20Audio%20Summarizer%20assistant%20with%20AI%20Endpoints%21&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2024/07/video-audio-summarizer.mp4" length="7716155" type="video/mp4" />

			</item>
	</channel>
</rss>
