<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>prometheus Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/prometheus/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/prometheus/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Fri, 10 Apr 2026 09:23:41 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>prometheus Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/prometheus/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Reference Architecture: Deploying a vision-language model with vLLM on OVHcloud MKS for high performance inference and full observability</title>
		<link>https://blog.ovhcloud.com/reference-architecture-deploying-a-vision-language-model-with-vllm-on-ovhcloud-mks-for-high-performance-inference-and-full-observability/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Fri, 10 Apr 2026 07:48:53 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[Kubernetes]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[prometheus]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<category><![CDATA[vLLM]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=30455</guid>

					<description><![CDATA[Ensure complete&#160;digital sovereignty&#160;of your AI models with end-to-end control through open-source solutions on OVHcloud’s&#160;Managed Kubernetes Service. This reference architecture demonstrates how to deploy a Large Language Model (LLM) inference system using vLLM on&#160;OVHcloud Managed Kubernetes Service&#160;(MKS). The solution leverages NVIDIA L40S GPUs to serve the&#160;Qwen3-VL-8B-Instruct&#160;multimodal model (vision + text) with OpenAI-compatible API endpoints. This comprehensive [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-deploying-a-vision-language-model-with-vllm-on-ovhcloud-mks-for-high-performance-inference-and-full-observability%2F&amp;action_name=Reference%20Architecture%3A%20Deploying%20a%20vision-language%20model%20with%20vLLM%20on%20OVHcloud%20MKS%20for%20high%20performance%20inference%20and%20full%20observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em><em>Ensure complete&nbsp;<strong>digital sovereignty</strong>&nbsp;of your AI models with end-to-end control through open-source solutions on OVHcloud’s&nbsp;<strong>Managed Kubernetes Service</strong>.</em></em></p>



<figure class="wp-block-image aligncenter size-large is-resized"><img fetchpriority="high" decoding="async" width="703" height="1024" src="https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-703x1024.jpg" alt="vLLM on OVHcloud MKS for high availability and full observability" class="wp-image-31153" style="width:710px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-703x1024.jpg 703w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-206x300.jpg 206w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-768x1118.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-1055x1536.jpg 1055w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm.jpg 1260w" sizes="(max-width: 703px) 100vw, 703px" /><figcaption class="wp-element-caption"><em><em>vLLM on OVHcloud MKS for high availability and full observability</em></em></figcaption></figure>



<p class="wp-block-paragraph">This reference architecture demonstrates how to deploy a Large Language Model (LLM) inference system using vLLM on&nbsp;<a href="https://www.ovhcloud.com/fr/public-cloud/kubernetes/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Managed Kubernetes Service</a>&nbsp;(MKS). The solution leverages NVIDIA L40S GPUs to serve the&nbsp;<a href="https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Qwen3-VL-8B-Instruct</a>&nbsp;multimodal model (vision + text) with OpenAI-compatible API endpoints.</p>



<p class="wp-block-paragraph">This comprehensive guide shows you how to deploy, to scale automatically, and how to monitor vLLM-based LLM workloads on the OVHcloud infrastructure.</p>



<p class="wp-block-paragraph"><strong>What are the key benefits?</strong></p>



<ul class="wp-block-list">
<li><strong>Cost-effectiveness:</strong>&nbsp;Leverage managed services to minimise operational overhead</li>



<li><strong>Real-time observability:</strong>&nbsp;Track Time-to-First-Token (TTFT), throughput, and resource utilisation</li>



<li><strong>Sovereign infrastructure:</strong>&nbsp;Keep all metrics and data within European datacentres</li>



<li><strong>Scalable by design:</strong>&nbsp;Automatically scale GPU inference replicas based on real workload demand</li>
</ul>



<h2 class="wp-block-heading">Context</h2>



<h3 class="wp-block-heading">Managed Kubernetes Service</h3>



<p class="wp-block-paragraph"><strong>OVHcloud MKS</strong>&nbsp;is a fully managed Kubernetes platform designed to help you deploy, operate, and scale containerised applications in production. It provides a secure and reliable Kubernetes environment without the operational overhead of managing the control plane.</p>



<p class="wp-block-paragraph"><strong>How does this benefit you?</strong></p>



<ul class="wp-block-list">
<li><strong>Cost-efficient</strong>: Pay only for worker nodes and consumed resources, with no additional charge for the Kubernetes control plane</li>



<li><strong>Fully managed Kubernetes</strong>: Certified upstream Kubernetes with automated control plane management, provided upgrades and high availability</li>



<li><strong>Production-ready by design</strong>: Built-in integrations with OVHcloud Load Balancers, networking, and persistent storage</li>



<li><strong>Scalable and flexible</strong>: Scale workloads easily, node pools to match application demand</li>



<li><strong>Open and portable</strong>: Based on standard Kubernetes APIs, enable seamless integration with open-source ecosystems and avoid vendor lock-in</li>
</ul>



<p class="wp-block-paragraph">In the following guide, all services are deployed within the&nbsp;<strong>OVHcloud Public Cloud</strong>.</p>



<h2 class="wp-block-heading">Architecture overview</h2>



<p class="wp-block-paragraph">This reference architecture demonstrates a basic deployment of vLLM for vision-language model inference on OVHcloud Managed Kubernetes Service, featuring:</p>



<ul class="wp-block-list">
<li><strong>High-availability deployment</strong>&nbsp;with 2 GPU nodes (NVIDIA L40S)</li>



<li><strong>Optimised GPU utilisation</strong>&nbsp;with proper driver configuration</li>



<li><strong>Scalable infrastructure</strong>&nbsp;supporting vision-language models</li>



<li><strong>Comprehensive monitoring</strong>&nbsp;using Prometheus, Grafana, and DCGM</li>



<li><strong>Full observability</strong>&nbsp;for both application and hardware metrics</li>
</ul>



<p class="wp-block-paragraph"><strong>Data flow</strong>:</p>



<figure class="wp-block-image aligncenter size-large"><img decoding="async" width="1024" height="538" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-1024x538.jpg" alt="" class="wp-image-30985" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-768x403.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-1536x806.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-2048x1075.jpg 2048w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Data Flow</em></figcaption></figure>



<ol class="wp-block-list">
<li><strong>Inference request:</strong>
<ul class="wp-block-list">
<li>User → LoadBalancer → Gateway → NGINX Ingress → &#8220;Qwen3 VL&#8221; Service → vLLM Pod → GPU</li>



<li>Response follows reverse path with streaming support</li>
</ul>
</li>



<li><strong>Metrics collection:</strong>
<ul class="wp-block-list">
<li>vLLM Pods expose <code>/metrics</code> endpoint (port <code><strong><mark class="has-inline-color has-ast-global-color-0-color">8000</mark></strong></code>)</li>



<li>DCGM Exporters expose GPU metrics (port <code><strong><mark class="has-inline-color has-ast-global-color-0-color">9400</mark></strong></code>)</li>



<li>Prometheus scrapes both endpoints every 30 seconds</li>



<li>Grafana queries Prometheus for visualization</li>
</ul>
</li>



<li><strong>Load distribution</strong>
<ul class="wp-block-list">
<li>NGINX Ingress uses cookie-based session affinity</li>



<li>vLLM Service uses ClientIP session affinity</li>



<li>Anti-affinity ensures 1 pod per GPU node</li>
</ul>
</li>
</ol>



<h2 class="wp-block-heading">Prerequisites</h2>



<p class="wp-block-paragraph">Before you begin, ensure you have:</p>



<ul class="wp-block-list">
<li>An&nbsp;<strong>OVHcloud Public Cloud</strong>&nbsp;account</li>



<li>An&nbsp;<strong>OpenStack user</strong>&nbsp;with the<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">&nbsp;</a><strong><code>Administrator</code></strong>&nbsp;role</li>



<li><strong>Hugging Face access</strong>&nbsp;–&nbsp;<em>create a&nbsp;<a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face account</a>&nbsp;and generate an&nbsp;<a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">access token</a></em></li>



<li><code><strong>kubectl</strong></code>&nbsp;already installed and&nbsp;<code><strong>helm</strong></code>&nbsp;installed (at least version 3.x)</li>
</ul>



<p class="wp-block-paragraph"><strong>🚀 Now you have all the ingredients, it’s time to deploy the recipe for&nbsp;<a href="https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Qwen/Qwen3-VL-8B-Instruct</a>&nbsp;using vLLM and MKS!</strong></p>



<h2 class="wp-block-heading">Architecture guide: Native GPU deployment of vLLM on MKS with full stack observability</h2>



<p class="wp-block-paragraph">This reference architecture describes a<strong>&nbsp;Large Language Model</strong>&nbsp;deployment using&nbsp;<strong>vLLM inference server&nbsp;</strong>and&nbsp;<strong>Kubernetes</strong>, to enjoy the&nbsp;benefits of a service that&#8217;s both highly available and monitorable in real time.</p>



<h3 class="wp-block-heading">Step 1 &#8211; Create MKS cluster and Node pools</h3>



<p class="wp-block-paragraph">From&nbsp;<a href="https://www.ovh.com/manager/" target="_blank" rel="noreferrer noopener" data-wpel-link="exclude">OVHcloud Control Panel</a>, create a Kubernetes cluster using the&nbsp;<strong>MKS</strong>. </p>



<p class="wp-block-paragraph">Navigate to: <code>Public Cloud</code> → <code>Managed Kubernetes Service</code> → <code>Create a cluster</code></p>



<h4 class="wp-block-heading">1. Configure cluster</h4>



<p class="wp-block-paragraph">Consider using the following configuration for the current use case:</p>



<ul class="wp-block-list">
<li><strong>Name:</strong> <code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm-deployment-l40s-qwen3-8b</mark></strong></code></li>



<li><strong>Location</strong>: 1-AZ Region &#8211; Gravelines (<code><strong><mark class="has-inline-color has-ast-global-color-0-color">GRA11</mark></strong></code>)</li>



<li><strong>Plan:</strong> Free (or Standard)</li>



<li><strong>Network</strong>: attach a <strong>Private network </strong>(e.g. <code><strong><mark class="has-inline-color has-ast-global-color-0-color">0000 - AI Private Network</mark></strong></code>)</li>



<li><strong>Version:</strong> Latest stable (e.g. <code><strong><mark class="has-inline-color has-ast-global-color-0-color">1.34</mark></strong></code>)</li>
</ul>



<h4 class="wp-block-heading">2. Create GPU Node pool</h4>



<p class="wp-block-paragraph">During the cluster creation, configure the vLLM Node pool for GPUs:</p>



<ul class="wp-block-list">
<li><strong>Node pool name:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">vllm</mark></code></li>



<li><strong>Flavor:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">L40S-90</mark></code></li>



<li><strong>Number of nodes:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">2</mark></code></li>



<li><strong>Autoscaling:</strong> Disabled (OFF)</li>
</ul>



<p class="wp-block-paragraph"><strong>Why L40S-90?</strong></p>



<ul class="wp-block-list">
<li>Cost-effective for single-model deployment (1 GPU per node)</li>



<li>Sufficient RAM (90GB) for <strong><code><mark class="has-inline-color has-ast-global-color-0-color">Qwen3-VL-8B</mark></code></strong> model</li>
</ul>



<p class="wp-block-paragraph">You should see your cluster (e.g.&nbsp;<code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm-deployment-l40s-qwen3-8b</mark></strong></code>) in the list, along with the following information:</p>



<figure class="wp-block-image size-full"><img decoding="async" width="930" height="588" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1.png" alt="" class="wp-image-30745" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1.png 930w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1-300x190.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1-768x486.png 768w" sizes="(max-width: 930px) 100vw, 930px" /></figure>



<p class="wp-block-paragraph">You can now set up the node pool dedicated to monitoring.</p>



<h4 class="wp-block-heading">3. Create CPU Node pool</h4>



<p class="wp-block-paragraph">From your cluster, click on <code><strong><mark class="has-inline-color has-ast-global-color-0-color">Add a node pool</mark></strong></code> and configure it as follow:</p>



<ul class="wp-block-list">
<li><strong>Node pool name:</strong> <mark class="has-inline-color has-ast-global-color-0-color"><code>monitoring</code></mark></li>



<li><strong>Flavor:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">B2-15</mark></code></li>



<li><strong>Number of nodes:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">1</mark></code></li>



<li><strong>Autoscaling:</strong> Disabled (OFF)</li>
</ul>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">✅ <strong>Note</strong></p>



<p class="wp-block-paragraph"><strong><em>Monitoring stack can run on GPU nodes if cost is a concern. Dedicated CPU node provides better isolation and resource management.</em></strong></p>
</blockquote>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="365" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-1024x365.png" alt="" class="wp-image-30743" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-1024x365.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-300x107.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-768x274.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation.png 1283w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">If the status is green with the&nbsp;<strong><code><mark class="has-inline-color has-ast-global-color-0-color">OK</mark></code></strong>&nbsp;label, you can proceed to the next step.</p>



<h4 class="wp-block-heading">4. Configure Kubernetes access</h4>



<p class="wp-block-paragraph">Once your nodes have been provisioned, you can download the <strong>Kubeconfig file</strong> and configure kubectl with your MKS cluster.</p>



<pre class="wp-block-code"><code class=""># configure kubectl with your MKS cluster<br>export KUBECONFIG=/path/to/your/kubeconfig-xxxxxx.yml<br><br># verify cluster connectivity<br>kubectl cluster-info<br>kubectl get nodes</code></pre>



<p class="wp-block-paragraph">Returning:</p>



<p class="wp-block-paragraph"><code>NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; STATUS &nbsp; ROLES&nbsp; &nbsp; AGE &nbsp; VERSION<br>monitoring-node-xxxxxx &nbsp; Ready&nbsp; &nbsp; &lt;none&gt; &nbsp; 1d &nbsp; v1.34.2<br>vllm-node-yyyyyy &nbsp; &nbsp; &nbsp; &nbsp; Ready&nbsp; &nbsp; &lt;none&gt; &nbsp; 1d &nbsp; v1.34.2<br>vllm-node-zzzzzz &nbsp; &nbsp; &nbsp; &nbsp; Ready&nbsp; &nbsp; &lt;none&gt; &nbsp; 1d &nbsp; v1.34.2</code></p>



<p class="wp-block-paragraph">Before going further, add a label to the CPU node for monitoring workloads.</p>



<pre class="wp-block-code"><code class="">CPU_NODE=$(kubectl get nodes -o json | \<br>  jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" == null) | .metadata.name')<br>kubectl label node $CPU_NODE node-role=monitoring</code></pre>



<p class="wp-block-paragraph">Finally, check with the following command:</p>



<pre class="wp-block-code"><code class="">NAME                     GPU      ROLE<br>monitoring-node-xxxxxx   &lt;none&gt;   monitoring<br>vllm-node-yyyyyy         1        &lt;none&gt;<br>vllm-node-zzzzzz         1        &lt;none&gt;</code></pre>



<p class="wp-block-paragraph">Once both nodes are in <strong>Ready</strong> status, you can proceed to the next step.</p>



<h3 class="wp-block-heading">Step 2 &#8211; Install GPU operator</h3>



<p class="wp-block-paragraph">To start, consider setting up the GPU operator.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>✅ Note</strong></p>



<p class="wp-block-paragraph"><em><strong>This step is based on this OVHcloud documentation: <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-kubernetes-deploy-gpu-application?id=kb_article_view&amp;sysparm_article=KB0049707" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Deploying a GPU application on OVHcloud Managed Kubernetes Service</a></strong></em></p>
</blockquote>



<h4 class="wp-block-heading">1. Add NVIDIA helm repository and create namespace</h4>



<p class="wp-block-paragraph">Add NVIDIA helm repo:</p>



<pre class="wp-block-code"><code class="">helm repo add nvidia https://helm.ngc.nvidia.com/nvidia<br>helm repo update</code></pre>



<p class="wp-block-paragraph">And create Namespace as follow.</p>



<pre class="wp-block-code"><code class="">kubectl create namespace gpu-operator</code></pre>



<h4 class="wp-block-heading">2. Install GPU operator with correct configuration</h4>



<p class="wp-block-paragraph">The GPU Operator must be configured with specific driver versions to ensure compatibility with vLLM containers.</p>



<p class="wp-block-paragraph">However, the default installation uses recent drivers (<code><strong><mark class="has-inline-color has-ast-global-color-0-color">580.x</mark></strong></code> with <strong><code><mark class="has-inline-color has-ast-global-color-0-color">CUDA 13.x</mark></code></strong>) which are incompatible with vLLM containers (<strong><code><mark class="has-inline-color has-ast-global-color-0-color">CUDA 12.x</mark></code></strong>).</p>



<p class="wp-block-paragraph"><strong>Solution:</strong> Force driver version <strong><code><mark class="has-inline-color has-ast-global-color-0-color">535.183.01</mark></code></strong> (<code><strong><mark class="has-inline-color has-ast-global-color-0-color">CUDA 12.2</mark></strong></code>).</p>



<pre class="wp-block-code"><code class="">helm install gpu-operator nvidia/gpu-operator \<br>  -n gpu-operator \<br>  --set driver.enabled=true \<br>  --set driver.version="535.183.01" \<br>  --set toolkit.enabled=true \<br>  --set operator.defaultRuntime=containerd \<br>  --set devicePlugin.enabled=true \<br>  --set dcgmExporter.enabled=true \<br>  --set dcgmExporter.image="dcgm-exporter" \<br>  --set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04" \<br>  --set gfd.enabled=true \<br>  --set migManager.enabled=false \<br>  --set nodeStatusExporter.enabled=true \<br>  --set validator.driver.enable=false \<br>  --set validator.toolkit.enable=false \<br>  --set validator.plugin.enable=false \<br>  --timeout 20m</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">✅ <strong>Note </strong></p>



<p class="wp-block-paragraph"><em><strong>Specifying the DCGM version may only be necessary if you encounter problems with the default image (e.g. <code><mark class="has-inline-color has-ast-global-color-0-color">‘ImagePullBackOff’</mark></code>). If this is the case, add the following parameters:<br><code><mark class="has-inline-color has-ast-global-color-0-color">--set dcgmExporter.repository="nvcr.io/nvidia/k8s"<br>--set dcgmExporter.image="dcgm-exporter"<br>--set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04"</mark></code></strong></em></p>
</blockquote>



<pre class="wp-block-code"><code class="">kubectl get pods -n gpu-operator</code></pre>



<p class="wp-block-paragraph">Note that all pods should reach <strong>Running</strong> state in 5-10 minutes.</p>



<p class="wp-block-paragraph">You can also check the GPU availability:</p>



<pre class="wp-block-code"><code class="">kubectl get nodes -o json | jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" != null) | "\(.metadata.name): \(.status.allocatable."nvidia.com/gpu") GPU(s)"'</code></pre>



<p class="wp-block-paragraph">Returning:</p>



<p class="wp-block-paragraph"><code>vllm-node-<code>yyyyyy</code>: 1 GPU(s)<br>vllm-node-zzzzzz: 1 GPU(s)</code></p>



<p class="wp-block-paragraph">And you can test to run <code><strong><mark class="has-inline-color has-ast-global-color-0-color">nvidia-smi</mark></strong></code>:</p>



<pre class="wp-block-code"><code class="">DRIVER_POD=$(kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset -o name | head -1)<br>kubectl exec -n gpu-operator $DRIVER_POD -- nvidia-smi</code></pre>



<p class="wp-block-paragraph">If GPU tests are working properly, you can move on DCGM service configuration.</p>



<h4 class="wp-block-heading">3. Configure DCGM service</h4>



<p class="wp-block-paragraph"><strong>Why is DCGM Exporter required?</strong></p>



<p class="wp-block-paragraph">DCGM (Data Centre GPU Manager) is NVIDIA&#8217;s official tool for monitoring GPUs in production. The goal is to be able to collect and display metrics from both GPU nodes.</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="571" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-1024x571.jpg" alt="" class="wp-image-30746" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-1024x571.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-300x167.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-768x428.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-1536x856.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1.jpg 1733w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>GPU monitoring with DCGM</em></figcaption></figure>



<p class="wp-block-paragraph">The metrics provided are:</p>



<ul class="wp-block-list">
<li><code><strong><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_GPU_UTIL</mark></strong></code> &#8211; GPU utilisation (%)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_GPU_TEMP</mark></code></strong> &#8211; GPU temperature (°C)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_FB_USED</mark></code></strong> &#8211; VRAM used (MB)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_FB_FREE</mark></code></strong> &#8211; Free VRAM (MB)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_POWER_USAGE</mark></code></strong> &#8211; Power consumption (W)</li>



<li>And 50+ other GPU metrics</li>
</ul>



<p class="wp-block-paragraph">Next, ensure DCGM service has the correct labels and port configuration:</p>



<pre class="wp-block-code"><code class="">kubectl patch svc nvidia-dcgm-exporter -n gpu-operator --type merge -p '{<br>  "metadata": {<br>    "labels": {<br>      "app": "nvidia-dcgm-exporter"<br>    }<br>  },<br>  "spec": {<br>    "ports": [<br>      {<br>        "name": "metrics",<br>        "port": 9400,<br>        "targetPort": 9400,<br>        "protocol": "TCP"<br>      }<br>    ]<br>  }<br>}'</code></pre>



<p class="wp-block-paragraph">Verify the endpoints (should show 2 IPs, one per GPU node).</p>



<pre class="wp-block-code"><code class="">kubectl get endpoints nvidia-dcgm-exporter -n gpu-operator</code></pre>



<p class="wp-block-paragraph"><code>NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ENDPOINTS &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; AGE<br>nvidia-dcgm-exporter &nbsp; x.x.x.x:9400,x.x.x.x:9400 &nbsp; 17d</code></p>



<h3 class="wp-block-heading">Step 3 &#8211; Deploy Qwen3 VL 8B with vLLM inference server</h3>



<p class="wp-block-paragraph">The deployment of the <strong>Qwen 3 VL 8B</strong> model on two L40S GPU nodes is carried out in several stages.</p>



<h4 class="wp-block-heading">1. Create namespace and Hugging Face secret</h4>



<p class="wp-block-paragraph">Start by creating Namespace:</p>



<pre class="wp-block-code"><code class="">kubectl create namespace vllm</code></pre>



<p class="wp-block-paragraph">Next, you must retrieve your Hugging Face token and replace the&nbsp;<code><strong><mark class="has-inline-color has-ast-global-color-0-color">HF_TOKEN</mark></strong></code>&nbsp;value by your own:</p>



<pre class="wp-block-code"><code class="">export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"</code></pre>



<p class="wp-block-paragraph">Create your secret as follow:</p>



<pre class="wp-block-code"><code class="">kubectl create secret generic huggingface-secret \<br>  --from-literal=token=$HF_TOKEN \<br>  --namespace=vllm</code></pre>



<p class="wp-block-paragraph">Verify you obtain the following output by launching:</p>



<pre class="wp-block-code"><code class="">kubectl get secret huggingface-secret -n vllm</code></pre>



<p class="wp-block-paragraph"><code>NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; TYPE &nbsp; &nbsp; DATA &nbsp; AGE<br>huggingface-secret &nbsp; Opaque &nbsp; 1&nbsp; &nbsp; &nbsp; 14d</code></p>



<h4 class="wp-block-heading">2. Create vLLM deployment configuration</h4>



<p class="wp-block-paragraph">First, you can create <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/vllm/vllm-deployment-2nodes.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-deployment-2nodes.yaml</a></strong></code> file.</p>



<p class="wp-block-paragraph">Deploy vLLM:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f vllm-deployment-2nodes.yaml</code></pre>



<p class="wp-block-paragraph">You can monitor the deployment (it should take 8-10 minutes for model download and loading).</p>



<pre class="wp-block-code"><code class="">kubectl get pods -n vllm -o wide -w</code></pre>



<p class="wp-block-paragraph">Expected output after 10 minutes:</p>



<pre class="wp-block-code"><code class="">NAME               READY  STATUS   RESTARTS  AGE  IP       NODE  <br>qwen3-vl-xxxx-yyy  1/1    Running  0         1d   X.X.X.X  vllm-node-yyyyyy<br>qwen3-vl-xxxx-zzz  1/1    Running  0         1d   X.X.X.X  vllm-node-zzzzzz</code></pre>



<p class="wp-block-paragraph">You can also check the container logs:</p>



<pre class="wp-block-code"><code class="">kubectl logs -f -n vllm &lt;pod-name&gt;</code></pre>



<p class="wp-block-paragraph">You should find in the logs: &#8220;<code>Uvicorn running on http://0.0.0.0:8000</code>&#8220;</p>



<p class="wp-block-paragraph">Is everything installed correctly? Then let&#8217;s move on to the next step.</p>



<h4 class="wp-block-heading">3. Add service label</h4>



<p class="wp-block-paragraph">Ensure service has the correct label for <strong><code><mark class="has-inline-color has-ast-global-color-0-color">ServiceMonitor</mark></code></strong> discovery.</p>



<pre class="wp-block-code"><code class="">kubectl label svc qwen3-vl-service -n vllm app=qwen3-vl --overwrite</code></pre>



<p class="wp-block-paragraph">You can now verify by launching the following command.</p>



<pre class="wp-block-code"><code class="">kubectl get svc qwen3-vl-service -n vllm --show-labels | grep "app=qwen3-vl"</code></pre>



<p class="wp-block-paragraph">Returning:</p>



<p class="wp-block-paragraph"><code>qwen3-vl-service&nbsp; ClusterIP&nbsp; X.X.X.X &nbsp;&lt;none&gt;  8000/TCP  1d &nbsp;app=qwen3-vl</code></p>



<h3 class="wp-block-heading">Step 4 &#8211; Install NGINX ingress controller</h3>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><mark style="color:#cf2e2e" class="has-inline-color">⚠️ <strong>Moving beyond Ingress</strong></mark></p>



<p class="wp-block-paragraph"><strong><em><mark style="color:#cf2e2e" class="has-inline-color">Follow this <a href="https://blog.ovhcloud.com/moving-beyond-ingress-why-should-ovhcloud-managed-kubernetes-service-mks-users-start-looking-at-the-gateway-api/" data-wpel-link="internal">tutorial</a> if you want to use Gateway instead of Ingress.</mark></em></strong></p>
</blockquote>



<h4 class="wp-block-heading">1. Add helm repository and configure Ingress</h4>



<p class="wp-block-paragraph">First of all, add helm repository:</p>



<pre class="wp-block-code"><code class="">helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx<br>helm repo update</code></pre>



<p class="wp-block-paragraph">Create configuration file with <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/ingress/ingress-nginx-values.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">ingress-nginx-values.yaml</a></strong></code>.</p>



<p class="wp-block-paragraph">Then, install NGINX Ingress:</p>



<pre class="wp-block-code"><code class="">helm install ingress-nginx ingress-nginx/ingress-nginx \<br>  --namespace ingress-nginx \<br>  --create-namespace \<br>  -f ingress-nginx-values.yaml \<br>  --wait</code></pre>



<p class="wp-block-paragraph">Wait for LoadBalancer IP. The external IP assignment should take 1-2 minutes.</p>



<pre class="wp-block-code"><code class="">kubectl get svc -n ingress-nginx ingress-nginx-controller -w</code></pre>



<p class="wp-block-paragraph">Once <code><strong><mark class="has-inline-color has-ast-global-color-0-color">&lt;EXTERNAL-IP&gt;</mark></strong></code> is no longer , Ctrl+C and export it:</p>



<pre class="wp-block-code"><code class="">export EXTERNAL_IP=&lt;EXTERNAL-IP&gt;<br>echo "API URL: http://$EXTERNAL_IP"</code></pre>



<h4 class="wp-block-heading">2. Create vLLM Ingress resource</h4>



<p class="wp-block-paragraph">Next, create vLLM Ingress using <strong><code><a href="https://github.com/ovh/public-cloud-examples/blob/ep-vllm-deployment-observability-mks/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/vllm/vllm-ingress.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-ingress.yaml</a></code></strong>.</p>



<p class="wp-block-paragraph">Apply it as follow:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f vllm-ingress.yaml</code></pre>



<p class="wp-block-paragraph">You can now test different API calls to verify that your deployment is functional.</p>



<h4 class="wp-block-heading">3. Test API</h4>



<p class="wp-block-paragraph">Firstly, check if the model is available:</p>



<pre class="wp-block-code"><code class="">curl http://$EXTERNAL_IP/v1/models | jq</code></pre>



<pre class="wp-block-preformatted"><code>{<br>  "object": "list",<br>  "data": [<br>    {<br>      "id": "qwen3-vl-8b",<br>      "object": "model",<br>      "created": 1772472143,<br>      "owned_by": "vllm",<br>      "root": "Qwen/Qwen3-VL-8B-Instruct",<br>      "parent": null,<br>      "max_model_len": 8192,<br>      "permission": [<br>        {<br>          "id": "modelperm-8fb35cdd3208b068",<br>          "object": "model_permission",<br>          "created": 1772472143,<br>          "allow_create_engine": false,<br>          "allow_sampling": true,<br>          "allow_logprobs": true,<br>          "allow_search_indices": false,<br>          "allow_view": true,<br>          "allow_fine_tuning": false,<br>          "organization": "*",<br>          "group": null,<br>          "is_blocking": false<br>        }<br>      ]<br>    }<br>  ]<br>}</code></pre>



<p class="wp-block-paragraph">Next, test inference using the following request:</p>



<pre class="wp-block-code"><code class="">curl http://$EXTERNAL_IP/v1/chat/completions \<br>  -H "Content-Type: application/json" \<br>  -d '{<br>    "model": "qwen3-vl-8b",<br>    "messages": [{"role": "user", "content": "Count from 1 to 10."}],<br>    "max_tokens": 100<br>  }' | jq '.choices[0].message.content'</code></pre>



<p class="wp-block-paragraph"><code>"1, 2, 3, 4, 5, 6, 7, 8, 9, 10"</code></p>



<p class="wp-block-paragraph">Great! You&#8217;re almost there…</p>



<h3 class="wp-block-heading">Step 5 &#8211; Install Prometheus stack</h3>



<p class="wp-block-paragraph">Now, set up the monitoring stack that provides complete observability for&nbsp;<strong>application-level&nbsp;</strong>(vLLM) and&nbsp;<strong>hardware-level</strong>&nbsp;(GPU) metrics:</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="763" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-1024x763.jpg" alt="" class="wp-image-30871" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-1024x763.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-300x223.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-768x572.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-1536x1144.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture.jpg 1673w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Monitoring architecture</em></figcaption></figure>



<h4 class="wp-block-heading">1. Add helm repository and create namespace</h4>



<p class="wp-block-paragraph">Add Prometheus helm repo:</p>



<pre class="wp-block-code"><code class="">helm repo add prometheus-community https://prometheus-community.github.io/helm-charts<br>helm repo update</code></pre>



<p class="wp-block-paragraph">Then, create the <code><strong><mark class="has-inline-color has-ast-global-color-0-color">monitoring</mark></strong></code> Namespace.</p>



<pre class="wp-block-code"><code class="">kubectl create namespace monitoring</code></pre>



<h4 class="wp-block-heading">2. Create Prometheus deployment configuration and installation</h4>



<p class="wp-block-paragraph">First, create <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/monitoring/prometheus.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">prometheus.yaml</a></strong></code> file.</p>



<p class="wp-block-paragraph">Install Prometheus stack:</p>



<pre class="wp-block-code"><code class="">helm install prometheus prometheus-community/kube-prometheus-stack \<br>  -n monitoring \<br>  -f prometheus.yaml \<br>  --timeout 10m \<br>  --wait</code></pre>



<p class="wp-block-paragraph">Now,&nbsp;monitor its installation and wait until the pods are ready:</p>



<pre class="wp-block-code"><code class="">kubectl get pods -n monitoring -w</code></pre>



<p class="wp-block-paragraph">If all pods are running successfully, you can proceed to the next step.</p>



<h4 class="wp-block-heading">3. Check that the installation is operational</h4>



<p class="wp-block-paragraph">First access Grafana in background:</p>



<pre class="wp-block-code"><code class="">kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &amp;</code></pre>



<p class="wp-block-paragraph">Test Grafana health:</p>



<pre class="wp-block-code"><code class="">curl -s http://localhost:3000/api/health | jq</code></pre>



<pre class="wp-block-preformatted"><code>{<br>  "database": "ok",<br>  "version": "12.3.3",<br>  "commit": "2a14494b2d6ab60f860d8b27603d0ccb264336f6"<br>}</code></pre>



<p class="wp-block-paragraph">You can now access to Grafana locally via <strong><a href="http://localhost:3000" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code>http://localhost:3000</code></a></strong>. You will have to use:</p>



<ul class="wp-block-list">
<li>Login: <code><strong><mark style="color:#cf2e2e" class="has-inline-color">admin</mark></strong></code></li>



<li>Password: <code><strong><mark style="color:#cf2e2e" class="has-inline-color">Admin123!vLLM</mark></strong></code></li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="518" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-1024x518.png" alt="" class="wp-image-30804" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-1024x518.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-300x152.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-768x389.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2.png 1322w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">Well done! You can now proceed to the configuration step.</p>



<h3 class="wp-block-heading">Step 6 &#8211; Configure ServiceMonitors</h3>



<p class="wp-block-paragraph">The ServiceMonitors is used to tell Prometheus which endpoints to scrape for metrics.</p>



<h4 class="wp-block-heading">1. Create vLLM ServiceMonitor</h4>



<p class="wp-block-paragraph">Retrieve the file from the GitHub repository: <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/monitoring/vllm-servicemonitor.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-servicemonitor.yaml</a></strong></code>.</p>



<p class="wp-block-paragraph">Next, apply and check that the ServiceMonitor <code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm-metrics</mark></strong></code> exists:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f vllm-servicemonitor.yaml<br>kubectl get servicemonitor -n vllm</code></pre>



<h4 class="wp-block-heading">2. Create DCGM ServiceMonitor</h4>



<p class="wp-block-paragraph">First, create the <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/monitoring/dcgm-servicemonitor.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">dcgm-servicemonitor.yaml</a></strong></code> file.</p>



<p class="wp-block-paragraph">Once again, apply and verify:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f dcgm-servicemonitor.yaml<br>kubectl get servicemonitor -n gpu-operator</code></pre>



<pre class="wp-block-preformatted"><code>gpu-operator                  1d<br>nvidia-dcgm-exporter          1d<br>nvidia-node-status-exporter   1d</code></pre>



<h4 class="wp-block-heading">3. Configure Prometheus for Cross-Namespace discovery</h4>



<p class="wp-block-paragraph">Apply a patch to allow Prometheus to discover ServiceMonitors in all namespaces.</p>



<pre class="wp-block-code"><code class="">kubectl patch prometheus prometheus-kube-prometheus-prometheus -n monitoring --type merge -p '{<br>  "spec": {<br>    "serviceMonitorNamespaceSelector": {},<br>    "podMonitorNamespaceSelector": {}<br>  }<br>}'</code></pre>



<p class="wp-block-paragraph">Now you have to restart Prometheus.</p>



<ol class="wp-block-list">
<li>Delete Prometheus pod to force configuration reload</li>



<li>Wait for Prometheus to restart</li>
</ol>



<pre class="wp-block-code"><code class="">kubectl delete pod prometheus-prometheus-kube-prometheus-prometheus-0 -n monitoring<br><br>kubectl wait --for=condition=Ready \<br>  pod/prometheus-prometheus-kube-prometheus-prometheus-0 \<br>  -n monitoring \<br>  --timeout=180s</code></pre>



<p class="wp-block-paragraph">Wait about 2 minutes for discovery and finally, verify targets:</p>



<pre class="wp-block-code"><code class="">kubectl port-forward -n monitoring \<br>  prometheus-prometheus-kube-prometheus-prometheus-0 9090:9090 &amp;</code></pre>



<p class="wp-block-paragraph">You can open in browser: <a href="http://localhost:9090/targets" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code><strong><mark class="has-inline-color has-ast-global-color-0-color">http://localhost:9090/targets</mark></strong></code></a> and search for:</p>



<ul class="wp-block-list">
<li><code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm</mark></strong></code></li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">dcgm</mark></code></strong></li>
</ul>



<p class="wp-block-paragraph">Note that the expected targets are: </p>



<ul class="wp-block-list">
<li>serviceMonitor/vllm/vllm-metrics/0   (2/2 UP)</li>



<li>serviceMonitor/gpu-operator/nvidia-dcgm-exporter/0 (2/2 UP)</li>
</ul>



<h3 class="wp-block-heading">Step 7 &#8211; Create Grafana dashboards</h3>



<p class="wp-block-paragraph">In this final step, the goal is to create two Grafana dashboards to track both the software side with vLLM metrics and the hardware metrics that will monitor the GPU consumption and system.</p>



<h4 class="wp-block-heading">1. vLLM application metrics</h4>



<p class="wp-block-paragraph">The dashboard provides insights into vLLM application performance, request handling, and resource utilization based on the following metrics:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>Type</th><th>Description</th><th>Unit</th><th>Dashboard Usage</th></tr></thead><tbody><tr><td><code>vllm:request_success_total</code></td><td>Counter</td><td>Total successful requests</td><td>count</td><td>Request Rate, Total Requests</td></tr><tr><td><code>vllm:num_requests_running</code></td><td>Gauge</td><td>Requests currently being processed</td><td>count</td><td>Queue Depth, Active Requests</td></tr><tr><td><code>vllm:num_requests_waiting</code></td><td>Gauge</td><td>Requests waiting in queue</td><td>count</td><td>Queue Depth, Queued Requests</td></tr><tr><td><code>vllm:time_to_first_token_seconds</code></td><td>Histogram</td><td>Latency until first token generated</td><td>seconds</td><td>TTFT P50/P95/P99</td></tr><tr><td><code>vllm:e2e_request_latency_seconds</code></td><td>Histogram</td><td>Total end-to-end latency</td><td>seconds</td><td>E2E Latency P50/P95/P99</td></tr><tr><td><code>vllm:generation_tokens_total</code></td><td>Counter</td><td>Total tokens generated (output)</td><td>count</td><td>Token Generation Rate, Throughput</td></tr><tr><td><code>vllm:prompt_tokens_total</code></td><td>Counter</td><td>Total prompt tokens (input)</td><td>count</td><td>Token Generation Rate, Avg Tokens</td></tr><tr><td><code>vllm:kv_cache_usage_perc</code></td><td>Gauge</td><td>GPU KV cache utilization</td><td>0-1 (0-100%)</td><td>KV Cache Usage</td></tr><tr><td><code>vllm:prefix_cache_hits_total</code></td><td>Counter</td><td>Number of prefix cache hits</td><td>count</td><td>Cache Hit Rate</td></tr><tr><td><code>vllm:prefix_cache_queries_total</code></td><td>Counter</td><td>Number of prefix cache queries</td><td>count</td><td>Cache Hit Rate</td></tr><tr><td><code>vllm:request_queue_time_seconds</code></td><td>Histogram</td><td>Time spent waiting in queue</td><td>seconds</td><td>Request Queue Time</td></tr><tr><td><code>vllm:request_prefill_time_seconds</code></td><td>Histogram</td><td>Prefill phase time</td><td>seconds</td><td>Prefill Time</td></tr><tr><td><code>vllm:request_decode_time_seconds</code></td><td>Histogram</td><td>Decode phase time</td><td>seconds</td><td>Decode Time</td></tr><tr><td><code>vllm:inter_token_latency_seconds</code></td><td>Histogram</td><td>Latency between each token</td><td>seconds</td><td>Inter-Token Latency</td></tr><tr><td><code>vllm:num_preemptions_total</code></td><td>Counter</td><td>Number of preemptions (OOM)</td><td>count</td><td>Preemptions</td></tr><tr><td><code>vllm:prompt_tokens_cached_total</code></td><td>Counter</td><td>Prompt tokens cached</td><td>count</td><td>Cached Tokens</td></tr><tr><td><code>vllm:request_prompt_tokens</code></td><td>Histogram</td><td>Prompt size distribution</td><td>count</td><td>(Table)</td></tr><tr><td><code>vllm:request_generation_tokens</code></td><td>Histogram</td><td>Generated tokens distribution</td><td>count</td><td>(Table)</td></tr><tr><td><code>vllm:iteration_tokens_total</code></td><td>Histogram</td><td>Tokens per iteration</td><td>count</td><td>(Advanced analysis)</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">This <strong>vLLM Grafana dashboard</strong> is composed of 23 panels:</p>



<p class="wp-block-paragraph">The dashboard provides insights into LLM application performance, request handling, and resource utilisation based on the previous metrics.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Type</th><th>Nombre</th><th>Panels</th></tr></thead><tbody><tr><td><strong>Timeseries</strong></td><td>12</td><td>Request Rate, Queue Depth, TTFT, E2E Latency, Token Gen, Cache Usage, Cache Hit, Queue Time, Prefill/Decode, Inter-Token, Preemptions, Avg Tokens</td></tr><tr><td><strong>Stat</strong></td><td>10</td><td>Throughput, TTFT P95, Active Req, Queued Req, Cache Hit Rate, Cache Usage, Total Req, Total Tokens, Cached Tokens, Preemptions</td></tr><tr><td><strong>Table</strong></td><td>1</td><td>Pod Performance</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">Now create the dashboard using <a href="https://github.com/ovh/public-cloud-examples/blob/ep-vllm-deployment-observability-mks/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/grafana-dashboards/vllm-app-dashboard.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"></a><code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/grafana-dashboards/vllm-app-dashboard.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-app-dashboard.json</a></strong></code>. Then, launch:</p>



<pre class="wp-block-code"><code class="">echo "Importing vLLM application dashboard..."<br>curl -X POST \<br>  'http://localhost:3000/api/dashboards/db' \<br>  -H 'Content-Type: application/json' \<br>  -u 'admin:Admin123!vLLM' \<br>  -d @vllm-app-dashboard.json | jq '.status, .url'</code></pre>



<p class="wp-block-paragraph">Next, you an access the vLLM dashboard and follow metrics in real time:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="686" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-1024x686.png" alt="" class="wp-image-30858" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-1024x686.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-300x201.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-768x514.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3.png 1230w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">This dashboard is also essential to track hardware consumption for comprehensive monitoring.</p>



<h4 class="wp-block-heading">2. GPU hardware metrics</h4>



<p class="wp-block-paragraph">Take advantage of the most useful DCGM metrics to check both the functioning and consumption of your hardware resources:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>Type</th><th>Description</th><th>Unit</th><th>Normal Thresholds</th><th>Dashboard Usage</th></tr></thead><tbody><tr><td><code>DCGM_FI_DEV_GPU_UTIL</code></td><td>Gauge</td><td>GPU utilization (compute)</td><td>% (0-100)</td><td>70-95% optimal</td><td>GPU Utilization</td></tr><tr><td><code>DCGM_FI_DEV_GPU_TEMP</code></td><td>Gauge</td><td>GPU temperature</td><td>°C</td><td>&lt; 85°C normal</td><td>GPU Temperature</td></tr><tr><td><code>DCGM_FI_DEV_FB_USED</code></td><td>Gauge</td><td>VRAM used</td><td>MB</td><td>Variable by model</td><td>GPU Memory Used</td></tr><tr><td><code>DCGM_FI_DEV_FB_FREE</code></td><td>Gauge</td><td>VRAM free</td><td>MB</td><td>&gt; 2GB recommended</td><td>GPU Memory Free</td></tr><tr><td><code>DCGM_FI_DEV_POWER_USAGE</code></td><td>Gauge</td><td>Power consumption</td><td>Watts</td><td>&lt; 300W (L40S)</td><td>GPU Power Usage</td></tr><tr><td><code>DCGM_FI_DEV_SM_CLOCK</code></td><td>Gauge</td><td>GPU clock speed (compute)</td><td>MHz</td><td>Variable</td><td>GPU Clock Speed</td></tr><tr><td><code>DCGM_FI_DEV_MEM_CLOCK</code></td><td>Gauge</td><td>Memory clock speed</td><td>MHz</td><td>Variable</td><td>Memory Clock Speed</td></tr><tr><td><code>DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL</code></td><td>Counter</td><td>Total NVLink bandwidth</td><td>bytes/s</td><td>(If multi-GPU)</td><td>NVLink Bandwidth</td></tr><tr><td><code>DCGM_FI_DEV_PCIE_TX_BYTES</code></td><td>Counter</td><td>PCIe data transmitted</td><td>bytes</td><td>(I/O monitoring)</td><td>PCIe TX</td></tr><tr><td><code>DCGM_FI_DEV_PCIE_RX_BYTES</code></td><td>Counter</td><td>PCIe data received</td><td>bytes</td><td>(I/O monitoring)</td><td>PCIe RX</td></tr><tr><td><code>DCGM_FI_DEV_ECC_DBE_VOL_TOTAL</code></td><td>Counter</td><td>ECC double-bit errors</td><td>count</td><td>0 ideal</td><td>(Health check)</td></tr><tr><td><code>DCGM_FI_DEV_ECC_SBE_VOL_TOTAL</code></td><td>Counter</td><td>ECC single-bit errors</td><td>count</td><td>&lt; 10/day acceptable</td><td>(Health check)</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">This&nbsp;<strong>hardware Grafana dashboard</strong>&nbsp;is composed of 13 panels with GPU hardware and system metrics. A detailed view is also available GPU util (%), temperature (°C), vRAM (GB) and power (Watt).</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Type</th><th>Count</th><th>Panels</th></tr></thead><tbody><tr><td><strong>Timeseries</strong></td><td>8</td><td>GPU Util, GPU Mem, GPU Temp, GPU Power, CPU Usage, RAM Usage, Network I/O, Disk I/O</td></tr><tr><td><strong>Stat</strong></td><td>4</td><td>Avg GPU Util, Avg GPU Temp, Total GPU Mem, Total GPU Power</td></tr><tr><td><strong>Table</strong></td><td>1</td><td>Hardware Status</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">Please refer to <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/grafana-dashboards/hardware-dashboard.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">hardware-dashboard.json</a></strong></code> by loading it as follows:</p>



<pre class="wp-block-code"><code class="">echo "Importing hardware dashboard..."<br>curl -X POST \<br>  'http://localhost:3000/api/dashboards/db' \<br>  -H 'Content-Type: application/json' \<br>  -u 'admin:Admin123!vLLM' \<br>  -d @hardware-dashboard.json | jq '.status, .url'</code></pre>



<p class="wp-block-paragraph">Finally, track resource consumption using this hardware dashboard:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="686" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-1024x686.png" alt="" class="wp-image-30859" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-1024x686.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-300x201.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-768x514.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4.png 1230w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">Congratulations! Everything is working. You can now test your model and track the various metrics in real time.</p>



<h3 class="wp-block-heading">Step 8 &#8211; LLM testing and performance tracking</h3>



<p class="wp-block-paragraph">Start by installing Python dependencies:</p>



<pre class="wp-block-code"><code class="">pip3 install openai tqdm</code></pre>



<p class="wp-block-paragraph">Replace the <strong><mark class="has-inline-color has-ast-global-color-0-color">&lt;EXTERNAL_IP&gt;</mark></strong> by the vLLM service external IP and launch the performance test thanks to the following <a href="https://github.com/ovh/public-cloud-examples/blob/ep-vllm-deployment-observability-mks/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/llm-inference-performance-test.py" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code><strong>Python code</strong></code></a>:</p>



<pre class="wp-block-code"><code class="">import time<br>import threading<br>import random<br>from statistics import mean<br>from openai import OpenAI<br>from tqdm import tqdm<br><br>APP_URL = "http://94.23.185.22/v1"<br>MODEL = "qwen3-vl-8b"<br><br>CONCURRENT_WORKERS = 500          # concurrency<br>REQUESTS_PER_WORKER = 10<br>MAX_TOKENS = 200                  # generation pressure<br><br># some random prompts<br>SHORT_PROMPTS = [<br>    "Summarize the theory of relativity.",<br>    "Explain what a transformer model is.",<br>    "What is Kubernetes autoscaling?"<br>]<br><br>MEDIUM_PROMPTS = [<br>    "Explain how attention mechanisms work in transformer-based models, including self-attention and multi-head attention.",<br>    "Describe how vLLM manages KV cache and why it impacts inference performance."<br>]<br><br>LONG_PROMPTS = [<br>    "Write a very detailed technical explanation of how large language models perform inference, "<br>    "including tokenization, embedding lookup, transformer layers, attention computation, KV cache usage, "<br>    "GPU memory management, and how batching affects latency and throughput. Use examples.",<br>]<br><br>PROMPT_POOL = (<br>    SHORT_PROMPTS * 2 +<br>    MEDIUM_PROMPTS * 4 +<br>    LONG_PROMPTS * 6    # bias toward long prompts<br>)<br><br># openai compliance<br>client = OpenAI(<br>    base_url=APP_URL,<br>    api_key="foo"<br>)<br><br># basic metrics<br>latencies = []<br>errors = 0<br>lock = threading.Lock()<br><br># worker<br>def worker(worker_id):<br>    global errors<br>    for _ in range(REQUESTS_PER_WORKER):<br>        prompt = random.choice(PROMPT_POOL)<br><br>        start = time.time()<br>        try:<br>            client.chat.completions.create(<br>                model=MODEL,<br>                messages=[{"role": "user", "content": prompt}],<br>                max_tokens=MAX_TOKENS,<br>                temperature=0.7,<br>            )<br>            elapsed = time.time() - start<br><br>            with lock:<br>                latencies.append(elapsed)<br><br>        except Exception as e:<br>            with lock:<br>                errors += 1<br><br># run<br>threads = []<br>start_time = time.time()<br><br>print("\n-&gt; STARTING PERFORMANCE TEST:")<br>print(f"Concurrency: {CONCURRENT_WORKERS}")<br>print(f"Total requests: {CONCURRENT_WORKERS * REQUESTS_PER_WORKER}")<br><br>for i in range(CONCURRENT_WORKERS):<br>    t = threading.Thread(target=worker, args=(i,))<br>    t.start()<br>    threads.append(t)<br><br>for t in threads:<br>    t.join()<br><br>total_time = time.time() - start_time<br><br># results<br>print("\n-&gt; BENCH RESULTS:")<br>print(f"Total requests sent: {len(latencies) + errors}")<br>print(f"Successful requests: {len(latencies)}")<br>print(f"Errors: {errors}")<br>print(f"Total wall time: {total_time:.2f}s")<br><br>if latencies:<br>    print(f"Avg latency: {mean(latencies):.2f}s")<br>    print(f"Min latency: {min(latencies):.2f}s")<br>    print(f"Max latency: {max(latencies):.2f}s")<br>    print(f"Throughput: {len(latencies)/total_time:.2f} req/s")</code></pre>



<p class="wp-block-paragraph">Returning:</p>



<pre class="wp-block-preformatted"><code>-&gt; STARTING PERFORMANCE TEST:</code><br><code>Concurrency: 500<br>Total requests: 5000</code><br><code><br>-&gt; BENCH RESULTS:<br>Total requests sent: 5000<br>Successful requests: 5000<br>Errors: 0<br>Total wall time: 225.54s<br>Avg latency: 21.45s<br>Min latency: 6.06s<br>Max latency: 25.19s<br>Throughput: 22.17 req/s</code></pre>



<p class="wp-block-paragraph">Don&#8217;t forget to track GPU and vLLM metrics in your Grafana dashboards!</p>



<h2 class="wp-block-heading">Conslusion</h2>



<p class="wp-block-paragraph">This reference architecture demonstrates a<strong>&nbsp;vLLM deployment on OVHcloud Managed Kubernetes Service (MKS)</strong>&nbsp;with comprehensive GPU monitoring. Benefits include:</p>



<ul class="wp-block-list">
<li><strong>High Performance</strong>: GPU-accelerated inference with L40S</li>



<li><strong>Scalability</strong>: Kubernetes-native, horizontal scaling-ready</li>



<li><strong>Reliability</strong>: Health checks, auto-restart, monitoring</li>



<li><strong>API Compatibility</strong>: OpenAI-compatible endpoints</li>



<li><strong>Multimodality</strong>: Vision &amp; text capabilities</li>



<li><strong>Full stack monitoring</strong>: Complete vLLM application and hardware dashboards</li>
</ul>



<h2 class="wp-block-heading">Going Further</h2>



<p class="wp-block-paragraph">Your current architecture is&nbsp;<strong>functional.&nbsp;</strong>However, if desired,&nbsp;<strong>it could be improved into a full production-ready&nbsp;solution.</strong></p>



<p class="wp-block-paragraph"><strong>Wish to take production hardening a step further?</strong></p>



<p class="wp-block-paragraph">Go further with the following enhancements:</p>



<ol class="wp-block-list">
<li><strong>Authentication &amp; authorization</strong>
<ul class="wp-block-list">
<li>vLLM API authentication</li>



<li>Grafana authentication</li>



<li>Prometheus security</li>
</ul>
</li>



<li><strong>High availability &amp; load balancing</strong>
<ul class="wp-block-list">
<li>Grafana high availability with multiple replicas and shared storage</li>



<li>Prometheus high availability</li>



<li>vLLM Horizontal Pod Autoscaling (HPA) based on custom metrics</li>
</ul>
</li>



<li><strong>Data persistence &amp; backup</strong>
<ul class="wp-block-list">
<li>Prometheus long-term storage with persistent storage</li>



<li>Grafana Dashboard Backup</li>
</ul>
</li>



<li><strong>Observability enhancements</strong>
<ul class="wp-block-list">
<li>Distributed tracing by adding OpenTelemetry for request tracing</li>



<li>Alerting rules with production-ready alert rules</li>
</ul>
</li>
</ol>



<p class="wp-block-paragraph"></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-deploying-a-vision-language-model-with-vllm-on-ovhcloud-mks-for-high-performance-inference-and-full-observability%2F&amp;action_name=Reference%20Architecture%3A%20Deploying%20a%20vision-language%20model%20with%20vLLM%20on%20OVHcloud%20MKS%20for%20high%20performance%20inference%20and%20full%20observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Reference Architecture: Custom metric autoscaling for LLM inference with vLLM on OVHcloud AI Deploy and observability using MKS</title>
		<link>https://blog.ovhcloud.com/reference-architecture-custom-metric-autoscaling-for-llm-inference-with-vllm-on-ovhcloud-ai-deploy-and-observability-using-mks/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Tue, 10 Feb 2026 08:51:11 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[Kubernetes]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[MKS]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[prometheus]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=30203</guid>

					<description><![CDATA[Take your LLM (Large Language Model) deployment to production level with comprehensive custom autoscaling configuration and advanced vLLM metrics observability. This reference architecture describes a comprehensive solution for deploying, autoscaling and monitoring vLLM-based LLM workloads on OVHcloud infrastructure. It combinesAI Deploy, used for model serving with custom metric autoscaling, and Managed Kubernetes Service (MKS), which [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-custom-metric-autoscaling-for-llm-inference-with-vllm-on-ovhcloud-ai-deploy-and-observability-using-mks%2F&amp;action_name=Reference%20Architecture%3A%20Custom%20metric%20autoscaling%20for%20LLM%20inference%20with%20vLLM%20on%20OVHcloud%20AI%20Deploy%20and%20observability%20using%20MKS&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em><strong>Take your LLM (Large Language Model) deployment to production level with comprehensive custom autoscaling configuration and advanced vLLM metrics observability.</strong></em></p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="538" src="https://blog.ovhcloud.com/wp-content/uploads/2026/02/3-1024x538.jpg" alt="" class="wp-image-30579" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/02/3-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/3-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/3-768x403.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/3.jpg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>vLLM metrics monitoring and observability based on OVHcloud infrastructure</em></figcaption></figure>



<p class="wp-block-paragraph">This reference architecture describes a comprehensive solution for <strong>deploying, autoscaling and monitoring vLLM-based LLM workloads</strong> on OVHcloud infrastructure. It combines<strong>AI Deploy</strong>, used for <strong>model serving with custom metric autoscaling</strong>, and <strong>Managed Kubernetes Service (MKS)</strong>, which hosts the monitoring and observability stack.</p>



<p class="wp-block-paragraph">By leveraging <strong>application-level Prometheus metrics exposed by vLLM</strong>, AI Deploy can automatically scale inference replicas based on real workload demand, ensuring <strong>high availability, consistent performance under load and efficient GPU utilisation</strong>. This autoscaling mechanism allows the platform to react dynamically to traffic spikes while maintaining predictable latency for end users.</p>



<p class="wp-block-paragraph">On top of this scalable inference layer, the monitoring architecture provides <strong>observability</strong> through <strong>Prometheus</strong>, <strong>Grafana</strong> and Alertmanager. It enables real-time performance monitoring, capacity planning, and operational insights, while ensuring <strong>full data sovereignty</strong> for organisations running Large Language Models (LLMs) in production environments.</p>



<p class="wp-block-paragraph"><strong>What are the key benefits</strong>?</p>



<ul class="wp-block-list">
<li><strong>Cost-effective</strong>: Leverage managed services to minimise operational overhead</li>



<li><strong>Real-time observability</strong>: Track Time-to-First-Token (TTFT), throughput, and resource utilisation</li>



<li><strong>Sovereign infrastructure</strong>: All metrics and data remain within European datacentres</li>



<li><strong>Production-ready</strong>: Persistent storage, high availability, and automated monitoring</li>
</ul>



<h2 class="wp-block-heading">Context</h2>



<h3 class="wp-block-heading">AI Deploy</h3>



<p class="wp-block-paragraph">OVHcloud AI Deploy is a<strong>&nbsp;Container as a Service</strong>&nbsp;(CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications/APIs based on Machine Learning (ML), Deep Learning (DL) or Large Language Models (LLMs).</p>



<p class="wp-block-paragraph"><strong>Key points to keep in mind</strong>:</p>



<ul class="wp-block-list">
<li><strong>Easy to use:</strong>&nbsp;Bring your own custom Docker image and deploy it in a command line or a few clicks surely</li>



<li><strong>High-performance computing:</strong>&nbsp;A complete range of GPUs available (H100, A100, V100S, L40S and L4)</li>



<li><strong>Scalability and flexibility:</strong>&nbsp;Supports automatic scaling, allowing your model to effectively handle fluctuating workloads</li>



<li><strong>Cost-efficient:</strong>&nbsp;Billing per minute, no surcharges</li>
</ul>



<h3 class="wp-block-heading">Managed Kubernetes Service</h3>



<p class="wp-block-paragraph"><strong>OVHcloud MKS</strong> is a fully managed Kubernetes platform designed to help you deploy, operate, and scale containerised applications in production. It provides a secure and reliable Kubernetes environment without the operational overhead of managing the control plane.</p>



<p class="wp-block-paragraph"><strong>What should you keep in mind?</strong></p>



<ul class="wp-block-list">
<li><strong>Cost-efficient</strong>: Only pay for worker nodes and consumed resources, with no additional charge for the Kubernetes control plane</li>



<li><strong>Fully managed Kubernetes</strong>: Certified upstream Kubernetes with automated control plane management, upgrades and high availability</li>



<li><strong>Production-ready by design</strong>: Built-in integrations with OVHcloud Load Balancers, networking and persistent storage</li>



<li><strong>Scalability and flexibility</strong>: Easily scale workloads and node pools to match application demand</li>



<li><strong>Open and portable</strong>: Based on standard Kubernetes APIs, enabling seamless integration with open-source ecosystems and avoiding vendor lock-in</li>
</ul>



<p class="wp-block-paragraph">In the following guide, all services are deployed within the&nbsp;<strong>OVHcloud Public Cloud</strong>.</p>



<h2 class="wp-block-heading">Overview of the architecture</h2>



<p class="wp-block-paragraph">This reference architecture describes a <strong>complete, secure and scalable solution</strong> to:</p>



<ul class="wp-block-list">
<li>Deploy an LLM with vLLM and <strong>AI Deploy</strong>, benefiting from automatic scaling based on custom metrics to ensure high service availability &#8211; vLLM exposes <code><mark class="has-inline-color has-ast-global-color-0-color"><strong>/metrics</strong></mark></code> via its public HTTPS endpoint on AI Deploy</li>



<li>Collect, store and visualise these vLLM metrics using Prometheus and Grafana on <strong>MKS</strong></li>
</ul>



<figure class="wp-block-image aligncenter size-full"><img loading="lazy" decoding="async" width="1200" height="630" src="https://blog.ovhcloud.com/wp-content/uploads/2026/02/1.jpg" alt="" class="wp-image-30578" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/02/1.jpg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/1-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/1-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/1-768x403.jpg 768w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /><figcaption class="wp-element-caption"><em>vLLM metrics monitoring and observability architecture overview</em></figcaption></figure>



<p class="wp-block-paragraph">Here you will find the main components of the architecture. The solution comprises three main layers:</p>



<ol class="wp-block-list">
<li><strong>Model serving layer</strong> with AI Deploy
<ul class="wp-block-list">
<li>vLLM containers running on top of GPUs for LLM inference</li>



<li>vLLM inference server exposing Prometheus metrics</li>



<li>Automatic scaling based on custom metrics to ensure high availability</li>



<li>HTTPS endpoints with Bearer token authentication</li>
</ul>
</li>



<li><strong>Monitoring and observability infrastructure</strong> using Kubernetes
<ul class="wp-block-list">
<li>Prometheus for metrics collection and storage</li>



<li>Grafana for visualisation and dashboards</li>



<li>Persistent volume storage for long-term retention</li>
</ul>
</li>



<li><strong>Network layer</strong>
<ul class="wp-block-list">
<li>Secure HTTPS communication between components</li>



<li>OVHcloud LoadBalancer for external access</li>
</ul>
</li>
</ol>



<p class="wp-block-paragraph">To go further, some prerequisites must be checked!</p>



<h2 class="wp-block-heading">Prerequisites</h2>



<p class="wp-block-paragraph">Before you begin, ensure you have:</p>



<ul class="wp-block-list">
<li>An&nbsp;<strong>OVHcloud Public Cloud</strong>&nbsp;account</li>



<li>An&nbsp;<strong>OpenStack user</strong>&nbsp;with the<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"> </a><strong><code><mark class="has-inline-color has-ast-global-color-0-color">Administrator</mark></code></strong> role</li>



<li><strong>ovhai CLI available</strong> &#8211;&nbsp;<em>install the&nbsp;<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">ovhai CLI</a></em></li>



<li>A <strong>Hugging Face access</strong> &#8211; <em>create a&nbsp;<a href="https://huggingface.co/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Hugging Face account</a>&nbsp;and generate an&nbsp;<a href="https://huggingface.co/settings/tokens" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">access token</a></em></li>



<li><code><strong><mark class="has-inline-color has-ast-global-color-0-color">kubectl</mark></strong></code> installed and <code><strong><mark class="has-inline-color has-ast-global-color-0-color">helm</mark></strong></code> installed (at least version 3.x)</li>
</ul>



<p class="wp-block-paragraph"><strong>🚀 Now you have all the ingredients for our recipe, it’s time to deploy the Ministral 14B using AI Deploy and vLLM Docker container!</strong></p>



<h2 class="wp-block-heading">Architecture guide: From autoscaling to observability for LLMs served by vLLM</h2>



<p class="wp-block-paragraph">Let’s set up and deploy this architecture!</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="538" src="https://blog.ovhcloud.com/wp-content/uploads/2026/02/2-1024x538.jpg" alt="" class="wp-image-30580" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/02/2-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/2-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/2-768x403.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/02/2.jpg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Overview of the deployment workflow</em></figcaption></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>✅ <em>Note</em></strong></p>



<p class="wp-block-paragraph"><strong><em>In this example, <a href="https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">mistralai/Ministral-3-14B-Instruct-2512</a> is used. Choose the open-source model of your choice and follow the same steps, adapting the model slug (from Hugging Face), the versions and the GPU(s) flavour.</em></strong></p>
</blockquote>



<p class="wp-block-paragraph"><em>Remember that all of the following steps can be automated using OVHcloud APIs!</em></p>



<h3 class="wp-block-heading">Step 1 &#8211; Manage access tokens</h3>



<p class="wp-block-paragraph">Before introducing the monitoring stack, this architecture starts with the <strong>deployment of the <strong>Ministral 3 14B</strong> on OVHcloud AI Deploy</strong>, configured to <strong>autoscale based on custom Prometheus metrics exposed by vLLM itself</strong>.</p>



<p class="wp-block-paragraph">Export your&nbsp;<a href="https://huggingface.co/settings/tokens" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Hugging Face token</a>.</p>



<pre class="wp-block-code"><code class="">export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx</code></pre>



<p class="wp-block-paragraph"><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-app-token?id=kb_article_view&amp;sysparm_article=KB0035280" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Create a Bearer token</a>&nbsp;to access your AI Deploy app once it&#8217;s been deployed.</p>



<pre class="wp-block-code"><code class="">ovhai token create --role operator ai_deploy_token=my_operator_token</code></pre>



<p class="wp-block-paragraph">Returning the following output:</p>



<p class="wp-block-paragraph"><code><strong>Id: 47292486-fb98-4a5b-8451-600895597a2b<br>Created At: 20-01-26 11:53:05<br>Updated At: 20-01-26 11:53:05<br>Spec:<br>Name: ai_deploy_token=my_operator_token<br>Role: AiTrainingOperator<br>Label Selector:<br>Status:<br>Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX<br>Version: 1</strong></code></p>



<p class="wp-block-paragraph">You can now store and export your access token:</p>



<pre class="wp-block-code"><code class="">export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</code></pre>



<h3 class="wp-block-heading">Step 2 &#8211; LLM deployment using AI Deploy</h3>



<p class="wp-block-paragraph">Before introducing the monitoring stack, this architecture starts with the <strong>deployment of the <strong>Ministral 3 14B</strong> on OVHcloud AI Deploy</strong>, configured to <strong>autoscale based on custom Prometheus metrics exposed by vLLM itself</strong>.</p>



<h4 class="wp-block-heading">1. Define the targeted vLLM metric for autoscaling</h4>



<p class="wp-block-paragraph">Before proceeding with the deployment of the <strong>Ministral 3 14B</strong> endpoint, you have to choose the metric you want to use as the trigger for scaling.</p>



<p class="wp-block-paragraph">Instead of relying solely on CPU/RAM utilisation, AI Deploy allows autoscaling decisions to be driven by <strong>application-level signals</strong>.</p>



<p class="wp-block-paragraph">To do this, you can consult the <a href="https://docs.vllm.ai/en/latest/design/metrics/#v1-metrics" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">metrics exposed by vLLM</a>.</p>



<p class="wp-block-paragraph">In this example, you can use a basic metric such as <code><mark class="has-inline-color has-ast-global-color-0-color"><strong>vllm:num_requests_running</strong></mark></code> to scale the number of replicas based on <strong>real inference load</strong>.</p>



<p class="wp-block-paragraph">This enables:</p>



<ul class="wp-block-list">
<li>Faster reaction to traffic spikes</li>



<li>Better GPU utilisation</li>



<li>Reduced inference latency under load</li>



<li>Cost-efficient scaling</li>
</ul>



<p class="wp-block-paragraph">Finally, the configuration chosen for scaling this application is as follows:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Parameter</th><th>Value</th><th>Description</th></tr></thead><tbody><tr><td>Metric source</td><td><code>/metrics</code></td><td>vLLM Prometheus endpoint</td></tr><tr><td>Metric name</td><td><code>vllm:num_requests_running</code></td><td>Number of in-flight requests</td></tr><tr><td>Aggregation</td><td><code>AVERAGE</code></td><td>Mean across replicas</td></tr><tr><td>Target value</td><td><code>50</code></td><td>Desired load per replica</td></tr><tr><td>Min replicas</td><td><code>1</code></td><td>Baseline capacity</td></tr><tr><td>Max replicas</td><td><code>3</code></td><td>Burst capacity</td></tr></tbody></table></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>✅ <em>Note</em></strong></p>



<p class="wp-block-paragraph"><em><strong>You can choose the metric that best suits your use case. You can also apply a patch to your AI Deploy deployment at any time to change the target metric for scaling</strong></em>.</p>
</blockquote>



<p class="wp-block-paragraph">When the <strong>average number of running requests exceeds 50</strong>, AI Deploy automatically provisions <strong>additional GPU-backed replicas</strong>.</p>



<h4 class="wp-block-heading">2. Deploy Ministral 3 14B using AI Deploy</h4>



<p class="wp-block-paragraph">Now you can deploy the LLM using the <strong><code>ovhai</code> CLI</strong>.</p>



<p class="wp-block-paragraph">Key elements necessary for proper functioning:</p>



<ul class="wp-block-list">
<li>GPU-based inference: <strong><code><mark class="has-inline-color has-ast-global-color-0-color">1 x H100</mark></code></strong></li>



<li>vLLM OpenAI-compatible Docker image: <a href="https://hub.docker.com/r/vllm/vllm-openai/tags" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong><code><mark class="has-inline-color has-ast-global-color-0-color">vllm/vllm-openai:v0.13.0</mark></code></strong></a></li>



<li>Custom autoscaling rules based on Prometheus metrics: <code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm:num_requests_running</mark></strong></code></li>
</ul>



<p class="wp-block-paragraph">Below is the reference command used to deploy the <strong><a href="https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">mistralai/Ministral-3-14B-Instruct-2512</a></strong>:</p>



<pre class="wp-block-code"><code class="">ovhai app run \<br>  --name vllm-ministral-14B-autoscaling-custom-metric \<br>  --default-http-port 8000 \<br>  --label ai_deploy_token=my_operator_token \<br>  --gpu 1 \<br>  --flavor h100-1-gpu \<br>  -e OUTLINES_CACHE_DIR=/tmp/.outlines \<br>  -e HF_TOKEN=$MY_HF_TOKEN \<br>  -e HF_HOME=/hub \<br>  -e HF_DATASETS_TRUST_REMOTE_CODE=1 \<br>  -e HF_HUB_ENABLE_HF_TRANSFER=0 \<br>  -v standalone:/hub:rw \<br>  -v standalone:/workspace:rw \<br>  --liveness-probe-path /health \<br>  --liveness-probe-port 8000 \<br>  --liveness-initial-delay-seconds 300 \<br>  --probe-path /v1/models \<br>  --probe-port 8000 \<br>  --initial-delay-seconds 300 \<br>  --auto-min-replicas 1 \<br>  --auto-max-replicas 3 \<br>  --auto-custom-api-url "http://&lt;SELF&gt;:8000/metrics" \<br>  --auto-custom-metric-format PROMETHEUS \<br>  --auto-custom-value-location vllm:num_requests_running \<br>  --auto-custom-target-value 50 \<br>  --auto-custom-metric-aggregation-type AVERAGE \<br>  vllm/vllm-openai:v0.13.0 \<br>  -- bash -c "python3 -m vllm.entrypoints.openai.api_server \<br>    --model mistralai/Ministral-3-14B-Instruct-2512 \<br>    --tokenizer_mode mistral \<br>    --load_format mistral \<br>    --config_format mistral \<br>    --enable-auto-tool-choice \<br>    --tool-call-parser mistral \<br>    --enable-prefix-caching"</code></pre>



<p class="wp-block-paragraph">How to understand the different parameters of this command?</p>



<h5 class="wp-block-heading"><strong>a. Start your AI Deploy app</strong></h5>



<p class="wp-block-paragraph">Launch a new app using&nbsp;<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">ovhai CLI</a>&nbsp;and name it.</p>



<p class="wp-block-paragraph"><code><strong>ovhai app run --name vllm-ministral-14B-autoscaling-custom-metric</strong></code></p>



<h5 class="wp-block-heading"><strong>b. Define access</strong></h5>



<p class="wp-block-paragraph">Define the HTTP API port and restrict access to your token.</p>



<p class="wp-block-paragraph"><strong><code>--default-http-port 8000</code><br><code>--label ai_deploy_token=my_operator_token</code></strong></p>



<h5 class="wp-block-heading"><strong>c. Configure GPU resources</strong></h5>



<p class="wp-block-paragraph">Specify the hardware type (<code><strong>h100-1-gpu</strong></code>), which refers to an&nbsp;<strong>NVIDIA H100 GPU</strong>&nbsp;and the number (<strong>1</strong>).</p>



<p class="wp-block-paragraph"><code><strong>--gpu 1<br>--flavor h100-1-gpu</strong></code></p>



<p class="wp-block-paragraph"><strong><mark>⚠️WARNING!</mark></strong>&nbsp;For this model, one H100 is sufficient, but if you want to deploy another model, you will need to check which GPU you need. Note that you can also access L40S and A100 GPUs for your LLM deployment.</p>



<h5 class="wp-block-heading"><strong>d. Set up environment variables</strong></h5>



<p class="wp-block-paragraph">Configure caching for the&nbsp;<strong>Outlines library</strong>&nbsp;(used for efficient text generation):</p>



<p class="wp-block-paragraph"><code><strong>-e OUTLINES_CACHE_DIR=/tmp/.outlines</strong></code></p>



<p class="wp-block-paragraph">Pass the&nbsp;<strong>Hugging Face token</strong>&nbsp;(<code>$MY_HF_TOKEN</code>) for model authentication and download:</p>



<p class="wp-block-paragraph"><code><strong>-e HF_TOKEN=$MY_HF_TOKEN</strong></code></p>



<p class="wp-block-paragraph">Set the&nbsp;<strong>Hugging Face cache directory</strong>&nbsp;to&nbsp;<code>/hub</code>&nbsp;(where models will be stored):</p>



<p class="wp-block-paragraph"><code><strong>-e HF_HOME=/hub</strong></code></p>



<p class="wp-block-paragraph">Allow execution of&nbsp;<strong>custom remote code</strong>&nbsp;from Hugging Face datasets (required for some model behaviours):</p>



<p class="wp-block-paragraph"><code><strong>-e HF_DATASETS_TRUST_REMOTE_CODE=1</strong></code></p>



<p class="wp-block-paragraph">Disable&nbsp;<strong>Hugging Face Hub transfer acceleration</strong>&nbsp;(to use standard model downloading):</p>



<p class="wp-block-paragraph"><code><strong>-e HF_HUB_ENABLE_HF_TRANSFER=0</strong></code></p>



<h5 class="wp-block-heading"><strong>e. Mount persistent volumes</strong></h5>



<p class="wp-block-paragraph">Mount&nbsp;<strong>two persistent storage volumes</strong>:</p>



<ol class="wp-block-list">
<li><code>/hub</code>&nbsp;→ Stores Hugging Face model files</li>



<li><code>/workspace</code>&nbsp;→ Main working directory</li>
</ol>



<p class="wp-block-paragraph">The&nbsp;<code>rw</code>&nbsp;flag means&nbsp;<strong>read-write access</strong>.</p>



<p class="wp-block-paragraph"><code><strong>-v standalone:/hub:rw<br>-v standalone:/workspace:rw</strong></code></p>



<h5 class="wp-block-heading"><strong>f. Health checks and readiness</strong></h5>



<p class="wp-block-paragraph">Configure <strong>liveness and readiness probes</strong>:</p>



<ol class="wp-block-list">
<li><code>/health</code> verifies the container is alive</li>



<li><code>/v1/models</code> confirms the model is loaded and ready to serve requests</li>
</ol>



<p class="wp-block-paragraph">The long initial delays (300 seconds) can be reduced; they correspond to the startup time of vLLM and the loading of the model on the GPU.</p>



<p class="wp-block-paragraph"><code><strong>--liveness-probe-path /health<br>--liveness-probe-port 8000<br>--liveness-initial-delay-seconds 300<br><br>--probe-path /v1/models<br>--probe-port 8000<br>--initial-delay-seconds 300</strong></code></p>



<h5 class="wp-block-heading"><strong>g. Autoscaling configuration (custom metrics)</strong></h5>



<p class="wp-block-paragraph">First set the minimum and maximum number of replicas.</p>



<p class="wp-block-paragraph"><strong><code>--auto-min-replicas 1<br>--auto-max-replicas 3</code></strong></p>



<p class="wp-block-paragraph">This guarantees basic availability (one replica always up) while allowing for peak capacity.</p>



<p class="wp-block-paragraph">Then enable autoscaling based on application-level metrics exposed by vLLM.</p>



<p class="wp-block-paragraph"><strong><code>--auto-custom-api-url "http://&lt;SELF&gt;:8000/metrics"<br>--auto-custom-metric-format PROMETHEUS<br>--auto-custom-value-location vllm:num_requests_running<br>--auto-custom-target-value 50<br>--auto-custom-metric-aggregation-type AVERAGE</code></strong></p>



<p class="wp-block-paragraph">AI Deploy:</p>



<ul class="wp-block-list">
<li>Scrapes the local <mark class="has-inline-color has-ast-global-color-0-color"><strong><code>/metrics</code></strong></mark> endpoint</li>



<li>Parses Prometheus-formatted metrics</li>



<li>Extracts the <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>vllm:num_requests_running</code></mark></strong> gauge</li>



<li>Computes the average value across replicas</li>
</ul>



<p class="wp-block-paragraph">Scaling behaviour:</p>



<ul class="wp-block-list">
<li>When the average number of in-flight requests exceeds <strong><code><mark class="has-inline-color has-ast-global-color-0-color">50</mark></code></strong>, AI Deploy adds replicas</li>



<li>When load decreases, replicas are scaled down</li>
</ul>



<p class="wp-block-paragraph">This approach ensures high availability and predictable latency under fluctuating traffic.</p>



<h5 class="wp-block-heading"><strong>h. Choose the target Docker image and the startup command</strong></h5>



<p class="wp-block-paragraph">Use the official <strong><a href="https://hub.docker.com/r/vllm/vllm-openai/tags" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vLLM OpenAI-compatible Docker image</a></strong>.</p>



<p class="wp-block-paragraph"><strong><code>vllm/vllm-openai:v0.13.0</code></strong></p>



<p class="wp-block-paragraph">Finally, run the model inside the container using a Python command to launch the vLLM API server:</p>



<ul class="wp-block-list">
<li><strong><code>python3 -m vllm.entrypoints.openai.api_server</code></strong>&nbsp;→ Starts the OpenAI-compatible vLLM API server</li>



<li><strong><code>--model mistralai/Ministral-3-14B-Instruct-2512</code></strong>&nbsp;→ Loads the&nbsp;<strong>Ministral 3 14B</strong>&nbsp;model from Hugging Face</li>



<li><strong><code>--tokenizer_mode mistral</code></strong>&nbsp;→ Uses the&nbsp;<strong>Mistral tokenizer</strong></li>



<li><strong><code>--load_format mistral</code></strong>&nbsp;→ Uses Mistral’s model loading format</li>



<li><strong><code>--config_format mistral</code></strong>&nbsp;→ Ensures the model configuration follows Mistral’s standard</li>



<li><code><strong>--enable-auto-tool-choice </strong></code>→ Automatic call of tools if necessary (function/tool call)</li>



<li><strong><code>--tool-call-parser mistral </code></strong>→ Tool calling support</li>



<li><strong><code>--enable-prefix-caching</code></strong> → Prefix caching for improved throughput and reduced latency</li>
</ul>



<p class="wp-block-paragraph">You can now launch this command using <strong>ovhai CLI</strong>.</p>



<h4 class="wp-block-heading">3. Check AI Deploy app status</h4>



<p class="wp-block-paragraph">You can now check if your&nbsp;<strong>AI Deploy</strong>&nbsp;app is alive:</p>



<pre class="wp-block-code"><code class="">ovhai app get &lt;your_vllm_app_id&gt;</code></pre>



<p class="wp-block-paragraph"><strong>Is your app in&nbsp;<code>RUNNING</code>&nbsp;status?</strong>&nbsp;Perfect! You can check in the logs that the server is started:</p>



<pre class="wp-block-code"><code class="">ovhai app logs &lt;your_vllm_app_id&gt;</code></pre>



<p class="wp-block-paragraph"><strong><mark>⚠️WARNING!</mark></strong>&nbsp;This step may take a little time as the LLM must be loaded.</p>



<h4 class="wp-block-heading">4. Test that the deployment is functional</h4>



<p class="wp-block-paragraph">First you can request and send a prompt to the LLM. Launch the following query by asking the question of your choice:</p>



<pre class="wp-block-code"><code class="">curl https://&lt;your_vllm_app_id&gt;.app.gra.ai.cloud.ovh.net/v1/chat/completions \<br>  -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \<br>  -H "Content-Type: application/json" \<br>  -d '{<br>    "model": "mistralai/Ministral-3-14B-Instruct-2512",<br>    "messages": [<br>      {"role": "system", "content": "You are a helpful assistant."},<br>      {"role": "user", "content": "Give me the name of OVHcloud’s founder."}<br>    ],<br>    "stream": false<br>  }'</code></pre>



<p class="wp-block-paragraph">You can also verify access to vLLM metrics.</p>



<pre class="wp-block-code"><code class="">curl -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \<br>  https://&lt;your_vllm_app_id&gt;.app.gra.ai.cloud.ovh.net/metrics</code></pre>



<p class="wp-block-paragraph">If both tests show that the model deployment is functional and you receive 200 HTTP responses, you are ready to move on to the next step!</p>



<p class="wp-block-paragraph">The next step is to set up the observability and monitoring stack. This autoscaling mechanism is <strong>fully independent</strong> from Prometheus used for observability:</p>



<ul class="wp-block-list">
<li>AI Deploy queries the local <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>/metrics</code></mark></strong> endpoint internally</li>



<li>Prometheus scrapes the <strong>same metrics endpoint</strong> externally for monitoring, dashboards and potentially alerting</li>
</ul>



<p class="wp-block-paragraph">This ensures:</p>



<ul class="wp-block-list">
<li>A single source of truth for metrics</li>



<li>No duplication of exporters</li>



<li>Consistent signals for scaling and observability</li>
</ul>



<h3 class="wp-block-heading">Step 3 &#8211; Create an MKS cluster</h3>



<p class="wp-block-paragraph">From <a href="https://manager.eu.ovhcloud.com/#/hub/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Control Panel</a>, create a Kubernetes cluster using the <strong>MKS</strong>.</p>



<p class="wp-block-paragraph">Consider using the following configuration for the current use case:</p>



<ul class="wp-block-list">
<li><strong>Location</strong>: GRA ( Gravelines) &#8211; <em>you can select the same region as for AI Deploy</em></li>



<li><strong>Network</strong>: Public</li>



<li><strong>Node pool</strong> :
<ul class="wp-block-list">
<li>Flavour : <code><strong><mark class="has-inline-color has-ast-global-color-0-color">b2-15</mark></strong></code> (or something similar)</li>



<li>Number of nodes: <strong><code><mark class="has-inline-color has-ast-global-color-0-color">3</mark></code></strong></li>



<li>Autoscaling : <strong><code><mark class="has-inline-color has-ast-global-color-0-color">OFF</mark></code></strong></li>
</ul>
</li>



<li><strong>Name your node pool:</strong> <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>monitoring</code></mark></strong></li>
</ul>



<p class="wp-block-paragraph">You should see your cluster (e.g. <code><mark class="has-inline-color has-ast-global-color-0-color"><strong>prometheus-vllm-metrics-ai-deploy</strong></mark></code>) in the list, along with the following information:</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="632" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-3-1024x632.png" alt="" class="wp-image-30242" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-3-1024x632.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-3-300x185.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-3-768x474.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-3-1536x948.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-3-2048x1264.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">If the status is green with the <strong><mark style="color:#00d084" class="has-inline-color"><code>OK</code></mark></strong> label, you can proceed to the next step.</p>



<h3 class="wp-block-heading">Step 4 &#8211; Configure Kubernetes access</h3>



<p class="wp-block-paragraph">Download your <strong>kubeconfig file</strong> from the OVHcloud Control Panel and configure <strong><code><mark class="has-inline-color has-ast-global-color-0-color">kubectl</mark></code></strong>:</p>



<pre class="wp-block-code"><code class=""># configure kubectl with your MKS cluster<br>export KUBECONFIG=/path/to/your/kubeconfig-xxxxxx.yml<br><br># verify cluster connectivity<br>kubectl cluster-info<br>kubectl get nodes</code></pre>



<p class="wp-block-paragraph">Now,- you can create the <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>values-prometheus.yaml</code></mark></strong> file:</p>



<pre class="wp-block-code"><code class=""># general configuration<br>nameOverride: "monitoring"<br>fullnameOverride: "monitoring"<br><br># Prometheus configuration<br>prometheus:<br>  prometheusSpec:<br>    # data retention (15d)<br>    retention: 15d<br>    <br>    # scrape interval (15s)<br>    scrapeInterval: 15s<br>    <br>    # persistent storage (required for production deployment)<br>    storageSpec:<br>      volumeClaimTemplate:<br>        spec:<br>          storageClassName: csi-cinder-high-speed  # OVHcloud storage<br>          accessModes: ["ReadWriteOnce"]<br>          resources:<br>            requests:<br>              storage: 50Gi  # (can be modified according to your needs)<br>    <br>    # scrape vLLM metrics from your AI Deploy instance (Ministral 3 14B)<br>    additionalScrapeConfigs:<br>      - job_name: 'vllm-ministral'<br>        scheme: https<br>        metrics_path: '/metrics'<br>        scrape_interval: 15s<br>        scrape_timeout: 10s<br>        <br>        # authentication using AI Deploy Bearer token stored Kubernetes Secret<br>        bearer_token_file: /etc/prometheus/secrets/vllm-auth-token/token<br>        static_configs:<br>          - targets:<br>              - '&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net'  # /!\ REPLACE THE &lt;APP_ID&gt; by yours /!\<br>            labels:<br>              service: 'vllm'<br>              model: 'ministral'<br>              environment: 'production'<br>        <br>        # TLS configuration<br>        tls_config:<br>          insecure_skip_verify: false<br>    <br>    # kube-prometheus-stack mounts the secret under /etc/prometheus/secrets/ and makes it accessible to Prometheus<br>    secrets:<br>      - vllm-auth-token<br><br># Grafana configuration (visualization layer)<br>grafana:<br>  enabled: true<br>  <br>  # disable automatic datasource provisioning<br>  sidecar:<br>    datasources:<br>      enabled: false<br>  <br>  # persistent dashboards<br>  persistence:<br>    enabled: true<br>    storageClassName: csi-cinder-high-speed<br>    size: 10Gi<br>  <br>  # /!\ DEFINE ADMIN PASSWORD - REPLACE "test" BY YOURS /!\<br>  adminPassword: "test"<br>  <br>  # access via OVHcloud LoadBalancer (public IP and managed LB)<br>  service:<br>    type: LoadBalancer<br>    port: 80<br>    annotations:<br>      # optional : limiter l'accès à certaines IPs<br>      # service.beta.kubernetes.io/ovh-loadbalancer-allowed-sources: "1.2.3.4/32"<br>  <br># alertmanager (optional but recommended for production)<br>alertmanager:<br>  enabled: true<br>  <br>  alertmanagerSpec:<br>    storage:<br>      volumeClaimTemplate:<br>        spec:<br>          storageClassName: csi-cinder-high-speed<br>          accessModes: ["ReadWriteOnce"]<br>          resources:<br>            requests:<br>              storage: 10Gi<br><br># cluster observability components<br>nodeExporter:<br>  enabled: true<br>  <br>kubeStateMetrics:<br>  enabled: true</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>✅ <em>Note</em></strong></p>



<p class="wp-block-paragraph"><strong><em>On OVHcloud MKS, persistent storage is handled automatically through the Cinder CSI driver. When a PersistentVolumeClaim (PVC) references a supported <code>storageClassName</code> such as <code>csi-cinder-high-speed</code>, OVHcloud dynamically provisions the underlying Block Storage volume and attaches it to the node running the pod. This enables stateful components like Prometheus, Alertmanager and Grafana to persist data reliably without any manual volume management, making the architecture fully cloud-native and operationally simple.</em></strong></p>
</blockquote>



<p class="wp-block-paragraph">Then create the <strong><code><mark class="has-inline-color has-ast-global-color-0-color">monitoring</mark></code></strong> namespace:</p>



<pre class="wp-block-code"><code class=""># create namespace<br>kubectl create namespace monitoring<br><br># verify creation<br>kubectl get namespaces | grep monitoring</code></pre>



<p class="wp-block-paragraph">Finally,  configure the Bearer token secret to access vLLM metrics.</p>



<pre class="wp-block-code"><code class=""># create bearer token secret<br>kubectl create secret generic vllm-auth-token \<br>  --from-literal=token='"$MY_OVHAI_ACCESS_TOKEN"' \<br>  -n monitoring<br><br># verify secret creation<br>kubectl get secret vllm-auth-token -n monitoring<br><br># test token (optional)<br>kubectl get secret vllm-auth-token -n monitoring \<br>  -o jsonpath='{.data.token}' | base64 -d </code></pre>



<p class="wp-block-paragraph">Right, if everything is working, let&#8217;s move on to deployment.</p>



<h3 class="wp-block-heading">Step 5 &#8211; Deploy Prometheus stack</h3>



<p class="wp-block-paragraph">Add the Prometheus Helm repository and install the monitoring stack. The deployment creates:</p>



<ul class="wp-block-list">
<li>Prometheus StatefulSet with persistent storage</li>



<li>Grafana deployment with LoadBalancer access</li>



<li>Alertmanager for future alert configuration (optional)</li>



<li>Supporting components (node exporters, kube-state-metrics)</li>
</ul>



<pre class="wp-block-code"><code class=""># add Helm repository<br>helm repo add prometheus-community \<br>  https://prometheus-community.github.io/helm-charts<br>helm repo update<br><br># install monitoring stack<br>helm install monitoring prometheus-community/kube-prometheus-stack \<br>  --namespace monitoring \<br>  --values values-prometheus.yaml \<br>  --wait</code></pre>



<p class="wp-block-paragraph">Then you can retrieve the LoadBalancer IP address to access Grafana:</p>



<pre class="wp-block-code"><code class="">kubectl get svc -n monitoring monitoring-grafana</code></pre>



<p class="wp-block-paragraph">Finally, open your browser to <code><strong><mark class="has-inline-color has-ast-global-color-0-color">http://&lt;EXTERNAL-IP&gt;</mark></strong></code> and login with:</p>



<ul class="wp-block-list">
<li><strong>Username</strong>: <code><mark class="has-inline-color has-ast-global-color-0-color"><strong>admin</strong></mark></code></li>



<li><strong>Password</strong>: as configured in your <code><strong><mark class="has-inline-color has-ast-global-color-0-color">values-prometheus.yaml</mark></strong></code> file</li>
</ul>



<h3 class="wp-block-heading">Step 6 &#8211; Create Grafana dashboards</h3>



<p class="wp-block-paragraph">In this step, you will be able to access Grafana interface and add your Prometheus as a new data source, then create a complete dashboard with different vLLM metrics.</p>



<h4 class="wp-block-heading">1. Add a new data source in Grafana</h4>



<p class="wp-block-paragraph">First of all, create a new Prometheus connection inside Grafana:</p>



<ul class="wp-block-list">
<li>Navigate to <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>Connections</code></mark></strong> → <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>Data sources</code></mark></strong> → <strong><code><mark class="has-inline-color has-ast-global-color-0-color">Add data source</mark></code></strong></li>



<li>Select <strong>Prometheus</strong></li>



<li>Configure URL: <code><strong><mark class="has-inline-color has-ast-global-color-0-color">http://monitoring-prometheus:9090</mark></strong></code></li>



<li>Click <strong>Save &amp; test</strong></li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="609" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-4-1024x609.png" alt="" class="wp-image-30247" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-4-1024x609.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-4-300x178.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-4-768x457.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-4-1536x913.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-4-2048x1218.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">Now that your Prometheus has been configured as a new data source, you can create your Grafana dashboard.</p>



<h4 class="wp-block-heading">2. Create your monitoring dashboard</h4>



<p class="wp-block-paragraph">To begin with, you can use the following pre-configured Grafana dashboard by downloading this JSON file locally:</p>





<p class="wp-block-paragraph">In the left-hand menu, select <strong><code><mark class="has-inline-color has-ast-global-color-0-color">Dashboard</mark></code></strong>:</p>



<ol class="wp-block-list">
<li>Navigate to <strong><code><mark class="has-inline-color has-ast-global-color-0-color">Dashboards</mark></code></strong> → <strong><code><mark class="has-inline-color has-ast-global-color-0-color">Import</mark></code></strong></li>



<li>Upload the provided dashboard JSON</li>



<li>Select <strong>Prometheus</strong> as datasource</li>



<li>Click <strong>Import</strong> and select the <strong><code><mark class="has-inline-color has-ast-global-color-0-color">vLLM-metrics-grafana-monitoring.json</mark></code></strong> file</li>
</ol>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="449" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-6-1024x449.png" alt="" class="wp-image-30250" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-6-1024x449.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-6-300x131.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-6-768x337.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-6-1536x673.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-6-2048x897.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">The dashboard provides real-time visibility for <strong>Ministral 3 14B</strong> deployed with vLLM container and OVHcloud AI Deploy.</p>



<p class="wp-block-paragraph">You can now track:</p>



<ul class="wp-block-list">
<li><strong>Performance metrics</strong>: TTFT, inter-token latency, end-to-end latency</li>



<li><strong>Throughput indicators</strong>: Requests per second, token generation rates</li>



<li><strong>Resource utilisation</strong>: KV cache usage, active/waiting requests</li>



<li><strong>Capacity indicators</strong>: Queue depth, preemption rates</li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="540" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-7-1024x540.png" alt="" class="wp-image-30253" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-7-1024x540.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-7-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-7-768x405.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-7-1536x811.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-7-2048x1081.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">Here are the key metrics tracked and displayed in the Grafana dashboard:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric Category</th><th>Prometheus Metric</th><th>Description</th><th>Use case</th></tr></thead><tbody><tr><td><strong>Latency</strong></td><td><code>vllm:time_to_first_token_seconds</code></td><td>Time until first token generation</td><td>User experience monitoring</td></tr><tr><td><strong>Latency</strong></td><td><code>vllm:inter_token_latency_seconds</code></td><td>Time between tokens</td><td>Throughput optimisation</td></tr><tr><td><strong>Latency</strong></td><td><code>vllm:e2e_request_latency_seconds</code></td><td>End-to-end request time</td><td>SLA monitoring</td></tr><tr><td><strong>Throughput</strong></td><td><code>vllm:request_success_total</code></td><td>Successful requests counter</td><td>Capacity planning</td></tr><tr><td><strong>Resource</strong></td><td><code>vllm:kv_cache_usage_perc</code></td><td>KV cache memory usage</td><td>Memory management</td></tr><tr><td><strong>Queue</strong></td><td><code>vllm:num_requests_running</code></td><td>Active requests</td><td>Load monitoring</td></tr><tr><td><strong>Queue</strong></td><td><code>vllm:num_requests_waiting</code></td><td>Queued requests</td><td>Overload detection</td></tr><tr><td><strong>Capacity</strong></td><td><code>vllm:num_preemptions_total</code></td><td>Request preemptions</td><td>Peak load indicator</td></tr><tr><td><strong>Tokens</strong></td><td><code>vllm:prompt_tokens_total</code></td><td>Input tokens processed</td><td>Usage analytics</td></tr><tr><td><strong>Tokens</strong></td><td><code>vllm:generation_tokens_total</code></td><td>Output tokens generated</td><td>Cost tracking</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">Well done, you now have at your disposal:</p>



<ul class="wp-block-list">
<li>An endpoint of the Ministral 3 14B model deployed with vLLM thanks to <strong>OVHcloud AI Deploy</strong> and its autoscaling strategies based on custom metrics</li>



<li>Prometheus for metrics collection and Grafana for visualisation/dashboards thanks to <strong>OVHcloud MKS</strong></li>
</ul>



<p class="wp-block-paragraph"><strong>But how can you check that everything will work when the load increases?</strong></p>



<h3 class="wp-block-heading">Step 7 &#8211; Test autoscaling and real-time visualisation</h3>



<p class="wp-block-paragraph">The first objective here is to force AI Deploy to:</p>



<ul class="wp-block-list">
<li>Increase <code>vllm:num_requests_running</code></li>



<li>&#8216;Saturate&#8217; a single replica</li>



<li>Trigger the <strong>scale up</strong></li>



<li>Observe replica increase + latency drop</li>
</ul>



<h4 class="wp-block-heading">1. Autoscaling testing strategy</h4>



<p class="wp-block-paragraph">The goal is to combine:</p>



<ul class="wp-block-list">
<li><strong>High concurrency</strong></li>



<li><strong>Long prompts</strong> (KVcache heavy)</li>



<li><strong>Long generations</strong></li>



<li><strong>Bursty load</strong></li>
</ul>



<p class="wp-block-paragraph">This is what vLLM autoscaling actually reacts to.</p>



<p class="wp-block-paragraph">To do so, a Python code can simulate the expected behaviour:</p>



<pre class="wp-block-code"><code class="">import time<br>import threading<br>import random<br>from statistics import mean<br>from openai import OpenAI<br>from tqdm import tqdm<br><br>APP_URL = "https://&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net/v1" # /!\ REPLACE THE &lt;APP_ID&gt; by yours /!\<br>MODEL = "mistralai/Ministral-3-14B-Instruct-2512"<br>API_KEY = $MY_OVHAI_ACCESS_TOKEN<br><br>CONCURRENT_WORKERS = 500          # concurrency (main scaling trigger)<br>REQUESTS_PER_WORKER = 25<br>MAX_TOKENS = 768                  # generation pressure<br><br># some random prompts<br>SHORT_PROMPTS = [<br>    "Summarize the theory of relativity.",<br>    "Explain what a transformer model is.",<br>    "What is Kubernetes autoscaling?"<br>]<br><br>MEDIUM_PROMPTS = [<br>    "Explain how attention mechanisms work in transformer-based models, including self-attention and multi-head attention.",<br>    "Describe how vLLM manages KV cache and why it impacts inference performance."<br>]<br><br>LONG_PROMPTS = [<br>    "Write a very detailed technical explanation of how large language models perform inference, "<br>    "including tokenization, embedding lookup, transformer layers, attention computation, KV cache usage, "<br>    "GPU memory management, and how batching affects latency and throughput. Use examples.",<br>]<br><br>PROMPT_POOL = (<br>    SHORT_PROMPTS * 2 +<br>    MEDIUM_PROMPTS * 4 +<br>    LONG_PROMPTS * 6    # bias toward long prompts<br>)<br><br># openai compliance<br>client = OpenAI(<br>    base_url=APP_URL,<br>    api_key=API_KEY,<br>)<br><br># basic metrics<br>latencies = []<br>errors = 0<br>lock = threading.Lock()<br><br># worker<br>def worker(worker_id):<br>    global errors<br>    for _ in range(REQUESTS_PER_WORKER):<br>        prompt = random.choice(PROMPT_POOL)<br><br>        start = time.time()<br>        try:<br>            client.chat.completions.create(<br>                model=MODEL,<br>                messages=[{"role": "user", "content": prompt}],<br>                max_tokens=MAX_TOKENS,<br>                temperature=0.7,<br>            )<br>            elapsed = time.time() - start<br><br>            with lock:<br>                latencies.append(elapsed)<br><br>        except Exception as e:<br>            with lock:<br>                errors += 1<br><br># run<br>threads = []<br>start_time = time.time()<br><br>print("Starting autoscaling stress test...")<br>print(f"Concurrency: {CONCURRENT_WORKERS}")<br>print(f"Total requests: {CONCURRENT_WORKERS * REQUESTS_PER_WORKER}")<br><br>for i in range(CONCURRENT_WORKERS):<br>    t = threading.Thread(target=worker, args=(i,))<br>    t.start()<br>    threads.append(t)<br><br>for t in threads:<br>    t.join()<br><br>total_time = time.time() - start_time<br><br># results<br>print("\n=== AUTOSCALING BENCH RESULTS ===")<br>print(f"Total requests sent: {len(latencies) + errors}")<br>print(f"Successful requests: {len(latencies)}")<br>print(f"Errors: {errors}")<br>print(f"Total wall time: {total_time:.2f}s")<br><br>if latencies:<br>    print(f"Avg latency: {mean(latencies):.2f}s")<br>    print(f"Min latency: {min(latencies):.2f}s")<br>    print(f"Max latency: {max(latencies):.2f}s")<br>    print(f"Throughput: {len(latencies)/total_time:.2f} req/s")</code></pre>



<p class="wp-block-paragraph"><strong>How can you verify that autoscaling is working and that the load is being handled correctly without latency skyrocketing?</strong></p>



<h4 class="wp-block-heading">2. Hardware and platform-level monitoring</h4>



<p class="wp-block-paragraph">First, <strong>AI Deploy Grafana</strong> answers <strong>&#8216;What resources are being used and how many replicas exist?</strong>&#8216;.</p>



<p class="wp-block-paragraph">GPU utilisation, GPU memory, CPU, RAM and replica count are monitored through <strong>OVHcloud AI Deploy Grafana</strong> (monitoring URL), which exposes infrastructure and runtime metrics for the AI Deploy application. This layer provides visibility into <strong>resource saturation and scaling events</strong> managed by the AI Deploy platform itself.</p>



<p class="wp-block-paragraph">Access it using the following URL (do not forget to replace <code><mark class="has-inline-color has-ast-global-color-0-color"><strong>&lt;APP_ID&gt;</strong></mark></code> by yours): <strong><code>https://monitoring.gra.ai.cloud.ovh.net/d/app/app-monitoring?var-app=</code><mark class="has-inline-color has-ast-global-color-0-color"><code>&lt;APP_ID&gt;</code></mark><code>&amp;orgId=1</code></strong></p>



<p class="wp-block-paragraph">For example, check GPU/RAM metrics:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="540" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-8-1024x540.png" alt="" class="wp-image-30260" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-8-1024x540.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-8-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-8-768x405.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-8-1536x811.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-8-2048x1081.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">You can also monitor scale ups and downs in real time, as well as information on HTTP calls and much more!</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="540" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-9-1024x540.png" alt="" class="wp-image-30261" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-9-1024x540.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-9-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-9-768x405.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-9-1536x811.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-9-2048x1081.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h4 class="wp-block-heading">3. Software and application-level monitoring</h4>



<p class="wp-block-paragraph">Next the combination of MKS + Prometheus + Grafana answers <strong>&#8216;How the inference engine behaves internally&#8217;</strong>.</p>



<p class="wp-block-paragraph">In fact, vLLM internal metrics (request concurrency, token throughput, latency indicators, KV cache pressure, etc.) are collected via the <strong>vLLM <code>/metrics</code> endpoint</strong> and scraped by <strong>Prometheus running on OVHcloud MKS</strong>, then visualised in a <strong>dedicated Grafana instance</strong>. This layer focuses on <strong>model behaviour and inference performance</strong>.</p>



<p class="wp-block-paragraph">Find all these metrics via (just replace <strong><code><mark class="has-inline-color has-ast-global-color-0-color">&lt;EXTERNAL-IP&gt;</mark></code></strong>): <strong><code>http://<mark class="has-inline-color has-ast-global-color-0-color">&lt;EXTERNAL-IP&gt;</mark>/d/vllm-ministral-monitoring/ministral-14b-vllm-metrics-monitoring?orgId=1</code></strong></p>



<p class="wp-block-paragraph">Find key metrics such as TTF, etc:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="540" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-10-1024x540.png" alt="" class="wp-image-30263" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-10-1024x540.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-10-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-10-768x405.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-10-1536x811.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-10-2048x1081.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">You can also find some information about <strong>&#8216;Model load and throughput&#8217;</strong>:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="540" src="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-11-1024x540.png" alt="" class="wp-image-30264" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-11-1024x540.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-11-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-11-768x405.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-11-1536x811.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/01/image-11-2048x1081.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">To go further and add even more metrics, you can refer to the vLLM documentation on &#8216;<a href="https://docs.vllm.ai/en/v0.7.2/getting_started/examples/prometheus_grafana.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Prometheus and Grafana</a>&#8216;.</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p class="wp-block-paragraph">This reference architecture provides a scalable, and production-ready approach for deploying LLM inference on OVHcloud using <strong>AI Deploy</strong> and the <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-apps-deployments?id=kb_article_view&amp;sysparm_article=KB0047997#advanced-custom-metrics-for-autoscaling" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">autoscaling on custom metric feature</a>.</p>



<p class="wp-block-paragraph">OVHcloud <strong>MKS</strong> is dedicated to running Prometheus and Grafana, enabling secure scraping and visualisation of <strong>vLLM internal metrics</strong> exposed via the <strong><mark class="has-inline-color has-ast-global-color-0-color"><code>/metrics</code> </mark></strong>endpoint.</p>



<p class="wp-block-paragraph">By scraping vLLM metrics securely from AI Deploy into Prometheus and exposing them through Grafana, the architecture provides full visibility into model behaviour, performance and load, enabling informed scaling analysis, troubleshooting and capacity planning in production environments.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-custom-metric-autoscaling-for-llm-inference-with-vllm-on-ovhcloud-ai-deploy-and-observability-using-mks%2F&amp;action_name=Reference%20Architecture%3A%20Custom%20metric%20autoscaling%20for%20LLM%20inference%20with%20vLLM%20on%20OVHcloud%20AI%20Deploy%20and%20observability%20using%20MKS&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Picking our Prometheus&#8217; remote storage</title>
		<link>https://blog.ovhcloud.com/picking-our-prometheus-remote-storage/</link>
		
		<dc:creator><![CDATA[Wilfried Roset]]></dc:creator>
		<pubDate>Mon, 17 Apr 2023 14:43:34 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[prometheus]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=24588</guid>

					<description><![CDATA[If you are running an IT system you are most likely using an Observability stack along it. Nowadays, the question&#8217;s no more whether or not you need Observability but more like how will you compose your stack. At OVHcloud, we have been running a scalable timeseries backend for years now. During the last year, we [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fpicking-our-prometheus-remote-storage%2F&amp;action_name=Picking%20our%20Prometheus%26%238217%3B%20remote%20storage&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph">If you are running an IT system you are most likely using an Observability stack along it. Nowadays, the question&#8217;s no more whether or not you need Observability but more like how will you compose your stack. At OVHcloud, we have been running a scalable timeseries backend for years now. </p>



<p class="wp-block-paragraph">During the last year, we have the opportunity to reassess our technical choices. Prometheus is the <em>de facto</em> standard but this choice is the beginning of the process. Thanks to open source communities, there is at lot of possible choices. </p>



<p class="wp-block-paragraph">The <a href="https://blog.ovhcloud.com/tag/prometheus/" data-wpel-link="internal">previous posts</a> were about the process we have followed select our new backend, this one concludes the series and share what we have chosen and why. In case you missed them, this series covers an <a href="https://blog.ovhcloud.com/welcome-to-prometheus-world-of-remote-storage/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">introduction to Prometheus remote storage</a>, how to bench such solution from both <a href="https://blog.ovhcloud.com/prometheus-remote-storage-playground/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">write</a> and <a href="https://blog.ovhcloud.com/benchmarking-prometheus-promql-performance/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">read</a> perspective the hard way or <a href="https://blog.ovhcloud.com/benchmarking-prometheus-like-a-pro-with-k6/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">like a pro</a>.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1488-1024x538.jpg" alt="Picking our Prometheus' remote storage" class="wp-image-25069" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1488-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1488-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1488-768x404.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1488.jpg 1199w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<h3 class="wp-block-heading">And the winner is&#8230; Grafana Mimir!</h3>



<p class="wp-block-paragraph">After all the experimentation we have made we have chosen Grafana Mimir. The first reason why this solution&#8217;s a good fit for use is Its read/write performance&#8217;s excellent as well as its scalability. My team, core-observability, main mission&#8217;s to provide a resilient and feature full observability infrastructure. All teams relying on us, each of them has it own particularity. Multitenancy is a must have for us, with it we must be able to prevent side effect or &#8220;noisy neighboor&#8221;. This is why rate limiting was on our bucket list. Mimir provides a lots of setting both at the cluster level and the tenant level to make sure one tenant does not impact others or simply impact the quality of services.</p>



<figure class="wp-block-image alignright size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1489.png" alt="Grafana Mimir" class="wp-image-25072" width="265" height="96" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1489.png 529w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1489-300x108.png 300w" sizes="auto, (max-width: 265px) 100vw, 265px" /></figure>



<p class="wp-block-paragraph">Like many cloud native technology Mimir relies on an <a href="https://www.ovhcloud.com/en/public-cloud/object-storage/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">object storage</a> where the timeseries are stored. Doing so allow to decouple the compute from the storage and therefore avoids to add more computing power or bigger disks to offer the retentions your users need. Data are compacted to have the small storage footprint possible and therefore achieve cost efficiency.</p>



<p class="wp-block-paragraph">As we said in our, Prometheus is today <em>de facto</em> standard when it comes to timeseries. We wanted to offer our users the full experience, 100% compliant with <a href="https://promlabs.com/promql-compliance-tests/" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">promql</a>, recording and alerting rules. Mimir is fully featured on this side, it&#8217;s even part of a bigger picture with more integration which is like icing on the cake. Let&#8217;s start with Grafana, which is of course fully compatible with Mimir, you can also manage you recording or alerting rules directly from the UI. Now comes <a href="https://grafana.com/oss/loki/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Loki</a> which is like prometheus but for logs, it allow you to query your logs just like your metrics. And finally <a href="https://grafana.com/oss/tempo/" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">Tempo</a> which cover the last observability pillar: distributed tracing.</p>



<p class="wp-block-paragraph">On the operational side, there is no doubt that Mimir has been built with production stability and resiliency in mind. The default settings are production ready, the documentation is crystal clear but you also have the material to facilitate the day to day care of Mimir in production. As SREs running Mimir you can use their knowledge base. You have at your disposal ready to use <a href="https://github.com/grafana/mimir/tree/main/operations/mimir-mixin-compiled/dashboards" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">dashboards</a>, <a href="https://github.com/grafana/mimir/blob/main/operations/mimir-mixin-compiled/rules.yaml" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">recording</a> &amp; <a href="https://github.com/grafana/mimir/blob/main/operations/mimir-mixin-compiled/alerts.yaml" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">alerting</a> rules and <a href="https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">runbook</a>. Of course deployment might be different one from another. This is a very good opportunity to contribute back to the vivid open source community around Grafana Labs. No matter the size of the contribution it is always welcomed and reviewed in a timely manner. Whether you need to adjust the <a href="https://github.com/grafana/mimir/pull/2657" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">dashboards</a>, add a <a href="https://github.com/grafana/mimir/pull/2864" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">feature</a>&nbsp;or <a href="https://github.com/grafana/mimir/pull/1803" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">build deb/rpm packages</a> you can always <a href="https://github.com/grafana/mimir/tree/main/docs/internal/contributing" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">contribute</a>.</p>



<p class="wp-block-paragraph">The definitive reason why we have chosen Mimir is the core values of its maintainers. Kudos to them. They are welcoming, easy going and more importantly they take <a href="https://grafana.com/blog/2022/03/30/announcing-grafana-mimir/" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">opensource seriously</a> just like us at OVHcloud. If you want to have a glimps of that come by their slack to see how fast they are answering.</p>



<p class="wp-block-paragraph">My team can&#8217;t wait to see all the beautiful things our users will do with this backend. One thing&#8217;s sure, we&#8217;ll contribute back and make sure Mimir thrives. Let&#8217;s reserve this part for a new blog posts.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fpicking-our-prometheus-remote-storage%2F&amp;action_name=Picking%20our%20Prometheus%26%238217%3B%20remote%20storage&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Benchmarking Prometheus like a pro with k6</title>
		<link>https://blog.ovhcloud.com/benchmarking-prometheus-like-a-pro-with-k6/</link>
		
		<dc:creator><![CDATA[Wilfried Roset]]></dc:creator>
		<pubDate>Tue, 04 Apr 2023 12:19:05 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[prometheus]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=24585</guid>

					<description><![CDATA[In our previous posts about choosing a Prometheus remote storage we have seen how to&#160;set up a benchmarking infrastructure and how to benchmark promql performance. We have been able to obtain results but the whole benchmark is flawned in many ways: This blog post discuss how we should have benchmark our remote storage. How to [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fbenchmarking-prometheus-like-a-pro-with-k6%2F&amp;action_name=Benchmarking%20Prometheus%20like%20a%20pro%20with%20k6&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph">In our previous posts about choosing a Prometheus remote storage we have seen how to&nbsp;<a href="https://blog.ovhcloud.com/prometheus-remote-storage-playground" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">set up a benchmarking infrastructure</a> and <a href="https://blog.ovhcloud.com/benchmarking-prometheus-promql-performance" data-wpel-link="internal">how to benchmark promql performance</a>. We have been able to obtain results but the whole benchmark is flawned in many ways:</p>



<ul class="wp-block-list">
<li>it&#8217;s expensive as you need to spawn more than necessary to assess a particular point of your remote storage.</li>



<li>it&#8217;s hard to reproduce 100% the same setup, even with the same configuration and software version you will have a similar result but not exactly the same.</li>



<li>you&#8217;re not always benchmarking what you think you are. We have spent couple of time troubleshoot performance issue which where in <a href="https://github.com/prometheus/prometheus/issues/9807" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">prometheus</a> or haproxy configuration.</li>



<li>it focus mainly on the write path without stress from the read path which is not realistic.</li>
</ul>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1433-1024x538.jpg" alt="Benchmarking Prometheus like a pro with k6" class="wp-image-24943" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1433-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1433-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1433-768x404.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1433.jpg 1199w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<p class="wp-block-paragraph">This blog post discuss how we should have benchmark our remote storage.</p>



<h3 class="wp-block-heading">How to do a good benchmark? K6 to the rescue</h3>



<p class="wp-block-paragraph">A good benchmark need to be accurate and reproducible. More over for our usecase we want to have a tool who takes into account both Prometheus&#8217;s read and write path. Finally, we need to be able to remove all unnecessary pieces. This way we are able to focus on the remote storage only.</p>



<p class="wp-block-paragraph">Such software could be a project on its own but fortunately for us there is one opensource solution for that: <a href="https://k6.io/" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">K6</a></p>



<p class="wp-block-paragraph">K6 is a general purpose&nbsp; modern load testing which can be extended with module to support Prometheus remote storage. Sounds interesting don&#8217;t you think?</p>



<p class="wp-block-paragraph">In our previous blog post we have explained how we have built our benchmarking infrastructure which was rather complex&nbsp; to be accurate.</p>



<figure class="wp-block-image aligncenter"><img decoding="async" src="https://github.com/wilfriedroset/remote-storage-wars/blob/master/assets/generic-infrastructure.png?raw=true" alt="generic-infrastructure.png"/></figure>



<p class="wp-block-paragraph">With k6 as benchmarking tool the infrastructure can be greatly simplified:</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/03/k6-1024x436.png" alt="With k6 as benchmarking tool the infrastructure can be greatly simplified" class="wp-image-24941" width="512" height="218" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/03/k6-1024x436.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/k6-300x128.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/k6-768x327.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/k6.png 1127w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<p class="wp-block-paragraph">K6 is quite flexible and configurable. Its input is a load testing script, you can either write your own script or reuse an <a href="https://github.com/grafana/mimir/tree/main/operations/k6" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">opensourced one</a>. As the whole logic is in the load testing script it become easily reproducible which is exactly what we need.</p>



<p class="wp-block-paragraph">To launch a benchmark you need two piece of infrastructure:</p>



<ul class="wp-block-list">
<li>Somewhere where you can run k6 which could be a <a href="https://www.ovhcloud.com/en-ie/public-cloud/prices/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">c2-120 instance on our public cloud</a></li>



<li>A remote storage to benchmark. for a quick start users are one helm apply away to start on <a href="https://www.ovhcloud.com/en-ie/public-cloud/kubernetes/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">k8s</a></li>
</ul>



<p class="wp-block-paragraph">For our use case we have chosen to reuse the load testing from Grafana which does exactly what we are looking for. All useful information to tune and assess your remote storage are outputed by k6.</p>



<pre class="wp-block-code"><code class="">     ✓ write worked

     █ instant query high cardinality

       ✓ expected request status to equal 200
       ✓ has valid json body
       ✓ expected status field to equal 'success'
       ✓ expected data.resultType field to equal 'vector'

     █ range query

       ✓ expected request status to equal 200
       ✓ has valid json body
       ✓ expected status field is 'success' to equal 'success'
       ✓ expected resultType is 'matrix' to equal 'matrix'

     █ instant query low cardinality

       ✓ expected request status to equal 200
       ✓ has valid json body
       ✓ expected status field to equal 'success'
       ✓ expected data.resultType field to equal 'vector'

     checks............................................................................: 100.00% ✓ 1454     ✗ 0
     ✓ { type:read }...................................................................: 0.00%   ✓ 0        ✗ 0
     ✓ { type:write }..................................................................: 100.00% ✓ 6        ✗ 0
     data_received.....................................................................: 1.0 MB  8.4 kB/s
     data_sent.........................................................................: 277 kB  2.3 kB/s
     group_duration....................................................................: avg=64.61ms min=39.94ms med=60.43ms max=230.05ms p(90)=80.39ms p(95)=107.93ms
     http_req_blocked..................................................................: avg=4.65ms  min=2µs     med=6µs     max=96.84ms  p(90)=11µs    p(95)=58.42ms
     http_req_connecting...............................................................: avg=1.31ms  min=0s      med=0s      max=21.87ms  p(90)=0s      p(95)=16.99ms
     http_req_duration.................................................................: avg=53.7ms  min=34.23ms med=52.71ms max=164.1ms  p(90)=67.02ms p(95)=71.82ms
       { expected_response:true }......................................................: avg=53.7ms  min=34.23ms med=52.71ms max=164.1ms  p(90)=67.02ms p(95)=71.82ms
     ✓ { type:read }...................................................................: avg=53.8ms  min=34.23ms med=52.76ms max=164.1ms  p(90)=66.85ms p(95)=71.62ms
     ✓ { url:https://admin:security-matters@remote-storage.poc.ovh.net/api/v1/push }...: avg=0s      min=0s      med=0s      max=0s       p(90)=0s      p(95)=0s
     http_req_failed...................................................................: 0.00%   ✓ 0        ✗ 368
     http_req_receiving................................................................: avg=92.34µs min=32µs    med=89µs    max=301µs    p(90)=125.3µs p(95)=150µs
     http_req_sending..................................................................: avg=49.05µs min=12µs    med=40µs    max=566µs    p(90)=68µs    p(95)=94.59µs
     http_req_tls_handshaking..........................................................: avg=3.11ms  min=0s      med=0s      max=54.28ms  p(90)=0s      p(95)=39.39ms
     http_req_waiting..................................................................: avg=53.56ms min=33.94ms med=52.56ms max=163.93ms p(90)=66.88ms p(95)=71.66ms
     http_reqs.........................................................................: 368     3.064697/s
     iteration_duration................................................................: avg=64.88ms min=40.34ms med=60.78ms max=230.27ms p(90)=80.87ms p(95)=108.47ms
     iterations........................................................................: 368     3.064697/s
     vus...............................................................................: 26      min=26     max=26
     vus_max...........................................................................: 26      min=26     max=26
</code></pre>



<p class="wp-block-paragraph">What a time saver? With k6 we have been able to efficiently assess all remote storage solutions. This is a <strong>significative</strong> improvement if we compare it to our previous benchmarking plan.</p>



<p class="wp-block-paragraph">The next and final post will be about which remote storage we have chosen to be our internal solution.</p>



<p class="wp-block-paragraph">Stay tuned.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fbenchmarking-prometheus-like-a-pro-with-k6%2F&amp;action_name=Benchmarking%20Prometheus%20like%20a%20pro%20with%20k6&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Benchmarking Prometheus promql performance</title>
		<link>https://blog.ovhcloud.com/benchmarking-prometheus-promql-performance/</link>
		
		<dc:creator><![CDATA[Julien Girard]]></dc:creator>
		<pubDate>Fri, 17 Mar 2023 12:00:00 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[prometheus]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=24598</guid>

					<description><![CDATA[Here @OVHCloud, we try to replace our legacy metrics oriented infrastructure. This infrastructure matters a lot for us as internal teams use it to supervise the core services of OVH, so before making any choices, we wanted to apply a bullet proof test to the challengers. You can do two main things with a storage [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fbenchmarking-prometheus-promql-performance%2F&amp;action_name=Benchmarking%20Prometheus%20promql%20performance&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph">Here @OVHCloud, we try to replace our legacy metrics oriented infrastructure. This infrastructure matters a lot for us as internal teams use it to supervise the core services of OVH, so before making any choices, we wanted to apply a bullet proof test to the challengers.</p>



<p class="wp-block-paragraph">You can do two main things with a storage backend. You can write in it or you can read from it. That on the test of this last part we are focusing on today. We wanted our test to reproduce a production oriented scenario, let’s see how we build it.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1335-1024x538.jpg" alt="Benchmarking Prometheus promql performance" class="wp-image-24878" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1335-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1335-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1335-768x404.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1335.jpg 1199w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<p class="wp-block-paragraph">In this blog post we wont cover the building of the underlying TSDB as it could apply to any of them as long as it ensure PromQL compatibility. We will also assume that you can write to the TSDB using Prometheus remote write protocol.</p>



<p class="wp-block-paragraph">Now that we have our bench cluster up and running, we need to fill it up with data and this is the subject of the following part.</p>



<h3 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-Let’sfindsome“real”data">Let’s find some <em>&#8220;real&#8221;</em> data</h3>



<p class="wp-block-paragraph">As a cloud provider, all our solutions use compute instances wherever they are virtual or baremetal. One of our most common use case is to <em>“look”</em> at system server metrics through automatic <a href="https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">recording rules</a> or through Grafana dashboards. All this query are PromQL expressions.</p>



<p class="wp-block-paragraph">To emulate our ingestion workflow, we deployed nodes exposing their metrics trough <a href="https://github.com/prometheus/node_exporter" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">node exporter</a>. We also charge couples of Prometheus to scrap them several time to emulate a large amount of host (several thousands of them). Those Prometheus are in charge of writing scrapped metrics to various backend we are benchmarking using remote write protocol.</p>



<p class="wp-block-paragraph">After waiting several hours or day, our backend is full of data and we can move on. If you need more info on this subject, we have written another <a href="https://blog.ovhcloud.com/prometheus-remote-storage-playground/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">blog post</a> about it.</p>



<h3 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-It’stimetobench">It’s time to bench</h3>



<p class="wp-block-paragraph">As we said it earlier, our read production workload has two components: automatic recording rules and Grafana dashboards. As our alerting system is not widely distributed, we won’t discuss it here, so let’s focus on the Grafana part. A dashboard is a collection of requests to execute against a backend, this is why we have extracted both range and&nbsp; instant the queries from one.</p>



<p class="wp-block-paragraph">Once we have got this first result, we need a way to execute this request against the backend. As a PromQL request is mainly an HTTP call, we can use an http benchmark tool as a support for our test. One of the most widely used is <a href="https://jmeter.apache.org" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Apache JMeter</a> and this is the one we have used.</p>



<figure class="wp-block-image alignright size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1336.png" alt="Graphana dashboard" class="wp-image-24880" width="235" height="184" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1336.png 469w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1336-300x235.png 300w" sizes="auto, (max-width: 235px) 100vw, 235px" /></figure>



<p class="wp-block-paragraph">To fit into Apache JMeter who is not able to directly execute promQL request against a Prometheus compliant backend, the previously extracted series have been converted to a test plan. This tool takes various parameters, but three of them are quite important, the timestamp, the interval and the step that will apply to every query forwarded to the backend, just like you do when you submit a time frame to a dashboard in Grafana.</p>



<p class="wp-block-paragraph">We are now able emulate the load of a dashboard with various time frame and extract meaningful information from this run as Apache Jmeter is a quite powerful tool. It allow us to use warm up period to exploit the benefice of cache or ramp up to study the response of our cluster when the load increase, loading always the same data or from new nodes.</p>



<p class="wp-block-paragraph">For our first bench, we decided to go with <a href="https://grafana.com/grafana/dashboards/1860-node-exporter-full" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">the most widely use node exporter dashboard</a>. We also identified time frame widely used (5m, 15m, 30m, 1h, 6h, 12h, 24h, 2d, 3d, 4d, 5d, 6d, 7d). Those are mainly the default time frame proposed by Grafana.</p>



<p class="wp-block-paragraph">With the set of tools defined above, we identified three tests we wanted to make against each one of those time frame.</p>



<h3 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-Firsttest&quot;Hotandcoldstorage&quot;">First test &#8220;Hot and cold storage&#8221;</h3>



<p class="wp-block-paragraph">A lot of solution use hot and cold storage sometimes also named short term storage and long term storage. With this test we want to identify the performances of those various layers.</p>



<p class="wp-block-paragraph">As the purpose of this test is to check the response time of the various underlying storage, you may want to be sure to disable any cache that may alter the results.</p>



<p class="wp-block-paragraph">Moreover, we do not want to test the saturation of the platform so we will emulate ten clients.</p>



<h3 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-Secondtest&quot;Cachingperformances&quot;">Second test &#8220;Caching performances&#8221;</h3>



<p class="wp-block-paragraph">This test is quite the opposite of the previous one. Here we want to test the response time of the TSB in the best possible scenario (data cached).</p>



<p class="wp-block-paragraph">To get the best results from this test, we will use a warm-up period that will populate the various caches and then measure the response time of the TSDB.</p>



<p class="wp-block-paragraph">Once again, in this test, we do not want to test the saturation of the platform so we will emulate ten clients.</p>



<h3 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-Thirdtest&quot;Fillingupthecache&quot;">Third test &#8220;Filling up the cache&#8221;</h3>



<p class="wp-block-paragraph">The purpose of this last bench is to test the saturation of the platform. Here we will use a ramp-up, adding more and more client to the test over a defined period of time and check the according errors and response time of the underlying platform.</p>



<p class="wp-block-paragraph">At a certain point, we should see that the platform is not able to handle anymore clients. We assume this number of client will differ with the lookup time frame.</p>



<h3 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-Conclusion">Conclusion</h3>



<p class="wp-block-paragraph">The benchmark concluded to two expected conclusions.</p>



<ul class="wp-block-list">
<li>Some support of data are way more faster than other (Memory is faster than local disk which is faster than a distant object storage).</li>



<li>The use of the various caches proposed is a game changer.</li>
</ul>



<h4 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-It’stimeforasecondconclusion">It’s time for a second conclusion</h4>



<p class="wp-block-paragraph">Our approach of the benchmark is quite interesting as it aims to emulate the more precisely our production workload. You may be wondering where do we store this wonderful collection of tools. Well, here is the truth, maybe those tool don&#8217;t need to be shared and for several reasons:</p>



<ul class="wp-block-list">
<li>The result of the test widely depends of the data stored inside the TSDB, which is the result of another procedure and is difficult to reproduce. That leads to a result that is subject to interpretation</li>



<li>The tooling is difficult to use and time consuming</li>



<li>Just like the time flies, the truth of today is not the one of tomorrow and your production reality of today will probably be quite different from the one to come</li>



<li>We like to fight against the not invented here syndrome</li>
</ul>



<p class="wp-block-paragraph">In consequence, we need a tool more convenient to use, ideally used by others and with a more reproducible pattern to bench. We will discuss how we should have benchmarked our remote storage in the next blog post.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fbenchmarking-prometheus-promql-performance%2F&amp;action_name=Benchmarking%20Prometheus%20promql%20performance&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Prometheus&#8217; remote storage playground</title>
		<link>https://blog.ovhcloud.com/prometheus-remote-storage-playground/</link>
		
		<dc:creator><![CDATA[Wilfried Roset]]></dc:creator>
		<pubDate>Sun, 05 Mar 2023 23:49:35 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[prometheus]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=24583</guid>

					<description><![CDATA[Introduction In the previous post we have discuss how important remote storage are for Prometheus. We have also covered several attention points. In the following post we are covering remote write storage and how to bench them. Context After you have identify one (or more) remote storage who might suit your must bench it. However [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fprometheus-remote-storage-playground%2F&amp;action_name=Prometheus%26%238217%3B%20remote%20storage%20playground&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<h3 class="wp-block-heading" id="Remotestorageplayground-Introduction">Introduction</h3>



<p class="wp-block-paragraph">In the <a href="https://blog.ovhcloud.com/welcome-to-prometheus-world-of-remote-storage/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">previous post</a> we have discuss how important remote storage are for Prometheus. We have also covered several attention points. In the following post we are covering remote <strong>write</strong> storage and how to bench them.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1324-1024x538.jpg" alt="Prometheus' remote storage playground" class="wp-image-24835" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1324-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1324-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1324-768x404.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1324.jpg 1199w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<h4 class="wp-block-heading" id="Remotestorageplayground-Context">Context</h4>



<p class="wp-block-paragraph">After you have identify one (or more) remote storage who might suit your must bench it. However it is not as straight forward as it seems. Let&#8217;s review what we will need for this experiment:</p>



<ul class="wp-block-list">
<li>A (scalable) remote storage, in our case one which is remote write</li>



<li>One or more data generator</li>
</ul>



<h3 class="wp-block-heading" id="Remotestorageplayground-IntroducingHachimon">Introducing Hachimon</h3>



<figure class="wp-block-image alignright size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1322.png" alt="Hachimon path" class="wp-image-24832" width="277" height="213" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1322.png 554w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1322-300x231.png 300w" sizes="auto, (max-width: 277px) 100vw, 277px" /></figure>



<p class="wp-block-paragraph">Benchmarking is always fun but you know what is even more fun? Gamification! With my team mates we have created a short benchmark plan which we have called the <a href="https://narutofanon.fandom.com/wiki/Hachimon" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">Hachimon path</a>:</p>



<ul class="wp-block-list">
<li>Gate of Opening
<ul class="wp-block-list">
<li>1k targets</li>



<li>1000 series/target</li>



<li>~ 66k datapoints/sec</li>
</ul>
</li>



<li>Gate of Healing
<ul class="wp-block-list">
<li>2k targets</li>



<li>1000 series/target,</li>



<li>~133k datapoints/sec</li>
</ul>
</li>



<li>Gate of Life
<ul class="wp-block-list">
<li>4k targets</li>



<li>1000 series/target</li>



<li>~266k datapoints/sec</li>
</ul>
</li>



<li>Gate of Pain
<ul class="wp-block-list">
<li>4k targets</li>



<li>1000 series/target</li>



<li>~266k datapoints/sec after deduplication</li>



<li>dual prometheus to increase pressure on deduplication features</li>
</ul>
</li>



<li>Gate of Limit
<ul class="wp-block-list">
<li>4k targets</li>



<li>2500 series/target to increase pressure on storage</li>



<li>~660k datapoints/sec</li>



<li>dual prometheus</li>
</ul>
</li>



<li>Gate of View
<ul class="wp-block-list">
<li>8k targets</li>



<li>2500 series/target</li>



<li>~1.3M datapoints/sec</li>



<li>dual prometheus</li>
</ul>
</li>



<li>Gate of Wonder
<ul class="wp-block-list">
<li>10k targets</li>



<li>2500 series/target</li>



<li>~1.6M datapoints/sec</li>



<li>dual prometheus</li>
</ul>
</li>



<li>Gate of Death
<ul class="wp-block-list">
<li>Add as many targets as you can until the backend almost on fire</li>
</ul>
</li>
</ul>



<p class="wp-block-paragraph">To walk the Hachimon path we&#8217;ve built an infrastructure where only the central piece, the remote storage, changes. Doing so help us compare results.</p>



<p class="wp-block-paragraph">The write path is stress by one or more Prometheus clusters which will scrap many time the same <a href="https://github.com/prometheus/node_exporter" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">node_exporter</a> under a different set of labels. Doing so allow us to emulate an infrastructure bigger than it is. To increase the cardinality we can tweak node_exporter configuration to expose more or less series. By deploying one or more Prometheus clusters we can both stress the deduplication feature of the backend and workaround the hardware limitation of a given prometheus.</p>



<p class="wp-block-paragraph">This approach is very similar to the one of <a href="https://valyala.medium.com/promscale-vs-victoriametrics-resource-usage-on-production-workload-91c8e3786c03" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Victoriametrics</a> which has inspired us. Kudos!</p>



<p class="wp-block-paragraph">By the time we have reach the end of our tests the infrastucture we have built looks like the following:</p>



<figure class="wp-block-image"><img decoding="async" src="https://raw.githubusercontent.com/wilfriedroset/remote-storage-wars/master/assets/generic-infrastructure.png" alt=""/></figure>



<p class="wp-block-paragraph">This is the infrastucture we have used to bench both the read and the write path of the remote storages. There is load balancing on both side, multiple pairs of Prometheus to put more or less pressure on the write path and the deduplication. Finally, the data comes from little instances exposing node_exporter metrics.</p>



<h3 class="wp-block-heading" id="Remotestorageplayground-Expectation">Expectation</h3>



<p class="wp-block-paragraph">Thanks to this benchmarking plan we have been able to differentiate the remote storage on a performance perspective. We&#8217;ve been able to get a first understanding about how each remote storage works, how to tune them and what can you done and what you cannot with them. It seems to us that it is equally important to have ease to operate a solution and good performance. But most importantly we learnt a lot of thing while having fun.</p>



<h3 class="wp-block-heading" id="Remotestorageplayground-Conclusion">Conclusion</h3>



<p class="wp-block-paragraph">This benchmarking plan&#8217;s s obviously flawned in many ways:</p>



<ul class="wp-block-list">
<li>it&#8217;s expensive as you need to spawn more than necessary to assess a particular point of your remote storage.</li>



<li>it&#8217;s hard to reproduce 100% the same setup, even with the same configuration and software version you will have a similar result but not exactly the same.</li>



<li>you&#8217;re not always benchmarking what you think you are. We have spent couple of time troubleshoot performance issue which where in <a href="https://github.com/prometheus/prometheus/issues/9807" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Prometheus</a> or haproxy configuration.</li>



<li>it focus mainly on the write path without stress from the read path which is not realistic.</li>
</ul>



<p class="wp-block-paragraph">The two next posts of this series continue to focus on benchmarking. The first one focus on the read performance.</p>



<p class="wp-block-paragraph">The second one focus on how we should have benchmarked our solution from the beginning.</p>



<p class="wp-block-paragraph">Stay tuned</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fprometheus-remote-storage-playground%2F&amp;action_name=Prometheus%26%238217%3B%20remote%20storage%20playground&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Welcome to Prometheus world of remote storage</title>
		<link>https://blog.ovhcloud.com/welcome-to-prometheus-world-of-remote-storage/</link>
		
		<dc:creator><![CDATA[Wilfried Roset]]></dc:creator>
		<pubDate>Thu, 16 Feb 2023 16:29:25 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[prometheus]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=24579</guid>

					<description><![CDATA[At OVHcloud, we recently made a change to our internal Observability stack. After testing and comparing the different solutions on the market, we opted for on open source solution. With this blog post, we&#8217;re starting a series of articles to provide feedback on our selection process and what we&#8217;ve learned along the way. Our mission [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fwelcome-to-prometheus-world-of-remote-storage%2F&amp;action_name=Welcome%20to%20Prometheus%20world%20of%20remote%20storage&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em>At OVHcloud, we recently made a change to our internal Observability stack. After testing and comparing the different solutions on the market, we opted for on open source solution. With this blog post, we&#8217;re starting a series of articles to provide feedback on our selection process and what we&#8217;ve learned along the way. Our mission was to find  an horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus, we begin this series with an introduction to Prometheus remote storage…</em></p>



<p class="wp-block-paragraph">Over the last decade Prometheus has become one of the standard for Observability. It&#8217;s core concept is well suited for today technological use cases and it makes sense that open source community loves it. While Prometheus does a lot of thing really well when it comes to long term storage users must find a solution. This blog post serie discuss Prometheus&#8217;s remote storages, the technical challenges they aim to solve and more importantly we discuss how to pick the right one for <strong>you</strong>.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/02/logo-article-prometheus2-1024x542.jpg" alt="Prometheus love remote storage" class="wp-image-24617" width="640" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/02/logo-article-prometheus2-1024x542.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/logo-article-prometheus2-300x159.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/logo-article-prometheus2-768x407.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/logo-article-prometheus2.jpg 1194w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<h2 class="wp-block-heading">What is a remote storage?</h2>



<p class="wp-block-paragraph">Prometheus can be configured to read or write to a remote storage on top of its local storage. This allow it to support long-terme storage of users data. The two features are called&nbsp;<a href="https://prometheus.io/docs/operating/configuration/#remote_read" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">remote_read</a> and <a href="https://prometheus.io/docs/operating/configuration/#remote_write" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">remote_write</a>.</p>



<p class="wp-block-paragraph">With remote_read configured, Prometheus will answer read queries with data from the remote storage. The remote_write is responsible for shipping samples to the remote storage. Both of them are extremely useful and highly configurable.For the rest of this blog post let&#8217;s focus on remote write.</p>



<p class="wp-block-paragraph">Whether you are a cloud provider or building an in-house Observability it is not always appropriate nor possible to connect to your customers infrastructure to extract data.</p>



<p class="wp-block-paragraph">With a remote write approach customers can have a strict control on what comes in/out of the infrastructure. We could argue that IPtables coupled with authentication is secure enough but this is still one more door to keep an eye on. With tight security taken into account we understand that remote write makes a lot of sense from a service provider point of view.</p>



<p class="wp-block-paragraph">Now that we know that we want a remote write compatible storage we must take into account that not all remote storages are equal. The list of solution keeps growing every day, let&#8217;s see if we can differentiate them.</p>



<p class="wp-block-paragraph">When writing metrics to a remote storage it is because we want to read then back later. Most Observability use cases imply writing down tons of data that will be queried afterwards. PromQL is the query language use to query Prometheus and therefore associated remote storage. It would make sense to check how PromQL compliant the solutions are. Fear not, Prometheus community is already tackling this question for us with <a href="https://promlabs.com/promql-compliance-tests/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">PromQL Compliance</a></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="621" src="https://blog.ovhcloud.com/wp-content/uploads/2023/02/image-1024x621.png" alt="" class="wp-image-24580" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/02/image-1024x621.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/image-300x182.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/image-768x466.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/image-1536x932.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/image-2048x1243.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">PromQL Compliance results as of 2021-10-14</figcaption></figure>



<p class="wp-block-paragraph">As you can, see most remote storage are 100% compliant with Prometheus results. Good news. This means users have a plethora of </p>



<p class="wp-block-paragraph">However, readers must not under estimate this point. Indeed compliance impacts what you can query from the backend, how you can query it and, the accuracy of a result. It might not be trivial to reach full compliance and to stay compliant. Maintainers might also choose to not be compliant and explain <a href="https://medium.com/@romanhavronenko/victoriametrics-promql-compliance-d4318203f51e" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">why</a>.</p>



<p class="wp-block-paragraph">Prometheus world grows in adoption and under active development. If a solution is compatible today there is no guarantee it&#8217;ll stay compatible tomorrow.</p>



<p class="wp-block-paragraph">Which bring us to the second point, the community. How healthy, large and active are the community behind each software?<br>Is it easy to contact them? Discuss issues? Propose feature and PRs? We tend to take granted the fact that PRs will be reviewed, that we&#8217;ll found someone to help us troubleshoot a bug but this is not necessarily the case.</p>



<h3 class="wp-block-heading">Features set</h3>



<p class="wp-block-paragraph">To better address the technical challenges that are your own you must pick the solution that have the features you need. If you need multi tenancy check that point. If you need to downsample your data add this to your checklist. Don&#8217;t be shy, dig a little deeper. Test the feature look for its limitation. Tests are the only way to be able to make an informed decision. </p>



<p class="wp-block-paragraph">To give you an idea you might want to have a look at the following features:   </p>



<ul class="wp-block-list">
<li>multi tenancy</li>



<li> rate limiting</li>



<li>deduplication</li>



<li>deletion</li>



<li>downsampling</li>
</ul>



<h3 class="wp-block-heading">Scalability</h3>



<p class="wp-block-paragraph">Nowadays the word scalability is present almost everywhere. How well each remote storage scale? Can you write 2M samples/sec? Can you answer 1M queries/sec? Can you have 200M active series in total? <a href="https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series/" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">1B active series</a>? Per tenant?</p>



<p class="wp-block-paragraph">You can have a rough understanding of the bottleneck by looking at the architecture diagram. But to have a crystal clear answer there is only one way, you need to make a proof of concept.</p>



<p class="wp-block-paragraph">By the way, you can even try one remote storage right now on our <a href="https://www.ovhcloud.com/en/public-cloud/kubernetes" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">managed k8s</a>. Most of the open source remote storage offer helm charts or operator to do so: <a href="https://github.com/VictoriaMetrics/helm-charts" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">VictoriaMetrics</a>, <a href="https://github.com/timescale/helm-charts" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">Timescale</a>, <a href="https://github.com/grafana/mimir/tree/main/operations/helm/charts/mimir-distributed" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">Mimir</a>.</p>



<h3 class="wp-block-heading">Cost</h3>



<p class="wp-block-paragraph">Along scalability comes <em>tco</em> which stand for <em>Total Cost of Ownership</em>. This boil down to how expensive a solution, infrastructure can be when you take all cost into account. For remote storage, on top of the team operating the infrastructure we must take into account the aforementioned infrastructure. All technical solution relies on 4 categories: trained engineers, compute resources, network and&#8230; Storage. Nevertheless, it is critical to take it into account all aspect of the target solution. Otherwise be ready for a surprise at the end of the month.</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p class="wp-block-paragraph">As we have demonstrate, we have a lot of technical solutions to address long term storage. However before putting one solution in production we need to thoroughly identify and assess all trade offs. In the next posts we will have a look on how to get to know your remote storage, bench it, break it.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fwelcome-to-prometheus-world-of-remote-storage%2F&amp;action_name=Welcome%20to%20Prometheus%20world%20of%20remote%20storage&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
