<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>vLLM Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/vllm/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/vllm/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Fri, 10 Apr 2026 09:23:41 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>vLLM Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/vllm/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Reference Architecture: Deploying a vision-language model with vLLM on OVHcloud MKS for high performance inference and full observability</title>
		<link>https://blog.ovhcloud.com/reference-architecture-deploying-a-vision-language-model-with-vllm-on-ovhcloud-mks-for-high-performance-inference-and-full-observability/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Fri, 10 Apr 2026 07:48:53 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[Kubernetes]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[prometheus]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<category><![CDATA[vLLM]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=30455</guid>

					<description><![CDATA[Ensure complete&#160;digital sovereignty&#160;of your AI models with end-to-end control through open-source solutions on OVHcloud’s&#160;Managed Kubernetes Service. This reference architecture demonstrates how to deploy a Large Language Model (LLM) inference system using vLLM on&#160;OVHcloud Managed Kubernetes Service&#160;(MKS). The solution leverages NVIDIA L40S GPUs to serve the&#160;Qwen3-VL-8B-Instruct&#160;multimodal model (vision + text) with OpenAI-compatible API endpoints. This comprehensive [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-deploying-a-vision-language-model-with-vllm-on-ovhcloud-mks-for-high-performance-inference-and-full-observability%2F&amp;action_name=Reference%20Architecture%3A%20Deploying%20a%20vision-language%20model%20with%20vLLM%20on%20OVHcloud%20MKS%20for%20high%20performance%20inference%20and%20full%20observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em><em>Ensure complete&nbsp;<strong>digital sovereignty</strong>&nbsp;of your AI models with end-to-end control through open-source solutions on OVHcloud’s&nbsp;<strong>Managed Kubernetes Service</strong>.</em></em></p>



<figure class="wp-block-image aligncenter size-large is-resized"><img fetchpriority="high" decoding="async" width="703" height="1024" src="https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-703x1024.jpg" alt="vLLM on OVHcloud MKS for high availability and full observability" class="wp-image-31153" style="width:710px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-703x1024.jpg 703w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-206x300.jpg 206w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-768x1118.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-1055x1536.jpg 1055w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm.jpg 1260w" sizes="(max-width: 703px) 100vw, 703px" /><figcaption class="wp-element-caption"><em><em>vLLM on OVHcloud MKS for high availability and full observability</em></em></figcaption></figure>



<p class="wp-block-paragraph">This reference architecture demonstrates how to deploy a Large Language Model (LLM) inference system using vLLM on&nbsp;<a href="https://www.ovhcloud.com/fr/public-cloud/kubernetes/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Managed Kubernetes Service</a>&nbsp;(MKS). The solution leverages NVIDIA L40S GPUs to serve the&nbsp;<a href="https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Qwen3-VL-8B-Instruct</a>&nbsp;multimodal model (vision + text) with OpenAI-compatible API endpoints.</p>



<p class="wp-block-paragraph">This comprehensive guide shows you how to deploy, to scale automatically, and how to monitor vLLM-based LLM workloads on the OVHcloud infrastructure.</p>



<p class="wp-block-paragraph"><strong>What are the key benefits?</strong></p>



<ul class="wp-block-list">
<li><strong>Cost-effectiveness:</strong>&nbsp;Leverage managed services to minimise operational overhead</li>



<li><strong>Real-time observability:</strong>&nbsp;Track Time-to-First-Token (TTFT), throughput, and resource utilisation</li>



<li><strong>Sovereign infrastructure:</strong>&nbsp;Keep all metrics and data within European datacentres</li>



<li><strong>Scalable by design:</strong>&nbsp;Automatically scale GPU inference replicas based on real workload demand</li>
</ul>



<h2 class="wp-block-heading">Context</h2>



<h3 class="wp-block-heading">Managed Kubernetes Service</h3>



<p class="wp-block-paragraph"><strong>OVHcloud MKS</strong>&nbsp;is a fully managed Kubernetes platform designed to help you deploy, operate, and scale containerised applications in production. It provides a secure and reliable Kubernetes environment without the operational overhead of managing the control plane.</p>



<p class="wp-block-paragraph"><strong>How does this benefit you?</strong></p>



<ul class="wp-block-list">
<li><strong>Cost-efficient</strong>: Pay only for worker nodes and consumed resources, with no additional charge for the Kubernetes control plane</li>



<li><strong>Fully managed Kubernetes</strong>: Certified upstream Kubernetes with automated control plane management, provided upgrades and high availability</li>



<li><strong>Production-ready by design</strong>: Built-in integrations with OVHcloud Load Balancers, networking, and persistent storage</li>



<li><strong>Scalable and flexible</strong>: Scale workloads easily, node pools to match application demand</li>



<li><strong>Open and portable</strong>: Based on standard Kubernetes APIs, enable seamless integration with open-source ecosystems and avoid vendor lock-in</li>
</ul>



<p class="wp-block-paragraph">In the following guide, all services are deployed within the&nbsp;<strong>OVHcloud Public Cloud</strong>.</p>



<h2 class="wp-block-heading">Architecture overview</h2>



<p class="wp-block-paragraph">This reference architecture demonstrates a basic deployment of vLLM for vision-language model inference on OVHcloud Managed Kubernetes Service, featuring:</p>



<ul class="wp-block-list">
<li><strong>High-availability deployment</strong>&nbsp;with 2 GPU nodes (NVIDIA L40S)</li>



<li><strong>Optimised GPU utilisation</strong>&nbsp;with proper driver configuration</li>



<li><strong>Scalable infrastructure</strong>&nbsp;supporting vision-language models</li>



<li><strong>Comprehensive monitoring</strong>&nbsp;using Prometheus, Grafana, and DCGM</li>



<li><strong>Full observability</strong>&nbsp;for both application and hardware metrics</li>
</ul>



<p class="wp-block-paragraph"><strong>Data flow</strong>:</p>



<figure class="wp-block-image aligncenter size-large"><img decoding="async" width="1024" height="538" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-1024x538.jpg" alt="" class="wp-image-30985" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-768x403.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-1536x806.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-2048x1075.jpg 2048w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Data Flow</em></figcaption></figure>



<ol class="wp-block-list">
<li><strong>Inference request:</strong>
<ul class="wp-block-list">
<li>User → LoadBalancer → Gateway → NGINX Ingress → &#8220;Qwen3 VL&#8221; Service → vLLM Pod → GPU</li>



<li>Response follows reverse path with streaming support</li>
</ul>
</li>



<li><strong>Metrics collection:</strong>
<ul class="wp-block-list">
<li>vLLM Pods expose <code>/metrics</code> endpoint (port <code><strong><mark class="has-inline-color has-ast-global-color-0-color">8000</mark></strong></code>)</li>



<li>DCGM Exporters expose GPU metrics (port <code><strong><mark class="has-inline-color has-ast-global-color-0-color">9400</mark></strong></code>)</li>



<li>Prometheus scrapes both endpoints every 30 seconds</li>



<li>Grafana queries Prometheus for visualization</li>
</ul>
</li>



<li><strong>Load distribution</strong>
<ul class="wp-block-list">
<li>NGINX Ingress uses cookie-based session affinity</li>



<li>vLLM Service uses ClientIP session affinity</li>



<li>Anti-affinity ensures 1 pod per GPU node</li>
</ul>
</li>
</ol>



<h2 class="wp-block-heading">Prerequisites</h2>



<p class="wp-block-paragraph">Before you begin, ensure you have:</p>



<ul class="wp-block-list">
<li>An&nbsp;<strong>OVHcloud Public Cloud</strong>&nbsp;account</li>



<li>An&nbsp;<strong>OpenStack user</strong>&nbsp;with the<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">&nbsp;</a><strong><code>Administrator</code></strong>&nbsp;role</li>



<li><strong>Hugging Face access</strong>&nbsp;–&nbsp;<em>create a&nbsp;<a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face account</a>&nbsp;and generate an&nbsp;<a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">access token</a></em></li>



<li><code><strong>kubectl</strong></code>&nbsp;already installed and&nbsp;<code><strong>helm</strong></code>&nbsp;installed (at least version 3.x)</li>
</ul>



<p class="wp-block-paragraph"><strong>🚀 Now you have all the ingredients, it’s time to deploy the recipe for&nbsp;<a href="https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Qwen/Qwen3-VL-8B-Instruct</a>&nbsp;using vLLM and MKS!</strong></p>



<h2 class="wp-block-heading">Architecture guide: Native GPU deployment of vLLM on MKS with full stack observability</h2>



<p class="wp-block-paragraph">This reference architecture describes a<strong>&nbsp;Large Language Model</strong>&nbsp;deployment using&nbsp;<strong>vLLM inference server&nbsp;</strong>and&nbsp;<strong>Kubernetes</strong>, to enjoy the&nbsp;benefits of a service that&#8217;s both highly available and monitorable in real time.</p>



<h3 class="wp-block-heading">Step 1 &#8211; Create MKS cluster and Node pools</h3>



<p class="wp-block-paragraph">From&nbsp;<a href="https://www.ovh.com/manager/" target="_blank" rel="noreferrer noopener" data-wpel-link="exclude">OVHcloud Control Panel</a>, create a Kubernetes cluster using the&nbsp;<strong>MKS</strong>. </p>



<p class="wp-block-paragraph">Navigate to: <code>Public Cloud</code> → <code>Managed Kubernetes Service</code> → <code>Create a cluster</code></p>



<h4 class="wp-block-heading">1. Configure cluster</h4>



<p class="wp-block-paragraph">Consider using the following configuration for the current use case:</p>



<ul class="wp-block-list">
<li><strong>Name:</strong> <code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm-deployment-l40s-qwen3-8b</mark></strong></code></li>



<li><strong>Location</strong>: 1-AZ Region &#8211; Gravelines (<code><strong><mark class="has-inline-color has-ast-global-color-0-color">GRA11</mark></strong></code>)</li>



<li><strong>Plan:</strong> Free (or Standard)</li>



<li><strong>Network</strong>: attach a <strong>Private network </strong>(e.g. <code><strong><mark class="has-inline-color has-ast-global-color-0-color">0000 - AI Private Network</mark></strong></code>)</li>



<li><strong>Version:</strong> Latest stable (e.g. <code><strong><mark class="has-inline-color has-ast-global-color-0-color">1.34</mark></strong></code>)</li>
</ul>



<h4 class="wp-block-heading">2. Create GPU Node pool</h4>



<p class="wp-block-paragraph">During the cluster creation, configure the vLLM Node pool for GPUs:</p>



<ul class="wp-block-list">
<li><strong>Node pool name:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">vllm</mark></code></li>



<li><strong>Flavor:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">L40S-90</mark></code></li>



<li><strong>Number of nodes:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">2</mark></code></li>



<li><strong>Autoscaling:</strong> Disabled (OFF)</li>
</ul>



<p class="wp-block-paragraph"><strong>Why L40S-90?</strong></p>



<ul class="wp-block-list">
<li>Cost-effective for single-model deployment (1 GPU per node)</li>



<li>Sufficient RAM (90GB) for <strong><code><mark class="has-inline-color has-ast-global-color-0-color">Qwen3-VL-8B</mark></code></strong> model</li>
</ul>



<p class="wp-block-paragraph">You should see your cluster (e.g.&nbsp;<code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm-deployment-l40s-qwen3-8b</mark></strong></code>) in the list, along with the following information:</p>



<figure class="wp-block-image size-full"><img decoding="async" width="930" height="588" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1.png" alt="" class="wp-image-30745" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1.png 930w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1-300x190.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1-768x486.png 768w" sizes="(max-width: 930px) 100vw, 930px" /></figure>



<p class="wp-block-paragraph">You can now set up the node pool dedicated to monitoring.</p>



<h4 class="wp-block-heading">3. Create CPU Node pool</h4>



<p class="wp-block-paragraph">From your cluster, click on <code><strong><mark class="has-inline-color has-ast-global-color-0-color">Add a node pool</mark></strong></code> and configure it as follow:</p>



<ul class="wp-block-list">
<li><strong>Node pool name:</strong> <mark class="has-inline-color has-ast-global-color-0-color"><code>monitoring</code></mark></li>



<li><strong>Flavor:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">B2-15</mark></code></li>



<li><strong>Number of nodes:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">1</mark></code></li>



<li><strong>Autoscaling:</strong> Disabled (OFF)</li>
</ul>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">✅ <strong>Note</strong></p>



<p class="wp-block-paragraph"><strong><em>Monitoring stack can run on GPU nodes if cost is a concern. Dedicated CPU node provides better isolation and resource management.</em></strong></p>
</blockquote>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="365" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-1024x365.png" alt="" class="wp-image-30743" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-1024x365.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-300x107.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-768x274.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation.png 1283w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">If the status is green with the&nbsp;<strong><code><mark class="has-inline-color has-ast-global-color-0-color">OK</mark></code></strong>&nbsp;label, you can proceed to the next step.</p>



<h4 class="wp-block-heading">4. Configure Kubernetes access</h4>



<p class="wp-block-paragraph">Once your nodes have been provisioned, you can download the <strong>Kubeconfig file</strong> and configure kubectl with your MKS cluster.</p>



<pre class="wp-block-code"><code class=""># configure kubectl with your MKS cluster<br>export KUBECONFIG=/path/to/your/kubeconfig-xxxxxx.yml<br><br># verify cluster connectivity<br>kubectl cluster-info<br>kubectl get nodes</code></pre>



<p class="wp-block-paragraph">Returning:</p>



<p class="wp-block-paragraph"><code>NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; STATUS &nbsp; ROLES&nbsp; &nbsp; AGE &nbsp; VERSION<br>monitoring-node-xxxxxx &nbsp; Ready&nbsp; &nbsp; &lt;none&gt; &nbsp; 1d &nbsp; v1.34.2<br>vllm-node-yyyyyy &nbsp; &nbsp; &nbsp; &nbsp; Ready&nbsp; &nbsp; &lt;none&gt; &nbsp; 1d &nbsp; v1.34.2<br>vllm-node-zzzzzz &nbsp; &nbsp; &nbsp; &nbsp; Ready&nbsp; &nbsp; &lt;none&gt; &nbsp; 1d &nbsp; v1.34.2</code></p>



<p class="wp-block-paragraph">Before going further, add a label to the CPU node for monitoring workloads.</p>



<pre class="wp-block-code"><code class="">CPU_NODE=$(kubectl get nodes -o json | \<br>  jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" == null) | .metadata.name')<br>kubectl label node $CPU_NODE node-role=monitoring</code></pre>



<p class="wp-block-paragraph">Finally, check with the following command:</p>



<pre class="wp-block-code"><code class="">NAME                     GPU      ROLE<br>monitoring-node-xxxxxx   &lt;none&gt;   monitoring<br>vllm-node-yyyyyy         1        &lt;none&gt;<br>vllm-node-zzzzzz         1        &lt;none&gt;</code></pre>



<p class="wp-block-paragraph">Once both nodes are in <strong>Ready</strong> status, you can proceed to the next step.</p>



<h3 class="wp-block-heading">Step 2 &#8211; Install GPU operator</h3>



<p class="wp-block-paragraph">To start, consider setting up the GPU operator.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>✅ Note</strong></p>



<p class="wp-block-paragraph"><em><strong>This step is based on this OVHcloud documentation: <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-kubernetes-deploy-gpu-application?id=kb_article_view&amp;sysparm_article=KB0049707" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Deploying a GPU application on OVHcloud Managed Kubernetes Service</a></strong></em></p>
</blockquote>



<h4 class="wp-block-heading">1. Add NVIDIA helm repository and create namespace</h4>



<p class="wp-block-paragraph">Add NVIDIA helm repo:</p>



<pre class="wp-block-code"><code class="">helm repo add nvidia https://helm.ngc.nvidia.com/nvidia<br>helm repo update</code></pre>



<p class="wp-block-paragraph">And create Namespace as follow.</p>



<pre class="wp-block-code"><code class="">kubectl create namespace gpu-operator</code></pre>



<h4 class="wp-block-heading">2. Install GPU operator with correct configuration</h4>



<p class="wp-block-paragraph">The GPU Operator must be configured with specific driver versions to ensure compatibility with vLLM containers.</p>



<p class="wp-block-paragraph">However, the default installation uses recent drivers (<code><strong><mark class="has-inline-color has-ast-global-color-0-color">580.x</mark></strong></code> with <strong><code><mark class="has-inline-color has-ast-global-color-0-color">CUDA 13.x</mark></code></strong>) which are incompatible with vLLM containers (<strong><code><mark class="has-inline-color has-ast-global-color-0-color">CUDA 12.x</mark></code></strong>).</p>



<p class="wp-block-paragraph"><strong>Solution:</strong> Force driver version <strong><code><mark class="has-inline-color has-ast-global-color-0-color">535.183.01</mark></code></strong> (<code><strong><mark class="has-inline-color has-ast-global-color-0-color">CUDA 12.2</mark></strong></code>).</p>



<pre class="wp-block-code"><code class="">helm install gpu-operator nvidia/gpu-operator \<br>  -n gpu-operator \<br>  --set driver.enabled=true \<br>  --set driver.version="535.183.01" \<br>  --set toolkit.enabled=true \<br>  --set operator.defaultRuntime=containerd \<br>  --set devicePlugin.enabled=true \<br>  --set dcgmExporter.enabled=true \<br>  --set dcgmExporter.image="dcgm-exporter" \<br>  --set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04" \<br>  --set gfd.enabled=true \<br>  --set migManager.enabled=false \<br>  --set nodeStatusExporter.enabled=true \<br>  --set validator.driver.enable=false \<br>  --set validator.toolkit.enable=false \<br>  --set validator.plugin.enable=false \<br>  --timeout 20m</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">✅ <strong>Note </strong></p>



<p class="wp-block-paragraph"><em><strong>Specifying the DCGM version may only be necessary if you encounter problems with the default image (e.g. <code><mark class="has-inline-color has-ast-global-color-0-color">‘ImagePullBackOff’</mark></code>). If this is the case, add the following parameters:<br><code><mark class="has-inline-color has-ast-global-color-0-color">--set dcgmExporter.repository="nvcr.io/nvidia/k8s"<br>--set dcgmExporter.image="dcgm-exporter"<br>--set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04"</mark></code></strong></em></p>
</blockquote>



<pre class="wp-block-code"><code class="">kubectl get pods -n gpu-operator</code></pre>



<p class="wp-block-paragraph">Note that all pods should reach <strong>Running</strong> state in 5-10 minutes.</p>



<p class="wp-block-paragraph">You can also check the GPU availability:</p>



<pre class="wp-block-code"><code class="">kubectl get nodes -o json | jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" != null) | "\(.metadata.name): \(.status.allocatable."nvidia.com/gpu") GPU(s)"'</code></pre>



<p class="wp-block-paragraph">Returning:</p>



<p class="wp-block-paragraph"><code>vllm-node-<code>yyyyyy</code>: 1 GPU(s)<br>vllm-node-zzzzzz: 1 GPU(s)</code></p>



<p class="wp-block-paragraph">And you can test to run <code><strong><mark class="has-inline-color has-ast-global-color-0-color">nvidia-smi</mark></strong></code>:</p>



<pre class="wp-block-code"><code class="">DRIVER_POD=$(kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset -o name | head -1)<br>kubectl exec -n gpu-operator $DRIVER_POD -- nvidia-smi</code></pre>



<p class="wp-block-paragraph">If GPU tests are working properly, you can move on DCGM service configuration.</p>



<h4 class="wp-block-heading">3. Configure DCGM service</h4>



<p class="wp-block-paragraph"><strong>Why is DCGM Exporter required?</strong></p>



<p class="wp-block-paragraph">DCGM (Data Centre GPU Manager) is NVIDIA&#8217;s official tool for monitoring GPUs in production. The goal is to be able to collect and display metrics from both GPU nodes.</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="571" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-1024x571.jpg" alt="" class="wp-image-30746" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-1024x571.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-300x167.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-768x428.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-1536x856.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1.jpg 1733w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>GPU monitoring with DCGM</em></figcaption></figure>



<p class="wp-block-paragraph">The metrics provided are:</p>



<ul class="wp-block-list">
<li><code><strong><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_GPU_UTIL</mark></strong></code> &#8211; GPU utilisation (%)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_GPU_TEMP</mark></code></strong> &#8211; GPU temperature (°C)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_FB_USED</mark></code></strong> &#8211; VRAM used (MB)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_FB_FREE</mark></code></strong> &#8211; Free VRAM (MB)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_POWER_USAGE</mark></code></strong> &#8211; Power consumption (W)</li>



<li>And 50+ other GPU metrics</li>
</ul>



<p class="wp-block-paragraph">Next, ensure DCGM service has the correct labels and port configuration:</p>



<pre class="wp-block-code"><code class="">kubectl patch svc nvidia-dcgm-exporter -n gpu-operator --type merge -p '{<br>  "metadata": {<br>    "labels": {<br>      "app": "nvidia-dcgm-exporter"<br>    }<br>  },<br>  "spec": {<br>    "ports": [<br>      {<br>        "name": "metrics",<br>        "port": 9400,<br>        "targetPort": 9400,<br>        "protocol": "TCP"<br>      }<br>    ]<br>  }<br>}'</code></pre>



<p class="wp-block-paragraph">Verify the endpoints (should show 2 IPs, one per GPU node).</p>



<pre class="wp-block-code"><code class="">kubectl get endpoints nvidia-dcgm-exporter -n gpu-operator</code></pre>



<p class="wp-block-paragraph"><code>NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ENDPOINTS &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; AGE<br>nvidia-dcgm-exporter &nbsp; x.x.x.x:9400,x.x.x.x:9400 &nbsp; 17d</code></p>



<h3 class="wp-block-heading">Step 3 &#8211; Deploy Qwen3 VL 8B with vLLM inference server</h3>



<p class="wp-block-paragraph">The deployment of the <strong>Qwen 3 VL 8B</strong> model on two L40S GPU nodes is carried out in several stages.</p>



<h4 class="wp-block-heading">1. Create namespace and Hugging Face secret</h4>



<p class="wp-block-paragraph">Start by creating Namespace:</p>



<pre class="wp-block-code"><code class="">kubectl create namespace vllm</code></pre>



<p class="wp-block-paragraph">Next, you must retrieve your Hugging Face token and replace the&nbsp;<code><strong><mark class="has-inline-color has-ast-global-color-0-color">HF_TOKEN</mark></strong></code>&nbsp;value by your own:</p>



<pre class="wp-block-code"><code class="">export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"</code></pre>



<p class="wp-block-paragraph">Create your secret as follow:</p>



<pre class="wp-block-code"><code class="">kubectl create secret generic huggingface-secret \<br>  --from-literal=token=$HF_TOKEN \<br>  --namespace=vllm</code></pre>



<p class="wp-block-paragraph">Verify you obtain the following output by launching:</p>



<pre class="wp-block-code"><code class="">kubectl get secret huggingface-secret -n vllm</code></pre>



<p class="wp-block-paragraph"><code>NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; TYPE &nbsp; &nbsp; DATA &nbsp; AGE<br>huggingface-secret &nbsp; Opaque &nbsp; 1&nbsp; &nbsp; &nbsp; 14d</code></p>



<h4 class="wp-block-heading">2. Create vLLM deployment configuration</h4>



<p class="wp-block-paragraph">First, you can create <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/vllm/vllm-deployment-2nodes.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-deployment-2nodes.yaml</a></strong></code> file.</p>



<p class="wp-block-paragraph">Deploy vLLM:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f vllm-deployment-2nodes.yaml</code></pre>



<p class="wp-block-paragraph">You can monitor the deployment (it should take 8-10 minutes for model download and loading).</p>



<pre class="wp-block-code"><code class="">kubectl get pods -n vllm -o wide -w</code></pre>



<p class="wp-block-paragraph">Expected output after 10 minutes:</p>



<pre class="wp-block-code"><code class="">NAME               READY  STATUS   RESTARTS  AGE  IP       NODE  <br>qwen3-vl-xxxx-yyy  1/1    Running  0         1d   X.X.X.X  vllm-node-yyyyyy<br>qwen3-vl-xxxx-zzz  1/1    Running  0         1d   X.X.X.X  vllm-node-zzzzzz</code></pre>



<p class="wp-block-paragraph">You can also check the container logs:</p>



<pre class="wp-block-code"><code class="">kubectl logs -f -n vllm &lt;pod-name&gt;</code></pre>



<p class="wp-block-paragraph">You should find in the logs: &#8220;<code>Uvicorn running on http://0.0.0.0:8000</code>&#8220;</p>



<p class="wp-block-paragraph">Is everything installed correctly? Then let&#8217;s move on to the next step.</p>



<h4 class="wp-block-heading">3. Add service label</h4>



<p class="wp-block-paragraph">Ensure service has the correct label for <strong><code><mark class="has-inline-color has-ast-global-color-0-color">ServiceMonitor</mark></code></strong> discovery.</p>



<pre class="wp-block-code"><code class="">kubectl label svc qwen3-vl-service -n vllm app=qwen3-vl --overwrite</code></pre>



<p class="wp-block-paragraph">You can now verify by launching the following command.</p>



<pre class="wp-block-code"><code class="">kubectl get svc qwen3-vl-service -n vllm --show-labels | grep "app=qwen3-vl"</code></pre>



<p class="wp-block-paragraph">Returning:</p>



<p class="wp-block-paragraph"><code>qwen3-vl-service&nbsp; ClusterIP&nbsp; X.X.X.X &nbsp;&lt;none&gt;  8000/TCP  1d &nbsp;app=qwen3-vl</code></p>



<h3 class="wp-block-heading">Step 4 &#8211; Install NGINX ingress controller</h3>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><mark style="color:#cf2e2e" class="has-inline-color">⚠️ <strong>Moving beyond Ingress</strong></mark></p>



<p class="wp-block-paragraph"><strong><em><mark style="color:#cf2e2e" class="has-inline-color">Follow this <a href="https://blog.ovhcloud.com/moving-beyond-ingress-why-should-ovhcloud-managed-kubernetes-service-mks-users-start-looking-at-the-gateway-api/" data-wpel-link="internal">tutorial</a> if you want to use Gateway instead of Ingress.</mark></em></strong></p>
</blockquote>



<h4 class="wp-block-heading">1. Add helm repository and configure Ingress</h4>



<p class="wp-block-paragraph">First of all, add helm repository:</p>



<pre class="wp-block-code"><code class="">helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx<br>helm repo update</code></pre>



<p class="wp-block-paragraph">Create configuration file with <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/ingress/ingress-nginx-values.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">ingress-nginx-values.yaml</a></strong></code>.</p>



<p class="wp-block-paragraph">Then, install NGINX Ingress:</p>



<pre class="wp-block-code"><code class="">helm install ingress-nginx ingress-nginx/ingress-nginx \<br>  --namespace ingress-nginx \<br>  --create-namespace \<br>  -f ingress-nginx-values.yaml \<br>  --wait</code></pre>



<p class="wp-block-paragraph">Wait for LoadBalancer IP. The external IP assignment should take 1-2 minutes.</p>



<pre class="wp-block-code"><code class="">kubectl get svc -n ingress-nginx ingress-nginx-controller -w</code></pre>



<p class="wp-block-paragraph">Once <code><strong><mark class="has-inline-color has-ast-global-color-0-color">&lt;EXTERNAL-IP&gt;</mark></strong></code> is no longer , Ctrl+C and export it:</p>



<pre class="wp-block-code"><code class="">export EXTERNAL_IP=&lt;EXTERNAL-IP&gt;<br>echo "API URL: http://$EXTERNAL_IP"</code></pre>



<h4 class="wp-block-heading">2. Create vLLM Ingress resource</h4>



<p class="wp-block-paragraph">Next, create vLLM Ingress using <strong><code><a href="https://github.com/ovh/public-cloud-examples/blob/ep-vllm-deployment-observability-mks/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/vllm/vllm-ingress.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-ingress.yaml</a></code></strong>.</p>



<p class="wp-block-paragraph">Apply it as follow:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f vllm-ingress.yaml</code></pre>



<p class="wp-block-paragraph">You can now test different API calls to verify that your deployment is functional.</p>



<h4 class="wp-block-heading">3. Test API</h4>



<p class="wp-block-paragraph">Firstly, check if the model is available:</p>



<pre class="wp-block-code"><code class="">curl http://$EXTERNAL_IP/v1/models | jq</code></pre>



<pre class="wp-block-preformatted"><code>{<br>  "object": "list",<br>  "data": [<br>    {<br>      "id": "qwen3-vl-8b",<br>      "object": "model",<br>      "created": 1772472143,<br>      "owned_by": "vllm",<br>      "root": "Qwen/Qwen3-VL-8B-Instruct",<br>      "parent": null,<br>      "max_model_len": 8192,<br>      "permission": [<br>        {<br>          "id": "modelperm-8fb35cdd3208b068",<br>          "object": "model_permission",<br>          "created": 1772472143,<br>          "allow_create_engine": false,<br>          "allow_sampling": true,<br>          "allow_logprobs": true,<br>          "allow_search_indices": false,<br>          "allow_view": true,<br>          "allow_fine_tuning": false,<br>          "organization": "*",<br>          "group": null,<br>          "is_blocking": false<br>        }<br>      ]<br>    }<br>  ]<br>}</code></pre>



<p class="wp-block-paragraph">Next, test inference using the following request:</p>



<pre class="wp-block-code"><code class="">curl http://$EXTERNAL_IP/v1/chat/completions \<br>  -H "Content-Type: application/json" \<br>  -d '{<br>    "model": "qwen3-vl-8b",<br>    "messages": [{"role": "user", "content": "Count from 1 to 10."}],<br>    "max_tokens": 100<br>  }' | jq '.choices[0].message.content'</code></pre>



<p class="wp-block-paragraph"><code>"1, 2, 3, 4, 5, 6, 7, 8, 9, 10"</code></p>



<p class="wp-block-paragraph">Great! You&#8217;re almost there…</p>



<h3 class="wp-block-heading">Step 5 &#8211; Install Prometheus stack</h3>



<p class="wp-block-paragraph">Now, set up the monitoring stack that provides complete observability for&nbsp;<strong>application-level&nbsp;</strong>(vLLM) and&nbsp;<strong>hardware-level</strong>&nbsp;(GPU) metrics:</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="763" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-1024x763.jpg" alt="" class="wp-image-30871" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-1024x763.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-300x223.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-768x572.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-1536x1144.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture.jpg 1673w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Monitoring architecture</em></figcaption></figure>



<h4 class="wp-block-heading">1. Add helm repository and create namespace</h4>



<p class="wp-block-paragraph">Add Prometheus helm repo:</p>



<pre class="wp-block-code"><code class="">helm repo add prometheus-community https://prometheus-community.github.io/helm-charts<br>helm repo update</code></pre>



<p class="wp-block-paragraph">Then, create the <code><strong><mark class="has-inline-color has-ast-global-color-0-color">monitoring</mark></strong></code> Namespace.</p>



<pre class="wp-block-code"><code class="">kubectl create namespace monitoring</code></pre>



<h4 class="wp-block-heading">2. Create Prometheus deployment configuration and installation</h4>



<p class="wp-block-paragraph">First, create <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/monitoring/prometheus.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">prometheus.yaml</a></strong></code> file.</p>



<p class="wp-block-paragraph">Install Prometheus stack:</p>



<pre class="wp-block-code"><code class="">helm install prometheus prometheus-community/kube-prometheus-stack \<br>  -n monitoring \<br>  -f prometheus.yaml \<br>  --timeout 10m \<br>  --wait</code></pre>



<p class="wp-block-paragraph">Now,&nbsp;monitor its installation and wait until the pods are ready:</p>



<pre class="wp-block-code"><code class="">kubectl get pods -n monitoring -w</code></pre>



<p class="wp-block-paragraph">If all pods are running successfully, you can proceed to the next step.</p>



<h4 class="wp-block-heading">3. Check that the installation is operational</h4>



<p class="wp-block-paragraph">First access Grafana in background:</p>



<pre class="wp-block-code"><code class="">kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &amp;</code></pre>



<p class="wp-block-paragraph">Test Grafana health:</p>



<pre class="wp-block-code"><code class="">curl -s http://localhost:3000/api/health | jq</code></pre>



<pre class="wp-block-preformatted"><code>{<br>  "database": "ok",<br>  "version": "12.3.3",<br>  "commit": "2a14494b2d6ab60f860d8b27603d0ccb264336f6"<br>}</code></pre>



<p class="wp-block-paragraph">You can now access to Grafana locally via <strong><a href="http://localhost:3000" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code>http://localhost:3000</code></a></strong>. You will have to use:</p>



<ul class="wp-block-list">
<li>Login: <code><strong><mark style="color:#cf2e2e" class="has-inline-color">admin</mark></strong></code></li>



<li>Password: <code><strong><mark style="color:#cf2e2e" class="has-inline-color">Admin123!vLLM</mark></strong></code></li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="518" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-1024x518.png" alt="" class="wp-image-30804" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-1024x518.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-300x152.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-768x389.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2.png 1322w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">Well done! You can now proceed to the configuration step.</p>



<h3 class="wp-block-heading">Step 6 &#8211; Configure ServiceMonitors</h3>



<p class="wp-block-paragraph">The ServiceMonitors is used to tell Prometheus which endpoints to scrape for metrics.</p>



<h4 class="wp-block-heading">1. Create vLLM ServiceMonitor</h4>



<p class="wp-block-paragraph">Retrieve the file from the GitHub repository: <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/monitoring/vllm-servicemonitor.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-servicemonitor.yaml</a></strong></code>.</p>



<p class="wp-block-paragraph">Next, apply and check that the ServiceMonitor <code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm-metrics</mark></strong></code> exists:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f vllm-servicemonitor.yaml<br>kubectl get servicemonitor -n vllm</code></pre>



<h4 class="wp-block-heading">2. Create DCGM ServiceMonitor</h4>



<p class="wp-block-paragraph">First, create the <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/monitoring/dcgm-servicemonitor.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">dcgm-servicemonitor.yaml</a></strong></code> file.</p>



<p class="wp-block-paragraph">Once again, apply and verify:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f dcgm-servicemonitor.yaml<br>kubectl get servicemonitor -n gpu-operator</code></pre>



<pre class="wp-block-preformatted"><code>gpu-operator                  1d<br>nvidia-dcgm-exporter          1d<br>nvidia-node-status-exporter   1d</code></pre>



<h4 class="wp-block-heading">3. Configure Prometheus for Cross-Namespace discovery</h4>



<p class="wp-block-paragraph">Apply a patch to allow Prometheus to discover ServiceMonitors in all namespaces.</p>



<pre class="wp-block-code"><code class="">kubectl patch prometheus prometheus-kube-prometheus-prometheus -n monitoring --type merge -p '{<br>  "spec": {<br>    "serviceMonitorNamespaceSelector": {},<br>    "podMonitorNamespaceSelector": {}<br>  }<br>}'</code></pre>



<p class="wp-block-paragraph">Now you have to restart Prometheus.</p>



<ol class="wp-block-list">
<li>Delete Prometheus pod to force configuration reload</li>



<li>Wait for Prometheus to restart</li>
</ol>



<pre class="wp-block-code"><code class="">kubectl delete pod prometheus-prometheus-kube-prometheus-prometheus-0 -n monitoring<br><br>kubectl wait --for=condition=Ready \<br>  pod/prometheus-prometheus-kube-prometheus-prometheus-0 \<br>  -n monitoring \<br>  --timeout=180s</code></pre>



<p class="wp-block-paragraph">Wait about 2 minutes for discovery and finally, verify targets:</p>



<pre class="wp-block-code"><code class="">kubectl port-forward -n monitoring \<br>  prometheus-prometheus-kube-prometheus-prometheus-0 9090:9090 &amp;</code></pre>



<p class="wp-block-paragraph">You can open in browser: <a href="http://localhost:9090/targets" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code><strong><mark class="has-inline-color has-ast-global-color-0-color">http://localhost:9090/targets</mark></strong></code></a> and search for:</p>



<ul class="wp-block-list">
<li><code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm</mark></strong></code></li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">dcgm</mark></code></strong></li>
</ul>



<p class="wp-block-paragraph">Note that the expected targets are: </p>



<ul class="wp-block-list">
<li>serviceMonitor/vllm/vllm-metrics/0   (2/2 UP)</li>



<li>serviceMonitor/gpu-operator/nvidia-dcgm-exporter/0 (2/2 UP)</li>
</ul>



<h3 class="wp-block-heading">Step 7 &#8211; Create Grafana dashboards</h3>



<p class="wp-block-paragraph">In this final step, the goal is to create two Grafana dashboards to track both the software side with vLLM metrics and the hardware metrics that will monitor the GPU consumption and system.</p>



<h4 class="wp-block-heading">1. vLLM application metrics</h4>



<p class="wp-block-paragraph">The dashboard provides insights into vLLM application performance, request handling, and resource utilization based on the following metrics:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>Type</th><th>Description</th><th>Unit</th><th>Dashboard Usage</th></tr></thead><tbody><tr><td><code>vllm:request_success_total</code></td><td>Counter</td><td>Total successful requests</td><td>count</td><td>Request Rate, Total Requests</td></tr><tr><td><code>vllm:num_requests_running</code></td><td>Gauge</td><td>Requests currently being processed</td><td>count</td><td>Queue Depth, Active Requests</td></tr><tr><td><code>vllm:num_requests_waiting</code></td><td>Gauge</td><td>Requests waiting in queue</td><td>count</td><td>Queue Depth, Queued Requests</td></tr><tr><td><code>vllm:time_to_first_token_seconds</code></td><td>Histogram</td><td>Latency until first token generated</td><td>seconds</td><td>TTFT P50/P95/P99</td></tr><tr><td><code>vllm:e2e_request_latency_seconds</code></td><td>Histogram</td><td>Total end-to-end latency</td><td>seconds</td><td>E2E Latency P50/P95/P99</td></tr><tr><td><code>vllm:generation_tokens_total</code></td><td>Counter</td><td>Total tokens generated (output)</td><td>count</td><td>Token Generation Rate, Throughput</td></tr><tr><td><code>vllm:prompt_tokens_total</code></td><td>Counter</td><td>Total prompt tokens (input)</td><td>count</td><td>Token Generation Rate, Avg Tokens</td></tr><tr><td><code>vllm:kv_cache_usage_perc</code></td><td>Gauge</td><td>GPU KV cache utilization</td><td>0-1 (0-100%)</td><td>KV Cache Usage</td></tr><tr><td><code>vllm:prefix_cache_hits_total</code></td><td>Counter</td><td>Number of prefix cache hits</td><td>count</td><td>Cache Hit Rate</td></tr><tr><td><code>vllm:prefix_cache_queries_total</code></td><td>Counter</td><td>Number of prefix cache queries</td><td>count</td><td>Cache Hit Rate</td></tr><tr><td><code>vllm:request_queue_time_seconds</code></td><td>Histogram</td><td>Time spent waiting in queue</td><td>seconds</td><td>Request Queue Time</td></tr><tr><td><code>vllm:request_prefill_time_seconds</code></td><td>Histogram</td><td>Prefill phase time</td><td>seconds</td><td>Prefill Time</td></tr><tr><td><code>vllm:request_decode_time_seconds</code></td><td>Histogram</td><td>Decode phase time</td><td>seconds</td><td>Decode Time</td></tr><tr><td><code>vllm:inter_token_latency_seconds</code></td><td>Histogram</td><td>Latency between each token</td><td>seconds</td><td>Inter-Token Latency</td></tr><tr><td><code>vllm:num_preemptions_total</code></td><td>Counter</td><td>Number of preemptions (OOM)</td><td>count</td><td>Preemptions</td></tr><tr><td><code>vllm:prompt_tokens_cached_total</code></td><td>Counter</td><td>Prompt tokens cached</td><td>count</td><td>Cached Tokens</td></tr><tr><td><code>vllm:request_prompt_tokens</code></td><td>Histogram</td><td>Prompt size distribution</td><td>count</td><td>(Table)</td></tr><tr><td><code>vllm:request_generation_tokens</code></td><td>Histogram</td><td>Generated tokens distribution</td><td>count</td><td>(Table)</td></tr><tr><td><code>vllm:iteration_tokens_total</code></td><td>Histogram</td><td>Tokens per iteration</td><td>count</td><td>(Advanced analysis)</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">This <strong>vLLM Grafana dashboard</strong> is composed of 23 panels:</p>



<p class="wp-block-paragraph">The dashboard provides insights into LLM application performance, request handling, and resource utilisation based on the previous metrics.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Type</th><th>Nombre</th><th>Panels</th></tr></thead><tbody><tr><td><strong>Timeseries</strong></td><td>12</td><td>Request Rate, Queue Depth, TTFT, E2E Latency, Token Gen, Cache Usage, Cache Hit, Queue Time, Prefill/Decode, Inter-Token, Preemptions, Avg Tokens</td></tr><tr><td><strong>Stat</strong></td><td>10</td><td>Throughput, TTFT P95, Active Req, Queued Req, Cache Hit Rate, Cache Usage, Total Req, Total Tokens, Cached Tokens, Preemptions</td></tr><tr><td><strong>Table</strong></td><td>1</td><td>Pod Performance</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">Now create the dashboard using <a href="https://github.com/ovh/public-cloud-examples/blob/ep-vllm-deployment-observability-mks/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/grafana-dashboards/vllm-app-dashboard.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"></a><code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/grafana-dashboards/vllm-app-dashboard.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-app-dashboard.json</a></strong></code>. Then, launch:</p>



<pre class="wp-block-code"><code class="">echo "Importing vLLM application dashboard..."<br>curl -X POST \<br>  'http://localhost:3000/api/dashboards/db' \<br>  -H 'Content-Type: application/json' \<br>  -u 'admin:Admin123!vLLM' \<br>  -d @vllm-app-dashboard.json | jq '.status, .url'</code></pre>



<p class="wp-block-paragraph">Next, you an access the vLLM dashboard and follow metrics in real time:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="686" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-1024x686.png" alt="" class="wp-image-30858" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-1024x686.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-300x201.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-768x514.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3.png 1230w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">This dashboard is also essential to track hardware consumption for comprehensive monitoring.</p>



<h4 class="wp-block-heading">2. GPU hardware metrics</h4>



<p class="wp-block-paragraph">Take advantage of the most useful DCGM metrics to check both the functioning and consumption of your hardware resources:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>Type</th><th>Description</th><th>Unit</th><th>Normal Thresholds</th><th>Dashboard Usage</th></tr></thead><tbody><tr><td><code>DCGM_FI_DEV_GPU_UTIL</code></td><td>Gauge</td><td>GPU utilization (compute)</td><td>% (0-100)</td><td>70-95% optimal</td><td>GPU Utilization</td></tr><tr><td><code>DCGM_FI_DEV_GPU_TEMP</code></td><td>Gauge</td><td>GPU temperature</td><td>°C</td><td>&lt; 85°C normal</td><td>GPU Temperature</td></tr><tr><td><code>DCGM_FI_DEV_FB_USED</code></td><td>Gauge</td><td>VRAM used</td><td>MB</td><td>Variable by model</td><td>GPU Memory Used</td></tr><tr><td><code>DCGM_FI_DEV_FB_FREE</code></td><td>Gauge</td><td>VRAM free</td><td>MB</td><td>&gt; 2GB recommended</td><td>GPU Memory Free</td></tr><tr><td><code>DCGM_FI_DEV_POWER_USAGE</code></td><td>Gauge</td><td>Power consumption</td><td>Watts</td><td>&lt; 300W (L40S)</td><td>GPU Power Usage</td></tr><tr><td><code>DCGM_FI_DEV_SM_CLOCK</code></td><td>Gauge</td><td>GPU clock speed (compute)</td><td>MHz</td><td>Variable</td><td>GPU Clock Speed</td></tr><tr><td><code>DCGM_FI_DEV_MEM_CLOCK</code></td><td>Gauge</td><td>Memory clock speed</td><td>MHz</td><td>Variable</td><td>Memory Clock Speed</td></tr><tr><td><code>DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL</code></td><td>Counter</td><td>Total NVLink bandwidth</td><td>bytes/s</td><td>(If multi-GPU)</td><td>NVLink Bandwidth</td></tr><tr><td><code>DCGM_FI_DEV_PCIE_TX_BYTES</code></td><td>Counter</td><td>PCIe data transmitted</td><td>bytes</td><td>(I/O monitoring)</td><td>PCIe TX</td></tr><tr><td><code>DCGM_FI_DEV_PCIE_RX_BYTES</code></td><td>Counter</td><td>PCIe data received</td><td>bytes</td><td>(I/O monitoring)</td><td>PCIe RX</td></tr><tr><td><code>DCGM_FI_DEV_ECC_DBE_VOL_TOTAL</code></td><td>Counter</td><td>ECC double-bit errors</td><td>count</td><td>0 ideal</td><td>(Health check)</td></tr><tr><td><code>DCGM_FI_DEV_ECC_SBE_VOL_TOTAL</code></td><td>Counter</td><td>ECC single-bit errors</td><td>count</td><td>&lt; 10/day acceptable</td><td>(Health check)</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">This&nbsp;<strong>hardware Grafana dashboard</strong>&nbsp;is composed of 13 panels with GPU hardware and system metrics. A detailed view is also available GPU util (%), temperature (°C), vRAM (GB) and power (Watt).</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Type</th><th>Count</th><th>Panels</th></tr></thead><tbody><tr><td><strong>Timeseries</strong></td><td>8</td><td>GPU Util, GPU Mem, GPU Temp, GPU Power, CPU Usage, RAM Usage, Network I/O, Disk I/O</td></tr><tr><td><strong>Stat</strong></td><td>4</td><td>Avg GPU Util, Avg GPU Temp, Total GPU Mem, Total GPU Power</td></tr><tr><td><strong>Table</strong></td><td>1</td><td>Hardware Status</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">Please refer to <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/grafana-dashboards/hardware-dashboard.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">hardware-dashboard.json</a></strong></code> by loading it as follows:</p>



<pre class="wp-block-code"><code class="">echo "Importing hardware dashboard..."<br>curl -X POST \<br>  'http://localhost:3000/api/dashboards/db' \<br>  -H 'Content-Type: application/json' \<br>  -u 'admin:Admin123!vLLM' \<br>  -d @hardware-dashboard.json | jq '.status, .url'</code></pre>



<p class="wp-block-paragraph">Finally, track resource consumption using this hardware dashboard:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="686" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-1024x686.png" alt="" class="wp-image-30859" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-1024x686.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-300x201.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-768x514.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4.png 1230w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">Congratulations! Everything is working. You can now test your model and track the various metrics in real time.</p>



<h3 class="wp-block-heading">Step 8 &#8211; LLM testing and performance tracking</h3>



<p class="wp-block-paragraph">Start by installing Python dependencies:</p>



<pre class="wp-block-code"><code class="">pip3 install openai tqdm</code></pre>



<p class="wp-block-paragraph">Replace the <strong><mark class="has-inline-color has-ast-global-color-0-color">&lt;EXTERNAL_IP&gt;</mark></strong> by the vLLM service external IP and launch the performance test thanks to the following <a href="https://github.com/ovh/public-cloud-examples/blob/ep-vllm-deployment-observability-mks/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/llm-inference-performance-test.py" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code><strong>Python code</strong></code></a>:</p>



<pre class="wp-block-code"><code class="">import time<br>import threading<br>import random<br>from statistics import mean<br>from openai import OpenAI<br>from tqdm import tqdm<br><br>APP_URL = "http://94.23.185.22/v1"<br>MODEL = "qwen3-vl-8b"<br><br>CONCURRENT_WORKERS = 500          # concurrency<br>REQUESTS_PER_WORKER = 10<br>MAX_TOKENS = 200                  # generation pressure<br><br># some random prompts<br>SHORT_PROMPTS = [<br>    "Summarize the theory of relativity.",<br>    "Explain what a transformer model is.",<br>    "What is Kubernetes autoscaling?"<br>]<br><br>MEDIUM_PROMPTS = [<br>    "Explain how attention mechanisms work in transformer-based models, including self-attention and multi-head attention.",<br>    "Describe how vLLM manages KV cache and why it impacts inference performance."<br>]<br><br>LONG_PROMPTS = [<br>    "Write a very detailed technical explanation of how large language models perform inference, "<br>    "including tokenization, embedding lookup, transformer layers, attention computation, KV cache usage, "<br>    "GPU memory management, and how batching affects latency and throughput. Use examples.",<br>]<br><br>PROMPT_POOL = (<br>    SHORT_PROMPTS * 2 +<br>    MEDIUM_PROMPTS * 4 +<br>    LONG_PROMPTS * 6    # bias toward long prompts<br>)<br><br># openai compliance<br>client = OpenAI(<br>    base_url=APP_URL,<br>    api_key="foo"<br>)<br><br># basic metrics<br>latencies = []<br>errors = 0<br>lock = threading.Lock()<br><br># worker<br>def worker(worker_id):<br>    global errors<br>    for _ in range(REQUESTS_PER_WORKER):<br>        prompt = random.choice(PROMPT_POOL)<br><br>        start = time.time()<br>        try:<br>            client.chat.completions.create(<br>                model=MODEL,<br>                messages=[{"role": "user", "content": prompt}],<br>                max_tokens=MAX_TOKENS,<br>                temperature=0.7,<br>            )<br>            elapsed = time.time() - start<br><br>            with lock:<br>                latencies.append(elapsed)<br><br>        except Exception as e:<br>            with lock:<br>                errors += 1<br><br># run<br>threads = []<br>start_time = time.time()<br><br>print("\n-&gt; STARTING PERFORMANCE TEST:")<br>print(f"Concurrency: {CONCURRENT_WORKERS}")<br>print(f"Total requests: {CONCURRENT_WORKERS * REQUESTS_PER_WORKER}")<br><br>for i in range(CONCURRENT_WORKERS):<br>    t = threading.Thread(target=worker, args=(i,))<br>    t.start()<br>    threads.append(t)<br><br>for t in threads:<br>    t.join()<br><br>total_time = time.time() - start_time<br><br># results<br>print("\n-&gt; BENCH RESULTS:")<br>print(f"Total requests sent: {len(latencies) + errors}")<br>print(f"Successful requests: {len(latencies)}")<br>print(f"Errors: {errors}")<br>print(f"Total wall time: {total_time:.2f}s")<br><br>if latencies:<br>    print(f"Avg latency: {mean(latencies):.2f}s")<br>    print(f"Min latency: {min(latencies):.2f}s")<br>    print(f"Max latency: {max(latencies):.2f}s")<br>    print(f"Throughput: {len(latencies)/total_time:.2f} req/s")</code></pre>



<p class="wp-block-paragraph">Returning:</p>



<pre class="wp-block-preformatted"><code>-&gt; STARTING PERFORMANCE TEST:</code><br><code>Concurrency: 500<br>Total requests: 5000</code><br><code><br>-&gt; BENCH RESULTS:<br>Total requests sent: 5000<br>Successful requests: 5000<br>Errors: 0<br>Total wall time: 225.54s<br>Avg latency: 21.45s<br>Min latency: 6.06s<br>Max latency: 25.19s<br>Throughput: 22.17 req/s</code></pre>



<p class="wp-block-paragraph">Don&#8217;t forget to track GPU and vLLM metrics in your Grafana dashboards!</p>



<h2 class="wp-block-heading">Conslusion</h2>



<p class="wp-block-paragraph">This reference architecture demonstrates a<strong>&nbsp;vLLM deployment on OVHcloud Managed Kubernetes Service (MKS)</strong>&nbsp;with comprehensive GPU monitoring. Benefits include:</p>



<ul class="wp-block-list">
<li><strong>High Performance</strong>: GPU-accelerated inference with L40S</li>



<li><strong>Scalability</strong>: Kubernetes-native, horizontal scaling-ready</li>



<li><strong>Reliability</strong>: Health checks, auto-restart, monitoring</li>



<li><strong>API Compatibility</strong>: OpenAI-compatible endpoints</li>



<li><strong>Multimodality</strong>: Vision &amp; text capabilities</li>



<li><strong>Full stack monitoring</strong>: Complete vLLM application and hardware dashboards</li>
</ul>



<h2 class="wp-block-heading">Going Further</h2>



<p class="wp-block-paragraph">Your current architecture is&nbsp;<strong>functional.&nbsp;</strong>However, if desired,&nbsp;<strong>it could be improved into a full production-ready&nbsp;solution.</strong></p>



<p class="wp-block-paragraph"><strong>Wish to take production hardening a step further?</strong></p>



<p class="wp-block-paragraph">Go further with the following enhancements:</p>



<ol class="wp-block-list">
<li><strong>Authentication &amp; authorization</strong>
<ul class="wp-block-list">
<li>vLLM API authentication</li>



<li>Grafana authentication</li>



<li>Prometheus security</li>
</ul>
</li>



<li><strong>High availability &amp; load balancing</strong>
<ul class="wp-block-list">
<li>Grafana high availability with multiple replicas and shared storage</li>



<li>Prometheus high availability</li>



<li>vLLM Horizontal Pod Autoscaling (HPA) based on custom metrics</li>
</ul>
</li>



<li><strong>Data persistence &amp; backup</strong>
<ul class="wp-block-list">
<li>Prometheus long-term storage with persistent storage</li>



<li>Grafana Dashboard Backup</li>
</ul>
</li>



<li><strong>Observability enhancements</strong>
<ul class="wp-block-list">
<li>Distributed tracing by adding OpenTelemetry for request tracing</li>



<li>Alerting rules with production-ready alert rules</li>
</ul>
</li>
</ol>



<p class="wp-block-paragraph"></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-deploying-a-vision-language-model-with-vllm-on-ovhcloud-mks-for-high-performance-inference-and-full-observability%2F&amp;action_name=Reference%20Architecture%3A%20Deploying%20a%20vision-language%20model%20with%20vLLM%20on%20OVHcloud%20MKS%20for%20high%20performance%20inference%20and%20full%20observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to serve LLMs with vLLM and OVHcloud AI Deploy</title>
		<link>https://blog.ovhcloud.com/how-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy/</link>
		
		<dc:creator><![CDATA[Mathieu Busquet]]></dc:creator>
		<pubDate>Wed, 29 May 2024 12:22:26 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[AI Endpoints]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Deep learning]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LLaMA]]></category>
		<category><![CDATA[LLaMA 3]]></category>
		<category><![CDATA[LLM Serving]]></category>
		<category><![CDATA[Mistral]]></category>
		<category><![CDATA[Mixtral]]></category>
		<category><![CDATA[vLLM]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=26762</guid>

					<description><![CDATA[In this tutorial, we will learn how to serve Large Language Models (LLMs) using vLLM and the OVHcloud AI Products.<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy%2F&amp;action_name=How%20to%20serve%20LLMs%20with%20vLLM%20and%20OVHcloud%20AI%20Deploy&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em>In this tutorial, we will walk you through the process of serving large language models (LLMs), providing step-by-step instruction</em>.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" width="1024" height="345" src="https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1024x345.png" alt="" class="wp-image-25615" style="width:750px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1024x345.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-300x101.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-768x259.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1536x518.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-2048x690.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading">Introduction</h3>



<p class="wp-block-paragraph">In recent years, <strong>large language models</strong> (LLMs) have become increasingly <strong>popular</strong>, with <strong>open-source</strong> models like <em>Mistral</em> and <em>LLaMA</em> gaining widespread attention. In particular, the <em>LLaMA 3</em> model was released on <em>April 18, 2024</em>, is one of today&#8217;s most powerful open-source LLMs.</p>



<p class="wp-block-paragraph">However, <strong>serving these LLMs can be challenging</strong>, particularly on hardware with limited resources. Indeed, even on expensive hardware, LLMs can be surprisingly slow, with high VRAM utilization and throughput limitations.</p>



<p class="wp-block-paragraph">This is where<strong><em> </em></strong><em><a href="https://github.com/vllm-project/vllm" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>vLLM</strong></a></em> comes in. <em><strong>vLLM</strong></em> is an <strong>open-source project</strong> that enables <strong>fast and easy-to-use LLM inference and serving</strong>. Designed for optimal performance and resource utilization, <em>vLLM</em> supports a range of <a href="https://docs.vllm.ai/en/latest/models/supported_models.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLM architectures</a> and offers <a href="https://docs.vllm.ai/en/latest/models/engine_args.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">flexible customization options</a>. That&#8217;s why we are going to use it to efficiently deploy and scale our LLMs.</p>



<h3 class="wp-block-heading">Objective</h3>



<p class="wp-block-paragraph">In this guide, you will discover how to deploy a LLM thanks to <a href="https://github.com/vllm-project/vllm" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>vLLM</em></a> and the <strong><em>AI Deploy</em></strong> <em>OVHcloud</em> solution. This will enable you to benefit from <em>vLLM</em>&#8216;s optimisations and <em>OVHcloud</em>&#8216;s GPU computing resources. Your LLM will then be exposed by a secured API.</p>



<p class="wp-block-paragraph">🎁 And for those who do not want to bother with the deployment process, <strong>a surprise awaits you at the <a href="#AI-ENDPOINTS">end of the article</a></strong>. We are going to introduce you to our new solution for using LLMs, called <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>AI Endpoints</strong></a>. This product makes it easy to integrate AI capabilities into your applications with a simple API call, without the need for deep AI expertise or infrastructure management. And while it&#8217;s in alpha, it&#8217;s <strong>free</strong>!</p>



<h3 class="wp-block-heading">Requirements</h3>



<p class="wp-block-paragraph">To deploy your <em>vLLM</em> server, you need:</p>



<ul class="wp-block-list">
<li>An <em>OVHcloud</em> account to access the <a href="https://www.ovh.com/auth/?action=gotomanager&amp;from=https://www.ovh.co.uk/&amp;ovhSubsidiary=GB" data-wpel-link="exclude"><em>OVHcloud Control Panel</em></a></li>



<li>A <em>Public Cloud</em> project</li>



<li>A <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">user for the AI Products</a>, related to this <em>Public Cloud</em> project</li>



<li><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">The <em>OVHcloud AI CLI</em></a> installed on your local computer (to interact with the AI products by running commands). </li>



<li><a href="https://www.docker.com/get-started" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Docker</a> installed on your local computer, <strong>or</strong> access to a Debian Docker Instance, which is available on the <a href="https://www.ovh.com/manager/public-cloud/" data-wpel-link="exclude"><em>Public Cloud</em></a></li>
</ul>



<p class="wp-block-paragraph">Once these conditions have been met, you are ready to serve your LLMs.</p>



<h3 class="wp-block-heading">Building a Docker image</h3>



<p class="wp-block-paragraph">Since the <a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>OVHcloud AI Deploy</em></a> solution is based on <a href="https://www.docker.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>Docker</em></a> images, we will be using a <em>Docker</em> image to deploy our <em>vLLM</em> inference server. </p>



<p class="wp-block-paragraph">As a reminder, <em>Docker</em> is a platform that allows you to create, deploy, and run applications in containers. <em>Docker</em> containers are standalone and executable packages that include everything needed to run an application (code, libraries, system tools).</p>



<p class="wp-block-paragraph">To create this <em>Docker</em> image, we will need to write the following <em><strong>Dockerfile</strong></em> into a new folder:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">mkdir my_vllm_image
nano Dockerfile</code></pre>



<pre class="wp-block-code"><code lang="bash" class="language-bash"># 🐳 Base image
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

# 👱 Set the working directory inside the container
WORKDIR /workspace

# 📚 Install missing system packages (git) so we can clone the vLLM project repository
RUN apt-get update &amp;&amp; apt-get install -y git
RUN git clone https://github.com/vllm-project/vllm/

# 📚 Install the Python dependencies
RUN pip3 install --upgrade pip
RUN pip3 install vllm 

# 🔑 Give correct access rights to the OVHcloud user
ENV HOME=/workspace
RUN chown -R 42420:42420 /workspace</code></pre>



<p class="wp-block-paragraph">Let&#8217;s take a closer look at this <em>Dockerfile</em> to understand it:</p>



<ul class="wp-block-list">
<li><strong>FROM</strong>: Specify the base image for our <em>Docker</em> Image. We choose the <em>PyTorch</em> image since it comes with <em>CUDA</em>, <em>CuDNN</em> and <em>torch</em>, which is needed by <em>vLLM</em>. </li>



<li><strong>WORKDIR /workspace</strong>: We set the working directory for the <em>Docker</em> container to <em>/workspace</em>, which is the default folder when we use <em>AI Deploy</em>.</li>



<li><strong>RUN</strong>: It allows us to upgrade <em>pip</em> to the latest version to make sure we have access to the latest libraries and dependencies. We will install <em>vLLM</em> library, and <em>git</em>, which will enable to clone the <a href="https://github.com/vllm-project/vllm/tree/main" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>vLLM</em> repository</a> into th<em>e /workspace</em> directory.</li>



<li><strong>ENV</strong> HOME=/workspace: This sets the <em>HOME</em> environment variable to <em>/workspace</em>. This is a requirement to use the <em>OVHcloud</em> AI Products.</li>



<li><strong>RUN chown -R 42420:42420 /workspace</strong>: This changes the owner of the <em>/workspace</em> directory to the user and group with IDs of <em>42420</em> (<em>OVHcloud</em> user). This is also a requirement to use the <em>OVHcloud</em> AI Products.</li>
</ul>



<p class="wp-block-paragraph">This <em>Dockerfile</em> does not contain a <strong>CMD</strong> instruction and therefore does not launch our <em>VLLM</em> server. Do not worry about that, we will do it directly from <a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Deploy</a>&nbsp;to have more flexibility.</p>



<p class="wp-block-paragraph">Once your Dockerfile is written, launch the following command to build your image:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker build . -t vllm_image:latest</code></pre>



<h3 class="wp-block-heading">Push the image into the shared registry</h3>



<p class="wp-block-paragraph">Once you have built the Docker image, you will need to push it to a <strong>registry</strong> to make it accessible from <em>AI Deploy</em>. A <strong>registry</strong> is a service that allows you to store and distribute <em>Docker</em> images, making it easy to deploy them in different environments.</p>



<p class="wp-block-paragraph">Several registries can be used (<em><a href="https://www.ovhcloud.com/en-gb/public-cloud/managed-private-registry/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Managed Private Registry</a>, <a href="https://hub.docker.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Docker Hub</a>, <a href="https://github.com/features/packages" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GitHub packages</a>, &#8230;</em>). In this tutorial, we will use the <strong><em>OVHcloud</em> <em>shared registry</em></strong>. More information are available in the <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-manage-registries?id=kb_article_view&amp;sysparm_article=KB0057949" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Registries documentation</a>.</p>



<p class="wp-block-paragraph">To find the address of your shared registry, use the following command (<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>ovhai CLI</em></a> needs to be installed on your computer):</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">ovhai registry list</code></pre>



<p class="wp-block-paragraph">Then, log in on your <em>shared registry</em> with your usual <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Platform user</em></a> credentials:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker login -u &lt;user&gt; -p &lt;password&gt; &lt;shared-registry-address&gt;</code></pre>



<p class="wp-block-paragraph">Once you are logged in to the registry, tag the compiled image and push it into your shared registry:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker tag vllm_image:latest &lt;shared-registry-address&gt;/vllm_image:latest
docker push &lt;shared-registry-address&gt;/vllm_image:latest</code></pre>



<h3 class="wp-block-heading">vLLM inference server deployment</h3>



<p class="wp-block-paragraph">Once your image has been pushed, it can be used with <em>AI Deploy</em>, using either the <em>ovhai CLI</em> or the <em>OVHcloud Control Panel (UI)</em>.</p>



<h5 class="wp-block-heading">Creating an access token </h5>



<p class="wp-block-paragraph">Tokens are used as unique authenticators to securely access the <em>AI Deploy</em> apps. By creating a token, you can ensure that only authorized requests are allowed to interact with the <em>vLLM</em> endpoint. You can create this token by using the <em>OVHcloud Control Panel (UI)</em> or by running the following command:</p>



<pre class="wp-block-code"><code lang="" class="">ovhai token create vllm --role operator --label-selector name=vllm</code></pre>



<p class="wp-block-paragraph">This will give you a token that you will need to keep.</p>



<h5 class="wp-block-heading">Creating a Hugging Face token (optionnal)</h5>



<p class="wp-block-paragraph">Note that some models, such as <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLaMA 3</a> require you to accept their license, hence, you need to create a <a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">HuggingFace account</a>, accept the model’s license, and generate a <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">token</a> by accessing your account settings, that will allow you to access the model.</p>



<p class="wp-block-paragraph">For example, when visiting the HugginFace <a href="https://huggingface.co/google/gemma-2b" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Gemma model page</a>, you’ll see this (if you are logged in):</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="716" height="312" src="https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21.png" alt="accept_model_conditions_hugging_face" class="wp-image-26768" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21.png 716w, https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21-300x131.png 300w" sizes="auto, (max-width: 716px) 100vw, 716px" /></figure>



<p class="wp-block-paragraph">If you want to use this model, you will have to Acknowledge the license, and then make sure to create a token in the <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">tokens section</a>.</p>



<p class="wp-block-paragraph">In the next step, we will set this token as an environment variable (named  <code>HF_TOKEN</code>). Doing this will enable us to use any LLM whose conditions of use we have accepted.</p>



<h5 class="wp-block-heading">Run the AI Deploy application</h5>



<p class="wp-block-paragraph">Run the following command to deploy your <em>vLLM</em> server by running your customized <em>Docker</em> image:</p>



<pre class="wp-block-code"><code lang="" class="">ovhai app run &lt;shared-registry-address&gt;/vllm_image:latest \
  --name vllm_app \
  --flavor h100-1-gpu \
  --gpu 1 \
  --env HF_TOKEN="&lt;YOUR_HUGGING_FACE_TOKEN&gt;" \
  --label name=vllm \
  --default-http-port 8080 \
  -- python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model &lt;model&gt; --dtype half</code></pre>



<p class="wp-block-paragraph"><em>You just need to change the address of your registry to the one you used, and the name of the LLM you want to use. Also pay attention to the name of the image, its tag, and the label selector of your label if you haven&#8217;t used the same ones as those given in this tutorial.</em></p>



<p class="wp-block-paragraph"><strong>Parameters explanation</strong></p>



<ul class="wp-block-list">
<li><code>&lt;shared-registry-address&gt;/vllm_image:latest</code> is the image on which the app is based.</li>



<li><code>--name vllm_app</code> is an optional argument that allows you to give your app a custom name, making it easier to manage all your apps.</li>



<li><code>--flavor h100-1-gpu</code> indicates that we want to run our app on H100 GPU(s). You can access the full list of GPUs available by <code>running ovhai capabilities flavor list</code></li>



<li><code>--gpu 1</code> indicates that we request 1 GPU for that app.</li>



<li><code>--env HF_TOKEN</code> is an optional argument that allows us to set our Hugging Face token as an environment variable. This gives us access to models for which we have accepted the conditions.</li>



<li><code>--label name=vllm</code> allows to privatize our LLM by adding the token corresponding to the label selector <code>name=vllm</code>.</li>



<li><code>--default-http-port 8080</code> indicates that the port to reach on the app URL is the <code>8080</code>.</li>



<li><code>--python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model &lt;model&gt;</code> allows to start the vLLM API server. The specified &lt;model&gt; will be downloaded from Hugging Face. Here is a list of those that are <a href="https://docs.vllm.ai/en/latest/models/supported_models.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">supported by vLLM</a>. <a href="https://docs.vllm.ai/en/latest/models/engine_args.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Many arguments</a> can be used to optimize your inference.</li>
</ul>



<p class="wp-block-paragraph">When this <code>ovhai app run</code> command is executed, several pieces of information will appear in your terminal. Get the ID of your application, and open the Info URL in a new tab. Wait a few minutes for your application to launch. When it is <strong>RUNNING</strong>, you can stream its logs by executing:</p>



<pre class="wp-block-code"><code class="">ovhai app logs -f &lt;APP_ID&gt;</code></pre>



<p class="wp-block-paragraph">This will allow you to track the server launch, the model download and any errors you may encounter if you have used a model for which you have not accepted the user contract. </p>



<p class="wp-block-paragraph">If all goes well, you should see the following output, which means that your server is up and running:</p>



<pre class="wp-block-code"><code class="">Started server process [11]
Waiting for application startup.
Application startup complete.
Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)</code></pre>



<h3 class="wp-block-heading">Interacting with your LLM</h3>



<p class="wp-block-paragraph">Once the server is up and running, we can interact with our LLM by hitting the <code>/generate</code> endpoint.</p>



<p class="wp-block-paragraph"><strong>Using cURL</strong></p>



<p class="wp-block-paragraph"><em>Make sure you change the ID to that of your application so that you target the right endpoint. In order for the request to be accepted, also specify the token that you generated previously by executing</em> <code>ovhai token create</code>. Feel free to adapt the parameters of the request (<em>prompt</em>, <em>max_tokens</em>, <em>temperature</em>, &#8230;)</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">curl --request POST \                                             
  --url https://&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net/generate \
  --header 'Authorization: Bearer &lt;AI_TOKEN_generated_with_CLI&gt;' \
  --header 'Content-Type: application/json' \
  --data '{
        "prompt": "&lt;YOUR_PROMPT&gt;",
        "max_tokens": 50,
        "n": 1,
        "stream": false
}'</code></pre>



<p class="wp-block-paragraph"><strong>Using Python</strong></p>



<p class="wp-block-paragraph"><em>Here too, you need to add your personal token and the correct link for your application.</em></p>



<pre class="wp-block-code"><code lang="python" class="language-python">import requests
import json

# change for your host
APP_URL = "https://&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net"
TOKEN = "AI_TOKEN_generated_with_CLI"

url = f"{APP_URL}/generate"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {TOKEN}"
}
data = {
    "prompt": "What a LLM is in AI?",
    "max_tokens": 100,
    "temperature": 0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json()["text"][0])</code></pre>



<h3 class="wp-block-heading" id="AI-ENDPOINTS">OVHcloud AI Endpoints</h3>



<p class="wp-block-paragraph">If you are not interested in building your own image and deploying your own LLM inference server, you can use OVHcloud&#8217;s new <em><strong><a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints</a></strong> </em>product which will make your life definitely easier!</p>



<p class="wp-block-paragraph"><a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> is a serverless solution that provides AI APIs, enabling you to easily use pre-trained and optimized AI models in your applications. </p>



<figure class="wp-block-video"><video height="1400" style="aspect-ratio: 2560 / 1400;" width="2560" controls src="https://blog.ovhcloud.com/wp-content/uploads/2024/05/demo-ai-endpoints.mp4"></video></figure>



<p class="has-text-align-center wp-block-paragraph"><em>Overview of AI Endpoints</em></p>



<p class="wp-block-paragraph">You can use LLM as a Service, choosing the desired model (such as <em>LLaMA</em>, <em>Mistral</em>, or <em>Mixtral</em>) and making an API call to use it in your application. This will allow you to interact with these models without even having to deploy them!</p>



<p class="wp-block-paragraph">In addition to LLM capabilities, <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> also offers a range of other AI models, including speech-to-text, translation, summarization, embeddings and computer vision. </p>



<p class="wp-block-paragraph">Best of all, <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> is currently in alpha phase and is <strong>free to use</strong>, making it an accessible and affordable solution for developers seeking to explore the possibilities of AI. Check <a href="https://blog.ovhcloud.com/enhance-your-applications-with-ai-endpoints/" data-wpel-link="internal">this article</a> and try it out today to discover the power of AI!</p>



<p class="wp-block-paragraph">Join our <a href="https://discord.gg/ovhcloud" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Discord server</a> to interact with the community and send us your feedbacks (#<em>ai-endpoints</em> channel)!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy%2F&amp;action_name=How%20to%20serve%20LLMs%20with%20vLLM%20and%20OVHcloud%20AI%20Deploy&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2024/05/demo-ai-endpoints.mp4" length="14424826" type="video/mp4" />

			</item>
	</channel>
</rss>
