<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>GPU Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/gpu/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/gpu/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Fri, 10 Apr 2026 09:23:41 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>GPU Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/gpu/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Reference Architecture: Deploying a vision-language model with vLLM on OVHcloud MKS for high performance inference and full observability</title>
		<link>https://blog.ovhcloud.com/reference-architecture-deploying-a-vision-language-model-with-vllm-on-ovhcloud-mks-for-high-performance-inference-and-full-observability/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Fri, 10 Apr 2026 07:48:53 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[Kubernetes]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[prometheus]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<category><![CDATA[vLLM]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=30455</guid>

					<description><![CDATA[Ensure complete&#160;digital sovereignty&#160;of your AI models with end-to-end control through open-source solutions on OVHcloud’s&#160;Managed Kubernetes Service. This reference architecture demonstrates how to deploy a Large Language Model (LLM) inference system using vLLM on&#160;OVHcloud Managed Kubernetes Service&#160;(MKS). The solution leverages NVIDIA L40S GPUs to serve the&#160;Qwen3-VL-8B-Instruct&#160;multimodal model (vision + text) with OpenAI-compatible API endpoints. This comprehensive [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-deploying-a-vision-language-model-with-vllm-on-ovhcloud-mks-for-high-performance-inference-and-full-observability%2F&amp;action_name=Reference%20Architecture%3A%20Deploying%20a%20vision-language%20model%20with%20vLLM%20on%20OVHcloud%20MKS%20for%20high%20performance%20inference%20and%20full%20observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em><em>Ensure complete&nbsp;<strong>digital sovereignty</strong>&nbsp;of your AI models with end-to-end control through open-source solutions on OVHcloud’s&nbsp;<strong>Managed Kubernetes Service</strong>.</em></em></p>



<figure class="wp-block-image aligncenter size-large is-resized"><img fetchpriority="high" decoding="async" width="703" height="1024" src="https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-703x1024.jpg" alt="vLLM on OVHcloud MKS for high availability and full observability" class="wp-image-31153" style="width:710px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-703x1024.jpg 703w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-206x300.jpg 206w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-768x1118.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm-1055x1536.jpg 1055w, https://blog.ovhcloud.com/wp-content/uploads/2026/04/ref-archi-mks-vllm.jpg 1260w" sizes="(max-width: 703px) 100vw, 703px" /><figcaption class="wp-element-caption"><em><em>vLLM on OVHcloud MKS for high availability and full observability</em></em></figcaption></figure>



<p>This reference architecture demonstrates how to deploy a Large Language Model (LLM) inference system using vLLM on&nbsp;<a href="https://www.ovhcloud.com/fr/public-cloud/kubernetes/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Managed Kubernetes Service</a>&nbsp;(MKS). The solution leverages NVIDIA L40S GPUs to serve the&nbsp;<a href="https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Qwen3-VL-8B-Instruct</a>&nbsp;multimodal model (vision + text) with OpenAI-compatible API endpoints.</p>



<p>This comprehensive guide shows you how to deploy, to scale automatically, and how to monitor vLLM-based LLM workloads on the OVHcloud infrastructure.</p>



<p><strong>What are the key benefits?</strong></p>



<ul class="wp-block-list">
<li><strong>Cost-effectiveness:</strong>&nbsp;Leverage managed services to minimise operational overhead</li>



<li><strong>Real-time observability:</strong>&nbsp;Track Time-to-First-Token (TTFT), throughput, and resource utilisation</li>



<li><strong>Sovereign infrastructure:</strong>&nbsp;Keep all metrics and data within European datacentres</li>



<li><strong>Scalable by design:</strong>&nbsp;Automatically scale GPU inference replicas based on real workload demand</li>
</ul>



<h2 class="wp-block-heading">Context</h2>



<h3 class="wp-block-heading">Managed Kubernetes Service</h3>



<p><strong>OVHcloud MKS</strong>&nbsp;is a fully managed Kubernetes platform designed to help you deploy, operate, and scale containerised applications in production. It provides a secure and reliable Kubernetes environment without the operational overhead of managing the control plane.</p>



<p><strong>How does this benefit you?</strong></p>



<ul class="wp-block-list">
<li><strong>Cost-efficient</strong>: Pay only for worker nodes and consumed resources, with no additional charge for the Kubernetes control plane</li>



<li><strong>Fully managed Kubernetes</strong>: Certified upstream Kubernetes with automated control plane management, provided upgrades and high availability</li>



<li><strong>Production-ready by design</strong>: Built-in integrations with OVHcloud Load Balancers, networking, and persistent storage</li>



<li><strong>Scalable and flexible</strong>: Scale workloads easily, node pools to match application demand</li>



<li><strong>Open and portable</strong>: Based on standard Kubernetes APIs, enable seamless integration with open-source ecosystems and avoid vendor lock-in</li>
</ul>



<p>In the following guide, all services are deployed within the&nbsp;<strong>OVHcloud Public Cloud</strong>.</p>



<h2 class="wp-block-heading">Architecture overview</h2>



<p>This reference architecture demonstrates a basic deployment of vLLM for vision-language model inference on OVHcloud Managed Kubernetes Service, featuring:</p>



<ul class="wp-block-list">
<li><strong>High-availability deployment</strong>&nbsp;with 2 GPU nodes (NVIDIA L40S)</li>



<li><strong>Optimised GPU utilisation</strong>&nbsp;with proper driver configuration</li>



<li><strong>Scalable infrastructure</strong>&nbsp;supporting vision-language models</li>



<li><strong>Comprehensive monitoring</strong>&nbsp;using Prometheus, Grafana, and DCGM</li>



<li><strong>Full observability</strong>&nbsp;for both application and hardware metrics</li>
</ul>



<p><strong>Data flow</strong>:</p>



<figure class="wp-block-image aligncenter size-large"><img decoding="async" width="1024" height="538" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-1024x538.jpg" alt="" class="wp-image-30985" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-768x403.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-1536x806.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-3-1-2048x1075.jpg 2048w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Data Flow</em></figcaption></figure>



<ol class="wp-block-list">
<li><strong>Inference request:</strong>
<ul class="wp-block-list">
<li>User → LoadBalancer → Gateway → NGINX Ingress → &#8220;Qwen3 VL&#8221; Service → vLLM Pod → GPU</li>



<li>Response follows reverse path with streaming support</li>
</ul>
</li>



<li><strong>Metrics collection:</strong>
<ul class="wp-block-list">
<li>vLLM Pods expose <code>/metrics</code> endpoint (port <code><strong><mark class="has-inline-color has-ast-global-color-0-color">8000</mark></strong></code>)</li>



<li>DCGM Exporters expose GPU metrics (port <code><strong><mark class="has-inline-color has-ast-global-color-0-color">9400</mark></strong></code>)</li>



<li>Prometheus scrapes both endpoints every 30 seconds</li>



<li>Grafana queries Prometheus for visualization</li>
</ul>
</li>



<li><strong>Load distribution</strong>
<ul class="wp-block-list">
<li>NGINX Ingress uses cookie-based session affinity</li>



<li>vLLM Service uses ClientIP session affinity</li>



<li>Anti-affinity ensures 1 pod per GPU node</li>
</ul>
</li>
</ol>



<h2 class="wp-block-heading">Prerequisites</h2>



<p>Before you begin, ensure you have:</p>



<ul class="wp-block-list">
<li>An&nbsp;<strong>OVHcloud Public Cloud</strong>&nbsp;account</li>



<li>An&nbsp;<strong>OpenStack user</strong>&nbsp;with the<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">&nbsp;</a><strong><code>Administrator</code></strong>&nbsp;role</li>



<li><strong>Hugging Face access</strong>&nbsp;–&nbsp;<em>create a&nbsp;<a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face account</a>&nbsp;and generate an&nbsp;<a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">access token</a></em></li>



<li><code><strong>kubectl</strong></code>&nbsp;already installed and&nbsp;<code><strong>helm</strong></code>&nbsp;installed (at least version 3.x)</li>
</ul>



<p><strong>🚀 Now you have all the ingredients, it’s time to deploy the recipe for&nbsp;<a href="https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Qwen/Qwen3-VL-8B-Instruct</a>&nbsp;using vLLM and MKS!</strong></p>



<h2 class="wp-block-heading">Architecture guide: Native GPU deployment of vLLM on MKS with full stack observability</h2>



<p>This reference architecture describes a<strong>&nbsp;Large Language Model</strong>&nbsp;deployment using&nbsp;<strong>vLLM inference server&nbsp;</strong>and&nbsp;<strong>Kubernetes</strong>, to enjoy the&nbsp;benefits of a service that&#8217;s both highly available and monitorable in real time.</p>



<h3 class="wp-block-heading">Step 1 &#8211; Create MKS cluster and Node pools</h3>



<p>From&nbsp;<a href="https://www.ovh.com/manager/" target="_blank" rel="noreferrer noopener" data-wpel-link="exclude">OVHcloud Control Panel</a>, create a Kubernetes cluster using the&nbsp;<strong>MKS</strong>. </p>



<p>Navigate to: <code>Public Cloud</code> → <code>Managed Kubernetes Service</code> → <code>Create a cluster</code></p>



<h4 class="wp-block-heading">1. Configure cluster</h4>



<p>Consider using the following configuration for the current use case:</p>



<ul class="wp-block-list">
<li><strong>Name:</strong> <code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm-deployment-l40s-qwen3-8b</mark></strong></code></li>



<li><strong>Location</strong>: 1-AZ Region &#8211; Gravelines (<code><strong><mark class="has-inline-color has-ast-global-color-0-color">GRA11</mark></strong></code>)</li>



<li><strong>Plan:</strong> Free (or Standard)</li>



<li><strong>Network</strong>: attach a <strong>Private network </strong>(e.g. <code><strong><mark class="has-inline-color has-ast-global-color-0-color">0000 - AI Private Network</mark></strong></code>)</li>



<li><strong>Version:</strong> Latest stable (e.g. <code><strong><mark class="has-inline-color has-ast-global-color-0-color">1.34</mark></strong></code>)</li>
</ul>



<h4 class="wp-block-heading">2. Create GPU Node pool</h4>



<p>During the cluster creation, configure the vLLM Node pool for GPUs:</p>



<ul class="wp-block-list">
<li><strong>Node pool name:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">vllm</mark></code></li>



<li><strong>Flavor:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">L40S-90</mark></code></li>



<li><strong>Number of nodes:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">2</mark></code></li>



<li><strong>Autoscaling:</strong> Disabled (OFF)</li>
</ul>



<p><strong>Why L40S-90?</strong></p>



<ul class="wp-block-list">
<li>Cost-effective for single-model deployment (1 GPU per node)</li>



<li>Sufficient RAM (90GB) for <strong><code><mark class="has-inline-color has-ast-global-color-0-color">Qwen3-VL-8B</mark></code></strong> model</li>
</ul>



<p>You should see your cluster (e.g.&nbsp;<code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm-deployment-l40s-qwen3-8b</mark></strong></code>) in the list, along with the following information:</p>



<figure class="wp-block-image size-full"><img decoding="async" width="930" height="588" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1.png" alt="" class="wp-image-30745" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1.png 930w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1-300x190.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-1-768x486.png 768w" sizes="(max-width: 930px) 100vw, 930px" /></figure>



<p>You can now set up the node pool dedicated to monitoring.</p>



<h4 class="wp-block-heading">3. Create CPU Node pool</h4>



<p>From your cluster, click on <code><strong><mark class="has-inline-color has-ast-global-color-0-color">Add a node pool</mark></strong></code> and configure it as follow:</p>



<ul class="wp-block-list">
<li><strong>Node pool name:</strong> <mark class="has-inline-color has-ast-global-color-0-color"><code>monitoring</code></mark></li>



<li><strong>Flavor:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">B2-15</mark></code></li>



<li><strong>Number of nodes:</strong> <code><mark class="has-inline-color has-ast-global-color-0-color">1</mark></code></li>



<li><strong>Autoscaling:</strong> Disabled (OFF)</li>
</ul>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>✅ <strong>Note</strong></p>



<p><strong><em>Monitoring stack can run on GPU nodes if cost is a concern. Dedicated CPU node provides better isolation and resource management.</em></strong></p>
</blockquote>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="365" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-1024x365.png" alt="" class="wp-image-30743" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-1024x365.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-300x107.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation-768x274.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-node-pool-creation.png 1283w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>If the status is green with the&nbsp;<strong><code><mark class="has-inline-color has-ast-global-color-0-color">OK</mark></code></strong>&nbsp;label, you can proceed to the next step.</p>



<h4 class="wp-block-heading">4. Configure Kubernetes access</h4>



<p>Once your nodes have been provisioned, you can download the <strong>Kubeconfig file</strong> and configure kubectl with your MKS cluster.</p>



<pre class="wp-block-code"><code class=""># configure kubectl with your MKS cluster<br>export KUBECONFIG=/path/to/your/kubeconfig-xxxxxx.yml<br><br># verify cluster connectivity<br>kubectl cluster-info<br>kubectl get nodes</code></pre>



<p>Returning:</p>



<p><code>NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; STATUS &nbsp; ROLES&nbsp; &nbsp; AGE &nbsp; VERSION<br>monitoring-node-xxxxxx &nbsp; Ready&nbsp; &nbsp; &lt;none&gt; &nbsp; 1d &nbsp; v1.34.2<br>vllm-node-yyyyyy &nbsp; &nbsp; &nbsp; &nbsp; Ready&nbsp; &nbsp; &lt;none&gt; &nbsp; 1d &nbsp; v1.34.2<br>vllm-node-zzzzzz &nbsp; &nbsp; &nbsp; &nbsp; Ready&nbsp; &nbsp; &lt;none&gt; &nbsp; 1d &nbsp; v1.34.2</code></p>



<p>Before going further, add a label to the CPU node for monitoring workloads.</p>



<pre class="wp-block-code"><code class="">CPU_NODE=$(kubectl get nodes -o json | \<br>  jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" == null) | .metadata.name')<br>kubectl label node $CPU_NODE node-role=monitoring</code></pre>



<p>Finally, check with the following command:</p>



<pre class="wp-block-code"><code class="">NAME                     GPU      ROLE<br>monitoring-node-xxxxxx   &lt;none&gt;   monitoring<br>vllm-node-yyyyyy         1        &lt;none&gt;<br>vllm-node-zzzzzz         1        &lt;none&gt;</code></pre>



<p>Once both nodes are in <strong>Ready</strong> status, you can proceed to the next step.</p>



<h3 class="wp-block-heading">Step 2 &#8211; Install GPU operator</h3>



<p>To start, consider setting up the GPU operator.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>✅ Note</strong></p>



<p><em><strong>This step is based on this OVHcloud documentation: <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-kubernetes-deploy-gpu-application?id=kb_article_view&amp;sysparm_article=KB0049707" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Deploying a GPU application on OVHcloud Managed Kubernetes Service</a></strong></em></p>
</blockquote>



<h4 class="wp-block-heading">1. Add NVIDIA helm repository and create namespace</h4>



<p>Add NVIDIA helm repo:</p>



<pre class="wp-block-code"><code class="">helm repo add nvidia https://helm.ngc.nvidia.com/nvidia<br>helm repo update</code></pre>



<p>And create Namespace as follow.</p>



<pre class="wp-block-code"><code class="">kubectl create namespace gpu-operator</code></pre>



<h4 class="wp-block-heading">2. Install GPU operator with correct configuration</h4>



<p>The GPU Operator must be configured with specific driver versions to ensure compatibility with vLLM containers.</p>



<p>However, the default installation uses recent drivers (<code><strong><mark class="has-inline-color has-ast-global-color-0-color">580.x</mark></strong></code> with <strong><code><mark class="has-inline-color has-ast-global-color-0-color">CUDA 13.x</mark></code></strong>) which are incompatible with vLLM containers (<strong><code><mark class="has-inline-color has-ast-global-color-0-color">CUDA 12.x</mark></code></strong>).</p>



<p><strong>Solution:</strong> Force driver version <strong><code><mark class="has-inline-color has-ast-global-color-0-color">535.183.01</mark></code></strong> (<code><strong><mark class="has-inline-color has-ast-global-color-0-color">CUDA 12.2</mark></strong></code>).</p>



<pre class="wp-block-code"><code class="">helm install gpu-operator nvidia/gpu-operator \<br>  -n gpu-operator \<br>  --set driver.enabled=true \<br>  --set driver.version="535.183.01" \<br>  --set toolkit.enabled=true \<br>  --set operator.defaultRuntime=containerd \<br>  --set devicePlugin.enabled=true \<br>  --set dcgmExporter.enabled=true \<br>  --set dcgmExporter.image="dcgm-exporter" \<br>  --set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04" \<br>  --set gfd.enabled=true \<br>  --set migManager.enabled=false \<br>  --set nodeStatusExporter.enabled=true \<br>  --set validator.driver.enable=false \<br>  --set validator.toolkit.enable=false \<br>  --set validator.plugin.enable=false \<br>  --timeout 20m</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>✅ <strong>Note </strong></p>



<p><em><strong>Specifying the DCGM version may only be necessary if you encounter problems with the default image (e.g. <code><mark class="has-inline-color has-ast-global-color-0-color">‘ImagePullBackOff’</mark></code>). If this is the case, add the following parameters:<br><code><mark class="has-inline-color has-ast-global-color-0-color">--set dcgmExporter.repository="nvcr.io/nvidia/k8s"<br>--set dcgmExporter.image="dcgm-exporter"<br>--set dcgmExporter.version="3.1.7-3.1.4-ubuntu20.04"</mark></code></strong></em></p>
</blockquote>



<pre class="wp-block-code"><code class="">kubectl get pods -n gpu-operator</code></pre>



<p>Note that all pods should reach <strong>Running</strong> state in 5-10 minutes.</p>



<p>You can also check the GPU availability:</p>



<pre class="wp-block-code"><code class="">kubectl get nodes -o json | jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" != null) | "\(.metadata.name): \(.status.allocatable."nvidia.com/gpu") GPU(s)"'</code></pre>



<p>Returning:</p>



<p><code>vllm-node-<code>yyyyyy</code>: 1 GPU(s)<br>vllm-node-zzzzzz: 1 GPU(s)</code></p>



<p>And you can test to run <code><strong><mark class="has-inline-color has-ast-global-color-0-color">nvidia-smi</mark></strong></code>:</p>



<pre class="wp-block-code"><code class="">DRIVER_POD=$(kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset -o name | head -1)<br>kubectl exec -n gpu-operator $DRIVER_POD -- nvidia-smi</code></pre>



<p>If GPU tests are working properly, you can move on DCGM service configuration.</p>



<h4 class="wp-block-heading">3. Configure DCGM service</h4>



<p><strong>Why is DCGM Exporter required?</strong></p>



<p>DCGM (Data Centre GPU Manager) is NVIDIA&#8217;s official tool for monitoring GPUs in production. The goal is to be able to collect and display metrics from both GPU nodes.</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="571" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-1024x571.jpg" alt="" class="wp-image-30746" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-1024x571.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-300x167.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-768x428.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1-1536x856.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/data_ia_archi-1.jpg 1733w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>GPU monitoring with DCGM</em></figcaption></figure>



<p>The metrics provided are:</p>



<ul class="wp-block-list">
<li><code><strong><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_GPU_UTIL</mark></strong></code> &#8211; GPU utilisation (%)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_GPU_TEMP</mark></code></strong> &#8211; GPU temperature (°C)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_FB_USED</mark></code></strong> &#8211; VRAM used (MB)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_FB_FREE</mark></code></strong> &#8211; Free VRAM (MB)</li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">DCGM_FI_DEV_POWER_USAGE</mark></code></strong> &#8211; Power consumption (W)</li>



<li>And 50+ other GPU metrics</li>
</ul>



<p>Next, ensure DCGM service has the correct labels and port configuration:</p>



<pre class="wp-block-code"><code class="">kubectl patch svc nvidia-dcgm-exporter -n gpu-operator --type merge -p '{<br>  "metadata": {<br>    "labels": {<br>      "app": "nvidia-dcgm-exporter"<br>    }<br>  },<br>  "spec": {<br>    "ports": [<br>      {<br>        "name": "metrics",<br>        "port": 9400,<br>        "targetPort": 9400,<br>        "protocol": "TCP"<br>      }<br>    ]<br>  }<br>}'</code></pre>



<p>Verify the endpoints (should show 2 IPs, one per GPU node).</p>



<pre class="wp-block-code"><code class="">kubectl get endpoints nvidia-dcgm-exporter -n gpu-operator</code></pre>



<p><code>NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ENDPOINTS &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; AGE<br>nvidia-dcgm-exporter &nbsp; x.x.x.x:9400,x.x.x.x:9400 &nbsp; 17d</code></p>



<h3 class="wp-block-heading">Step 3 &#8211; Deploy Qwen3 VL 8B with vLLM inference server</h3>



<p>The deployment of the <strong>Qwen 3 VL 8B</strong> model on two L40S GPU nodes is carried out in several stages.</p>



<h4 class="wp-block-heading">1. Create namespace and Hugging Face secret</h4>



<p>Start by creating Namespace:</p>



<pre class="wp-block-code"><code class="">kubectl create namespace vllm</code></pre>



<p>Next, you must retrieve your Hugging Face token and replace the&nbsp;<code><strong><mark class="has-inline-color has-ast-global-color-0-color">HF_TOKEN</mark></strong></code>&nbsp;value by your own:</p>



<pre class="wp-block-code"><code class="">export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"</code></pre>



<p>Create your secret as follow:</p>



<pre class="wp-block-code"><code class="">kubectl create secret generic huggingface-secret \<br>  --from-literal=token=$HF_TOKEN \<br>  --namespace=vllm</code></pre>



<p>Verify you obtain the following output by launching:</p>



<pre class="wp-block-code"><code class="">kubectl get secret huggingface-secret -n vllm</code></pre>



<p><code>NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; TYPE &nbsp; &nbsp; DATA &nbsp; AGE<br>huggingface-secret &nbsp; Opaque &nbsp; 1&nbsp; &nbsp; &nbsp; 14d</code></p>



<h4 class="wp-block-heading">2. Create vLLM deployment configuration</h4>



<p>First, you can create <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/vllm/vllm-deployment-2nodes.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-deployment-2nodes.yaml</a></strong></code> file.</p>



<p>Deploy vLLM:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f vllm-deployment-2nodes.yaml</code></pre>



<p>You can monitor the deployment (it should take 8-10 minutes for model download and loading).</p>



<pre class="wp-block-code"><code class="">kubectl get pods -n vllm -o wide -w</code></pre>



<p>Expected output after 10 minutes:</p>



<pre class="wp-block-code"><code class="">NAME               READY  STATUS   RESTARTS  AGE  IP       NODE  <br>qwen3-vl-xxxx-yyy  1/1    Running  0         1d   X.X.X.X  vllm-node-yyyyyy<br>qwen3-vl-xxxx-zzz  1/1    Running  0         1d   X.X.X.X  vllm-node-zzzzzz</code></pre>



<p>You can also check the container logs:</p>



<pre class="wp-block-code"><code class="">kubectl logs -f -n vllm &lt;pod-name&gt;</code></pre>



<p>You should find in the logs: &#8220;<code>Uvicorn running on http://0.0.0.0:8000</code>&#8220;</p>



<p>Is everything installed correctly? Then let&#8217;s move on to the next step.</p>



<h4 class="wp-block-heading">3. Add service label</h4>



<p>Ensure service has the correct label for <strong><code><mark class="has-inline-color has-ast-global-color-0-color">ServiceMonitor</mark></code></strong> discovery.</p>



<pre class="wp-block-code"><code class="">kubectl label svc qwen3-vl-service -n vllm app=qwen3-vl --overwrite</code></pre>



<p>You can now verify by launching the following command.</p>



<pre class="wp-block-code"><code class="">kubectl get svc qwen3-vl-service -n vllm --show-labels | grep "app=qwen3-vl"</code></pre>



<p>Returning:</p>



<p><code>qwen3-vl-service&nbsp; ClusterIP&nbsp; X.X.X.X &nbsp;&lt;none&gt;  8000/TCP  1d &nbsp;app=qwen3-vl</code></p>



<h3 class="wp-block-heading">Step 4 &#8211; Install NGINX ingress controller</h3>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><mark style="color:#cf2e2e" class="has-inline-color">⚠️ <strong>Moving beyond Ingress</strong></mark></p>



<p><strong><em><mark style="color:#cf2e2e" class="has-inline-color">Follow this <a href="https://blog.ovhcloud.com/moving-beyond-ingress-why-should-ovhcloud-managed-kubernetes-service-mks-users-start-looking-at-the-gateway-api/" data-wpel-link="internal">tutorial</a> if you want to use Gateway instead of Ingress.</mark></em></strong></p>
</blockquote>



<h4 class="wp-block-heading">1. Add helm repository and configure Ingress</h4>



<p>First of all, add helm repository:</p>



<pre class="wp-block-code"><code class="">helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx<br>helm repo update</code></pre>



<p>Create configuration file with <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/ingress/ingress-nginx-values.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">ingress-nginx-values.yaml</a></strong></code>.</p>



<p>Then, install NGINX Ingress:</p>



<pre class="wp-block-code"><code class="">helm install ingress-nginx ingress-nginx/ingress-nginx \<br>  --namespace ingress-nginx \<br>  --create-namespace \<br>  -f ingress-nginx-values.yaml \<br>  --wait</code></pre>



<p>Wait for LoadBalancer IP. The external IP assignment should take 1-2 minutes.</p>



<pre class="wp-block-code"><code class="">kubectl get svc -n ingress-nginx ingress-nginx-controller -w</code></pre>



<p>Once <code><strong><mark class="has-inline-color has-ast-global-color-0-color">&lt;EXTERNAL-IP&gt;</mark></strong></code> is no longer , Ctrl+C and export it:</p>



<pre class="wp-block-code"><code class="">export EXTERNAL_IP=&lt;EXTERNAL-IP&gt;<br>echo "API URL: http://$EXTERNAL_IP"</code></pre>



<h4 class="wp-block-heading">2. Create vLLM Ingress resource</h4>



<p>Next, create vLLM Ingress using <strong><code><a href="https://github.com/ovh/public-cloud-examples/blob/ep-vllm-deployment-observability-mks/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/vllm/vllm-ingress.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-ingress.yaml</a></code></strong>.</p>



<p>Apply it as follow:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f vllm-ingress.yaml</code></pre>



<p>You can now test different API calls to verify that your deployment is functional.</p>



<h4 class="wp-block-heading">3. Test API</h4>



<p>Firstly, check if the model is available:</p>



<pre class="wp-block-code"><code class="">curl http://$EXTERNAL_IP/v1/models | jq</code></pre>



<pre class="wp-block-preformatted"><code>{<br>  "object": "list",<br>  "data": [<br>    {<br>      "id": "qwen3-vl-8b",<br>      "object": "model",<br>      "created": 1772472143,<br>      "owned_by": "vllm",<br>      "root": "Qwen/Qwen3-VL-8B-Instruct",<br>      "parent": null,<br>      "max_model_len": 8192,<br>      "permission": [<br>        {<br>          "id": "modelperm-8fb35cdd3208b068",<br>          "object": "model_permission",<br>          "created": 1772472143,<br>          "allow_create_engine": false,<br>          "allow_sampling": true,<br>          "allow_logprobs": true,<br>          "allow_search_indices": false,<br>          "allow_view": true,<br>          "allow_fine_tuning": false,<br>          "organization": "*",<br>          "group": null,<br>          "is_blocking": false<br>        }<br>      ]<br>    }<br>  ]<br>}</code></pre>



<p>Next, test inference using the following request:</p>



<pre class="wp-block-code"><code class="">curl http://$EXTERNAL_IP/v1/chat/completions \<br>  -H "Content-Type: application/json" \<br>  -d '{<br>    "model": "qwen3-vl-8b",<br>    "messages": [{"role": "user", "content": "Count from 1 to 10."}],<br>    "max_tokens": 100<br>  }' | jq '.choices[0].message.content'</code></pre>



<p><code>"1, 2, 3, 4, 5, 6, 7, 8, 9, 10"</code></p>



<p>Great! You&#8217;re almost there…</p>



<h3 class="wp-block-heading">Step 5 &#8211; Install Prometheus stack</h3>



<p>Now, set up the monitoring stack that provides complete observability for&nbsp;<strong>application-level&nbsp;</strong>(vLLM) and&nbsp;<strong>hardware-level</strong>&nbsp;(GPU) metrics:</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="763" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-1024x763.jpg" alt="" class="wp-image-30871" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-1024x763.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-300x223.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-768x572.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture-1536x1144.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/monitoring-architecture.jpg 1673w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Monitoring architecture</em></figcaption></figure>



<h4 class="wp-block-heading">1. Add helm repository and create namespace</h4>



<p>Add Prometheus helm repo:</p>



<pre class="wp-block-code"><code class="">helm repo add prometheus-community https://prometheus-community.github.io/helm-charts<br>helm repo update</code></pre>



<p>Then, create the <code><strong><mark class="has-inline-color has-ast-global-color-0-color">monitoring</mark></strong></code> Namespace.</p>



<pre class="wp-block-code"><code class="">kubectl create namespace monitoring</code></pre>



<h4 class="wp-block-heading">2. Create Prometheus deployment configuration and installation</h4>



<p>First, create <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/monitoring/prometheus.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">prometheus.yaml</a></strong></code> file.</p>



<p>Install Prometheus stack:</p>



<pre class="wp-block-code"><code class="">helm install prometheus prometheus-community/kube-prometheus-stack \<br>  -n monitoring \<br>  -f prometheus.yaml \<br>  --timeout 10m \<br>  --wait</code></pre>



<p>Now,&nbsp;monitor its installation and wait until the pods are ready:</p>



<pre class="wp-block-code"><code class="">kubectl get pods -n monitoring -w</code></pre>



<p>If all pods are running successfully, you can proceed to the next step.</p>



<h4 class="wp-block-heading">3. Check that the installation is operational</h4>



<p>First access Grafana in background:</p>



<pre class="wp-block-code"><code class="">kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &amp;</code></pre>



<p>Test Grafana health:</p>



<pre class="wp-block-code"><code class="">curl -s http://localhost:3000/api/health | jq</code></pre>



<pre class="wp-block-preformatted"><code>{<br>  "database": "ok",<br>  "version": "12.3.3",<br>  "commit": "2a14494b2d6ab60f860d8b27603d0ccb264336f6"<br>}</code></pre>



<p>You can now access to Grafana locally via <strong><a href="http://localhost:3000" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code>http://localhost:3000</code></a></strong>. You will have to use:</p>



<ul class="wp-block-list">
<li>Login: <code><strong><mark style="color:#cf2e2e" class="has-inline-color">admin</mark></strong></code></li>



<li>Password: <code><strong><mark style="color:#cf2e2e" class="has-inline-color">Admin123!vLLM</mark></strong></code></li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="518" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-1024x518.png" alt="" class="wp-image-30804" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-1024x518.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-300x152.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2-768x389.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-2.png 1322w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Well done! You can now proceed to the configuration step.</p>



<h3 class="wp-block-heading">Step 6 &#8211; Configure ServiceMonitors</h3>



<p>The ServiceMonitors is used to tell Prometheus which endpoints to scrape for metrics.</p>



<h4 class="wp-block-heading">1. Create vLLM ServiceMonitor</h4>



<p>Retrieve the file from the GitHub repository: <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/monitoring/vllm-servicemonitor.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-servicemonitor.yaml</a></strong></code>.</p>



<p>Next, apply and check that the ServiceMonitor <code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm-metrics</mark></strong></code> exists:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f vllm-servicemonitor.yaml<br>kubectl get servicemonitor -n vllm</code></pre>



<h4 class="wp-block-heading">2. Create DCGM ServiceMonitor</h4>



<p>First, create the <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/monitoring/dcgm-servicemonitor.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">dcgm-servicemonitor.yaml</a></strong></code> file.</p>



<p>Once again, apply and verify:</p>



<pre class="wp-block-code"><code class="">kubectl apply -f dcgm-servicemonitor.yaml<br>kubectl get servicemonitor -n gpu-operator</code></pre>



<pre class="wp-block-preformatted"><code>gpu-operator                  1d<br>nvidia-dcgm-exporter          1d<br>nvidia-node-status-exporter   1d</code></pre>



<h4 class="wp-block-heading">3. Configure Prometheus for Cross-Namespace discovery</h4>



<p>Apply a patch to allow Prometheus to discover ServiceMonitors in all namespaces.</p>



<pre class="wp-block-code"><code class="">kubectl patch prometheus prometheus-kube-prometheus-prometheus -n monitoring --type merge -p '{<br>  "spec": {<br>    "serviceMonitorNamespaceSelector": {},<br>    "podMonitorNamespaceSelector": {}<br>  }<br>}'</code></pre>



<p>Now you have to restart Prometheus.</p>



<ol class="wp-block-list">
<li>Delete Prometheus pod to force configuration reload</li>



<li>Wait for Prometheus to restart</li>
</ol>



<pre class="wp-block-code"><code class="">kubectl delete pod prometheus-prometheus-kube-prometheus-prometheus-0 -n monitoring<br><br>kubectl wait --for=condition=Ready \<br>  pod/prometheus-prometheus-kube-prometheus-prometheus-0 \<br>  -n monitoring \<br>  --timeout=180s</code></pre>



<p>Wait about 2 minutes for discovery and finally, verify targets:</p>



<pre class="wp-block-code"><code class="">kubectl port-forward -n monitoring \<br>  prometheus-prometheus-kube-prometheus-prometheus-0 9090:9090 &amp;</code></pre>



<p>You can open in browser: <a href="http://localhost:9090/targets" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code><strong><mark class="has-inline-color has-ast-global-color-0-color">http://localhost:9090/targets</mark></strong></code></a> and search for:</p>



<ul class="wp-block-list">
<li><code><strong><mark class="has-inline-color has-ast-global-color-0-color">vllm</mark></strong></code></li>



<li><strong><code><mark class="has-inline-color has-ast-global-color-0-color">dcgm</mark></code></strong></li>
</ul>



<p>Note that the expected targets are: </p>



<ul class="wp-block-list">
<li>serviceMonitor/vllm/vllm-metrics/0   (2/2 UP)</li>



<li>serviceMonitor/gpu-operator/nvidia-dcgm-exporter/0 (2/2 UP)</li>
</ul>



<h3 class="wp-block-heading">Step 7 &#8211; Create Grafana dashboards</h3>



<p>In this final step, the goal is to create two Grafana dashboards to track both the software side with vLLM metrics and the hardware metrics that will monitor the GPU consumption and system.</p>



<h4 class="wp-block-heading">1. vLLM application metrics</h4>



<p>The dashboard provides insights into vLLM application performance, request handling, and resource utilization based on the following metrics:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>Type</th><th>Description</th><th>Unit</th><th>Dashboard Usage</th></tr></thead><tbody><tr><td><code>vllm:request_success_total</code></td><td>Counter</td><td>Total successful requests</td><td>count</td><td>Request Rate, Total Requests</td></tr><tr><td><code>vllm:num_requests_running</code></td><td>Gauge</td><td>Requests currently being processed</td><td>count</td><td>Queue Depth, Active Requests</td></tr><tr><td><code>vllm:num_requests_waiting</code></td><td>Gauge</td><td>Requests waiting in queue</td><td>count</td><td>Queue Depth, Queued Requests</td></tr><tr><td><code>vllm:time_to_first_token_seconds</code></td><td>Histogram</td><td>Latency until first token generated</td><td>seconds</td><td>TTFT P50/P95/P99</td></tr><tr><td><code>vllm:e2e_request_latency_seconds</code></td><td>Histogram</td><td>Total end-to-end latency</td><td>seconds</td><td>E2E Latency P50/P95/P99</td></tr><tr><td><code>vllm:generation_tokens_total</code></td><td>Counter</td><td>Total tokens generated (output)</td><td>count</td><td>Token Generation Rate, Throughput</td></tr><tr><td><code>vllm:prompt_tokens_total</code></td><td>Counter</td><td>Total prompt tokens (input)</td><td>count</td><td>Token Generation Rate, Avg Tokens</td></tr><tr><td><code>vllm:kv_cache_usage_perc</code></td><td>Gauge</td><td>GPU KV cache utilization</td><td>0-1 (0-100%)</td><td>KV Cache Usage</td></tr><tr><td><code>vllm:prefix_cache_hits_total</code></td><td>Counter</td><td>Number of prefix cache hits</td><td>count</td><td>Cache Hit Rate</td></tr><tr><td><code>vllm:prefix_cache_queries_total</code></td><td>Counter</td><td>Number of prefix cache queries</td><td>count</td><td>Cache Hit Rate</td></tr><tr><td><code>vllm:request_queue_time_seconds</code></td><td>Histogram</td><td>Time spent waiting in queue</td><td>seconds</td><td>Request Queue Time</td></tr><tr><td><code>vllm:request_prefill_time_seconds</code></td><td>Histogram</td><td>Prefill phase time</td><td>seconds</td><td>Prefill Time</td></tr><tr><td><code>vllm:request_decode_time_seconds</code></td><td>Histogram</td><td>Decode phase time</td><td>seconds</td><td>Decode Time</td></tr><tr><td><code>vllm:inter_token_latency_seconds</code></td><td>Histogram</td><td>Latency between each token</td><td>seconds</td><td>Inter-Token Latency</td></tr><tr><td><code>vllm:num_preemptions_total</code></td><td>Counter</td><td>Number of preemptions (OOM)</td><td>count</td><td>Preemptions</td></tr><tr><td><code>vllm:prompt_tokens_cached_total</code></td><td>Counter</td><td>Prompt tokens cached</td><td>count</td><td>Cached Tokens</td></tr><tr><td><code>vllm:request_prompt_tokens</code></td><td>Histogram</td><td>Prompt size distribution</td><td>count</td><td>(Table)</td></tr><tr><td><code>vllm:request_generation_tokens</code></td><td>Histogram</td><td>Generated tokens distribution</td><td>count</td><td>(Table)</td></tr><tr><td><code>vllm:iteration_tokens_total</code></td><td>Histogram</td><td>Tokens per iteration</td><td>count</td><td>(Advanced analysis)</td></tr></tbody></table></figure>



<p>This <strong>vLLM Grafana dashboard</strong> is composed of 23 panels:</p>



<p>The dashboard provides insights into LLM application performance, request handling, and resource utilisation based on the previous metrics.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Type</th><th>Nombre</th><th>Panels</th></tr></thead><tbody><tr><td><strong>Timeseries</strong></td><td>12</td><td>Request Rate, Queue Depth, TTFT, E2E Latency, Token Gen, Cache Usage, Cache Hit, Queue Time, Prefill/Decode, Inter-Token, Preemptions, Avg Tokens</td></tr><tr><td><strong>Stat</strong></td><td>10</td><td>Throughput, TTFT P95, Active Req, Queued Req, Cache Hit Rate, Cache Usage, Total Req, Total Tokens, Cached Tokens, Preemptions</td></tr><tr><td><strong>Table</strong></td><td>1</td><td>Pod Performance</td></tr></tbody></table></figure>



<p>Now create the dashboard using <a href="https://github.com/ovh/public-cloud-examples/blob/ep-vllm-deployment-observability-mks/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/grafana-dashboards/vllm-app-dashboard.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"></a><code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/grafana-dashboards/vllm-app-dashboard.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vllm-app-dashboard.json</a></strong></code>. Then, launch:</p>



<pre class="wp-block-code"><code class="">echo "Importing vLLM application dashboard..."<br>curl -X POST \<br>  'http://localhost:3000/api/dashboards/db' \<br>  -H 'Content-Type: application/json' \<br>  -u 'admin:Admin123!vLLM' \<br>  -d @vllm-app-dashboard.json | jq '.status, .url'</code></pre>



<p>Next, you an access the vLLM dashboard and follow metrics in real time:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="686" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-1024x686.png" alt="" class="wp-image-30858" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-1024x686.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-300x201.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3-768x514.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-3.png 1230w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>This dashboard is also essential to track hardware consumption for comprehensive monitoring.</p>



<h4 class="wp-block-heading">2. GPU hardware metrics</h4>



<p>Take advantage of the most useful DCGM metrics to check both the functioning and consumption of your hardware resources:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>Type</th><th>Description</th><th>Unit</th><th>Normal Thresholds</th><th>Dashboard Usage</th></tr></thead><tbody><tr><td><code>DCGM_FI_DEV_GPU_UTIL</code></td><td>Gauge</td><td>GPU utilization (compute)</td><td>% (0-100)</td><td>70-95% optimal</td><td>GPU Utilization</td></tr><tr><td><code>DCGM_FI_DEV_GPU_TEMP</code></td><td>Gauge</td><td>GPU temperature</td><td>°C</td><td>&lt; 85°C normal</td><td>GPU Temperature</td></tr><tr><td><code>DCGM_FI_DEV_FB_USED</code></td><td>Gauge</td><td>VRAM used</td><td>MB</td><td>Variable by model</td><td>GPU Memory Used</td></tr><tr><td><code>DCGM_FI_DEV_FB_FREE</code></td><td>Gauge</td><td>VRAM free</td><td>MB</td><td>&gt; 2GB recommended</td><td>GPU Memory Free</td></tr><tr><td><code>DCGM_FI_DEV_POWER_USAGE</code></td><td>Gauge</td><td>Power consumption</td><td>Watts</td><td>&lt; 300W (L40S)</td><td>GPU Power Usage</td></tr><tr><td><code>DCGM_FI_DEV_SM_CLOCK</code></td><td>Gauge</td><td>GPU clock speed (compute)</td><td>MHz</td><td>Variable</td><td>GPU Clock Speed</td></tr><tr><td><code>DCGM_FI_DEV_MEM_CLOCK</code></td><td>Gauge</td><td>Memory clock speed</td><td>MHz</td><td>Variable</td><td>Memory Clock Speed</td></tr><tr><td><code>DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL</code></td><td>Counter</td><td>Total NVLink bandwidth</td><td>bytes/s</td><td>(If multi-GPU)</td><td>NVLink Bandwidth</td></tr><tr><td><code>DCGM_FI_DEV_PCIE_TX_BYTES</code></td><td>Counter</td><td>PCIe data transmitted</td><td>bytes</td><td>(I/O monitoring)</td><td>PCIe TX</td></tr><tr><td><code>DCGM_FI_DEV_PCIE_RX_BYTES</code></td><td>Counter</td><td>PCIe data received</td><td>bytes</td><td>(I/O monitoring)</td><td>PCIe RX</td></tr><tr><td><code>DCGM_FI_DEV_ECC_DBE_VOL_TOTAL</code></td><td>Counter</td><td>ECC double-bit errors</td><td>count</td><td>0 ideal</td><td>(Health check)</td></tr><tr><td><code>DCGM_FI_DEV_ECC_SBE_VOL_TOTAL</code></td><td>Counter</td><td>ECC single-bit errors</td><td>count</td><td>&lt; 10/day acceptable</td><td>(Health check)</td></tr></tbody></table></figure>



<p>This&nbsp;<strong>hardware Grafana dashboard</strong>&nbsp;is composed of 13 panels with GPU hardware and system metrics. A detailed view is also available GPU util (%), temperature (°C), vRAM (GB) and power (Watt).</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Type</th><th>Count</th><th>Panels</th></tr></thead><tbody><tr><td><strong>Timeseries</strong></td><td>8</td><td>GPU Util, GPU Mem, GPU Temp, GPU Power, CPU Usage, RAM Usage, Network I/O, Disk I/O</td></tr><tr><td><strong>Stat</strong></td><td>4</td><td>Avg GPU Util, Avg GPU Temp, Total GPU Mem, Total GPU Power</td></tr><tr><td><strong>Table</strong></td><td>1</td><td>Hardware Status</td></tr></tbody></table></figure>



<p>Please refer to <code><strong><a href="https://github.com/ovh/public-cloud-examples/blob/main/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/grafana-dashboards/hardware-dashboard.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">hardware-dashboard.json</a></strong></code> by loading it as follows:</p>



<pre class="wp-block-code"><code class="">echo "Importing hardware dashboard..."<br>curl -X POST \<br>  'http://localhost:3000/api/dashboards/db' \<br>  -H 'Content-Type: application/json' \<br>  -u 'admin:Admin123!vLLM' \<br>  -d @hardware-dashboard.json | jq '.status, .url'</code></pre>



<p>Finally, track resource consumption using this hardware dashboard:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="686" src="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-1024x686.png" alt="" class="wp-image-30859" srcset="https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-1024x686.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-300x201.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4-768x514.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2026/03/image-4.png 1230w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Congratulations! Everything is working. You can now test your model and track the various metrics in real time.</p>



<h3 class="wp-block-heading">Step 8 &#8211; LLM testing and performance tracking</h3>



<p>Start by installing Python dependencies:</p>



<pre class="wp-block-code"><code class="">pip3 install openai tqdm</code></pre>



<p>Replace the <strong><mark class="has-inline-color has-ast-global-color-0-color">&lt;EXTERNAL_IP&gt;</mark></strong> by the vLLM service external IP and launch the performance test thanks to the following <a href="https://github.com/ovh/public-cloud-examples/blob/ep-vllm-deployment-observability-mks/containers-orchestration/managed-kubernetes/gpu-cluster-for-vllm-deployment-and-observability/llm-inference-performance-test.py" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code><strong>Python code</strong></code></a>:</p>



<pre class="wp-block-code"><code class="">import time<br>import threading<br>import random<br>from statistics import mean<br>from openai import OpenAI<br>from tqdm import tqdm<br><br>APP_URL = "http://94.23.185.22/v1"<br>MODEL = "qwen3-vl-8b"<br><br>CONCURRENT_WORKERS = 500          # concurrency<br>REQUESTS_PER_WORKER = 10<br>MAX_TOKENS = 200                  # generation pressure<br><br># some random prompts<br>SHORT_PROMPTS = [<br>    "Summarize the theory of relativity.",<br>    "Explain what a transformer model is.",<br>    "What is Kubernetes autoscaling?"<br>]<br><br>MEDIUM_PROMPTS = [<br>    "Explain how attention mechanisms work in transformer-based models, including self-attention and multi-head attention.",<br>    "Describe how vLLM manages KV cache and why it impacts inference performance."<br>]<br><br>LONG_PROMPTS = [<br>    "Write a very detailed technical explanation of how large language models perform inference, "<br>    "including tokenization, embedding lookup, transformer layers, attention computation, KV cache usage, "<br>    "GPU memory management, and how batching affects latency and throughput. Use examples.",<br>]<br><br>PROMPT_POOL = (<br>    SHORT_PROMPTS * 2 +<br>    MEDIUM_PROMPTS * 4 +<br>    LONG_PROMPTS * 6    # bias toward long prompts<br>)<br><br># openai compliance<br>client = OpenAI(<br>    base_url=APP_URL,<br>    api_key="foo"<br>)<br><br># basic metrics<br>latencies = []<br>errors = 0<br>lock = threading.Lock()<br><br># worker<br>def worker(worker_id):<br>    global errors<br>    for _ in range(REQUESTS_PER_WORKER):<br>        prompt = random.choice(PROMPT_POOL)<br><br>        start = time.time()<br>        try:<br>            client.chat.completions.create(<br>                model=MODEL,<br>                messages=[{"role": "user", "content": prompt}],<br>                max_tokens=MAX_TOKENS,<br>                temperature=0.7,<br>            )<br>            elapsed = time.time() - start<br><br>            with lock:<br>                latencies.append(elapsed)<br><br>        except Exception as e:<br>            with lock:<br>                errors += 1<br><br># run<br>threads = []<br>start_time = time.time()<br><br>print("\n-&gt; STARTING PERFORMANCE TEST:")<br>print(f"Concurrency: {CONCURRENT_WORKERS}")<br>print(f"Total requests: {CONCURRENT_WORKERS * REQUESTS_PER_WORKER}")<br><br>for i in range(CONCURRENT_WORKERS):<br>    t = threading.Thread(target=worker, args=(i,))<br>    t.start()<br>    threads.append(t)<br><br>for t in threads:<br>    t.join()<br><br>total_time = time.time() - start_time<br><br># results<br>print("\n-&gt; BENCH RESULTS:")<br>print(f"Total requests sent: {len(latencies) + errors}")<br>print(f"Successful requests: {len(latencies)}")<br>print(f"Errors: {errors}")<br>print(f"Total wall time: {total_time:.2f}s")<br><br>if latencies:<br>    print(f"Avg latency: {mean(latencies):.2f}s")<br>    print(f"Min latency: {min(latencies):.2f}s")<br>    print(f"Max latency: {max(latencies):.2f}s")<br>    print(f"Throughput: {len(latencies)/total_time:.2f} req/s")</code></pre>



<p>Returning:</p>



<pre class="wp-block-preformatted"><code>-&gt; STARTING PERFORMANCE TEST:</code><br><code>Concurrency: 500<br>Total requests: 5000</code><br><code><br>-&gt; BENCH RESULTS:<br>Total requests sent: 5000<br>Successful requests: 5000<br>Errors: 0<br>Total wall time: 225.54s<br>Avg latency: 21.45s<br>Min latency: 6.06s<br>Max latency: 25.19s<br>Throughput: 22.17 req/s</code></pre>



<p>Don&#8217;t forget to track GPU and vLLM metrics in your Grafana dashboards!</p>



<h2 class="wp-block-heading">Conslusion</h2>



<p>This reference architecture demonstrates a<strong>&nbsp;vLLM deployment on OVHcloud Managed Kubernetes Service (MKS)</strong>&nbsp;with comprehensive GPU monitoring. Benefits include:</p>



<ul class="wp-block-list">
<li><strong>High Performance</strong>: GPU-accelerated inference with L40S</li>



<li><strong>Scalability</strong>: Kubernetes-native, horizontal scaling-ready</li>



<li><strong>Reliability</strong>: Health checks, auto-restart, monitoring</li>



<li><strong>API Compatibility</strong>: OpenAI-compatible endpoints</li>



<li><strong>Multimodality</strong>: Vision &amp; text capabilities</li>



<li><strong>Full stack monitoring</strong>: Complete vLLM application and hardware dashboards</li>
</ul>



<h2 class="wp-block-heading">Going Further</h2>



<p>Your current architecture is&nbsp;<strong>functional.&nbsp;</strong>However, if desired,&nbsp;<strong>it could be improved into a full production-ready&nbsp;solution.</strong></p>



<p><strong>Wish to take production hardening a step further?</strong></p>



<p>Go further with the following enhancements:</p>



<ol class="wp-block-list">
<li><strong>Authentication &amp; authorization</strong>
<ul class="wp-block-list">
<li>vLLM API authentication</li>



<li>Grafana authentication</li>



<li>Prometheus security</li>
</ul>
</li>



<li><strong>High availability &amp; load balancing</strong>
<ul class="wp-block-list">
<li>Grafana high availability with multiple replicas and shared storage</li>



<li>Prometheus high availability</li>



<li>vLLM Horizontal Pod Autoscaling (HPA) based on custom metrics</li>
</ul>
</li>



<li><strong>Data persistence &amp; backup</strong>
<ul class="wp-block-list">
<li>Prometheus long-term storage with persistent storage</li>



<li>Grafana Dashboard Backup</li>
</ul>
</li>



<li><strong>Observability enhancements</strong>
<ul class="wp-block-list">
<li>Distributed tracing by adding OpenTelemetry for request tracing</li>



<li>Alerting rules with production-ready alert rules</li>
</ul>
</li>
</ol>



<p></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-deploying-a-vision-language-model-with-vllm-on-ovhcloud-mks-for-high-performance-inference-and-full-observability%2F&amp;action_name=Reference%20Architecture%3A%20Deploying%20a%20vision-language%20model%20with%20vLLM%20on%20OVHcloud%20MKS%20for%20high%20performance%20inference%20and%20full%20observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Pushing beyond the limits of embedded real-time AI for edge devices</title>
		<link>https://blog.ovhcloud.com/pushing-beyond-the-limits-of-embedded-real-time-ai-for-edge-devices/</link>
		
		<dc:creator><![CDATA[Katya Guez]]></dc:creator>
		<pubDate>Thu, 03 Apr 2025 22:44:40 +0000</pubDate>
				<category><![CDATA[OVHcloud Startup Program]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=28560</guid>

					<description><![CDATA[Startup highlight: Interview with Kevin Conley, CEO at Applied Brain Research (ABR) Applied Brain Research (ABR) is a fabless semiconductor company founded by a team from the University of Waterloo’s Centre for Theoretical Neuroscience, under the leadership of Dr. Chris Eliasmith, the Centre’s founding chair, to commercialize brain-inspired AI inference solutions. Can you introduce Applied [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fpushing-beyond-the-limits-of-embedded-real-time-ai-for-edge-devices%2F&amp;action_name=Pushing%20beyond%20the%20limits%20of%20embedded%20real-time%20AI%20for%20edge%20devices&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<h4 class="wp-block-heading"><strong><em>Startup highlight:</em> Interview with Kevin Conley, CEO at Applied Brain Research (ABR)</strong></h4>



<p><a href="https://www.appliedbrainresearch.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Applied Brain Research (ABR)</a> is a fabless semiconductor company founded by a team from the University of Waterloo’s Centre for Theoretical Neuroscience, under the leadership of <a href="https://www.linkedin.com/in/chris-eliasmith/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Dr. Chris Eliasmith</a>, the Centre’s founding chair, to commercialize brain-inspired AI inference solutions.</p>



<figure class="wp-block-image aligncenter size-full is-resized"><a href="https://www.appliedbrainresearch.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><img loading="lazy" decoding="async" width="144" height="52" src="https://blog.ovhcloud.com/wp-content/uploads/2025/04/PastedGraphic-4.jpg" alt="" class="wp-image-28553" style="width:222px;height:auto"/></a></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Can you introduce Applied Brain Research and its mission?</strong></p>



<p>ABR&#8217;s mission is to bring advanced time series inference out of data centers by empowering edge devices with advanced time series AI capability. Out of its research, ABR invented the Legendre Memory Unit which has established a new chapter of state space models for advanced time series processing. To enable low power edge devices with advanced capabilities like low latency natural language interfaces, ABR has developed an <a href="https://www.appliedbrainresearch.com/technology" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LMU powered time series processor AI accelerator ASIC </a>that will enter the market this year. Our CEO, <a href="https://www.linkedin.com/in/kconley/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Kevin Conley</a>, is a semiconductor vet who has built billion dollar businesses in the past and plans to do the same with ABR’s cutting edge technology.</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" width="885" height="441" src="https://blog.ovhcloud.com/wp-content/uploads/2025/04/Screenshot-2025-04-03-at-5.12.56-PM.png" alt="" class="wp-image-28552" style="width:634px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/04/Screenshot-2025-04-03-at-5.12.56-PM.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/Screenshot-2025-04-03-at-5.12.56-PM-300x149.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/Screenshot-2025-04-03-at-5.12.56-PM-768x383.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>What challenges did ABR face before partnering with OVHcloud?</strong></p>



<p>Our business depends on its ability to train advanced AI models for edge device applications creating technical and financial challenges. This requires the <strong>availability of <a href="https://www.ovhcloud.com/en-ca/public-cloud/gpu/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">advanced GPUs</a> for network training</strong>. Successfully optimizing neural networks for our chip is critical to our success. Our main challenges have been budgetary and scaling our R&amp;D efforts. We decided to explore cloud solutions because investment in our own training capability would be prohibitive both from a cost and management perspective.</p>



<p><strong>How did OVHcloud and the Startup Program help you overcome these challenges?</strong></p>



<p>The <a href="https://startup.ovhcloud.com/en-ca/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Startup Program</a> has let us explore different ways of leveraging OVHcloud’s resources in order to train next generation models, and fine-tune them in ways that would be very difficult to do in house. It lets us quickly expand or <strong>focus our efforts without having all of the infrastructure headaches</strong> that come along with that typically.</p>



<p><strong>Which OVHcloud services or features do you use, and how do they stand out from other solutions?</strong></p>



<p>We use the <a href="https://www.ovhcloud.com/en-ca/public-cloud/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Public Cloud service from OVHcloud</a>. Compared to other services from Google, Amazon, etc., OVHCloud provides the <strong>most cost-effective solution per FLOP</strong>.</p>



<p><strong>How has OVHcloud’s support helped you evolve your infrastructure to meet the demands of your business?</strong></p>



<p>The nature of our AI development cycle means that our usage of AI training hardware fluctuates over time. OVHcloud&#8217;s <a href="https://www.ovhcloud.com/en-ca/public-cloud/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">public cloud</a> allows us to <strong>dynamically scale up and down our AI training hardware </strong>in a cost effective manner.</p>



<p><strong>What tangible results have you achieved since collaborating with OVHcloud?</strong></p>



<p>We have pushed our networks to be the smallest possible ASR networks with the highest accuracy. This was possible because we could do hyper perimeter, searching, using OVHcloud’s infrastructure. Specifically we’ve gotten <strong>less than 5% word error rates on full vocabulary speech transcription with a tiny 8 million parameter quantized network</strong>. We have built other state of the networks for TTS and Voice Control using the same infrastructure, but we haven’t announced their general availability yet on our platform.</p>



<p>Providing these networks as starting points for customers to use our chip, greatly reduces the barrier to entry for taking advantage of our technologies. Broadening our pre-trained library available customers will only improve that going forward, and this will be much more efficient using OVHcloud than doing it in house.</p>



<p><strong>What are your ambitions for the future of your startup, and how do you see it evolving within the cloud ecosystem? What future challenges do you foresee?</strong></p>



<p>We plan to offer our no code environment to all of our customers, which will effectively scale as we grow, and allow customers to build, train, and deploy all manner of models on our chip. This SaaS offering will be crucial as we deploy our chip to many markets. Our fundamental advantage is for <strong>time series processing</strong>, i.e. problems where the order of the data in time is important for making decisions. This includes everything from <a href="https://www.appliedbrainresearch.com/applications" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">speech and language processing</a> to heartbeat monitoring and fall detection. As a result, we are building a versatile, cloud-hosted development environment that will require rapid scalability.</p>



<p><strong>What advice would you give to other growth-stage startups considering the cloud or joining a support program?</strong></p>



<p>Probably the most important piece of advice is to <strong>take advantage of everything that’s provided</strong>. This requires some commitment on the side of the startup, but going to the meetings, asking questions, and leveraging the resources is the only way to get the most out of the program.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<figure class="wp-block-image size-large"><a href="https://startup.ovhcloud.com/en-ca/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external"><img loading="lazy" decoding="async" width="1024" height="341" src="https://blog.ovhcloud.com/wp-content/uploads/2025/04/Startup-Program-V3-–-1-1024x341.jpg" alt="" class="wp-image-28562" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/04/Startup-Program-V3-–-1-1024x341.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/Startup-Program-V3-–-1-300x100.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/Startup-Program-V3-–-1-768x256.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/04/Startup-Program-V3-–-1.jpg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></a></figure>



<p><a href="https://www.appliedbrainresearch.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Applied Brain Research</a>’s journey with OVHcloud, joining the Startup Program then AI Accelerator, highlights how a startup can make the most of available resources to overcome challenges, achieve sustainable growth, and scale. If you’re a startup looking to transform your business, we encourage you to join the <strong><a href="https://startup.ovhcloud.com/en/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud Startup Program</a></strong> or contact OVHcloud to discover how our solutions can support your journey!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fpushing-beyond-the-limits-of-embedded-real-time-ai-for-edge-devices%2F&amp;action_name=Pushing%20beyond%20the%20limits%20of%20embedded%20real-time%20AI%20for%20edge%20devices&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Enhancing Customer Service with Interactive Avatars</title>
		<link>https://blog.ovhcloud.com/enhancing-customer-service-with-interactive-avatars/</link>
		
		<dc:creator><![CDATA[Leonard Pommereau]]></dc:creator>
		<pubDate>Thu, 20 Mar 2025 14:24:19 +0000</pubDate>
				<category><![CDATA[OVHcloud Startup Program]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<category><![CDATA[Scaleups]]></category>
		<category><![CDATA[Startup Program]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=28147</guid>

					<description><![CDATA[Startup highlight: Interview with Fatma Chelly, Marketing Manager at Jumbo Mana Jumbo Mana is a deep-tech startup founded in 2022. Specializing in Agentic AI, the company creates conversational solutions, including avatars and digital assistants that provide precise, fast, reliable and engaging answers. Can you introduce Jumbo Mana and its mission? Jumbo Mana’s solution distinguishes by [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fenhancing-customer-service-with-interactive-avatars%2F&amp;action_name=Enhancing%20Customer%20Service%20with%20Interactive%20Avatars&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<h4 class="wp-block-heading"><strong><em>Startup highlight:</em> Interview with Fatma Chelly, Marketing Manager at Jumbo Mana</strong></h4>



<p><a href="https://jumbomana.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Jumbo Mana</a> is a deep-tech startup founded in 2022. Specializing in Agentic AI, the company creates conversational solutions, including avatars and digital assistants that provide precise, fast, reliable and engaging answers.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Can you introduce Jumbo Mana and its mission?</strong></p>



<p><a href="https://jumbomana.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Jumbo Mana</a>’s solution distinguishes by a high level of adaptability making it easy to integrate for any customer, as well as a strong degree of accuracy in its answers. Jumbo Mana aims to offer (i) alternative solutions to sectors suffering from low-skilled labour recruitment issues (airports, subways, hospitality, education, etc.) and (ii) powerful digital assistants providing true and efficient answers to users with a high level of comprehension.</p>



<p>Jumbo Mana is at the forefront of the rapidly growing Agentic AI market and already delivers high-accuracy solutions that drive engagement across a wide range of industries.</p>



<p></p>



<figure class="wp-block-image aligncenter size-full is-resized"><a href="https://jumbomana.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><img loading="lazy" decoding="async" width="750" height="328" src="https://blog.ovhcloud.com/wp-content/uploads/2025/02/JumboMana-assistant.png" alt="" class="wp-image-28151" style="width:544px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/02/JumboMana-assistant.png 750w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/JumboMana-assistant-300x131.png 300w" sizes="auto, (max-width: 750px) 100vw, 750px" /></a></figure>



<p></p>



<p>Our guiding philosophy is to create <strong>technology that serves humanity</strong>. We envision developing a <strong>companion assistant</strong> that adapts to individual needs, simplifies daily life, and enhances human interaction. Our ultimate goal is to become the <strong>global leader in interactive AI avatars</strong>, combining technological innovation with a commitment to improving quality of life.</p>



<p><strong>What challenges did Jumbo Mana face before partnering with OVHcloud?</strong></p>



<p>Jumbo Mana primarily faced technical and operational challenges. One of the key issues was <strong>scalability</strong>: the rapid growth of our SaaS platform&#8217;s user base required an infrastructure capable of handling high traffic volumes while maintaining consistent performance and ensuring a seamless user experience. Additionally, <strong>security</strong> was a critical concern, as we needed to protect sensitive client data from cyber threats and preserve customer trust. Finally, <strong>data privacy and compliance</strong> posed a significant challenge, as strict adherence to the GDPR (General Data Protection Regulation) was mandatory, necessitating solutions that guaranteed data privacy and ensured full compliance with European regulations.</p>



<p>Jumbo Mana chose OVHcloud solutions as the ideal way to address our challenges. <strong>Scalability</strong> was ensured through cloud platforms that offer on-demand resources, allowing us to easily adjust to traffic fluctuations without requiring significant upfront hardware investments. <strong>Enhanced security</strong> was another key advantage, with OVHcloud providing advanced built-in security protocols and tools to protect the platform from evolving threats. In terms of <strong>regulatory compliance</strong>, OVHcloud’s adherence to GDPR standards and expertise in data privacy gave Jumbo Mana confidence in meeting legal obligations without compromise. Lastly, the <strong>cost-effectiveness</strong> of cloud solutions offered a more predictable and manageable cost structure compared to maintaining physical infrastructure, enabling better resource allocation.</p>



<p><strong>How did OVHcloud and the Startup Program help you overcome these challenges?</strong></p>



<p>OVHcloud’s wide array of services provided a strong foundation for Jumbo Mana’s technical architecture. The <a href="https://www.ovhcloud.com/en/public-cloud/gpu/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GPU cluster</a> allowed us to deploy our Triton server, which hosts and serves our AI models, enabling real-time performance and efficient inference. With OVHcloud’s <a href="https://www.ovhcloud.com/en/public-cloud/prices/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">competitive pricing</a>, we were able to deploy enough replicas of our AI services to ensure a robust infrastructure capable of handling high workloads. This not only increased the reliability of our platform but also allowed us to scale seamlessly as our user base grew. Moreover, the ease of obtaining additional quotas from OVHcloud made it simple to scale our infrastructure quickly, supporting the platform’s growth without operational bottlenecks.</p>



<p><strong>Which OVHcloud services or features do you use, and how do they stand out from other solutions?</strong></p>



<p>For the backend, we deployed the entire stack of backend microservices, APIs, and databases on OVHcloud&#8217;s<a href="https://www.ovhcloud.com/en/public-cloud/kubernetes/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"> Kubernetes-managed clusters</a>. This streamlined the deployment process and allowed us to efficiently manage scaling, updates, and monitoring of all services. Kubernetes also provided the flexibility needed to adapt our deployments to fluctuating workloads, ensuring consistent performance for our users.</p>



<p><strong>How has OVHcloud&#8217;s support helped you evolve your infrastructure to meet the demands of your business?</strong></p>



<p>One of the key highlights of our collaboration was OVHcloud’s active and responsive customer support. Whenever challenges arose, their team provided quick and effective solutions, ensuring that any issues were resolved with minimal downtime. The <a href="https://startup.ovhcloud.com/en/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Startup Program</a> further complemented this with infrastructure credits and access to a vibrant ecosystem of partners and startups, helping us optimize our architecture for growth and performance.</p>



<figure class="wp-block-image aligncenter size-large"><a href="https://jumbomana.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><img loading="lazy" decoding="async" width="1024" height="465" src="https://blog.ovhcloud.com/wp-content/uploads/2025/02/diagram2-1024x465.png" alt="" class="wp-image-28154" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/02/diagram2-1024x465.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/diagram2-300x136.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/diagram2-768x349.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/diagram2.png 1187w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></a></figure>



<p></p>



<p>With OVHcloud’s cost-effective solutions, scalability, and excellent support, <a href="https://jumbomana.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Jumbo Mana</a> was able to overcome its challenges and establish a high-performing, reliable infrastructure. This partnership has been instrumental in enabling us to focus on our mission of delivering innovative Agentic AI solutions, while OVHcloud handles the complexities of infrastructure management.</p>



<p><strong>What tangible results have you achieved since collaborating with OVHcloud?</strong></p>



<p>Since partnering with OVHcloud, we have achieved significant advancements in the performance and scalability of our platform. Thanks to OVHcloud’s reliable infrastructure and robust <a href="https://www.ovhcloud.com/en/public-cloud/kubernetes/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Kubernetes-managed clusters</a>, we have maintained zero downtime, even during high traffic periods or unexpected surges in demand. This reliability is critical for delivering a seamless user experience and meeting the expectations of our clients in industries like airports and education.</p>



<p>Our platform’s capacity has <strong>increased by 300%</strong>, now supporting three times the number of concurrent users. This scalability is driven by the ability to dynamically allocate resources and deploy additional replicas as needed, enabling us to handle rapid growth across sectors like airports and education.</p>



<p>We’ve also <strong>doubled our GPU capacity while staying within the same budget</strong>, thanks to OVHcloud’s competitive GPU pricing. This enhancement has significantly boosted our ability to handle high-performance AI workloads, ensuring fast and efficient service delivery for our clients.</p>



<p><strong>What are your ambitions for the future of your startup, and how do you see it evolving within the cloud ecosystem? What future challenges do you foresee?</strong></p>



<p>Our future is focused on <strong>international expansion</strong> and significantly growing our client base. By leveraging OVHcloud’s scalable infrastructure and global presence, we aim to adapt to the needs of diverse markets while maintaining high performance and compliance. </p>



<p>As we grow, we anticipate challenges like <strong>global scalability</strong>, <strong>regulatory compliance across regions</strong>, and <strong>cost management at scale</strong>. The flexibility and innovation provided by OVHcloud’s ecosystem will play a critical role in helping us overcome these obstacles and achieve sustainable success.</p>



<p><strong>What advice would you give to other growth-stage startups considering the cloud or joining a support program?</strong></p>



<p>Pick a cloud that supports your growth with <strong>scalability, security, and compliance</strong>. Make the most of support programs to grow faster and stay focused on your business.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<figure class="wp-block-image size-large"><a href="https://startup.ovhcloud.com/en/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><img loading="lazy" decoding="async" width="1024" height="253" src="https://blog.ovhcloud.com/wp-content/uploads/2025/02/FF-banner-1-1024x253.png" alt="" class="wp-image-28155" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/02/FF-banner-1-1024x253.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/FF-banner-1-300x74.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/FF-banner-1-768x190.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/FF-banner-1-1536x379.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/FF-banner-1.png 1870w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></a></figure>



<p></p>



<p>Jumbo Mana&#8217;s journey with OVHcloud highlights how the right cloud partnership can help startups overcome challenges, achieve sustainable growth, and scale globally. If you’re a startup looking to transform your business, we encourage you to join the <strong><a href="https://startup.ovhcloud.com/en/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Startup Program</a></strong> or contact OVHcloud to discover how our solutions can support your journey!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fenhancing-customer-service-with-interactive-avatars%2F&amp;action_name=Enhancing%20Customer%20Service%20with%20Interactive%20Avatars&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Five ways to develop sovereign, sustainable AI solutions</title>
		<link>https://blog.ovhcloud.com/five-ways-to-develop-sovereign-sustainable-ai-solutions/</link>
		
		<dc:creator><![CDATA[Cezary Skarzynski]]></dc:creator>
		<pubDate>Mon, 27 Jan 2025 15:07:21 +0000</pubDate>
				<category><![CDATA[OVHcloud Startup Program]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Endpoints]]></category>
		<category><![CDATA[Data Sovereignty]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<category><![CDATA[Startup Program]]></category>
		<category><![CDATA[Sustainability]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=28039</guid>

					<description><![CDATA[Now that organisations understand AI and what it can achieve, businesses around the world are focusing on how to build it responsibly. Three of the five main themes at the Paris AI Action Summit examine the need for responsible AI, with separate streams on trust, public interest and good governance. These themes are not simple. [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ffive-ways-to-develop-sovereign-sustainable-ai-solutions%2F&amp;action_name=Five%20ways%20to%20develop%20sovereign%2C%20sustainable%20AI%20solutions&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>Now that organisations understand AI and what it can achieve, businesses around the world are focusing on how to build it responsibly. Three of the five main themes at the <a href="https://www.elysee.fr/en/sommet-pour-l-action-sur-l-ia" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Paris AI Action Summit</a> examine the need for responsible AI, with separate streams on trust, public interest and good governance.</p>



<p>These themes are not simple. In addition to the core function of AI tools – for example, considering what an AI app does, how it does it, and whether bias is present – most businesses are starting to realise that they need to consider the deeper ‘AI supply chain’.</p>



<p>This is not just altruistic. A number of LLM tools are currently facing the risk of lawsuits for copyright infringement, because they may have been trained without due content permission. AI tools that present biased results are quickly exposed in press, leading to reputational damage and a loss of customer trust. Some countries also have legislation permitting data usage for economic intelligence purposes – but in another region, this may represent a data breach. AI has also received negative publicity for ‘running hot’ and consuming large amounts of energy and water in datacenters.</p>



<p>However, AI can also be a tremendous force for good – if handled correctly. So, what should businesses be thinking about so that they get the most from AI, without incurring undue commercial or reputational risk?</p>



<h5 class="wp-block-heading"><strong>1- Consider Sovereignty from the Start</strong></h5>



<p>Understand your data ‘supply chain’ from the very beginning of the process. For example, if you’re using an external LLM for a chatbot, where was this developed? Which data was it trained on, and was this data acquired ethically?</p>



<p>“AI can often be a black box when it comes to processing data,” says Lex Avstreikh, Strategy Lead for Stockholm-based AI firm Hopsworks. “It’s far too complex to show how the system arrived at any one decision. But if you can show people the inputs and the outputs, then that goes a long way to building transparency and trust.”</p>



<h5 class="wp-block-heading"><strong>2- Plan for a Sovereign Future</strong></h5>



<p>It’s important to think about where data will be during its future lifecycle – will you be running in an external datacenter, and where will data be in transit and at rest? Where are the headquarters of the datacenter company in question and what does this mean from a regulatory and handling perspective? Perhaps most importantly, will your customers be happy with all of these arrangements?</p>



<p>This was the decision journey faced by Swedish AI firm Ebbot. In July 2020, the Data Protection Commission v. Facebook Ireland case, commonly referred to as Schrems II, resulted in the Court of Justice of the European Union (CJEU) issuing a decision that added more regulations to data protection and processing principles. Ebbot recognised the importance of data security and compliance and thus made it a priority to store and process all data within the EU.</p>



<h5 class="wp-block-heading"><strong>3- Location, location, location</strong></h5>



<p>Location isn’t just an important sovereignty concern – it’s also crucial to sustainability. Although Scandinavia may have very green energy, it’s easy to forget that many cloud providers will offer geographical ‘computing zones’ rather than defined locations, which can result in a less green footprint. CPU- and GPU-intensive tasks like model training should be run in green energy zones wherever possible, and are rarely latency-dependent; consequently, you can locate them far away if necessary.</p>



<p>When your AI app goes into production, also remember that backup and redundancy are a necessity – but will also increase your carbon footprint. Consider having a ‘low power’ or passive backup if commercially feasible – it will take longer to bring online in the case of emergency, but you’ll be consuming less power.</p>



<h5 class="wp-block-heading"><strong>4- Always Consider Necessity</strong></h5>



<p>A lot of organisations only consider hardware efficiency and power consumption during the development process, but green software is rapidly gaining popularity. Having efficient code which is still fit for purpose can have a huge impact on power consumption, particularly if you’re building an app for very broad use. “We’ll definitely see more efficient and specific LLMs, because they’re absolutely needed,” added Avstreikh.</p>



<p>Although organisations are often considering the cost of development, with FinOps initiatives, we are also seeing the dawn of GreenOps, ensuring that technology is as green as possible from end to end. To that effect, consider benchmarking the CPU and memory usage of your application, because less hardware-intensive apps are usually less power-hungry.</p>



<h5 class="wp-block-heading"><strong>5- Re-use, recycle</strong></h5>



<p>Developing bespoke code can make sure that it’s as lean and efficient as possible, but it can also use needless computing power to develop. Many technology organisations will offer PaaS offerings that can automate common parts of the application development and deployment process. For example, consider our <a href="https://ovh.commander1.com/c3/?tcs=3810&amp;chn=display&amp;src=partnership&amp;cty_ads=multi&amp;lang_ads=en&amp;cty=US&amp;unvrse=multi&amp;pcat=multi&amp;subtpc=undefinite&amp;tactic=awrns&amp;objv=impressions&amp;site_domain=https://labs.ovhcloud.com&amp;cmp=display_PR_multi_en_US_multi_multi_undefinite_awrns_impressions&amp;crtive=dimg_image_728x90_STN-NE&amp;url=https%3A%2F%2Flabs.ovhcloud.com%2Fen%2Fai-endpoints%2F%3Fat_medium%3Ddisplay%26at_campaign%3Dpartnership%26at_creation%3Ddisplay_PR_multi_en_US_multi_multi_undefinite_awrns_impressions%26at_variant%3Ddimg_image_728x90_STN-NE" data-wpel-link="exclude">AI Endpoints solution</a>, which helps developers to access other AI models, from Bert to Mistral to Llama, all using a simple API.</p>



<p>This is not an easy process, but establishing responsible AI conduct in your organisation’s DNA will avoid complications further down the road, and also show to customers that you are considering data – including theirs – in a responsible, secure way. With increasing numbers of organisations tracking not only their scope three emissions, but also their data supply chains in a more comprehensive fashion, sovereignty and sustainability are two clear ‘musts’ for any modern AI company.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<figure class="wp-block-image aligncenter size-large is-resized"><a href="https://startup.ovhcloud.com/en/accelerator/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><img loading="lazy" decoding="async" width="1024" height="253" src="https://blog.ovhcloud.com/wp-content/uploads/2025/01/FF-banner-1024x253.png" alt="" class="wp-image-28042" style="width:626px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/01/FF-banner-1024x253.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/01/FF-banner-300x74.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/01/FF-banner-768x190.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/01/FF-banner-1536x379.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/01/FF-banner.png 1870w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></a></figure>



<p></p>



<p><em>If you’re a startup or scale-up building an AI solution, and would like to work with a sovereign, sustainable cloud provider in turn, you can find more information about OVHcloud – including our cloud credit scheme – on our <a href="https://ovh.commander1.com/c3/?tcs=3810&amp;chn=display&amp;src=partnership&amp;cty_ads=multi&amp;lang_ads=en&amp;cty=US&amp;unvrse=multi&amp;pcat=multi&amp;subtpc=undefinite&amp;tactic=awrns&amp;objv=impressions&amp;site_domain=https://startup.ovhcloud.com&amp;cmp=display_PR_multi_en_US_multi_multi_undefinite_awrns_impressions&amp;crtive=dimg_image_728x90_STN-NE&amp;url=https%3A%2F%2Fstartup.ovhcloud.com%2Fen%2F%3Fat_medium%3Ddisplay%26at_campaign%3Dpartnership%26at_creation%3Ddisplay_PR_multi_en_US_multi_multi_undefinite_awrns_impressions%26at_variant%3Ddimg_image_728x90_STN-NE" data-wpel-link="exclude">startup hub</a>.</em></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ffive-ways-to-develop-sovereign-sustainable-ai-solutions%2F&amp;action_name=Five%20ways%20to%20develop%20sovereign%2C%20sustainable%20AI%20solutions&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to serve LLMs with vLLM and OVHcloud AI Deploy</title>
		<link>https://blog.ovhcloud.com/how-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy/</link>
		
		<dc:creator><![CDATA[Mathieu Busquet]]></dc:creator>
		<pubDate>Wed, 29 May 2024 12:22:26 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[AI Endpoints]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Deep learning]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LLaMA]]></category>
		<category><![CDATA[LLaMA 3]]></category>
		<category><![CDATA[LLM Serving]]></category>
		<category><![CDATA[Mistral]]></category>
		<category><![CDATA[Mixtral]]></category>
		<category><![CDATA[vLLM]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=26762</guid>

					<description><![CDATA[In this tutorial, we will learn how to serve Large Language Models (LLMs) using vLLM and the OVHcloud AI Products.<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy%2F&amp;action_name=How%20to%20serve%20LLMs%20with%20vLLM%20and%20OVHcloud%20AI%20Deploy&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>In this tutorial, we will walk you through the process of serving large language models (LLMs), providing step-by-step instruction</em>.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" width="1024" height="345" src="https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1024x345.png" alt="" class="wp-image-25615" style="width:750px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1024x345.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-300x101.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-768x259.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1536x518.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-2048x690.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p></p>



<h3 class="wp-block-heading">Introduction</h3>



<p>In recent years, <strong>large language models</strong> (LLMs) have become increasingly <strong>popular</strong>, with <strong>open-source</strong> models like <em>Mistral</em> and <em>LLaMA</em> gaining widespread attention. In particular, the <em>LLaMA 3</em> model was released on <em>April 18, 2024</em>, is one of today&#8217;s most powerful open-source LLMs.</p>



<p>However, <strong>serving these LLMs can be challenging</strong>, particularly on hardware with limited resources. Indeed, even on expensive hardware, LLMs can be surprisingly slow, with high VRAM utilization and throughput limitations.</p>



<p>This is where<strong><em> </em></strong><em><a href="https://github.com/vllm-project/vllm" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>vLLM</strong></a></em> comes in. <em><strong>vLLM</strong></em> is an <strong>open-source project</strong> that enables <strong>fast and easy-to-use LLM inference and serving</strong>. Designed for optimal performance and resource utilization, <em>vLLM</em> supports a range of <a href="https://docs.vllm.ai/en/latest/models/supported_models.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLM architectures</a> and offers <a href="https://docs.vllm.ai/en/latest/models/engine_args.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">flexible customization options</a>. That&#8217;s why we are going to use it to efficiently deploy and scale our LLMs.</p>



<h3 class="wp-block-heading">Objective</h3>



<p>In this guide, you will discover how to deploy a LLM thanks to <a href="https://github.com/vllm-project/vllm" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>vLLM</em></a> and the <strong><em>AI Deploy</em></strong> <em>OVHcloud</em> solution. This will enable you to benefit from <em>vLLM</em>&#8216;s optimisations and <em>OVHcloud</em>&#8216;s GPU computing resources. Your LLM will then be exposed by a secured API.</p>



<p>🎁 And for those who do not want to bother with the deployment process, <strong>a surprise awaits you at the <a href="#AI-ENDPOINTS">end of the article</a></strong>. We are going to introduce you to our new solution for using LLMs, called <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>AI Endpoints</strong></a>. This product makes it easy to integrate AI capabilities into your applications with a simple API call, without the need for deep AI expertise or infrastructure management. And while it&#8217;s in alpha, it&#8217;s <strong>free</strong>!</p>



<h3 class="wp-block-heading">Requirements</h3>



<p>To deploy your <em>vLLM</em> server, you need:</p>



<ul class="wp-block-list">
<li>An <em>OVHcloud</em> account to access the <a href="https://www.ovh.com/auth/?action=gotomanager&amp;from=https://www.ovh.co.uk/&amp;ovhSubsidiary=GB" data-wpel-link="exclude"><em>OVHcloud Control Panel</em></a></li>



<li>A <em>Public Cloud</em> project</li>



<li>A <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">user for the AI Products</a>, related to this <em>Public Cloud</em> project</li>



<li><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">The <em>OVHcloud AI CLI</em></a> installed on your local computer (to interact with the AI products by running commands). </li>



<li><a href="https://www.docker.com/get-started" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Docker</a> installed on your local computer, <strong>or</strong> access to a Debian Docker Instance, which is available on the <a href="https://www.ovh.com/manager/public-cloud/" data-wpel-link="exclude"><em>Public Cloud</em></a></li>
</ul>



<p>Once these conditions have been met, you are ready to serve your LLMs.</p>



<h3 class="wp-block-heading">Building a Docker image</h3>



<p>Since the <a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>OVHcloud AI Deploy</em></a> solution is based on <a href="https://www.docker.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>Docker</em></a> images, we will be using a <em>Docker</em> image to deploy our <em>vLLM</em> inference server. </p>



<p>As a reminder, <em>Docker</em> is a platform that allows you to create, deploy, and run applications in containers. <em>Docker</em> containers are standalone and executable packages that include everything needed to run an application (code, libraries, system tools).</p>



<p>To create this <em>Docker</em> image, we will need to write the following <em><strong>Dockerfile</strong></em> into a new folder:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">mkdir my_vllm_image
nano Dockerfile</code></pre>



<pre class="wp-block-code"><code lang="bash" class="language-bash"># 🐳 Base image
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

# 👱 Set the working directory inside the container
WORKDIR /workspace

# 📚 Install missing system packages (git) so we can clone the vLLM project repository
RUN apt-get update &amp;&amp; apt-get install -y git
RUN git clone https://github.com/vllm-project/vllm/

# 📚 Install the Python dependencies
RUN pip3 install --upgrade pip
RUN pip3 install vllm 

# 🔑 Give correct access rights to the OVHcloud user
ENV HOME=/workspace
RUN chown -R 42420:42420 /workspace</code></pre>



<p>Let&#8217;s take a closer look at this <em>Dockerfile</em> to understand it:</p>



<ul class="wp-block-list">
<li><strong>FROM</strong>: Specify the base image for our <em>Docker</em> Image. We choose the <em>PyTorch</em> image since it comes with <em>CUDA</em>, <em>CuDNN</em> and <em>torch</em>, which is needed by <em>vLLM</em>. </li>



<li><strong>WORKDIR /workspace</strong>: We set the working directory for the <em>Docker</em> container to <em>/workspace</em>, which is the default folder when we use <em>AI Deploy</em>.</li>



<li><strong>RUN</strong>: It allows us to upgrade <em>pip</em> to the latest version to make sure we have access to the latest libraries and dependencies. We will install <em>vLLM</em> library, and <em>git</em>, which will enable to clone the <a href="https://github.com/vllm-project/vllm/tree/main" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>vLLM</em> repository</a> into th<em>e /workspace</em> directory.</li>



<li><strong>ENV</strong> HOME=/workspace: This sets the <em>HOME</em> environment variable to <em>/workspace</em>. This is a requirement to use the <em>OVHcloud</em> AI Products.</li>



<li><strong>RUN chown -R 42420:42420 /workspace</strong>: This changes the owner of the <em>/workspace</em> directory to the user and group with IDs of <em>42420</em> (<em>OVHcloud</em> user). This is also a requirement to use the <em>OVHcloud</em> AI Products.</li>
</ul>



<p>This <em>Dockerfile</em> does not contain a <strong>CMD</strong> instruction and therefore does not launch our <em>VLLM</em> server. Do not worry about that, we will do it directly from <a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Deploy</a>&nbsp;to have more flexibility.</p>



<p>Once your Dockerfile is written, launch the following command to build your image:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker build . -t vllm_image:latest</code></pre>



<h3 class="wp-block-heading">Push the image into the shared registry</h3>



<p>Once you have built the Docker image, you will need to push it to a <strong>registry</strong> to make it accessible from <em>AI Deploy</em>. A <strong>registry</strong> is a service that allows you to store and distribute <em>Docker</em> images, making it easy to deploy them in different environments.</p>



<p>Several registries can be used (<em><a href="https://www.ovhcloud.com/en-gb/public-cloud/managed-private-registry/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Managed Private Registry</a>, <a href="https://hub.docker.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Docker Hub</a>, <a href="https://github.com/features/packages" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GitHub packages</a>, &#8230;</em>). In this tutorial, we will use the <strong><em>OVHcloud</em> <em>shared registry</em></strong>. More information are available in the <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-manage-registries?id=kb_article_view&amp;sysparm_article=KB0057949" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Registries documentation</a>.</p>



<p>To find the address of your shared registry, use the following command (<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>ovhai CLI</em></a> needs to be installed on your computer):</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">ovhai registry list</code></pre>



<p>Then, log in on your <em>shared registry</em> with your usual <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Platform user</em></a> credentials:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker login -u &lt;user&gt; -p &lt;password&gt; &lt;shared-registry-address&gt;</code></pre>



<p>Once you are logged in to the registry, tag the compiled image and push it into your shared registry:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker tag vllm_image:latest &lt;shared-registry-address&gt;/vllm_image:latest
docker push &lt;shared-registry-address&gt;/vllm_image:latest</code></pre>



<h3 class="wp-block-heading">vLLM inference server deployment</h3>



<p>Once your image has been pushed, it can be used with <em>AI Deploy</em>, using either the <em>ovhai CLI</em> or the <em>OVHcloud Control Panel (UI)</em>.</p>



<h5 class="wp-block-heading">Creating an access token </h5>



<p>Tokens are used as unique authenticators to securely access the <em>AI Deploy</em> apps. By creating a token, you can ensure that only authorized requests are allowed to interact with the <em>vLLM</em> endpoint. You can create this token by using the <em>OVHcloud Control Panel (UI)</em> or by running the following command:</p>



<pre class="wp-block-code"><code lang="" class="">ovhai token create vllm --role operator --label-selector name=vllm</code></pre>



<p>This will give you a token that you will need to keep.</p>



<h5 class="wp-block-heading">Creating a Hugging Face token (optionnal)</h5>



<p>Note that some models, such as <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLaMA 3</a> require you to accept their license, hence, you need to create a <a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">HuggingFace account</a>, accept the model’s license, and generate a <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">token</a> by accessing your account settings, that will allow you to access the model.</p>



<p>For example, when visiting the HugginFace <a href="https://huggingface.co/google/gemma-2b" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Gemma model page</a>, you’ll see this (if you are logged in):</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="716" height="312" src="https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21.png" alt="accept_model_conditions_hugging_face" class="wp-image-26768" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21.png 716w, https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21-300x131.png 300w" sizes="auto, (max-width: 716px) 100vw, 716px" /></figure>



<p>If you want to use this model, you will have to Acknowledge the license, and then make sure to create a token in the <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">tokens section</a>.</p>



<p>In the next step, we will set this token as an environment variable (named  <code>HF_TOKEN</code>). Doing this will enable us to use any LLM whose conditions of use we have accepted.</p>



<h5 class="wp-block-heading">Run the AI Deploy application</h5>



<p>Run the following command to deploy your <em>vLLM</em> server by running your customized <em>Docker</em> image:</p>



<pre class="wp-block-code"><code lang="" class="">ovhai app run &lt;shared-registry-address&gt;/vllm_image:latest \
  --name vllm_app \
  --flavor h100-1-gpu \
  --gpu 1 \
  --env HF_TOKEN="&lt;YOUR_HUGGING_FACE_TOKEN&gt;" \
  --label name=vllm \
  --default-http-port 8080 \
  -- python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model &lt;model&gt; --dtype half</code></pre>



<p><em>You just need to change the address of your registry to the one you used, and the name of the LLM you want to use. Also pay attention to the name of the image, its tag, and the label selector of your label if you haven&#8217;t used the same ones as those given in this tutorial.</em></p>



<p><strong>Parameters explanation</strong></p>



<ul class="wp-block-list">
<li><code>&lt;shared-registry-address&gt;/vllm_image:latest</code> is the image on which the app is based.</li>



<li><code>--name vllm_app</code> is an optional argument that allows you to give your app a custom name, making it easier to manage all your apps.</li>



<li><code>--flavor h100-1-gpu</code> indicates that we want to run our app on H100 GPU(s). You can access the full list of GPUs available by <code>running ovhai capabilities flavor list</code></li>



<li><code>--gpu 1</code> indicates that we request 1 GPU for that app.</li>



<li><code>--env HF_TOKEN</code> is an optional argument that allows us to set our Hugging Face token as an environment variable. This gives us access to models for which we have accepted the conditions.</li>



<li><code>--label name=vllm</code> allows to privatize our LLM by adding the token corresponding to the label selector <code>name=vllm</code>.</li>



<li><code>--default-http-port 8080</code> indicates that the port to reach on the app URL is the <code>8080</code>.</li>



<li><code>--python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model &lt;model&gt;</code> allows to start the vLLM API server. The specified &lt;model&gt; will be downloaded from Hugging Face. Here is a list of those that are <a href="https://docs.vllm.ai/en/latest/models/supported_models.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">supported by vLLM</a>. <a href="https://docs.vllm.ai/en/latest/models/engine_args.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Many arguments</a> can be used to optimize your inference.</li>
</ul>



<p>When this <code>ovhai app run</code> command is executed, several pieces of information will appear in your terminal. Get the ID of your application, and open the Info URL in a new tab. Wait a few minutes for your application to launch. When it is <strong>RUNNING</strong>, you can stream its logs by executing:</p>



<pre class="wp-block-code"><code class="">ovhai app logs -f &lt;APP_ID&gt;</code></pre>



<p>This will allow you to track the server launch, the model download and any errors you may encounter if you have used a model for which you have not accepted the user contract. </p>



<p>If all goes well, you should see the following output, which means that your server is up and running:</p>



<pre class="wp-block-code"><code class="">Started server process [11]
Waiting for application startup.
Application startup complete.
Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)</code></pre>



<h3 class="wp-block-heading">Interacting with your LLM</h3>



<p>Once the server is up and running, we can interact with our LLM by hitting the <code>/generate</code> endpoint.</p>



<p><strong>Using cURL</strong></p>



<p><em>Make sure you change the ID to that of your application so that you target the right endpoint. In order for the request to be accepted, also specify the token that you generated previously by executing</em> <code>ovhai token create</code>. Feel free to adapt the parameters of the request (<em>prompt</em>, <em>max_tokens</em>, <em>temperature</em>, &#8230;)</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">curl --request POST \                                             
  --url https://&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net/generate \
  --header 'Authorization: Bearer &lt;AI_TOKEN_generated_with_CLI&gt;' \
  --header 'Content-Type: application/json' \
  --data '{
        "prompt": "&lt;YOUR_PROMPT&gt;",
        "max_tokens": 50,
        "n": 1,
        "stream": false
}'</code></pre>



<p><strong>Using Python</strong></p>



<p><em>Here too, you need to add your personal token and the correct link for your application.</em></p>



<pre class="wp-block-code"><code lang="python" class="language-python">import requests
import json

# change for your host
APP_URL = "https://&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net"
TOKEN = "AI_TOKEN_generated_with_CLI"

url = f"{APP_URL}/generate"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {TOKEN}"
}
data = {
    "prompt": "What a LLM is in AI?",
    "max_tokens": 100,
    "temperature": 0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json()["text"][0])</code></pre>



<h3 class="wp-block-heading" id="AI-ENDPOINTS">OVHcloud AI Endpoints</h3>



<p>If you are not interested in building your own image and deploying your own LLM inference server, you can use OVHcloud&#8217;s new <em><strong><a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints</a></strong> </em>product which will make your life definitely easier!</p>



<p><a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> is a serverless solution that provides AI APIs, enabling you to easily use pre-trained and optimized AI models in your applications. </p>



<figure class="wp-block-video"><video height="1400" style="aspect-ratio: 2560 / 1400;" width="2560" controls src="https://blog.ovhcloud.com/wp-content/uploads/2024/05/demo-ai-endpoints.mp4"></video></figure>



<p class="has-text-align-center"><em>Overview of AI Endpoints</em></p>



<p>You can use LLM as a Service, choosing the desired model (such as <em>LLaMA</em>, <em>Mistral</em>, or <em>Mixtral</em>) and making an API call to use it in your application. This will allow you to interact with these models without even having to deploy them!</p>



<p>In addition to LLM capabilities, <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> also offers a range of other AI models, including speech-to-text, translation, summarization, embeddings and computer vision. </p>



<p>Best of all, <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> is currently in alpha phase and is <strong>free to use</strong>, making it an accessible and affordable solution for developers seeking to explore the possibilities of AI. Check <a href="https://blog.ovhcloud.com/enhance-your-applications-with-ai-endpoints/" data-wpel-link="internal">this article</a> and try it out today to discover the power of AI!</p>



<p>Join our <a href="https://discord.gg/ovhcloud" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Discord server</a> to interact with the community and send us your feedbacks (#<em>ai-endpoints</em> channel)!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy%2F&amp;action_name=How%20to%20serve%20LLMs%20with%20vLLM%20and%20OVHcloud%20AI%20Deploy&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2024/05/demo-ai-endpoints.mp4" length="14424826" type="video/mp4" />

			</item>
		<item>
		<title>Fine-Tuning LLaMA 2 Models using a single GPU, QLoRA and AI Notebooks</title>
		<link>https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/</link>
		
		<dc:creator><![CDATA[Mathieu Busquet]]></dc:creator>
		<pubDate>Fri, 21 Jul 2023 15:04:00 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Notebooks]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Deep learning]]></category>
		<category><![CDATA[Fine-tuning]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LLaMa 2]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[PyTorch]]></category>
		<category><![CDATA[QLoRA]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=25613</guid>

					<description><![CDATA[In this tutorial, we will walk you through the process of fine-tuning LLaMA 2 models, providing step-by-step instructions. All the code related to this article is available in our dedicated GitHub repository. You can reproduce all the experiments with OVHcloud AI Notebooks. Introduction On July 18, 2023, Meta released LLaMA 2, the latest version of [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ffine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks%2F&amp;action_name=Fine-Tuning%20LLaMA%202%20Models%20using%20a%20single%20GPU%2C%20QLoRA%20and%20AI%20Notebooks&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>In this tutorial, we will walk you through the process of fine-tuning <a href="https://ai.meta.com/llama/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLaMA 2</a> models, providing step-by-step instructions.</em> </p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564-1024x538.jpg" alt="Fine-Tuning LLaMA 2 Models with a single GPU and OVHcloud" class="wp-image-25629" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564-768x404.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564.jpg 1199w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<p class="has-text-align-center"><em>All the code related to this article is available in our dedicated <a href="https://github.com/ovh/ai-training-examples/blob/main/notebooks/natural-language-processing/llm/miniconda/llama2-fine-tuning/llama_2_finetuning.ipynb" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a><a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/speech-to-text" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">.</a> You can reproduce all the experiments with</em> <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud AI Notebooks</a>.</p>



<h3 class="wp-block-heading">Introduction</h3>



<p>On July 18, 2023, <a href="https://about.meta.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Meta</a> released <a href="https://ai.meta.com/llama/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLaMA 2</a>, the latest version of their <strong>Large Language Model </strong>(LLM).</p>



<p>Trained between January 2023 and July 2023 on 2 trillion tokens, these new models outperforms other LLMs on many benchmarks, including reasoning, coding, proficiency, and knowledge tests. This release comes in different flavors, with parameter sizes of <strong>7B</strong>, <strong>13B</strong>, and a mind-blowing <strong>70B</strong>. Models are intended for free for both commercial and research use in English.</p>



<p>To suit every text generation needed and fine-tune these models, we will use <a href="https://arxiv.org/abs/2305.14314" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">QLoRA (Efficient Finetuning of Quantized LLMs)</a>, a highly efficient fine-tuning technique that involves quantizing a pretrained LLM to just 4 bits and adding small &#8220;Low-Rank Adapters&#8221;. This unique approach allows for fine-tuning LLMs <strong>using just a single GPU</strong>! This technique is supported by the <a href="https://huggingface.co/docs/peft/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">PEFT library</a>.</p>



<p>To fine-tune our model, we will create <em>a</em> <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud AI Notebooks</a> with only 1 GPU.</p>



<h3 class="wp-block-heading">Mandatory requirements</h3>



<p>To successfully fine-tune LLaMA 2 models, you will need the following:</p>



<ul class="wp-block-list">
<li>Fill Meta&#8217;s form to <a href="https://ai.meta.com/resources/models-and-libraries/llama-downloads/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">request access to the next version of Llama</a>. Indeed, the use of Llama 2 is governed by the Meta license, that you must accept in order to download the model weights and tokenizer.<br></li>



<li>Have a <a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face</a> account (with the same email address you entered in Meta&#8217;s form).<br></li>



<li>Have a <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face token</a>.<br></li>



<li>Visit the page of one of the LLaMA 2 available models (version <a href="https://huggingface.co/meta-llama/Llama-2-7b-hf" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">7B</a>, <a href="https://huggingface.co/meta-llama/Llama-2-13b-hf" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">13B</a> or <a href="https://huggingface.co/meta-llama/Llama-2-70b-hf" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">70B</a>), and accept Hugging Face&#8217;s license terms and acceptable use policy.<br></li>



<li>Log in to the Hugging Face model Hub from your notebook&#8217;s terminal by running the <code>huggingface-cli login</code> command, and enter your token. You will not need to add your token as git credential.<br></li>



<li>Powerful Computing Resources: Fine-tuning the Llama 2 model requires substantial computational power. Ensure you are running code on GPU(s) when using <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Notebooks</a> or <a href="https://www.ovhcloud.com/en/public-cloud/ai-training/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Training</a>.</li>
</ul>



<h3 class="wp-block-heading">Set up your Python environment</h3>



<p>Create the following <code>requirements.txt</code> file:</p>



<pre class="wp-block-code"><code lang="" class="">torch
accelerate @ git+https://github.com/huggingface/accelerate.git
bitsandbytes
datasets==2.13.1
transformers @ git+https://github.com/huggingface/transformers.git
peft @ git+https://github.com/huggingface/peft.git
trl @ git+https://github.com/lvwerra/trl.git
scipy</code></pre>



<p>Then install and import the installed libraries:</p>



<pre class="wp-block-code"><code class="">pip install -r requirements.txt</code></pre>



<pre class="wp-block-code"><code lang="python" class="language-python">import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
    DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset</code></pre>



<h3 class="wp-block-heading">Download LLaMA 2 model</h3>



<p>As mentioned before, LLaMA 2 models come in different flavors which are 7B, 13B, and 70B. Your choice can be influenced by your computational resources. Indeed, larger models require more resources, memory, processing power, and training time.</p>



<p>To download the model you have been granted access to, <strong>make sure you are logged in to the Hugging Face model hub</strong>. As mentioned in the requirements step, you need to use the <code>huggingface-cli login</code> command.</p>



<p>The following function will help us to download the model and its tokenizer. It requires a bitsandbytes configuration that we will define later.</p>



<pre class="wp-block-code"><code lang="python" class="language-python">def load_model(model_name, bnb_config):
    n_gpus = torch.cuda.device_count()
    max_memory = f'{40960}MB'

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto", # dispatch efficiently the model on the available ressources
        max_memory = {i: max_memory for i in range(n_gpus)},
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)

    # Needed for LLaMA tokenizer
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer</code></pre>



<h3 class="wp-block-heading">Download a Dataset</h3>



<p>There are many datasets that can help you fine-tune your model. You can even use your own dataset!</p>



<p>In this tutorial, we are going to download and use the <a href="https://huggingface.co/datasets/databricks/databricks-dolly-15k" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Databricks Dolly 15k dataset</a>, which contains <strong>15,000 prompt/response pairs</strong>. It was crafted by over 5,000 <a href="https://www.databricks.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Databricks</a> employees during March and April of 2023.</p>



<p>This dataset is designed specifically for fine-tuning large language models. Released under the <a href="https://creativecommons.org/licenses/by-sa/3.0/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">CC BY-SA 3.0 license</a>, it can be used, modified, and extended by any individual or company, even for commercial applications. So it&#8217;s a perfect fit for our use case!</p>



<p>However, like most datasets, this one has <strong>its limitations</strong>. Indeed, pay attention to the following points:</p>



<ul class="wp-block-list">
<li>It consists of content collected from the public internet, which means it may contain objectionable, incorrect or biased content and typo errors, which could influence the behavior of models fine-tuned using this dataset.<br></li>



<li>Since the dataset has been created for Databricks by their own employees, it&#8217;s worth noting that the dataset reflects the interests and semantic choices of Databricks employees, which may not be representative of the global population at large.<br></li>



<li>We only have access to the <code>train</code> split of the dataset, which is its largest subset.</li>
</ul>



<pre class="wp-block-code"><code lang="python" class="language-python"># Load the databricks dataset from Hugging Face
from datasets import load_dataset

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")</code></pre>



<h3 class="wp-block-heading">Explore dataset</h3>



<p>Once the dataset is downloaded, we can take a look at it to understand what it contains:</p>



<pre class="wp-block-code"><code lang="python" class="language-python">print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')

*** OUTPUT ***
Number of prompts: 15011
Column Names are: ['instruction', 'context', 'response', 'category']</code></pre>



<p>As we can see, each sample is a dictionary that contains:</p>



<ul class="wp-block-list">
<li><strong>An instruction</strong>: What could be entered by the user, such as a question</li>



<li><strong>A context</strong>: Help to interpret the sample</li>



<li><strong>A response</strong>: Answer to the instruction</li>



<li><strong>A category</strong>: Classify the sample between Open Q&amp;A, Closed Q&amp;A, Extract information from Wikipedia, Summarize information from Wikipedia, Brainstorming, Classification, Creative writing</li>
</ul>



<h3 class="wp-block-heading">Pre-processing dataset</h3>



<p><strong>Instruction fine-tuning</strong> is a common technique used to fine-tune a base LLM for a specific downstream use-case.</p>



<p>It will help us to format our prompts as follows: </p>



<pre class="wp-block-code"><code class="">Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Sea or Mountain

### Response:
I believe Mountain are more attractive but Ocean has it's own beauty and this tropical weather definitely turn you on! SO 50% 50%

### End</code></pre>



<p>To delimit each prompt part by hashtags, we can use the following function:</p>



<pre class="wp-block-code"><code lang="python" class="language-python">def create_prompt_formats(sample):
    """
    Format various fields of the sample ('instruction', 'context', 'response')
    Then concatenate them using two newline characters 
    :param sample: Sample dictionnary
    """

    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruction:"
    INPUT_KEY = "Input:"
    RESPONSE_KEY = "### Response:"
    END_KEY = "### End"
    
    blurb = f"{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}\n{sample['instruction']}"
    input_context = f"{INPUT_KEY}\n{sample['context']}" if sample["context"] else None
    response = f"{RESPONSE_KEY}\n{sample['response']}"
    end = f"{END_KEY}"
    
    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    
    sample["text"] = formatted_prompt

    return sample</code></pre>



<p>Now, we will use our <strong>model tokenizer to process these prompts into tokenized ones</strong>. </p>



<p>The goal is to create input sequences of uniform length (which are suitable for fine-tuning the language model because it maximizes efficiency and minimize computational overhead), that must not exceed the model&#8217;s maximum token limit.</p>



<pre class="wp-block-code"><code lang="python" class="language-python"># SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )


# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset: str):
    """Format &amp; tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """
    
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)
    
    # Apply preprocessing to each batch of the dataset &amp; and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=["instruction", "context", "response", "text", "category"],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) &lt; max_length)
    
    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset</code></pre>



<p>With these functions, our dataset will be ready for fine-tuning !</p>



<h3 class="wp-block-heading">Create a bitsandbytes configuration</h3>



<p>This will allow us to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices. We choose to apply bfloat16 compute data type and nested quantization for memory-saving purposes.</p>



<pre class="wp-block-code"><code lang="python" class="language-python">def create_bnb_config():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    return bnb_config</code></pre>



<p>To leverage the LoRa method, we need to wrap the model as a PeftModel.</p>



<p>To do this, we need to implement a <a href="https://huggingface.co/docs/peft/conceptual_guides/lora" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LoRa configuration</a>:</p>



<pre class="wp-block-code"><code lang="python" class="language-python">def create_peft_config(modules):
    """
    Create Parameter-Efficient Fine-Tuning config for your model
    :param modules: Names of the modules to apply Lora to
    """
    config = LoraConfig(
        r=16,  # dimension of the updated matrices
        lora_alpha=64,  # parameter for scaling
        target_modules=modules,
        lora_dropout=0.1,  # dropout probability for layers
        bias="none",
        task_type="CAUSAL_LM",
    )

    return config</code></pre>



<p>Previous function needs the <strong>target modules</strong> to update the necessary matrices. The following function will get them for our model:</p>



<pre class="wp-block-code"><code lang="python" class="language-python"># SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)</code></pre>



<p>Once everything is set up and the base model is prepared, we can use the <em>print_trainable_parameters()</em> helper function to see how many trainable parameters are in the model. </p>



<pre class="wp-block-code"><code lang="python" class="language-python">def print_trainable_parameters(model, use_4bit=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit:
        trainable_params /= 2
    print(
        f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
    )</code></pre>



<p>We expect the LoRa model to have fewer trainable parameters compared to the original one, since we want to perform fine-tuning.</p>



<h3 class="wp-block-heading">Train</h3>



<p>Now that everything is ready, we can pre-process our dataset and load our model using the set configurations: </p>



<pre class="wp-block-code"><code lang="python" class="language-python"># Load model from HF with user's token and with bitsandbytes config

model_name = "meta-llama/Llama-2-7b-hf" 

bnb_config = create_bnb_config()

model, tokenizer = load_model(model_name, bnb_config)</code></pre>



<pre class="wp-block-code"><code lang="python" class="language-python">## Preprocess dataset

max_length = get_max_length(model)

dataset = preprocess_dataset(tokenizer, max_length, seed, dataset)</code></pre>



<p>Then, we can run our fine-tuning process: </p>



<pre class="wp-block-code"><code lang="python" class="language-python">def train(model, tokenizer, dataset, output_dir):
    # Apply preprocessing to the model to prepare it by
    # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
    model.gradient_checkpointing_enable()

    # 2 - Using the prepare_model_for_kbit_training method from PEFT
    model = prepare_model_for_kbit_training(model)

    # Get lora module names
    modules = find_all_linear_names(model)

    # Create PEFT config for these modules and wrap the model to PEFT
    peft_config = create_peft_config(modules)
    model = get_peft_model(model, peft_config)
    
    # Print information about the percentage of trainable parameters
    print_trainable_parameters(model)
    
    # Training parameters
    trainer = Trainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            max_steps=20,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            output_dir="outputs",
            optim="paged_adamw_8bit",
        ),
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )
    
    model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs
    
    ### SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
    # Verifying the datatypes before training
    
    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total+= v
    for k, v in dtypes.items():
        print(k, v, v/total)
     
    do_train = True
    
    # Launch training
    print("Training...")
    
    if do_train:
        train_result = trainer.train()
        metrics = train_result.metrics
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()
        print(metrics)    
    
    ###
    
    # Saving model
    print("Saving last checkpoint of the model...")
    os.makedirs(output_dir, exist_ok=True)
    trainer.model.save_pretrained(output_dir)
    
    # Free memory for merging weights
    del model
    del trainer
    torch.cuda.empty_cache()
    
    
output_dir = "results/llama2/final_checkpoint"
train(model, tokenizer, dataset, output_dir)</code></pre>



<p><em>If you prefer to have a number of epochs (entire training dataset will be passed through the model) instead of a number of training steps (forward and backward passes through the model with one batch of data), you can replace the <code>max_steps</code> argument by <code>num_train_epochs</code>.</em></p>



<p>To later load and use the model for inference, we have used the <code>trainer.model.save_pretrained(output_dir)</code> function, which saves the fine-tuned model&#8217;s weights, configuration, and tokenizer files.</p>



<figure class="wp-block-image size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results-1024x498.png" alt="" class="wp-image-25619" width="870" height="422" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results-1024x498.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results-300x146.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results-768x374.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results.png 1320w" sizes="auto, (max-width: 870px) 100vw, 870px" /></figure>



<p class="has-text-align-center">Fine-tuning llama2 results on <a href="https://huggingface.co/datasets/databricks/databricks-dolly-15k" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">databricks-dolly-15k</a> dataset</p>



<p>Unfortunately, it is possible that the latest weights are not the best. To solve this problem, you can implement a <code>EarlyStoppingCallback</code>, from transformers, during your fine-tuning. This will enable you to regularly test your model on the validation set, if you have one, and keep only the best weights.</p>



<h3 class="wp-block-heading">Merge weights</h3>



<p>Once we have our fine-tuned weights, we can build our fine-tuned model and save it to a new directory, with its associated tokenizer. By performing these steps, we can have a memory-efficient fine-tuned model and tokenizer ready for inference!</p>



<pre class="wp-block-code"><code lang="python" class="language-python">model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
model = model.merge_and_unload()

output_merged_dir = "results/llama2/final_merged_checkpoint"
os.makedirs(output_merged_dir, exist_ok=True)
model.save_pretrained(output_merged_dir, safe_serialization=True)

# save tokenizer for easy inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_merged_dir)</code></pre>



<h3 class="wp-block-heading">Conclusion</h3>



<p>We hope you have enjoyed this article!</p>



<p>You are now able to fine-tune LLaMA 2 models on your own datasets!</p>



<p>In our next tutorial, you will discover how to <strong>Deploy your Fine-tuned LLM on <a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud AI Deploy</a> for inference</strong>!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ffine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks%2F&amp;action_name=Fine-Tuning%20LLaMA%202%20Models%20using%20a%20single%20GPU%2C%20QLoRA%20and%20AI%20Notebooks&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Using GPU on Managed Kubernetes Service with NVIDIA GPU operator</title>
		<link>https://blog.ovhcloud.com/using-gpu-on-managed-kubernetes-service-with-nvidia-gpu-operator/</link>
		
		<dc:creator><![CDATA[Maxime Hurtrel]]></dc:creator>
		<pubDate>Wed, 19 Jan 2022 15:53:13 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[Kubernetes]]></category>
		<category><![CDATA[OVHcloud Managed Kubernetes]]></category>
		<category><![CDATA[Partnership]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=21533</guid>

					<description><![CDATA[Two years after launching our Managed Kubernetes service, we&#8217;re seeing a lot of diversity in the workloads that run in production. We have been challenged by some customers looking for GPU acceleration, and have teamed up with our partner NVIDIA to deliver high performance GPUs on Kubernetes. We&#8217;ve done it in a way that combines [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fusing-gpu-on-managed-kubernetes-service-with-nvidia-gpu-operator%2F&amp;action_name=Using%20GPU%20on%20Managed%20Kubernetes%20Service%20with%20NVIDIA%20GPU%20operator&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>Two years after launching our Managed Kubernetes service, we&#8217;re seeing a lot of diversity in the workloads that run in production. We have been challenged by some customers looking for GPU acceleration, and have teamed up with our partner NVIDIA to deliver high performance GPUs on Kubernetes. We&#8217;ve done it in a way that combines <strong>simplicity, day-2-maintainability and total flexibility</strong>. The solution<strong> is now available in all OVHcloud regions where we offer Kubernetes and GPUs.</strong></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="574" src="https://blog.ovhcloud.com/wp-content/uploads/2021/12/Capture-décran-2021-12-29-à-10.50.52-1-1024x574.png" alt="" class="wp-image-21539" srcset="https://blog.ovhcloud.com/wp-content/uploads/2021/12/Capture-décran-2021-12-29-à-10.50.52-1-1024x574.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2021/12/Capture-décran-2021-12-29-à-10.50.52-1-300x168.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2021/12/Capture-décran-2021-12-29-à-10.50.52-1-768x431.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2021/12/Capture-décran-2021-12-29-à-10.50.52-1.png 1416w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h2 class="wp-block-heading">The challenge behind a fully managed service</h2>



<p>Readers unfamiliar with our orchestration service and/or GPUs may be surprised that we did not yet offer this integration in general availability. This lies in the fact that our team is focused on providing a <strong>totally managed experience, including patching the OS (Operating System) and Kubelet of each Node each time it is required</strong>. To achieve this goal, we have built and maintained a single hardened image for the dozens of flavors, in each of the 10+ regions.<br>Based on the experience of selected beta users, we found that this approach doesn&#8217;t always work for use cases that require a very specific NVIDIA driver configuration. Working with our technical partners at NVIDIA, we found a solution to leverage GPUs is a simple way that allows fine tuning such as the <a href="https://en.wikipedia.org/wiki/CUDA" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">CUDA</a> configuration for example.</p>



<h2 class="wp-block-heading">NVIDIA to the rescue </h2>



<p>This Keep-It-Simple-Stupid (KISS) solution relies on the great work of NVIDIA building and maintaining an <strong>official <a href="https://github.com/NVIDIA/gpu-operator" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">NVIDIA GPU operator</a></strong><a href="https://github.com/NVIDIA/gpu-operator" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">.</a> The Apache 2.0 licensed software uses the operator framework within Kubernetes. It does this to <strong>automate the management of all NVIDIA software components needed to use GPUs</strong>, such as NVIDIA drivers, Kubernetes device plugin for GPUs, and others. </p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="895" src="https://blog.ovhcloud.com/wp-content/uploads/2021/12/gpu-operator-1024x895.png" alt="" class="wp-image-21547" srcset="https://blog.ovhcloud.com/wp-content/uploads/2021/12/gpu-operator-1024x895.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2021/12/gpu-operator-300x262.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2021/12/gpu-operator-768x671.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2021/12/gpu-operator.png 1394w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>We ensured it was compliant with our fully maintained Operating System (OS), based on a recent Ubuntu LTS version. After testing it, <a href="https://docs.ovh.com/gb/en/kubernetes/deploying-gpu-application" data-wpel-link="exclude">we documented how to use it on our Managed Kubernetes Service.</a> We appreciate that this solution leverages an open source software that you can use on any compatible NVIDIA hardware. This allows you to guarantee consistent behavior in hybrid or multicloud scenarios, aligned with our <a href="https://www.ovhcloud.com/en/lp/manifesto/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">SMART </a>motto.</p>



<p>Here is an illustration describing the <strong>shared responsibility model </strong>of the stack:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="772" src="https://blog.ovhcloud.com/wp-content/uploads/2021/12/ovh-nvidia3-1024x772.png" alt="" class="wp-image-21560" srcset="https://blog.ovhcloud.com/wp-content/uploads/2021/12/ovh-nvidia3-1024x772.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2021/12/ovh-nvidia3-300x226.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2021/12/ovh-nvidia3-768x579.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2021/12/ovh-nvidia3-1536x1157.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2021/12/ovh-nvidia3-2048x1543.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>All our OVHcloud Public customers can now le<a href="https://docs.ovh.com/gb/en/kubernetes/deploying-gpu-application" data-wpel-link="exclude">verage the feature, adding a GPU node pool to any of their existing or new clusters. </a>This can be done in the regions where both Kubernetes and T1 or T2 instances are available: GRA5, GRA7 and GRA9 (France), DE1 (Germany) (available in the upcoming weeks) and BHS5 (Canada) at the date this blog post is published.<br>Note that GPUs worker nodes are <strong>compatible with all features released, including <a href="https://docs.ovh.com/gb/en/kubernetes/using_vrack/" data-wpel-link="exclude">vRack technology</a> and <a href="https://docs.ovh.com/gb/en/kubernetes/using-cluster-autoscaler/" data-wpel-link="exclude">cluster autoscaling</a></strong> for example.</p>



<p>Having Kubernetes clusters with GPU options means deploying typical AI/ML applications, such as Kubeflow, MLFlow, JupyterHub, NVIDIA NGC is easy and flexible. Do not hesitate to discuss this feature with other Kubernetes users on our <a href="https://gitter.im/ovh/kubernetes" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Gitter Channel</a>. You may also have a look to our fully managed <a href="https://www.ovhcloud.com/en/public-cloud/ai-notebook/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Notebook</a> or <a href="https://www.ovhcloud.com/en/public-cloud/ai-training/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI training </a>services for even simpler out-of-the box experience and per-minute pricing!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fusing-gpu-on-managed-kubernetes-service-with-nvidia-gpu-operator%2F&amp;action_name=Using%20GPU%20on%20Managed%20Kubernetes%20Service%20with%20NVIDIA%20GPU%20operator&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Managing GPU pools efficiently in AI pipelines</title>
		<link>https://blog.ovhcloud.com/managing-gpu-pools-efficiently-in-ai-pipelines/</link>
		
		<dc:creator><![CDATA[Bastien Verdebout]]></dc:creator>
		<pubDate>Tue, 22 Dec 2020 16:18:36 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Deep learning]]></category>
		<category><![CDATA[GPU]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=20146</guid>

					<description><![CDATA[A growing number of companies are using artificial intelligence on a daily basis — and dealing with the back-end architecture can reveal some unexpected challenges. Whether the machine learning workload involves fraud detection, forecasts, chatbots, computer vision or NLP, it will need frequent access to computing power for training and fine-tuning. GPUs have proven to [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmanaging-gpu-pools-efficiently-in-ai-pipelines%2F&amp;action_name=Managing%20GPU%20pools%20efficiently%20in%20AI%20pipelines&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>A growing number of companies are using artificial intelligence on a daily basis — and dealing with the back-end architecture can reveal some unexpected challenges.</p>



<p>Whether the machine learning workload involves fraud detection, forecasts, chatbots, computer vision or NLP, it will need frequent access to computing power for training and fine-tuning.</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/12/IMG_0420-1024x537.png" alt="Managing GPU pools efficiently in AI pipelines" class="wp-image-20449" width="768" height="403" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0420-1024x537.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0420-300x157.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0420-768x403.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0420.png 1200w" sizes="auto, (max-width: 768px) 100vw, 768px" /></figure></div>



<p>GPUs have proven to be a game-changer for deep learning. If you&#8217;re wondering why, you can find out more by reading our blog post about <a href="https://www.ovh.com/blog/understanding-the-anatomy-of-gpus-using-pokemon/" target="_blank" rel="noreferrer noopener" data-wpel-link="exclude">GPU architectures</a>. A few years ago, manufacturers such as NVIDIA began to develop specific ranges for cloud datacentres. You may be familiar with the NVIDIA TITAN RTX for gaming — and in our datacentres, we use NVIDIA A100, V100, Tesla and DGX GPUs for enterprise-grade workloads.</p>



<p>In short, GPUs are perfect for tasks that can be solved or improved by AI, and require a lot of processing power.<br>They offer optimal compute, and are widely used in deep learning. A growing number of companies are using AI, and GPUs seem to be the best choice for them.</p>



<p>However, when dealing with pools of GPUs, the back-end architecture can be really tricky.  </p>



<p><strong>So how do we use them to benefit a company with minimal hassle and headaches?</strong> <strong>On-premise or in the cloud?</strong></p>



<p>These are good questions that I&#8217;m keen to discuss here, from both a business and technical perspective.</p>



<p></p>



<h3 class="wp-block-heading">Dealing with GPU pools&#8230; The struggle is real.</h3>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/12/IMG_0419.png" alt="One does not simply set up GPUs for Deep Learning" class="wp-image-20443" width="603" height="430" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0419.png 804w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0419-300x214.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0419-768x547.png 768w" sizes="auto, (max-width: 603px) 100vw, 603px" /></figure></div>



<p>For anyone who has had to deploy and manage more than 1 GPU for a data-AI team, I&#8217;m sure this topic will bring tears to your eyes, and make your voice tremble. Yes, it is indeed complicated.</p>



<p>I can talk about it on our blog, because our team of data scientists here at OVHcloud had to deal with the exact same annoying issues. Thankfully, we solved all of them — stay tuned!</p>



<p><strong>GPU sharing is hard</strong>. Even if one GPU is better than none, in most cases it will not be sufficient, and a GPU pool will be far more effective. From a tech perspective, dealing with a GPU pool — or worse, allowing your team to use this pool simultaneously — is very tricky. The market is really mature for CPU sharing (via hypervisors), but by design, a GPU has to be attached to a VM or container. This means that quite often, it needs to be &#8220;booked&#8221; for a specific workload. To get around this issue, you&#8217;ll need to provide a scale-out with orchestration, so that you can dynamically assign GPUs to jobs over time. Whenever you tell yourself &#8220;<em>I want to launch this task with 4 GPUs for 2 days</em>&#8220;, you should simply be able to ask, and the back-end should work its magic for you.</p>



<p><strong>Setting up and maintaining an architecture is time-consuming.&nbsp; </strong>So you&#8217;ve deployed servers with GPU, updated and upgraded your Linux distros, installed your main AI packages, CUDA drivers, and now you want to move on to something else. But wait — a new TensorFlow version has been released, and you also have a security patch to apply. What you initially thought to be a single task is now taking up 4-5 hours of your time per week.</p>



<p><strong>Diagnosing is quite complex</strong>. If, for whatever reason, something isn&#8217;t working as it should — good luck. You barely know who is doing what, and you can&#8217;t track jobs or usage unless you connect to the platform yourself and set up monitoring tools. Remember to grab your snorkel set, because you&#8217;ll need to deep-dive.</p>



<p><strong>Bottlenecks are almost inevitable</strong>. Imagine setting up a pool of GPUs based on your current AI project workloads. Your infrastructure is not really designed to scale automatically, and as soon as the AI workloads increase, your jobs have to be scheduled while the GPU fleet is being updated constantly. A backlog starts to accumulate, and a bottleneck is created as a result.</p>



<p><strong>Providing tools for teams to work collaboratively on code is mandatory.</strong> Usually, your team will need to share their data experimentations — and the best way to do this for now is with <strong>JupyterLab Notebooks</strong> (we love them) or <strong>VSCode. </strong>But you&#8217;ll need to keep in mind that this is more software to set up and maintain.</p>



<p><strong>Securing data access is essential. </strong>The required data must be easily accessible, and sensitive data must be covered by security guarantees.</p>



<p><strong>Cost control is difficult. </strong>Even worse, for one reason or another (who said holidays?), you might need to stop almost all your GPU servers for a week or two — but to do this, you would need to wait for any ongoing jobs to be completed.</p>



<p>All jokes aside, while we may be passionate about tech and hardware, we have other things to do. Data engineers cannot achieve their full potential and talent in maintenance-based or billing-based tasks.</p>



<h3 class="wp-block-heading">Kubeflow to the rescue?</h3>



<p>Kubernetes 1.0 was launched 5 years ago. Whatever your opinion is on it, in five years they have become the de facto standard for container orchestration in enterprise environments.</p>



<p>Data scientists use containers for portability, agility, and community — but Kubernetes was made to orchestrate services, not data experimentations.</p>



<p>Kubernetes alone is not tailored for a data team. It presents too much complexity, with the sole benefit of solving the orchestration issue.</p>



<p><strong>We need something that not only improves orchestration, but also code contribution, tests and deployments.</strong></p>



<p>Luckily, <a href="https://www.kubeflow.org/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external"><strong>Kubeflow</strong> </a>appeared 2 years ago, and was open-sourced by Google at the time. Its main promise is to simplify complex ML workflows, for example <code>data processing =&gt; data labeling =&gt; training =&gt; serving</code>, and complete it with notebooks.</p>



<p>I do really love the promise, and the way they simplify ML pipelines. Kubeflow can be run over K8s clusters on-premise or in the cloud, and can also be set up on a single VM or even on a workstation (Linux/Mac/Windows).</p>



<p>Students can easily have their own ML environment. However, for the most advanced uses, a workstation or a single VM might be out of the question, and you would need a K8s cluster with Kubeflow installed on top of that. You&#8217;ll have a nice UI for starting notebooks and creating ML pipelines (processing/training/inference), <strong>but still zero GPU support by default</strong>.</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.kubeflow.org/docs/images/central-ui.png" alt="" width="480" height="319"/><figcaption>Central Dashboard / Image property of Kubeflow.org</figcaption></figure></div>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.kubeflow.org/docs/images/pipelines-xgboost-graph.png" alt="" width="480" height="270"/><figcaption>XGBoost pipeline / Image property of Kubeflow.org</figcaption></figure></div>



<p>Your GPU support will depend on your setup. It may differ if you host it on GCP, AWS, Azure, OVHcloud, on-premise, MicroK8s, or anything else.</p>



<p>For example, on AWS EKS, you need to declare GPU pools in your Kubeflow manifest:</p>



<pre class="wp-block-code"><code lang="yaml" class="language-yaml"># Official doc: https://www.kubeflow.org/docs/aws/customizing-aws/

# NodeGroup holds all configuration attributes that are specific to a node group
# You can have several node groups in your cluster.
nodeGroups:
  - name: eks-gpu
    instanceType: p2.xlarge
    availabilityZones: ["us-west-2b"]
    desiredCapacity: 2
    minSize: 0
    maxSize: 2
    volumeSize: 30
    ssh:
      allow: true
      publicKeyPath: '~/.ssh/id_rsa.pub'</code></pre>



<p>On GCP GKE, you will need to run this command to export a GPU pool:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash"># Official doc: https://www.kubeflow.org/docs/gke/customizing-gke/#common-customizations
 
export GPU_POOL_NAME=&lt;name of the new GPU pool&gt;
 
gcloud container node-pools create ${GPU_POOL_NAME} \
--accelerator type=nvidia-tesla-k80,count=1 \
--zone us-central1-a --cluster ${KF_NAME} \
--num-nodes=1 --machine-type=n1-standard-4 --min-nodes=0 --max-nodes=5 --enable-autoscaling</code></pre>



<p>You will then need to install NVIDIA drivers on all the GPU nodes. NVIDIA maintains a <em>deamonset</em>, which enables you to install them easily:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash"># Official doc: https://www.kubeflow.org/docs/gke/customizing-gke/#common-customizations
 
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml</code></pre>



<p>Once you have done this, you will be able to create GPU pools (don&#8217;t forget to check your quotas before — with a basic account, you are restricted by default, and you will need to contact their support).</p>



<h3 class="wp-block-heading">Okay, but do things get easier from here?</h3>



<p>As we say in France, especially in Normandy, yes but no.</p>



<p>Yes, Kubeflow does resolve some of the challenges we&#8217;ve mentioned — but some of the biggest challenges are yet to come, and they will take up a lot of your daily routine. Many manual operations will still require you to dig into specific K8s documentation, or guides published by cloud providers.</p>



<p>Below is a summary of <strong>Kubeflow vs GPU pool challenges</strong>.</p>



<figure class="wp-block-table is-style-stripes"><table><thead><tr><th>Challenges</th><th>Status</th></tr></thead><tbody><tr><td><strong>GPU pool with sharing option</strong></td><td><strong><span class="has-inline-color has-vivid-green-cyan-color">YES</span></strong> but will require manual configuration (declaration in manifest, driver installation, etc.).</td></tr><tr><td><strong>Collaborative tools</strong></td><td><strong><span class="has-inline-color has-vivid-green-cyan-color">YES</span></strong> definitely. Notebooks are provided via Kubeflow.</td></tr><tr><td><strong>Infrastructure maintenance</strong></td><td>Definitely <strong><span class="has-inline-color has-vivid-red-color">NO</span></strong>.<br>Now you have a Kubeflow cluster to maintain and operate.</td></tr><tr><td><strong>Infrastructure diagnosis</strong></td><td><strong><span class="has-inline-color has-vivid-green-cyan-color">YES</span> BUT <span class="has-inline-color has-vivid-red-color">NO</span></strong>. Activity Dashboard and reporting tools based on SpartaKus, Logs, etc.<br>But provided to the data engineers, not data scientists themselves. They may come back to you.</td></tr><tr><td><strong>Infrastructure agility/flexibility</strong></td><td><strong>TRICKY</strong>. It will depend on your hosting implementation. If it&#8217;s on-premise, definitely no. You&#8217;ll need to buy hardware components (an NVIDIA V100 costs approximately $10K without chassis, electricity usage, etc.)<br>Some cloud providers can provide &#8220;auto-scaling GPU pools&#8221; from 0 to n, which is nice.</td></tr><tr><td><strong>Secured data access</strong></td><td><strong>TRICKY</strong>. It will depend on how you locate your data, and the technology used. It&#8217;s not a ready-to-use solution.</td></tr><tr><td><strong>Cost control</strong></td><td><strong>TRICKY.</strong> Again, it will depend on your hosting implementation. It&#8217;s not easy, since you need to take care of the infrastructure. Some hidden costs can appear, too (network traffic, monitoring, etc.).</td></tr></tbody></table><figcaption>Kubeflow vs Challenges</figcaption></figure>



<h3 class="wp-block-heading">Forget infrastructure, welcome to GPU platforms made for AI</h3>



<p>You can now find various third-party solutions on the market that go one step further. Instead of dealing with the architecture and the Kubernetes cluster, what if you simply focused on your machine learning or deep learning code?</p>



<p>There are well-known solutions such as <strong>Paperspace Gradient</strong> — or smaller ones, like <strong>Run:AI</strong> — and we&#8217;re pleased to offer another option on the market: <strong>AI Training</strong>. We&#8217;re using this post as a self-promotion opportunity (it&#8217;s our blog after all), but the logic remains the same for competitors.</p>



<p>What are the concepts behind it?</p>



<h4 class="wp-block-heading" id="id-[Blogpost]ManagingGPUspoolsefficentlyinAIpipelines-Noinfrastructuretomanage">No infrastructure to manage</h4>



<p>You don&#8217;t need to set up and manage a K8s cluster, or a Kubeflow cluster.</p>



<p>You don&#8217;t need to declare GPU pools in your manifest.</p>



<p>You don&#8217;t need to install NVIDIA drivers on the nodes.</p>



<p>With GPU Platforms like OVHcloud AI Training, your neural network training is as simple as this:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash"><code># Upload data directly to Object Storage</code>
<code>ovhai</code> <code>data upload myBucket@GRA train.zip</code>
&nbsp;
<code># Launch a job with 4 GPUs on a Pytorch environment, with Object Storage bucket directly linked to it</code>
<code>ovhai</code> <code>job run \</code>
 <code>    --gpu 4 \</code>
<code>    --volume myBucket@GRA:/data:RW \</code>
<code>    ovhcom/ai-training-pytorch:1.6.0</code></code></pre>



<p>This line of code will provide you with a JupyterLab Notebook directly plugged to a pool of 4x NVIDIA GPUs, with the Pytorch environment installed. This is all you need to do, and the entire process takes around 15 seconds. </p>



<h4 class="wp-block-heading" id="id-[Blogpost]ManagingGPUspoolsefficentlyinAIpipelines-Parralelizationforthewin">Parallel computing — a great advantage</h4>



<p>One of the most significant benefits is that since the infrastructure is not on your premises, you can count on the provider to scale it.</p>



<p>So you can run dozens of jobs simultaneously. A classic use case is to fine-tune all of your models once a week or once a month, with 1 line of bash script:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash"><code># Start a basic loop</code>
<code>for</code> <code>model in</code> <code>my_models_listing</code>
<code>do</code>
&nbsp;
<code># Launch a job with 4 GPUs on a Pytorch environment, with Object Storage bucket directly linked to it</code>
<code>echo</code> <code>"starting training of $model"</code>
<code>ovhai job run \</code>
<code>--gpu 3 \</code>
<code>--volume myBucket@GRA:/data:RW \</code>
<code>my_docker_repository/$model</code>
&nbsp;
<code>done</code></code></pre>



<p>If you have 10 models, it will launch 10x 3 GPUs in few seconds, and it will stop them once the job is complete, from sequential to parallel work.</p>



<h5 class="wp-block-heading">Collaboration out of the box</h5>



<p>All of these platforms natively include notebooks, directly plugged to GPU power. With OVHcloud AI Training, we also provide pre-installed environments for TensorFlow, Hugging Face, Pytorch, MXnet, Fast.AI — and others will be added to this list soon.</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="571" src="https://www.ovh.com/blog/wp-content/uploads/2020/12/nbook-1024x571.png" alt="" class="wp-image-20259" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/12/nbook-1024x571.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/nbook-300x167.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/nbook-768x429.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/nbook-1536x857.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/nbook.png 1672w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption>JupyterLab Notebook</figcaption></figure></div>



<h4 class="wp-block-heading" id="id-[Blogpost]ManagingGPUspoolsefficentlyinAIpipelines-Datasetaccessmadeeasy">Data set access made easy</h4>



<p>I haven&#8217;t tested all the GPU platforms on the market, but usually they provide some useful ways to access data. We aim to provide the best work environment for data science teams, so we are also offering an easy way for them to access their data — by enabling them to attach object storage containers during the job launch.</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="271" src="https://www.ovh.com/blog/wp-content/uploads/2020/12/container-1024x271.png" alt="" class="wp-image-20260" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/12/container-1024x271.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/container-300x79.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/container-768x204.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/container.png 1536w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption>OVHcloud AI Training : attach Object Storage containers to notebooks</figcaption></figure></div>



<h4 class="wp-block-heading" id="id-[Blogpost]ManagingGPUspoolsefficentlyinAIpipelines-Lastbutnotleast...CostControlsisareality">Cost control for users</h4>



<p>Third-party GPU platforms quite often provide clear pricing. This is the case for Paperspace, but not for Run:AI (I was unable to find their price list). This is also the case for OVHcloud AI Training.</p>



<ul class="wp-block-list"><li><strong>GPU power</strong>: You pay £1.58/hour/NVIDIA V100s GPU</li><li><strong>Storage</strong>: Standard price of OVHcloud Object Storage (compliant with AWS S3 protocol)</li><li><strong>Notebooks</strong>: Included</li><li><strong>Observability tools</strong>: Logs and metrics included</li><li><strong>Subscription</strong>: No, it&#8217;s pay-as-you-go, per minute</li></ul>



<p>And there we go — cost and budget estimation is now simple. Try it out for yourself!</p>



<h4 class="wp-block-heading" id="id-[Blogpost]ManagingGPUspoolsefficentlyinAIpipelines-Missioncomplete?">Mission complete?</h4>



<p>Below is a summary addressing the major challenges to resolve when dealing with GPU pool sharing. It&#8217;s a big yes!</p>



<figure class="wp-block-table is-style-stripes"><table><thead><tr><th>Challenges</th><th>Status</th></tr></thead><tbody><tr><td><strong>GPU pool with sharing option</strong></td><td><strong><span class="has-inline-color has-vivid-green-cyan-color">YES</span></strong> definitely. In fact, even many GPU pools in parallel, if you want to.</td></tr><tr><td><strong>Collaborative tools</strong></td><td><strong><span class="has-inline-color has-vivid-green-cyan-color">YES</span></strong> definitely. Notebooks are always provided, as far as I know.</td></tr><tr><td><strong>Infrastructure maintenance</strong></td><td><span class="has-inline-color has-vivid-green-cyan-color"><strong>YES</strong> </span>definitely. Infrastructure is managed by the provider. You will need need to connect via SSH to debug.</td></tr><tr><td><strong>Infrastructure diagnosis</strong></td><td><strong><span class="has-inline-color has-vivid-green-cyan-color">YES</span>. </strong>Logs and metrics provided on our side, at least.</td></tr><tr><td><strong>Infrastructure agility/flexibility</strong></td><td><strong><span class="has-inline-color has-vivid-green-cyan-color">YES</span> </strong>definitely. Scale up or down one or more GPU pools, use them for 10 minutes or a full month, etc.</td></tr><tr><td><strong>Secured data access</strong></td><td>Depends on the solution you choose, but usually it&#8217;s a <strong><span class="has-inline-color has-vivid-green-cyan-color">YES</span></strong> via simplified object storage access.</td></tr><tr><td><strong>Cost control</strong></td><td>Depends on the solution you choose, but usually is a <strong><span class="has-inline-color has-vivid-green-cyan-color">YES</span></strong> with packaged prices and zero investments to make (zero CAPEX).</td></tr></tbody></table></figure>



<p> </p>



<h3 class="wp-block-heading" id="id-[Blogpost]ManagingGPUspoolsefficentlyinAIpipelines-Conclusion">Conclusion</h3>



<p>If we go back to the main challenges faced by a company that requires shared GPU pools, we can say without a doubt that <strong>Kubernetes is a market-standard for AI pipeline orchestration</strong>.</p>



<p>An <strong>on-premise K8s cluster with Kubeflow</strong> is really interesting if the data cannot be processed into the cloud (e.g. banking, hospitals, any kind of sensitive data) or if your team has flat (and lower-level) GPU requirements. You can invest in a few GPUs and manage the fleet yourself with software on top. But if you need more power, <strong>very soon the cloud will become the only viable option</strong>. Hardware investments, hardware obsolescence, electricity usage and scaling will give you some headaches.</p>



<p>Then, depending on the situation, <strong>Kubeflow in the cloud might be really useful</strong>. It delivers powerful pipeline features, notebooks, and enables users to manage virtual GPU pools. </p>



<p>But if you want to avoid infrastructure tasks, control your spending, and focus on your added value and code, <strong>you might consider GPU platforms as your first choice</strong>.</p>



<p>However, there is no such thing as magic — and without knowing exactly what you want, even the best platform won&#8217;t be able to meet your needs. Yet some start-ups, not listed here, can offer a combination of platforms and expertise to help you in your project, infrastructures and use cases. </p>



<p>Thank you for reading, and don&#8217;t forget that we also offer inference at scale with ML Serving. This is the next logical step after training.</p>



<h5 class="wp-block-heading">Want to find out more?</h5>



<ul class="wp-block-list"><li>Solution page: <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-training/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://www.ovhcloud.com/en-gb/public-cloud/ai-training/</a></li><li>Public documentation: <a href="https://docs.ovh.com/gb/en/ai-training/" data-wpel-link="exclude">https://docs.ovh.com/gb/en/ai-training/</a></li><li>Community: <a href="http://community.ovh.com/en/" data-wpel-link="exclude">community.ovh.com/en/</a></li></ul>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmanaging-gpu-pools-efficiently-in-ai-pipelines%2F&amp;action_name=Managing%20GPU%20pools%20efficiently%20in%20AI%20pipelines&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How PCI-Express works and why you should care? #GPU</title>
		<link>https://blog.ovhcloud.com/how-pci-express-works-and-why-you-should-care-gpu/</link>
		
		<dc:creator><![CDATA[Jean-Louis Queguiner]]></dc:creator>
		<pubDate>Thu, 09 Jul 2020 10:16:00 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Deep learning]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[Hardware]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[PCIe]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14485</guid>

					<description><![CDATA[What is PCI-Express ? Everyone, and I mean everyone, should pay attention when they do intensive Machine Learning / Deep Learning Training. As I explained in a previous blog post, GPUs have accelerated Artificial Intelligence evolution massively. However, building a GPUs server is not that easy. And failing to create an appropriate infrastructure can have [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-pci-express-works-and-why-you-should-care-gpu%2F&amp;action_name=How%20PCI-Express%20works%20and%20why%20you%20should%20care%3F%20%23GPU&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="538" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/69659375-3553-40C9-A201-73C4CDED2461-1024x538.jpeg" alt="How PCI-Express works and why you should care? #GPU" class="wp-image-18783" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/69659375-3553-40C9-A201-73C4CDED2461-1024x538.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/69659375-3553-40C9-A201-73C4CDED2461-300x158.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/69659375-3553-40C9-A201-73C4CDED2461-768x403.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/69659375-3553-40C9-A201-73C4CDED2461.jpeg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<h2 class="wp-block-heading">What is PCI-Express ?</h2>



<p>Everyone, and I mean everyone, should pay attention when they do intensive Machine Learning / Deep Learning Training. </p>



<p>As I explained in a previous blog post, GPUs have accelerated Artificial Intelligence evolution massively.  </p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><a href="https://www.ovh.com/blog/understanding-the-anatomy-of-gpus-using-pokemon/" data-wpel-link="exclude"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/EEAD0A02-DFCA-4745-802B-E36BC517EFED.png" alt="" class="wp-image-18103" width="334" height="254" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/05/EEAD0A02-DFCA-4745-802B-E36BC517EFED.png 668w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/EEAD0A02-DFCA-4745-802B-E36BC517EFED-300x228.png 300w" sizes="auto, (max-width: 334px) 100vw, 334px" /></a></figure></div>



<p>However, building a GPUs server is not that easy. And failing to create an appropriate infrastructure can have consequences on training time.</p>



<p>If you use GPUs, you should know that there are 2 ways to connect them to the motherboard to allow it to connect to the other components (network, CPU, storage device). Solution 1 is through <strong>PCI Express </strong>and solution 2 through <strong>SXM2</strong>. We will talk about <strong>SXM2</strong> in the future. Today, we will focus on <strong>PCI Express</strong>. This is because it has a strong dependency with the choice of adjacent hardware such as PCI BUS or CPU.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>                     NVIDIA V100 with SXM2 design</th><th class="has-text-align-center" data-align="center">                          NVIDIA V100 with PCI express design</th></tr></thead><tbody><tr><td><img loading="lazy" decoding="async" width="609" height="644" class="wp-image-18763" style="width: 500px" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/PCIe-01.jpg" alt="NVIDIA V100 with SXM2 design" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/PCIe-01.jpg 609w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/PCIe-01-284x300.jpg 284w" sizes="auto, (max-width: 609px) 100vw, 609px" /><br>Source : <a aria-label="undefined (opens in a new tab)" href="https://www.ebizpc.com/NVIDIA-Tesla-V100-900-2G502-0300-000-16GB-GPU-p/900-2g503-0310-000.htm" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://www.ebizpc.com/NVIDIA-Tesla-V100-900-2G502-0300-000-16GB-GPU-p/900-2g503-0310-000.htm</a></td><td class="has-text-align-center" data-align="center"><img loading="lazy" decoding="async" width="450" height="450" class="wp-image-18764" style="width: 500px" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/PCIe-02.jpg" alt="NVIDIA V100 with PCI express design" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/PCIe-02.jpg 450w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/PCIe-02-300x300.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/PCIe-02-150x150.jpg 150w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/PCIe-02-70x70.jpg 70w" sizes="auto, (max-width: 450px) 100vw, 450px" /><br>Source : <a aria-label="undefined (opens in a new tab)" href="https://nvidiastore.com.br/nvidia-tesla-v100-16gb" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://nvidiastore.com.br/nvidia-tesla-v100-16gb</a></td></tr></tbody></table><figcaption>SXM2 design VS PCI Express Design</figcaption></figure>



<p>This is a major element to consider when talking about deep learning as data loading phase is a waste of compute time, so bandwidth between components and GPUs is a key bottleneck in most deep learning training contexts.</p>



<h2 class="wp-block-heading">How does PCI-Express work and why you should care about the number of PCIe lanes?</h2>



<h3 class="wp-block-heading">What is a PCI-Express Lanes and are there any associated CPU limitations?</h3>



<p>Each GPU V100 is using the 16 PCI-e lanes. What does it mean exactly?</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="618" height="442" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/PCIe-03.png" alt="Extract from NVidia V100 product specification sheet" class="wp-image-18767" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/PCIe-03.png 618w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/PCIe-03-300x215.png 300w" sizes="auto, (max-width: 618px) 100vw, 618px" /><figcaption>Extract from NVidia V100 product specification <a href="https://images.nvidia.com/content/technologies/volta/pdf/tesla-volta-v100-datasheet-letter-fnl-web.pdf" target="_blank" aria-label="undefined (opens in a new tab)" rel="noreferrer noopener nofollow external" data-wpel-link="external">sheet</a></figcaption></figure></div>



<p>The <strong><em>&#8220;x16&#8221;</em></strong> means that the PCIe has 16 dedicated lanes. So&#8230; next question: What is a PCI Express lane ?</p>



<h4 class="wp-block-heading">What&#8217;s a PCI Express lane?</h4>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/72DFDF80-DC39-4253-BAB3-CEB351B627D3.jpeg" alt="2 PCI Express Devices with its interconnexion" class="wp-image-18779" width="424" height="299" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/72DFDF80-DC39-4253-BAB3-CEB351B627D3.jpeg 848w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/72DFDF80-DC39-4253-BAB3-CEB351B627D3-300x211.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/72DFDF80-DC39-4253-BAB3-CEB351B627D3-768x541.jpeg 768w" sizes="auto, (max-width: 424px) 100vw, 424px" /><figcaption>2 PCI Express Devices with its interconnexion : figure inspired of the awesome <a aria-label="undefined (opens in a new tab)" href="https://www.phhsnews.com/what-is-chipset-and-why-should-i-care3538" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">article</a> &#8211; what is chipset and why should I care</figcaption></figure></div>



<p>PCIe lanes are used to communicate between PCIe Devices or between PCIe and CPU. A lane is composed of 2 wires: one for inbound communications and one, which has double the traffic bandwidth, for outbound. </p>



<p>Lane communications are similar to network Layer 1 communications &#8211; it’s all about transferring bits as fast as possible through electrical wires! However, the technique used for PCIe Link is a bit different as the PCIe device is composed of xN lanes. In our previous example N=16 but it could be any power of 2 from 1 to 16 (1/2/4/8/16).</p>



<h3 class="wp-block-heading">So… if PCIe is similar to network architecture it means that PCIe layers exist, doesn&#8217;t it?</h3>



<p>Yes ! you are right PCIe has 4 layers:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="724" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/photo_2020-07-02-15.08.02-1024x724.jpeg" alt="" class="wp-image-18723" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/photo_2020-07-02-15.08.02-1024x724.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/photo_2020-07-02-15.08.02-300x212.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/photo_2020-07-02-15.08.02-768x543.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/photo_2020-07-02-15.08.02.jpeg 1280w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p></p>



<h4 class="wp-block-heading"><strong>The Physical Layer (aka <em>the Big Negotiation Layer</em>)</strong></h4>



<p>The<strong><em> Physical Layer (PL)</em></strong> is responsible for negotiating the terms and conditions for receiving the raw packets (PLP for Physical Layer Packets) i.e the lane width and the frequency with the other device.</p>



<p>You should be aware that only the smallest number of lanes of the two devices will be used. This is why choosing the appropriate CPU is so important. CPUs have a limited number of lanes that they can manage so <strong>having a nice GPU with 16 PCIe Lanes and having a CPU with 8 PCIe Bus lanes will be as efficient as throwing away half your money because it doesn’t fit in your wallet.</strong></p>



<p>Packets received at the <strong><em>Physical Layer (aka PHY) </em></strong>are coming from other PCIe devices or from the system (via <strong><em>Direct Access Memory — DAM</em></strong> or from CPU for instance) and are encapsulated in a frame. </p>



<p>The purpose of a Start-of-Frame is to say: “I am sending you data, this is the beginning,” and it takes just 1 byte to say that!</p>



<p>The <strong><em>End-of-Frame</em> </strong>word is also 1 byte to say “goodbye I’m done with it”.</p>



<p>This layer implement a <strong><em>8b/10b or 128b/130b decoding</em></strong> that we will explain later and is mainly used for <strong><em>clock recovery.</em></strong></p>



<h4 class="wp-block-heading"><strong>The Data Link Layer Packet (aka <em>Let’s put this mess in the right&nbsp;order</em>)</strong></h4>



<p>The <strong><em>Data Link Layer Packet (DLLP)</em></strong> is starting with a <strong><em>Packet Sequence Number.</em></strong> This is really important as a packet might get corrupted at one point, so may need to be uniquely identified for retry purposes. The <strong><em>Sequence Number </em></strong>is coded on 2 bytes.</p>



<p>The <strong><em>Data Link Layer Packet</em></strong> is then followed by the <strong><em>Transaction Layer Packet</em></strong> and then closed with the <strong><em>LCRC (Local Cyclic Redundancy Check) </em></strong>and is used to check the <strong><em>Transaction Layer Packet (meaning the actual Payload)</em></strong> integrity.</p>



<p>If the <strong><em>LCRC</em></strong> is validated, then the <em><strong>Data Link Layer</strong></em> sends an <strong><em>ACK (ACKnowledge)</em></strong> signal to the <em><strong>emitter</strong></em> through the <strong><em>Physical Layer</em>.</strong> Otherwise it sends a <strong><em>NAK (Not AcKnowledge) </em></strong>signal to the emitter which will resend the frame associated with the <strong><em>sequence number </em></strong>to retry; this part handles the replay buffer on the <em><strong>receiver</strong></em> side.</p>



<h4 class="wp-block-heading"><strong>The Transaction Layer</strong></h4>



<p>The<strong><em> Transaction Layer</em></strong> is responsible for <strong>managing the actual payload (Header + Data)</strong> as well as the (optional) message digest <strong><em>ECRC (End to End Cyclic Redundancy Check)</em></strong>. This <strong><em>Transaction Layer Packet </em></strong>is coming from the <strong><em>Data Link Layer</em></strong> where it has been <strong>decapsulated</strong>.</p>



<p>An <strong>integrity check</strong> is performed if needed/requested. This step will check the integrity of the business logic and will insure no packet corruption when passing data from<strong><em> Data Link Layer</em></strong> to <em><strong>Transaction Layer.</strong></em></p>



<p>The header is describing the type of transaction such as:</p>



<ul class="wp-block-list"><li>Memory Transaction</li><li>I/O Transaction</li><li>Configuration Transaction</li><li>or Message Transaction</li></ul>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/5E282911-B63F-410D-A2CD-AD52B928C62E-1024x600.jpeg" alt="PCIe Layers" class="wp-image-18781" width="512" height="300" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/5E282911-B63F-410D-A2CD-AD52B928C62E-1024x600.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/5E282911-B63F-410D-A2CD-AD52B928C62E-300x176.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/5E282911-B63F-410D-A2CD-AD52B928C62E-768x450.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/5E282911-B63F-410D-A2CD-AD52B928C62E.jpeg 1368w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<h4 class="wp-block-heading"><strong>The Application Layer</strong></h4>



<p>The role of the <em><strong>application layer</strong></em> is to handle the<strong><em> User Logic</em></strong>. This layer is sending the <strong><em>Header</em></strong> <strong><em>and the data payload </em></strong>to the <strong><em>Transaction Layer</em></strong>. The magic happens in this layer where data in rooted to different hardware components.</p>



<h3 class="wp-block-heading">How PCIe is communicating with the rest of the&nbsp;world?</h3>



<p>PCIe Link is using the <strong>packet switching concept used in network in a full duplex mode.</strong></p>



<p>PCIe device have an <strong>internal clock to orchestrate PCIe </strong><em><strong>Data Transfer Cycles</strong>.</em> This <strong><em>Data Transfer Cycle</em></strong> is also orchestrated thanks to the <strong><em>Referential Clock.</em></strong> The latter is sending a signal through a <strong><em>Dedicated Lane</em> (which is not part of the x1/2/4/8/16/32 mentioned above)</strong>. This clock will help both receiving and emitting devices to synchronize for packets communications.</p>



<p><strong>Each PCIe lane is used to send bytes in parallel with other lanes</strong>. The<strong><em> Clock Synchronization </em></strong>mentioned above will help the receiver to put back those bytes in the right order</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="618" height="442" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/PCIe-03.png" alt="" class="wp-image-18767" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/PCIe-03.png 618w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/PCIe-03-300x215.png 300w" sizes="auto, (max-width: 618px) 100vw, 618px" /><figcaption>x16 means 16 lanes of parallel communication on generation 3 of PCIe&nbsp;protocol</figcaption></figure></div>



<h3 class="wp-block-heading">You may have the bytes in order but do you have the data integrity at the physical layer&nbsp;?</h3>



<p>To ensure <strong>integrity</strong> PCIe device uses <strong>8b/10b encoding for PCIe generations 1 and 2</strong> or <strong>128b/130b encoding scheme for generations 3</strong> <strong>and 4.</strong></p>



<p>These encodings are used to prevent the loss of temporal landmarks, especially when transmitting consecutive similar bits. This process is called “<strong><em>Clock Recovery</em></strong>”</p>



<p>Those 128 bits of payload data are sent and 2 bytes of control are appended to it.</p>



<h4 class="wp-block-heading">Quick examples</h4>



<p><em>Let’s simplify it with a 8b/10b example:</em> according to IEEE 802.3 clause 36, table 36–1a based on Ethernet specifications here is the table 8b/10b encoding:</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="600" height="546" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/PCIe-04.png" alt="IEEE 802.3 clause 36, table 36–1a - 8b/10b encoding table" class="wp-image-18770" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/PCIe-04.png 600w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/PCIe-04-300x273.png 300w" sizes="auto, (max-width: 600px) 100vw, 600px" /><figcaption>IEEE 802.3 clause 36, table 36–1a &#8211; 8b/10b encoding table</figcaption></figure></div>



<p>So how can the receiver make the difference between all those repeating 0 (Code Group Name D0.0) ?</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/2B41AC73-59D2-4230-B8F4-73327F3991E4-1024x819.png" alt="Repeating bits everywhere" class="wp-image-18777" width="512" height="410" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/2B41AC73-59D2-4230-B8F4-73327F3991E4-1024x819.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/2B41AC73-59D2-4230-B8F4-73327F3991E4-300x240.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/2B41AC73-59D2-4230-B8F4-73327F3991E4-768x615.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/2B41AC73-59D2-4230-B8F4-73327F3991E4.png 1381w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<p>8b/10b encoding is composed of 5b/6b + 3b/4b encodings.</p>



<p>Therefore <strong>00000 000</strong> will be encoded into <strong>100111 0100 </strong>the 5 first bits of the original data <strong>00000</strong> are encoded to <strong>100111</strong> using 5b/6b encoding (<strong>rd+</strong>); same goes for the second group of 3bits of original data <strong>000</strong> encoded into <strong>0100</strong> using 3b/4b encoding (<strong>rd-</strong>).</p>



<p>It could have been also <strong>5b/6b encoding rd+ </strong>and<strong> 3b/4b encoding rd- </strong>making <strong>00000 000</strong> turning into <strong>011000 1011</strong></p>



<p><strong>Therefore the original data which was 8bits is now 10bits due to bits control (1 control bit for 5b/6b and 1 fir 3b/4b). </strong></p>



<p>But don&#8217;t worry I will draft a blog post later dedicated to encoding.</p>



<p><strong>PCIe Generations 1 and 2 were designed with 8b/10b encoding </strong>meaning that the <strong>actual data transmitted was only 80% of the total load </strong>(as 20% — 2 bits are used as Clock synchronization).</p>



<p><strong>PCIe Gen3&amp;4 were designed with 128b/130b </strong>meaning that the <strong>control bits are now representing only 1.56% of the payload. </strong>Quite good isn’t it?</p>



<h3 class="wp-block-heading">Let’s calculate the PCIe bandwidth together</h3>



<p>Here is the table of PCIe versions specifications</p>



<figure class="wp-block-table"><table><thead><tr><th>Number of Lanes</th><th>PCIe 1.0 (2003)</th><th>PCIe 2.0 (2007)</th><th><strong>PCIe 3.0 (2010)</strong></th><th><strong>PCIe 4.0 (2017)</strong></th><th>PCIe 5.0 (2019)</th><th>PCIe 6.0 (2021)</th></tr></thead><tbody><tr><td>x1</td><td>250 MB/s</td><td>500 MB/s</td><td>1 GB/s</td><td>2 GB/s</td><td>4 GB/s</td><td>8 GB/s</td></tr><tr><td>x2</td><td>500 MB/s</td><td>1 GB/s</td><td>2 GB/s</td><td>4 GB/s</td><td>8 GB/s</td><td>16 GB/s</td></tr><tr><td>x4</td><td>1 GB/s</td><td>2 GB/s</td><td>4 GB/s</td><td>8 GB/s</td><td>16 GB/s</td><td>32 GB/s</td></tr><tr><td>x8</td><td>2 GB/s</td><td>4 GB/s</td><td>8 GB/s</td><td>16 GB/s</td><td>32 GB/s</td><td>64 GB/s</td></tr><tr><td><strong>x16</strong></td><td>4 GB/s</td><td>8 GB/s</td><td><strong>16 GB/s</strong></td><td>32 GB/s</td><td>64 GB/s</td><td>128 GB/s</td></tr></tbody></table><figcaption>consortium PCI-SIG PCIe theoretical bandwidth/Lane/Way specification sheet</figcaption></figure>



<figure class="wp-block-table"><table><thead><tr><th>                                </th><th>PCIe 1.0 (2003)</th><th>PCIe 2.0 (2007)</th><th>PCIe 3.0 (2010)</th><th>PCIe 4.0 (2017)</th><th>PCIe 5.0 (2019)</th><th>PCIe 6.0 (2021)</th></tr></thead><tbody><tr><td><strong>Frequency</strong></td><td>2.5 GT/s</td><td>5.0 GT/s</td><td>8.0 GT/s</td><td>16 GT/s</td><td>32 GT/s</td><td>64 GT/s</td></tr></tbody></table><figcaption>consortium PCI-SIG PCIe theoretical raw bit rate specification sheet</figcaption></figure>



<p>To obtain such numbers let&#8217;s look at the general Bandwidth formula:</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/B529B3E3-419B-49DE-9544-8B7BF190D3BB-1024x155.jpeg" alt="" class="wp-image-18793" width="512" height="78" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/B529B3E3-419B-49DE-9544-8B7BF190D3BB-1024x155.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/B529B3E3-419B-49DE-9544-8B7BF190D3BB-300x46.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/B529B3E3-419B-49DE-9544-8B7BF190D3BB-768x117.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/B529B3E3-419B-49DE-9544-8B7BF190D3BB.jpeg 1298w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<ul class="wp-block-list"><li>BW stands for Bandwidth</li><li>MT/s&nbsp;: Mega Transfers per second</li><li>Encoding could be 4b/5b/, 8b/10b, 128b/130b,&nbsp;…</li></ul>



<h4 class="wp-block-heading">For PCIe v1.0:</h4>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/A99597E2-4117-43B1-9048-1CE24EFAE227-1024x170.jpeg" alt="BW/lane\ (MB/s) = \ 2\ 500\ (MT/s)\ *\ \frac{8\ bits}{10\ bits} * \frac{1\ Byte}{8\ bits" class="wp-image-18785" width="512" height="85" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/A99597E2-4117-43B1-9048-1CE24EFAE227-1024x170.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/A99597E2-4117-43B1-9048-1CE24EFAE227-300x50.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/A99597E2-4117-43B1-9048-1CE24EFAE227-768x127.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/A99597E2-4117-43B1-9048-1CE24EFAE227.jpeg 1231w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/BC5F6C70-2FCF-4CD4-9040-848C8EB654CB.jpeg" alt="BW/lane\ (MB/s) = \ 250\ (MB/s)" class="wp-image-18788" width="347" height="79" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/BC5F6C70-2FCF-4CD4-9040-848C8EB654CB.jpeg 806w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/BC5F6C70-2FCF-4CD4-9040-848C8EB654CB-300x67.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/BC5F6C70-2FCF-4CD4-9040-848C8EB654CB-768x172.jpeg 768w" sizes="auto, (max-width: 347px) 100vw, 347px" /></figure></div>



<h4 class="wp-block-heading">For PCIe v3.0 (the one that interest us for NVIDIA V100):</h4>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/6EFDAF22-C7FC-44FC-B5BE-D8C4D291B71A-1024x154.jpeg" alt="BW/lane\ (MB/s) = \ 8\ 000\ (MT/s)\ *\ \frac{128\ bits}{130\ bits} * \frac{1\ Byte}{8\ bits}" class="wp-image-18795" width="512" height="77" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/6EFDAF22-C7FC-44FC-B5BE-D8C4D291B71A-1024x154.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/6EFDAF22-C7FC-44FC-B5BE-D8C4D291B71A-300x45.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/6EFDAF22-C7FC-44FC-B5BE-D8C4D291B71A-768x115.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/6EFDAF22-C7FC-44FC-B5BE-D8C4D291B71A.jpeg 1292w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/3B7E1754-67C8-4EF1-88BE-3A5D8985803F.jpeg" alt="BW/lane\ (MB/s) = \ 984.6\ (MB/s)" class="wp-image-18796" width="355" height="63" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/3B7E1754-67C8-4EF1-88BE-3A5D8985803F.jpeg 802w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/3B7E1754-67C8-4EF1-88BE-3A5D8985803F-300x53.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/3B7E1754-67C8-4EF1-88BE-3A5D8985803F-768x136.jpeg 768w" sizes="auto, (max-width: 355px) 100vw, 355px" /></figure></div>



<p>Therefore with <strong>16 lanes for a NVIDIA V100 connected in PCIe v3.0</strong>, we have an effective data rate transfer (data bandwidth)<strong> of nearly 16GB/s/way </strong>(<strong>actual bandwidth is 15.75GB/s/way</strong>)</p>



<p>You need to be careful not to get confused, as total bandwidth can also be interpreted as two ways bandwidth; in this case we consider total bandwidth x16 to be around 32GB/s.</p>



<p><em><strong>Note :</strong></em> Another element that we haven&#8217;t considered is that the maximum theoretical bandwidth needs to be reduced by around 1 Gb/s for error correction protocols (<strong><em>ECRC</em></strong> and <strong><em>LCRC</em></strong>) as well as the <strong><em>Headers</em></strong> (<strong><em>Start tag, Sequence tag, Header</em></strong>) and <strong><em>Footer</em></strong> (<em><strong>End</strong></em> tag) overheads explained earlier in this blog post.</p>



<h3 class="wp-block-heading">In conclusion</h3>



<p>We have seen that PCI Express has evolved a lot and that It&#8217;s based on the same concepts as network. To take the best from the PCIe devices it is necessary to understand the fundamentals of the underlying infrastructure. </p>



<p>Failing to choose the right underlying Motherboard, CPU or BUS can lead to major performance bottleneck and GPU under performance.</p>



<p>To sum up :</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p>Friends don&#8217;t let friends build their own GPUs hosts 😉</p><cite>Jean-Louis Quéguiner July 1<sup>st</sup>, 2020</cite></blockquote>



<p>If you liked this post but you want to drill down a bit into the Deep Learning and AI aspect of things don&#8217;t hesitate to check out my other blog posts:</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><a href="https://www.ovh.com/blog/deep-learning-explained-to-my-8-year-old-daughter/" data-wpel-link="exclude"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/BC0E1AC1-6593-4395-9844-A7D2CB457028.png" alt="" class="wp-image-18099" width="515" height="376" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/05/BC0E1AC1-6593-4395-9844-A7D2CB457028.png 748w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/BC0E1AC1-6593-4395-9844-A7D2CB457028-300x219.png 300w" sizes="auto, (max-width: 515px) 100vw, 515px" /></a></figure></div>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><a href="https://www.ovh.com/blog/what-does-training-neural-networks-mean/" data-wpel-link="exclude"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/04/81921ABA-7642-4CA2-87BF-9B2D92278BF1-1024x538.png" alt="What does training neural networks mean?" class="wp-image-17932" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/04/81921ABA-7642-4CA2-87BF-9B2D92278BF1-1024x538.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/81921ABA-7642-4CA2-87BF-9B2D92278BF1-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/81921ABA-7642-4CA2-87BF-9B2D92278BF1-768x404.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/81921ABA-7642-4CA2-87BF-9B2D92278BF1.png 1200w" sizes="auto, (max-width: 512px) 100vw, 512px" /></a></figure></div>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><a href="https://www.ovh.com/blog/distributed-training-in-a-deep-learning-context/" data-wpel-link="exclude"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/20C35ECE-4738-4967-951E-6BC863342D5D-1024x537.png" alt="Distributed Learning in a Deep Learning context" class="wp-image-18106" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/05/20C35ECE-4738-4967-951E-6BC863342D5D-1024x537.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20C35ECE-4738-4967-951E-6BC863342D5D-300x157.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20C35ECE-4738-4967-951E-6BC863342D5D-768x403.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20C35ECE-4738-4967-951E-6BC863342D5D.png 1200w" sizes="auto, (max-width: 512px) 100vw, 512px" /></a></figure></div>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-pci-express-works-and-why-you-should-care-gpu%2F&amp;action_name=How%20PCI-Express%20works%20and%20why%20you%20should%20care%3F%20%23GPU&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Distributed Training in a Deep Learning Context</title>
		<link>https://blog.ovhcloud.com/distributed-training-in-a-deep-learning-context/</link>
		
		<dc:creator><![CDATA[Jean-Louis Queguiner]]></dc:creator>
		<pubDate>Tue, 05 May 2020 10:14:07 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Deep learning]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[Neural networks]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=17871</guid>

					<description><![CDATA[Previously on OVHcloud Blog &#8230; In previous blog posts we have discussed a high level approach to deep learning as well as what is meant by &#8216;training&#8217; in relation to Deep Learning. Following the article, I had lots of questions entering my twitter inbox, especially regarding how GPUs actually works. I decided, therefore, to write [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fdistributed-training-in-a-deep-learning-context%2F&amp;action_name=Distributed%20Training%20in%20a%20Deep%20Learning%20Context&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="537" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/20C35ECE-4738-4967-951E-6BC863342D5D-1024x537.png" alt="Distributed Learning in a Deep Learning context" class="wp-image-18106" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/05/20C35ECE-4738-4967-951E-6BC863342D5D-1024x537.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20C35ECE-4738-4967-951E-6BC863342D5D-300x157.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20C35ECE-4738-4967-951E-6BC863342D5D-768x403.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20C35ECE-4738-4967-951E-6BC863342D5D.png 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading">Previously on OVHcloud Blog &#8230;</h3>



<p>In previous blog posts we have discussed a <a href="https://www.ovh.com/blog/deep-learning-explained-to-my-8-year-old-daughter/" data-wpel-link="exclude">high level approach to deep learning</a> as well as what is meant by &#8216;training&#8217; in relation to Deep Learning.</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><a href="https://www.ovh.com/blog/deep-learning-explained-to-my-8-year-old-daughter/" data-wpel-link="exclude"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/BC0E1AC1-6593-4395-9844-A7D2CB457028.png" alt="" class="wp-image-18099" width="374" height="273" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/05/BC0E1AC1-6593-4395-9844-A7D2CB457028.png 748w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/BC0E1AC1-6593-4395-9844-A7D2CB457028-300x219.png 300w" sizes="auto, (max-width: 374px) 100vw, 374px" /></a></figure></div>



<p>Following the article, I had lots of questions entering my twitter inbox, especially regarding how GPUs actually works.</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="410" height="157" src="https://www.ovh.com/blog/wp-content/uploads/2020/04/image.png" alt="" class="wp-image-17882" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/04/image.png 410w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/image-300x115.png 300w" sizes="auto, (max-width: 410px) 100vw, 410px" /><figcaption>Don&#8217;t worry it&#8217;s a friend, he is ok with me sharing the DM 😉</figcaption></figure></div>



<p>I decided, therefore, to write an article on how GPUs work:</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><a href="https://www.ovh.com/blog/understanding-the-anatomy-of-gpus-using-pokemon/" data-wpel-link="exclude"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/EEAD0A02-DFCA-4745-802B-E36BC517EFED.png" alt="" class="wp-image-18103" width="334" height="254" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/05/EEAD0A02-DFCA-4745-802B-E36BC517EFED.png 668w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/EEAD0A02-DFCA-4745-802B-E36BC517EFED-300x228.png 300w" sizes="auto, (max-width: 334px) 100vw, 334px" /></a></figure></div>



<p>During our R&amp;D process around hardware and AI models, the question of distributed training came up (quickly). But before looking in-depth at distributed training, I invite you to read the following article to understand how Deep Learning training actually works:</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><a href="https://www.ovh.com/blog/what-does-training-neural-networks-mean/" data-wpel-link="exclude"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/04/81921ABA-7642-4CA2-87BF-9B2D92278BF1-1024x538.png" alt="What does training neural networks mean?" class="wp-image-17932" width="476" height="249" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/04/81921ABA-7642-4CA2-87BF-9B2D92278BF1-1024x538.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/81921ABA-7642-4CA2-87BF-9B2D92278BF1-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/81921ABA-7642-4CA2-87BF-9B2D92278BF1-768x404.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/81921ABA-7642-4CA2-87BF-9B2D92278BF1.png 1200w" sizes="auto, (max-width: 476px) 100vw, 476px" /></a></figure></div>



<p>As previously discussed, Neural Networks training depends on :</p>



<ul class="wp-block-list"><li>Input Data</li><li>Neural Network architecture composed of &#8216;Layers&#8217;</li><li>Weights</li><li>Learning Rate (step used to adjust neural network weights)</li></ul>



<h2 class="wp-block-heading">Why do we need distributed Learning</h2>



<p>Deep Learning is mainly used for non structured data pattern learning. <strong>Non structured data &#8211; such as text corpus, image, video or sound &#8211; can represent a huge amount of data to train on.</strong></p>



<p>Training such a library can take days or even weeks because of the size of data and/or the size of the network.</p>



<p>Multiple distributed learning approaches can be considered.</p>



<h2 class="wp-block-heading">The different Distributed Learning approaches</h2>



<p>There are two main categories for distributed training when it comes to Deep Learning and both of them are based on the <strong><a rel="noreferrer noopener nofollow external" href="https://en.wikipedia.org/wiki/Divide-and-conquer_algorithm" target="_blank" data-wpel-link="external">divide and conquer paradigm.</a></strong></p>



<p>The first category is named : <strong>&#8220;Distributed Data Parallelism&#8221;</strong> where the <strong>data is split across m</strong>ultiple GPUs.</p>



<p>The second category is called : <strong>&#8220;Model Parallelism&#8221;</strong> where the deep learning <strong>model is split across multiple GPUs</strong>.</p>



<p>However the <strong>Distributed Data Parallelism </strong>is the most common approach as it <strong>fits almost any problem</strong>. The second approach has some serious technical limitations in relation to model splitting. Splitting a model is a highly technical approach, as you need to know the space used by each part of the network into the <strong>DRAM</strong> of the GPU. Once you have the <strong>DRAM usage per slice</strong> you need to enforce the computation by <strong>hard coding Neural Network Layers placement onto the desired GPU</strong>. T<strong>his approach makes it hardware-related</strong>, as the DRAM may vary from one GPU to the other, while the <strong>Distributed Data Parallelism </strong>will just require <strong>data size adjustments (usually batch size) which is relatively simple</strong>.</p>



<p><strong>Distributed Data Parallelism</strong> model has two variants, each of which has its advantages and disadvantages. The first variant allows you to train a model with a<strong> synchronous weight adjustment.</strong> That is to say that <strong>each training batch in each GPU will return the corrections</strong> that need to be made to the model in order for it to be trained. And <strong>that it will have to wait until all the workers have finished their task to have a new set of weights </strong>so it can recognise this in the next training batch. </p>



<p>Whereas the second variant lets you work in an <strong>asynchronous way</strong>. This means each batch of each GPU will report the corrections that need to be made to the neural network. The<strong> weights coordinator </strong>will send a <strong>new set of weights</strong> w<strong>ithout waiting for the other GPUs to finish training their own </strong>batch.</p>



<h2 class="wp-block-heading">3 cheat sheets to better understand Distributed Deep Learning</h2>



<p>In this cheat sheets lets assume you&#8217;re using docker with a volume attached.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="942" height="1024" src="https://www.ovh.com/blog/wp-content/uploads/2020/04/type1-942x1024.png" alt="" class="wp-image-18048" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/04/type1-942x1024.png 942w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/type1-276x300.png 276w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/type1-768x835.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/type1.png 1004w" sizes="auto, (max-width: 942px) 100vw, 942px" /></figure>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="904" height="1024" src="https://www.ovh.com/blog/wp-content/uploads/2020/04/type2-904x1024.png" alt="" class="wp-image-18049" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/04/type2-904x1024.png 904w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/type2-265x300.png 265w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/type2-768x870.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/type2.png 945w" sizes="auto, (max-width: 904px) 100vw, 904px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1543" height="2182" src="https://www.ovh.com/blog/wp-content/uploads/2020/04/distrib-training1.jpeg" alt="" class="wp-image-18036" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/04/distrib-training1.jpeg 1543w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/distrib-training1-212x300.jpeg 212w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/distrib-training1-724x1024.jpeg 724w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/distrib-training1-768x1086.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/distrib-training1-1086x1536.jpeg 1086w, https://blog.ovhcloud.com/wp-content/uploads/2020/04/distrib-training1-1448x2048.jpeg 1448w" sizes="auto, (max-width: 1543px) 100vw, 1543px" /></figure>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/63EDA175-2E61-4AC2-9157-97C18A973B78.png" alt="" class="wp-image-18096" width="320" height="322" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/05/63EDA175-2E61-4AC2-9157-97C18A973B78.png 640w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/63EDA175-2E61-4AC2-9157-97C18A973B78-150x150.png 150w" sizes="auto, (max-width: 320px) 100vw, 320px" /><figcaption>Now you need to choose your Distributed Training strategy (wisely)</figcaption></figure></div>



<p></p>



<h2 class="wp-block-heading">Further Readings</h2>



<p>While we have covered a lot in this blog post, we haven&#8217;t nearly covered all the aspects of Deep Learning distributed training &#8211; including prior work, history and associated mathematics.</p>



<p>I highly suggest that you read the great paper<em> <a href="https://stanford.edu/~rezab/classes/cme323/S16/projects_reports/hedge_usmani.pdf" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Parallel and Distributed Deep Learning</a></em> by <strong>Vishakh Hegde</strong> and <strong>Sheema</strong> <strong>Usmani</strong> (both from Stanford University)</p>



<p>As well as the article <em><a href="https://arxiv.org/pdf/1802.09941.pdf" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis</a></em> written by <strong>Tal Ben-Nun </strong>and <strong>Torsten Hoefler</strong> ETH Zurich, Switzerland. I suggest that you start by jumping to <strong>section 6.3</strong>.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fdistributed-training-in-a-deep-learning-context%2F&amp;action_name=Distributed%20Training%20in%20a%20Deep%20Learning%20Context&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
