<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>guide Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/guide/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/guide/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Tue, 02 Sep 2025 14:35:59 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>guide Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/guide/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>GPU for LLM Inferencing Guide</title>
		<link>https://blog.ovhcloud.com/gpu-for-llm-inferencing-guide/</link>
		
		<dc:creator><![CDATA[David Tonda]]></dc:creator>
		<pubDate>Thu, 24 Jul 2025 07:14:16 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[guide]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=29418</guid>

					<description><![CDATA[A guide on what GPU and in which setup, to use for LLM Inference.<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fgpu-for-llm-inferencing-guide%2F&amp;action_name=GPU%20for%20LLM%20Inferencing%20Guide&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph">A few days ago, we were discussing the GPU strategy for AI in OVHcloud. I realised after a few hours of calls that our finance colleagues were still finding the technical parts of this topic difficult to get their heads round, so I decided to write them a guide. Then someone joked that many of our customers were also confused, so the guide is now a blog post 😉</p>



<p class="wp-block-paragraph">This guide focuses on <strong>GPU-based inference for Large Language Models (LLMs)</strong>. When we refer to “performance” we mean <strong>tokens per second</strong>. It’s not a technical deep dive, but it should help you choose the right GPU setup for your use case. Many of the details have been simplified to keep the information practical and accessible.</p>



<h2 class="wp-block-heading"><a>TL:DR – Best LLM inference options in OVHcloud (As of July 2025)</a></h2>



<p class="wp-block-paragraph">These are the best deployment options currently available at OVHcloud for LLM Inferencing. The offering will evolve with time as new GPUs are released</p>



<figure class="wp-block-image aligncenter size-full"><img fetchpriority="high" decoding="async" width="802" height="251" src="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-16.png" alt="" class="wp-image-29420" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-16.png 802w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-16-300x94.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-16-768x240.png 768w" sizes="(max-width: 802px) 100vw, 802px" /></figure>



<p class="wp-block-paragraph"></p>



<p class="wp-block-paragraph"></p>



<h2 class="wp-block-heading"><a>1 – Scope your requirements</a></h2>



<p class="wp-block-paragraph">Before you go any further, try to scope your requirements (The answers to the following questions will help you to choose the best solution).</p>



<ul class="wp-block-list">
<li><strong>What model do you want to deploy ?</strong> (e.g. Llama3 70B)</li>



<li><strong>How many parameters does it have?</strong> (e.g. 7B, 70B, 120B)</li>



<li><strong>What context length do you need?</strong> (e.g. 32K, 128K)</li>



<li><strong>What precision or quantization level?</strong> (FP16, FP8, etc.)</li>



<li><strong>How many concurrent users? </strong>(A single user ? 10 ? 500 ? 10000 ?)</li>



<li><strong>What inference server?</strong> (e.g. LLM, TensorRT, Ollama&#8230;)</li>



<li><strong>Throughput needs?</strong> (e.g. latency per user, total TPS)</li>



<li><strong>Is the usage stable or bursty? Predictable or not?</strong></li>
</ul>



<p class="wp-block-paragraph"><em>Note : This guide assumes you are interested in inference on an 8B+ LLM on GPU (we won’t cover small LLMs using CPU compute)</em></p>



<h2 class="wp-block-heading"><a>2 – Choosing the GPU model – Discriminant Criterion</a></h2>



<h3 class="wp-block-heading"><a>a)&nbsp;&nbsp;&nbsp;&nbsp; Quantization / Precision support</a></h3>



<p class="wp-block-paragraph"><strong>What is Quantization ? </strong>The idea is to reduce the precision of the model weights in order to reduce the memory and computation required, at the cost of a small decrease in the model quality. Quantization reduces memory and compute costs by lowering precision (e.g., FP16 → FP8 → FP4), usually at the cost of model quality. <strong>It’s a trade-off</strong>.&nbsp;&nbsp;</p>



<p class="wp-block-paragraph"></p>



<p class="wp-block-paragraph"><strong>Currently LLM models are most often published in FP16 but often deployed in FP8 as the loss in quality is far outweighed by the gain in speed.</strong></p>



<p class="wp-block-paragraph"><strong>GPU Quantization Support</strong></p>



<figure class="wp-block-image size-full"><img decoding="async" width="327" height="363" src="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-17.png" alt="" class="wp-image-29421" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-17.png 327w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-17-270x300.png 270w" sizes="(max-width: 327px) 100vw, 327px" /></figure>



<p class="wp-block-paragraph">Most GPUs don&#8217;t support all types of precision/quantization so it’s a <strong>discriminant criterion.</strong> Choose a GPU that supports the quantization format you plan to use.</p>



<h3 class="wp-block-heading"><a>b)&nbsp;&nbsp;&nbsp;&nbsp; Minimum Nb of GPU to run your model</a></h3>



<p class="wp-block-paragraph">For inference you need to load all  the model weights (**) in memory (GPU memory, not RAM)  and keep room for Context/Cache. Either you have enough memory or it will simply <strong>not work</strong>.</p>



<p class="wp-block-paragraph">Here is a <strong>rule of thumb</strong> for calculating the required GPU memory for LLMs :</p>



<pre class="wp-block-code"><code class=""><strong>Total GPU memory = (Parameters × Precision Factor) + (Context Size × 0.0005</strong>)</code></pre>



<p class="wp-block-paragraph">With Precision Factor :</p>



<figure class="wp-block-table"><table><tbody><tr><td>FP32</td><td>4</td></tr><tr><td>FP16</td><td>2</td></tr><tr><td>FP8</td><td>1</td></tr><tr><td>FP4</td><td>0.5</td></tr></tbody></table></figure>



<p class="wp-block-paragraph"><em>Example</em>: Llama 3.3 70B, with 128k context, in FP8 would need 70 GB for the model weights + 62.5 GB for the context</p>



<p class="wp-block-paragraph"><strong>If we apply this formula to few standard LLM sizes / Contexts</strong>, we get the following:</p>



<figure class="wp-block-image size-full"><img decoding="async" width="647" height="193" src="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-18.png" alt="" class="wp-image-29422" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-18.png 647w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-18-300x89.png 300w" sizes="(max-width: 647px) 100vw, 647px" /></figure>



<p class="wp-block-paragraph">Now <strong>we&#8217;ll apply this to the most common GPU</strong> you can find to get the minimum number of GPU you need :</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="827" height="350" src="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-24.png" alt="" class="wp-image-29494" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-24.png 827w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-24-300x127.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-24-768x325.png 768w" sizes="auto, (max-width: 827px) 100vw, 827px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="877" height="120" src="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-20.png" alt="" class="wp-image-29435" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-20.png 877w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-20-300x41.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-20-768x105.png 768w" sizes="auto, (max-width: 877px) 100vw, 877px" /></figure>



<p class="wp-block-paragraph"> Color Legend, considering that servers usually come with 4 or 8 GPU (16 GPU soon)</p>



<p class="wp-block-paragraph">See also for 2 common Fine-tuning methods :</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="306" src="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-25-1024x306.png" alt="" class="wp-image-29495" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-25-1024x306.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-25-300x90.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-25-768x229.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-25.png 1169w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph"><em>Note :  it’s possible to run (small) LLM inference in CPU (see </em><a href="https://github.com/ggml-org/llama.cpp" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>Llama.cpp</em></a><em>) but only small models (or high levels of quantization with lower quality).</em></p>



<p class="wp-block-paragraph">** <em>Note : it’s possible to reduce the memory needs by “offloading” part of the model layers so RAM, but I won’t cover that (Check out </em><a href="https://www.reddit.com/r/LocalLLaMA/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>LocalLlama reddit</em></a><em> sub – some make a sport out of it) as the performances are poor and I guess if you are coming to&nbsp; cloud, it’s for the real experience 😉</em></p>



<h3 class="wp-block-heading"><a>c)&nbsp;&nbsp;&nbsp;&nbsp; Feature x Hardware compatibility</a></h3>



<p class="wp-block-paragraph">The last discriminant criteria for choosing a GPU is the hardware compatibility with some inference servers features.</p>



<p class="wp-block-paragraph">Inference servers (the software that runs the model) may have features that are not compatible with certain GPUs (brand or generation).</p>



<p class="wp-block-paragraph">These change often so I won’t list them but here is an example for VLLM :&nbsp; <a href="https://docs.vllm.ai/en/latest/features/compatibility_matrix.html#feature-x-hardware_1" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://docs.vllm.ai/en/latest/features/compatibility_matrix.html#feature-x-hardware_1</a></p>



<p class="wp-block-paragraph">The most common example we see is that <strong>“Flash Attention” mechanism is not supported on Tesla generation Nvidia cards, like V100 and V100S</strong>.</p>



<h2 class="wp-block-heading"><a>3 – Choosing the GPU setup and deployment – Performance criterion</a></h2>



<h3 class="wp-block-heading"><a>a)&nbsp;&nbsp;&nbsp;&nbsp; What impacts the performance for inference ?</a></h3>



<h4 class="wp-block-heading">Overview</h4>



<p class="wp-block-paragraph">Several elements impact the overall performance (i.e. the tokens/second), with approximate order of importance as follows :</p>



<p class="wp-block-paragraph">1 – The GPU performance</p>



<p class="wp-block-paragraph">2 – The Network performance (between GPU and between servers)</p>



<p class="wp-block-paragraph">3 – The Software (Inference Server, drivers, OS)</p>



<p class="wp-block-paragraph">Below is an explanation of each and the options you have to choose from.</p>



<h4 class="wp-block-heading">The GPU performance</h4>



<p class="wp-block-paragraph">This is mostly linked to the compute power (“flops”) of the GPU and the bandwidth of its memory (depending on the generation).</p>



<p class="wp-block-paragraph">See the theoretical performances (what is communicated by Nvidia and AMD) listed below :</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="455" height="375" src="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-21.png" alt="" class="wp-image-29436" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-21.png 455w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-21-300x247.png 300w" sizes="auto, (max-width: 455px) 100vw, 455px" /></figure>



<h4 class="wp-block-heading">The Network performance</h4>



<p class="wp-block-paragraph">When inference takes place, your data flows in several ways:</p>



<ul class="wp-block-list">
<li><strong>GPU to Motherboard</strong> : The speed depends on the type and version of the connection. Usually it’s PCIE or SXM (proprietary connection from Nvidia).</li>
</ul>



<p class="wp-block-paragraph"><strong>In a nutshell : Overall SXM &gt; PCIE and the higher the version the better.</strong></p>



<ul class="wp-block-list">
<li><strong>&nbsp;GPU to GPU</strong> : Either the communication goes through the motherboard (so PCIE/SXM) or you have a GPU direct interconnection. Nvlink is the Nvidia solution.</li>
</ul>



<p class="wp-block-paragraph"><strong>In a nutshell : If you are using several Nvidia GPUs, choose servers with Nvlink.</strong></p>



<ul class="wp-block-list">
<li><strong>The network between Servers</strong> (if using multiple servers) : Ethernet, Infiniband…</li>
</ul>



<p class="wp-block-paragraph"><strong>In a nutshell : if you are distributing your inference over several servers, choose Infiniband over Ethernet.</strong></p>



<h4 class="wp-block-heading">The software performance (Inference Server, Drivers)</h4>



<p class="wp-block-paragraph">Performance will vary widely based on the Inference Server (VLLM, Ollama, TensorRT…), the underlying libraries used (Pytorch…), and underlying drivers (Cuda, RocM).</p>



<p class="wp-block-paragraph"><strong>In a nutshell : Use the latest versions !</strong></p>



<p class="wp-block-paragraph">Not all inference servers provide the same performance or provide the same features. I won&#8217;t go into the details but here are some tips :</p>



<ul class="wp-block-list">
<li><strong>Ollama : Simple to setup/use. Best option for single user.</strong></li>



<li><strong>VLLM : Best for getting the latest models and features fast but complex to configure well</strong></li>



<li><strong>TensorRT : Best throughput but lag In support for new models / features and only works on Nvidia GPU.</strong></li>
</ul>



<h3 class="wp-block-heading"><a>a)&nbsp;&nbsp;&nbsp;&nbsp; Different deployment options</a></h3>



<p class="wp-block-paragraph">Now that you know which GPU and server to choose, you also have several options for the architecture setup.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="421" src="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-22-1024x421.png" alt="" class="wp-image-29437" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-22-1024x421.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-22-300x123.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-22-768x316.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-22-1536x632.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/07/image-22-2048x842.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph"><strong>Option A&nbsp; &#8211; &nbsp;Single GPU</strong></p>



<p class="wp-block-paragraph">If the model is small enough to fit into a single GPU, then it’s the best option !</p>



<p class="wp-block-paragraph"><strong>Option B and C&nbsp; &#8211; Single instance, Multiple GPU (with or without interconnect)</strong></p>



<p class="wp-block-paragraph">If it’s too big for a single GPU, then the best option is a single server with multiple GPUs. Either with (<strong>Option C</strong>) or without Nvlink (<strong>Option B</strong>). In these two cases the weights of the models are spread over the different GPUs but there is a cost : you will not have 2x the performance of 1 GPU !</p>



<p class="wp-block-paragraph"><strong>Option D&nbsp; &#8211; Single instance, Multiple replicates with Loadbalancing</strong></p>



<p class="wp-block-paragraph">If the model fits on 1 server (1+ GPU) but the performance is not enough, or you need to scale dynamically based on the current needs, then your best option is to use multiple replicates and add a loadbalancer (<strong>Option D</strong>) – This what AI Deploy provides off the shelf.</p>



<p class="wp-block-paragraph"><strong>Option E&nbsp; &#8211; Distributed inference over several servers</strong></p>



<p class="wp-block-paragraph">If the model is too large to even fit on a single server, then you must distribute the inference over several servers (<strong>Option E</strong>). By far the most complex (you need to setup the network and software for clustering) and the highest loss in performance (due to server to server network bottlenecks, on top of GPU to GPU).</p>



<h3 class="wp-block-heading"><a>c)&nbsp;&nbsp;&nbsp;&nbsp; Which OVHcloud product to use ?</a></h3>



<p class="wp-block-paragraph">For inference you have today six options to choose from :</p>



<figure class="wp-block-table"><table><tbody><tr><td><strong>Product</strong></td><td><strong>Type</strong></td><td><strong>GPU Available</strong></td></tr><tr><td><a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints</a></td><td>Inference API</td><td>Serverless</td></tr><tr><td><a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Deploy</a></td><td>Container as a Service</td><td>H100, L40S, A100, L4, A10, V100S</td></tr><tr><td><a href="https://www.ovhcloud.com/en-ie/public-cloud/compute/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Cloud GPU</a></td><td>Virtual Machine</td><td>H100, L40S, A100, L4, A10, V100S, V100, RTX5000</td></tr><tr><td><a href="https://www.ovhcloud.com/en/public-cloud/kubernetes/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Managed Kubernetes Service</a></td><td>Kubernetes</td><td>H100, L40S, A100, L4, A10, V100S, V100, RTX5000</td></tr><tr><td><a href="https://www.ovhcloud.com/en-ie/bare-metal/prices/?display=list&amp;gpu_brand=nvidia" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Dedicated Servers</a></td><td>Bare Metal</td><td>L40S, L4</td></tr><tr><td><a href="https://www.ovhcloud.com/en/dc-as-a-service/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">On Prem Cloud Platform</a></td><td>DC as a Service</td><td>Any</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">If you want a fully managed inference, then AI Endpoints is clearly the best option : it&#8217;s a serverless service where you pay per token consumed. You don&#8217;t need to deploy the model or manage it.<br>Caveat is that you need to choose from the models we propose (you cannot add your own) &#8211; That said we invite you to ask for new models on our <a href="https://discord.com/invite/ovhcloud" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Discord</a> ! </p>



<p class="wp-block-paragraph"><strong>AI Deploy is a product specially tailored to run inference servers with a few key features :</strong></p>



<ul class="wp-block-list">
<li>It’s a container as a service : you bring your own container, we run it.</li>



<li>Simple configuration : you can launch several times the container via single command line and change the parameters directly via that command line. </li>



<li>Scalability by design : at any time add replicates and we manage the loadbalancing.</li>



<li>Autoscaling : you can setup autoscaling either on CPU/RAM thresholds and soon you will be able to use custom metrics too (ex : latency on the inference).</li>



<li>Scaling to 0 : You will soon be able to scale to 0 à if no request has been sent to your server for some time, we stop the machine.</li>



<li>Pay by the minute of compute, no commitment.</li>
</ul>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fgpu-for-llm-inferencing-guide%2F&amp;action_name=GPU%20for%20LLM%20Inferencing%20Guide&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
