<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Mistral Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/mistral/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/mistral/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Thu, 19 Jun 2025 07:18:14 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>Mistral Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/mistral/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Reference Architecture: deploying the Mistral Large 123B model in a sovereign environment with OVHcloud</title>
		<link>https://blog.ovhcloud.com/reference-architecture-deploy-mistral-large-model-in-sovereign-environment-ovhcloud/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Wed, 18 Jun 2025 12:45:51 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[AI Training]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[Mistral]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=29186</guid>

					<description><![CDATA[Are you ready to think bigger with the Mistral Large model 🚀 ? As Artificial Intelligence (AI) becomes a strategic pillar for both enterprises and public institutions, data sovereignty and infrastructure control have become essential. Deploying advanced large language models (LLMs) like Mistral Large, under a commercial license, requires a secure, high-performance environment that complies [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-deploy-mistral-large-model-in-sovereign-environment-ovhcloud%2F&amp;action_name=Reference%20Architecture%3A%C2%A0deploying%20the%20Mistral%20Large%20123B%20model%20in%20a%20sovereign%20environment%20with%20OVHcloud&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em><strong>Are you ready to think bigger with the Mistral Large model 🚀 ?</strong></em></p>



<figure class="wp-block-image aligncenter size-large"><img fetchpriority="high" decoding="async" width="1024" height="461" src="https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_ref-1024x461.png" alt="" class="wp-image-29249" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_ref-1024x461.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_ref-300x135.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_ref-768x346.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_ref-1536x691.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_ref.png 1920w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Mistral Large model deployed on OVHcloud infrastructure<br></em></figcaption></figure>



<p class="wp-block-paragraph">As Artificial Intelligence (<strong>AI</strong>) becomes a strategic pillar for both enterprises and public institutions, <strong>data sovereignty</strong> and <strong>infrastructure control</strong> have become essential. Deploying advanced large language models (LLMs) like <strong>Mistral Large</strong>, under a commercial license, requires a secure, high-performance environment that complies with <strong>European data regulations</strong>.</p>



<p class="wp-block-paragraph"><strong>OVHcloud Machine Learning Services</strong> offer a trusted solution for deploying AI models in a <strong>fully sovereign cloud environment</strong> — hosted in Europe, under <strong>EU jurisdiction</strong>, and fully <strong>GDPR-compliant</strong>.</p>



<p class="wp-block-paragraph">This <strong>Reference Architecture</strong> will show you how to:</p>



<ul class="wp-block-list">
<li>Access Mistral AI registry using your own license</li>



<li>Download the Mistral Large 123B model automatically using <strong>AI Training</strong></li>



<li>Store the model into a dedicated bucket with <strong>OVHcloud Object Storage</strong></li>



<li>Deploy a production-ready inference API for <strong>Mistral Large</strong> using <strong>AI Deploy</strong> </li>
</ul>



<h2 class="wp-block-heading">Context</h2>



<h3 class="wp-block-heading">Mistral Large model</h3>



<p class="wp-block-paragraph">The <strong>Mistral Large</strong> model is a <strong>state-of-the-art (LLM)</strong> developed by <strong><a href="https://mistral.ai/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral AI</a>,</strong> a French AI company. It&#8217;s designed to compete with top-tier models like GPT-4, Claude, while emphasizing performance and efficiency.</p>



<p class="wp-block-paragraph">This is a model with <strong>123 billion</strong> parameters. <strong>Mistral AI</strong> recommends deploying this model in FP8 with 4 H100 GPUs. For more information, refer to <a href="https://help.mistral.ai/en/articles/235545-mistral-models" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral documentation</a>.</p>



<p class="wp-block-paragraph">This model requires the use of a <strong>commercial licence</strong>. To do this, you need to create an account on <a href="https://console.mistral.ai/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">La Plateforme</a> via the Mistral AI console (<strong>console.mistral.ai</strong>).</p>



<h3 class="wp-block-heading">AI Training </h3>



<p class="wp-block-paragraph"><strong>OVHcloud AI Training</strong> is a fully managed platform designed to help you <strong>train, tune</strong> Machine Learning (ML), Deep Learning (DL), and Large Language Models (LLMs) efficiently. Whether you&#8217;re working on computer vision, NLP, or tabular data, this solution lets you launch training jobs on high-performance GPUs in seconds.</p>



<p class="wp-block-paragraph"><strong>What are the key benefits?</strong></p>



<ul class="wp-block-list">
<li><strong>Easy to use</strong>: launch processing or training jobs in one CLI command or a few clicks using your own Docker image</li>



<li><strong>High-performance computing</strong>: access GPUs like H100, A100, V100S, L40S, and L4 as of June 2025 &#8211; new references are added regularly</li>



<li><strong>Cost-efficient</strong>:<strong> </strong>pay-per-minute billing with no upfront commitment. You only pay for compute time used, with precise control over resources thanks to automatic job stop and synchronisation</li>
</ul>



<p class="wp-block-paragraph"><strong>💡 Why do we need AI Training? </strong>To download the Mistral Large model automatically and efficiently, using a single command to launch the job.</p>



<h3 class="wp-block-heading">AI Deploy</h3>



<p class="wp-block-paragraph">OVHcloud AI Deploy is a<strong>&nbsp;Container as a Service</strong>&nbsp;(CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications / APIs based on Machine Learning (ML), Deep Learning (DL) or LLMs.</p>



<p class="wp-block-paragraph"><strong>The key benefits are:</strong></p>



<ul class="wp-block-list">
<li><strong>Easy to use:</strong>&nbsp;bring your own custom Docker image and deploy it in a command line or a few clicks surely</li>



<li><strong>High-performance computing:</strong>&nbsp;a complete range of GPUs available (H100, A100, V100S, L40S and L4)</li>



<li><strong>Scalability and flexibility:</strong>&nbsp;supports automatic scaling, allowing your model to effectively handle fluctuating workloads</li>



<li><strong>Cost-efficient:</strong>&nbsp;billing per minute, no surcharges</li>
</ul>



<p class="wp-block-paragraph">✅ To go further, some prerequisites must be checked!</p>



<h2 class="wp-block-heading">Overview of the Mistral Large deployment architecture</h2>



<p class="wp-block-paragraph">Here is how will be deployed <strong>Mistral Large 123B</strong>:</p>



<ol class="wp-block-list">
<li>Install the <strong>ovhai CLI</strong></li>



<li>Create a bucket for <strong>model storage</strong></li>



<li>Retrieve the <strong>license information</strong> from <a href="https://console.mistral.ai/on-premise/licenses" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral Console</a></li>



<li>Configure and set up the<strong> environment</strong></li>



<li>Download the <strong>Mistral Large model weights</strong></li>



<li>Deploy the <strong>Mistral Large service</strong></li>



<li>Test it with simple request and <strong>advanced usage</strong> thanks to LangChain</li>
</ol>



<figure class="wp-block-image aligncenter size-large"><img decoding="async" width="1024" height="173" src="https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_process-1024x173.png" alt="" class="wp-image-29251" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_process-1024x173.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_process-300x51.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_process-768x130.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_process-1536x259.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/06/mistral_large_archi_process.png 1920w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">Let’s go for the set up and deployment of your own Mistral Large service!</p>



<h2 class="wp-block-heading">Prerequisites</h2>



<p class="wp-block-paragraph">Before you begin, ensure you have:</p>



<ul class="wp-block-list">
<li>A <strong><a href="https://console.mistral.ai/on-premise/licenses" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral AI license</a></strong> to access to the <strong>Mistral Large model</strong></li>



<li>An&nbsp;<strong>OVHcloud Public Cloud</strong>&nbsp;account</li>



<li>An&nbsp;<strong>OpenStack user</strong>&nbsp;with the following roles:
<ul class="wp-block-list">
<li>Administrator</li>



<li>AI Training Operator</li>



<li>Object Storage Operator</li>
</ul>
</li>
</ul>



<p class="wp-block-paragraph"><strong>🚀 Having all the ingredients for our recipe, it’s time to </strong>deploy the Mistral Large model on 4 H100<strong>!</strong></p>



<h2 class="wp-block-heading">Architecture guide:&nbsp;Mistral Large on OVHcloud infrastructure</h2>



<p class="wp-block-paragraph">Let’s go for the set up and deployment of the <strong>Mistral Large</strong> model!</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>✅ Note</strong></p>
<cite><strong>In this example, the <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>Mistral Large 25.02</code></mark> is used. Choose the mistral model under the licence of your choice and repeat the same steps, adapting the model name and versions.</strong></cite></blockquote>



<p class="wp-block-paragraph">⚙️<em>&nbsp;Also consider that all of the following steps can be automated using OVHcloud APIs!</em></p>



<h3 class="wp-block-heading">Step 1 &#8211; Install&nbsp;<code>ovhai</code>&nbsp;CLI</h3>



<p class="wp-block-paragraph">If the <code><strong>ovhai</strong></code> CLI is not install, start by setting up your CLI environment.</p>



<pre class="wp-block-code"><code class="">curl https://cli.gra.ai.cloud.ovh.net/install.sh | bash</code></pre>



<p class="wp-block-paragraph">Secondly, login using your&nbsp;<strong>OpenStack credentials</strong>.</p>



<pre class="wp-block-code"><code class="">ovhai login -u &lt;openstack-username&gt; -p &lt;openstack-password&gt;</code></pre>



<p class="wp-block-paragraph">Now, it’s time to create your bucket inside OVHcloud Object Storage!</p>



<h3 class="wp-block-heading">Step 2 – Provision Object Storage</h3>



<ol class="wp-block-list">
<li>Go to&nbsp;<strong>Public Cloud &gt; Storage &gt; Object Storage</strong>&nbsp;in the OVHcloud Control Panel.</li>



<li>Create a&nbsp;<strong>datastore</strong>&nbsp;and a new&nbsp;<strong>S3 bucket</strong>&nbsp;(e.g.,&nbsp;<strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>s3-mistral-large-model</code>)</mark></strong>.</li>



<li>Register the datastore with the&nbsp;<code>ovhai</code>&nbsp;CLI:</li>
</ol>



<pre class="wp-block-code"><code class="">ovhai datastore add s3 &lt;ALIAS&gt; https://s3.gra.perf.cloud.ovh.net/ gra &lt;my-access-key&gt; &lt;my-secret-key&gt; --store-credentials-locally</code></pre>



<p class="wp-block-paragraph">💡 <em>Note that, for this use case, we recommend the <strong>High Performance Object Storage</strong> range using <code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><strong>https://s3.gra.perf.cloud.ovh.net/</strong></mark></code> instead of <code>https://s3.gra.io.cloud.ovh.net/</code></em></p>



<h3 class="wp-block-heading">Step 3 &#8211; Access the Mistral AI registry</h3>



<p class="wp-block-paragraph"><em>⚠️ Please note that you must have a <strong>licence for the Mistral Large model </strong>to be able to carry out the following steps.</em></p>



<ul class="wp-block-list">
<li>Go to the Mistral AI platform: <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">https://console.mistral.ai/home</mark></strong></li>



<li>Retrieve <strong>credentials</strong> and the <strong>license key</strong> from the Mistral console:<strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"> https://console.mistral.ai/on-premise/licenses</mark></strong></li>



<li>Authenticate to the Mistral AI Docker registry:</li>
</ul>



<pre class="wp-block-code"><code class="">docker login &lt;mistral-ai-registry&gt; --username $DOCKER_USERNAME --password $DOCKER_PASSWORD</code></pre>



<ul class="wp-block-list">
<li>Add the private registry to the config using the <code><strong>ovhai</strong></code> CLI:</li>
</ul>



<pre class="wp-block-code"><code class="">ovhai registry add &lt;mistral-ai-registry&gt;</code></pre>



<ul class="wp-block-list">
<li>Check that it is present in the list:</li>
</ul>



<pre class="wp-block-code"><code class="">ovhai registry list</code></pre>



<h3 class="wp-block-heading">Step 4 &#8211; Define environment variables</h3>



<p class="wp-block-paragraph">The next step is to define an<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"> <strong><code>.env</code></strong></mark> file that will list all the environment variables required to download and deploy the Mistral Large model.</p>



<ul class="wp-block-list">
<li>Create the <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><strong><code>.env</code></strong></mark> file, enter the following information:</li>
</ul>



<pre class="wp-block-code"><code class=""><code>SERVED_MODEL=mistral-large-2502
RECIPES_VERSION=v0.0.76TP_SIZE=4
LICENSE_KEY=&lt;your-mistral-license-key&gt;
DOCKER_IMAGE_INFERENCE_ENGINE=&lt;<span style="background-color: initial; font-family: inherit; font-size: inherit; font-weight: inherit;">mistral-inference-server</span>-docker-image&gt;
DOCKER_IMAGE_MISTRAL_UTILS=<span style="background-color: rgba(248, 248, 242, 0.2); font-family: inherit; font-size: inherit; font-weight: inherit;">&lt;</span><span style="font-family: inherit; font-size: inherit; font-weight: inherit; background-color: initial;">mistral-utils</span><span style="background-color: rgba(248, 248, 242, 0.2); font-family: inherit; font-size: inherit; font-weight: inherit;">-docker-image&gt;</span></code></code></pre>



<ul class="wp-block-list">
<li>Then, create a script to load theses environment variables easily. Name it <code><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">load_env.sh</mark></strong></code>:</li>
</ul>



<pre class="wp-block-code"><code class="">#!/bin/bash

# Vérifie si le fichier .env existe
if [ ! -f .env ]; then
  echo "Error: .env not found"
  exit 1
fi

# Exporter toutes les variables du .env
export $(grep -v '^#' .env | xargs)

echo "Environment variables are loaded from .env"</code></pre>



<ul class="wp-block-list">
<li>Now, launch this script :</li>
</ul>



<pre class="wp-block-code"><code class="">source load_env.sh</code></pre>



<p class="wp-block-paragraph">✅ You have everything you need to start the implementation!</p>



<h3 class="wp-block-heading">Step 5 &#8211; Download Mistral Large model weights</h3>



<p class="wp-block-paragraph">The aim here is to download the model and its artefacts into the S3 bucket created earlier.</p>



<p class="wp-block-paragraph">To achieve this, you can launch a download job that will run automatically with AI Training.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong> 💡 Here&#8217;s a tip! </strong></p>
<cite><strong>Note that here you are not using AI Training to train models, but as an easy-to-use Container as a Service solution. With a single command line, you can launch a one-shot download of the Mistral Large model with automatic synchronisation to Object Storage.</strong></cite></blockquote>



<ul class="wp-block-list">
<li>Launch the <strong>AI Training</strong> download job by attaching the object container:</li>
</ul>



<pre class="wp-block-code"><code class="">ovhai job run --name DOWNLOAD_MISTRAL_LARGE_123B \
              --cpu 12 \
              --volume s3-mistral-large-model@&lt;ALIAS&gt;/:/opt/ml/model:RW \
              -e RECIPES_VERSION=$RECIPES_VERSION \
              $<span style="background-color: initial; font-family: inherit; font-size: inherit; font-weight: inherit;">DOCKER_IMAGE_MISTRAL_UTILS</span> \
                -- bash -c "cd /app/mistral-rclone &amp;&amp; \ 
                  poetry run python mistral-rclone.py \
                  --license-key $LICENSE_KEY \
                  --download-model $SERVED_MODEL"</code></pre>



<p class="wp-block-paragraph"><em>Full command explained:</em></p>



<ul class="wp-block-list">
<li><code>ovhai job run</code></li>
</ul>



<p class="wp-block-paragraph">This is the core command to&nbsp;<strong>run a job</strong>&nbsp;using the&nbsp;<strong>OVHcloud AI Training</strong>&nbsp;platform.</p>



<ul class="wp-block-list">
<li><code>--name DOWNLOAD_MISTRAL_LARGE_123B</code></li>
</ul>



<p class="wp-block-paragraph">Sets a&nbsp;<strong>custom name</strong>&nbsp;for the job. For example,&nbsp;<code><code>DOWNLOAD_MISTRAL_LARGE_123B</code></code>.</p>



<ul class="wp-block-list">
<li><code>--cpu&nbsp;12</code></li>
</ul>



<p class="wp-block-paragraph">Allocates&nbsp;<strong>12 CPU</strong>&nbsp;for the job.</p>



<ul class="wp-block-list">
<li><code>--volume s3-mistral-large-model@&lt;ALIAS&gt;/:/opt/ml/model:RW</code></li>
</ul>



<p class="wp-block-paragraph">This mounts your&nbsp;<strong>OVHcloud Object Storage volume</strong>&nbsp;into the job’s file system:<br>–&nbsp;<code>s3-mistral-large-model@&lt;ALIAS&gt;/</code>: refers to your&nbsp;<strong>S3 bucket volume</strong>&nbsp;from the OVHcloud Object Storage<br>–&nbsp;<code>:<code>/opt/ml/model</code></code>: mounts the volume into the container under&nbsp;<code><code>/opt/ml/model</code></code><br>–&nbsp;<code>RW</code>: enables&nbsp;<strong>Read/Write</strong>&nbsp;permissions</p>



<ul class="wp-block-list">
<li><code>-e RECIPES_VERSION=$RECIPES_VERSION</code></li>
</ul>



<p class="wp-block-paragraph">This is from your <strong>environment variables</strong>&nbsp;defined previously.</p>



<ul class="wp-block-list">
<li><code>$<span style="background-color: initial; font-family: inherit; font-size: inherit; font-weight: inherit;">DOCKER_IMAGE_MISTRAL_UTILS</span></code></li>
</ul>



<p class="wp-block-paragraph">This is the<strong>&nbsp;Mistral Large utils Docker image</strong>&nbsp;you are running inside the job.</p>



<ul class="wp-block-list">
<li><code>-- bash -c "cd /app/mistral-rclone &amp;&amp; \</code><br><code>               poetry run python mistral-rclone.py \</code><br><code>                   --license-key $LICENSE_KEY \</code><br><code>                   --download-model $SERVED_MODEL"</code></li>
</ul>



<p class="wp-block-paragraph">Refers to the specific command to <strong>launch the model download</strong>.</p>



<p class="wp-block-paragraph"><em>Note that synchronisation with Object Storage will be <strong>automatic at the end of the AI Training job</strong>.</em></p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">⚠️ <strong>WARNING!</strong></p>
<cite><strong>Wait for the job to go to <code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">DONE</mark></code> before proceeding to the next step</strong>.</cite></blockquote>



<ul class="wp-block-list">
<li>Check that the various elements are present in the bucket:</li>
</ul>



<pre class="wp-block-code"><code class="">ovhai bucket object list s3-mistral-large-model@&lt;ALIAS&gt;</code></pre>



<p class="wp-block-paragraph">The bucket must be organized and split into 4 different folders:</p>



<ul class="wp-block-list">
<li>grammars</li>



<li>recipes</li>



<li>tokenizers</li>



<li>weights</li>
</ul>



<p class="wp-block-paragraph">Note that a total of 6 elements must be present.</p>



<p class="wp-block-paragraph">🚀 It&#8217;s all there? So let&#8217;s move on to the <strong>deployment of the Mistral Large model</strong>!</p>



<h3 class="wp-block-heading">Step 6 &#8211; Deploy Mistral Large service</h3>



<p class="wp-block-paragraph">To deploy the Mistral Large 123B model using the previously downloaded weights, you will use OVHcloud&#8217;s <strong>AI Deploy </strong>product.</p>



<p class="wp-block-paragraph">But first you need to create an API key that will allow you to consume the model and query it, in particular using Open AI compatibility.</p>



<ul class="wp-block-list">
<li>Creation of an access token:</li>
</ul>



<pre class="wp-block-code"><code class="">ovhai token create --role read mistral_large=api_key_reader</code></pre>



<ul class="wp-block-list">
<li>Export this token as an environment variable:</li>
</ul>



<pre class="wp-block-code"><code class="">export MY_OVHAI_MISTRAL_LARGE_TOKEN=&lt;your_ovh_access_token_value&gt;</code></pre>



<ul class="wp-block-list">
<li>Launch the <strong>Mistral Large service</strong> with <strong>AI Deploy </strong>by running the following command:</li>
</ul>



<pre class="wp-block-code"><code class="">ovhai app run --name DEPLOY_MISTRAL_LARGE_123B \
              --gpu 4 \
              --flavor h100-1-gpu \
              --default-http-port 5000 \
              --label mistral_large=api_key_reader \
              -e SERVED_MODEL=$SERVED_MODEL \
              -e RECIPES_VERSION=$RECIPES_VERSION \
              -e TP_SIZE=$TP_SIZE \
              --volume s3-mistral-large-model@&lt;ALIAS&gt;/:/opt/ml/model:RW \
              --volume standalone:/tmp:RW \
              --volume standalone:/workspace:RW \
              $<span style="background-color: initial; font-family: inherit; font-size: inherit; font-weight: inherit;">DOCKER_IMAGE_INFERENCE_ENGINE</span></code></pre>



<p class="wp-block-paragraph"><em>Full command explained:</em></p>



<ul class="wp-block-list">
<li><code>ovhai app run</code></li>
</ul>



<p class="wp-block-paragraph">This is the core command to&nbsp;<strong>run an app / API</strong>&nbsp;using the&nbsp;<strong>OVHcloud AI Deploy</strong>&nbsp;platform.</p>



<ul class="wp-block-list">
<li><code>--name DEPLOY_MISTRAL_LARGE_123B</code></li>
</ul>



<p class="wp-block-paragraph">Sets a&nbsp;<strong>custom name</strong>&nbsp;for the app. For example,&nbsp;<code>DEPLOY_MISTRAL_LARGE_123B</code>.</p>



<ul class="wp-block-list">
<li><code>--default-http-port 5000</code></li>
</ul>



<p class="wp-block-paragraph">Exposes&nbsp;<strong>port 5000</strong>&nbsp;as the default HTTP endpoint.</p>



<ul class="wp-block-list">
<li><code>--gpu&nbsp;</code>4</li>
</ul>



<p class="wp-block-paragraph">Allocates&nbsp;<strong>4 GPUs</strong>&nbsp;for the app.</p>



<ul class="wp-block-list">
<li><code>--flavor h100-1-gpu</code></li>
</ul>



<p class="wp-block-paragraph">Chooses&nbsp;<strong>H100 GPUs</strong>&nbsp;for the app.</p>



<ul class="wp-block-list">
<li><code>--volume s3-mistral-large-model@&lt;ALIAS&gt;/:/opt/ml/model:RW</code></li>
</ul>



<p class="wp-block-paragraph">This mounts your&nbsp;<strong>OVHcloud Object Storage volume</strong>&nbsp;into the job’s file system:<br>–&nbsp;<code>s3-mistral-large-model@&lt;ALIAS&gt;/</code>: refers to your&nbsp;<strong>S3 bucket volume</strong>&nbsp;from the OVHcloud Object Storage<br>–&nbsp;<code>:<code>/opt/ml/model</code></code>: mounts the volume into the container under&nbsp;<code><code>/opt/ml/model</code></code><br>–&nbsp;<code>RW</code>: enables&nbsp;<strong>Read/Write</strong>&nbsp;permissions</p>



<ul class="wp-block-list">
<li><code>--label mistral_large=api_key_reader</code></li>
</ul>



<p class="wp-block-paragraph">Means that the access is restricted to your token</p>



<ul class="wp-block-list">
<li><code>-e SERVED_MODEL=$SERVED_MODEL</code></li>



<li><code>-e RECIPES_VERSION=$RECIPES_VERSION</code></li>



<li><code>-e TP_SIZE=$TP_SIZE</code></li>
</ul>



<p class="wp-block-paragraph">These are&nbsp;<strong>environment variables</strong>&nbsp;defined previously.</p>



<ul class="wp-block-list">
<li><code>-v standalone:/tmp:rw</code></li>



<li><code>-v standalone:/workspace:rw</code></li>
</ul>



<p class="wp-block-paragraph">Mounts&nbsp;<strong>two persistent storage volumes</strong>:<br>&#8211; <code>/tmp</code><br>&#8211; <code>/workspace</code>&nbsp;→ Main working directory</p>



<ul class="wp-block-list">
<li><code>$<span style="background-color: initial; font-family: inherit; font-size: inherit; font-weight: inherit;">DOCKER_IMAGE_INFERENCE_ENGINE</span></code></li>
</ul>



<p class="wp-block-paragraph">This is the<strong>&nbsp;Mistral Large inference Docker image</strong>&nbsp;you are running inside the app.</p>



<p class="wp-block-paragraph"><em>It&nbsp;may&nbsp;take&nbsp;a&nbsp;few&nbsp;minutes&nbsp;for&nbsp;the&nbsp;resources&nbsp;to&nbsp;be&nbsp;allocated&nbsp;and&nbsp;for&nbsp;the&nbsp;<strong>Docker image</strong>&nbsp;to&nbsp;be&nbsp;pulled.&nbsp;</em></p>



<p class="wp-block-paragraph">To&nbsp;check&nbsp;the&nbsp;progress&nbsp;and&nbsp;get&nbsp;additional&nbsp;information&nbsp;about&nbsp;the&nbsp;<strong>AI&nbsp;deploy&nbsp;app</strong>,&nbsp;run&nbsp;the&nbsp;following&nbsp;command:</p>



<pre class="wp-block-code"><code class="">ovhai app get &lt;ai_deploy_mistral_app_id&gt;</code></pre>



<p class="wp-block-paragraph">Once in <strong><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">RUNNING</mark></code></strong> status, the model will be loaded. To check that the load was successful, you can check the container logs:</p>



<pre class="wp-block-code"><code class="">ovhai app logs &lt;ai_deploy_mistral_app_id&gt;</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">⚠️ <strong>WARNING!</strong></p>
<cite><strong>To&nbsp;consume&nbsp;the&nbsp;service,&nbsp;you&nbsp;must&nbsp;wait&nbsp;for&nbsp;the&nbsp;app&nbsp;to&nbsp;go&nbsp;into&nbsp;<code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">RUNNING</mark></code>&nbsp;status,&nbsp;AND&nbsp;for&nbsp;the&nbsp;model&nbsp;to&nbsp;finish&nbsp;loading.</strong></cite></blockquote>



<p class="wp-block-paragraph">🎉 Is&nbsp;that&nbsp;it?&nbsp;Everything&nbsp;ready?&nbsp;It&nbsp;is&nbsp;therefore&nbsp;possible&nbsp;to&nbsp;start&nbsp;playing&nbsp;with&nbsp;the&nbsp;model!</p>



<h3 class="wp-block-heading">Step 7 &#8211; Test the Mistral Large model by sending your first requests</h3>



<ul class="wp-block-list">
<li>Access the API doc via your app URL:</li>
</ul>



<p class="wp-block-paragraph"><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code><strong>https://&lt;ai_deploy_mistral_app_id>.app.gra.ai.cloud.ovh.net/docs</strong></code></mark></p>



<p class="wp-block-paragraph">To find the information, please refer to <a href="https://console.mistral.ai/on-premise/licenses" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">https://console.mistral.ai/on-premise/licenses</mark></strong></a></p>



<ul class="wp-block-list">
<li>Test with a basic cURL:</li>
</ul>



<pre class="wp-block-code"><code class="">curl -X 'POST' \
'https://&lt;ai_deploy_mistral_app_id&gt;.app.gra.ai.cloud.ovh.net/v1/chat/completions' \
  -H 'accept: application/json' \
  -H "Authorization: Bearer $MY_OVHAI_MISTRAL_LARGE_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "mistral-large-&lt;version&gt;",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant!"
    },
    {
      "role": "user",
      "content": "What is the capital of France?"     
    }
  ]
}'</code></pre>



<p class="wp-block-paragraph"><strong>⚠️ Note that you have also to replace <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code>&lt;version&gt;</code></mark> in the model name by the one you are using: </strong><br><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color"><code><strong>"model": "mistral-large-&lt;version&gt;"</strong></code></mark></p>



<p class="wp-block-paragraph">To take implementation a step further and take advantage of all the features of this endpoint, you can also integrate it with <strong>Langchain</strong> thanks to its fuOpenAI compatibility.</p>



<ul class="wp-block-list">
<li>LangChain integration:</li>
</ul>



<pre class="wp-block-code"><code class="">import time
import os 
from langchain.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

def chat_completion_basic(new_message: str):

  model = ChatOpenAI(model_name="mistral-large-&lt;version&gt;",
                        openai_api_key=$MY_OVHAI_MISTRAL_LARGE_TOKEN,
                        openai_api_base='https://&lt;ai_deploy_mistral_app_id&gt;.app.gra.ai.cloud.ovh.net/v1',
                       )

  prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant!"),
    ("human", "{question}"),
  ])

  chain = prompt | model

  print("🤖: ")
  for r in chain.stream({"question", new_message}):
    print(r.content, end="", flush=True)
    time.sleep(0.150)

chat_completion_basic("What is the capital of France?)</code></pre>



<p class="wp-block-paragraph">🥹 Congratulations! You have successfully completed the deployment!</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p class="wp-block-paragraph">You can now consume your <strong>Mistral Large 123B</strong> in a secure environment!</p>



<p class="wp-block-paragraph">The result of your implementation? The deployment of a sovereign, scalable, production-quality 123B LLM, powered by <strong>OVHcloud AI Deploy</strong>.</p>



<p class="wp-block-paragraph">➡️ <strong>To go further? </strong></p>



<ul class="wp-block-list">
<li>Update your model in a single command line and without interruption following this <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-update-custom-docker-image?id=kb_article_view&amp;sysparm_article=KB0057968" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">documentation</a></li>



<li>Go to the next replica in the event of a heavy load to ensure high availability using this <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-apps-deployments?id=kb_article_view&amp;sysparm_article=KB0047997" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">method</a></li>
</ul>
<img decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Freference-architecture-deploy-mistral-large-model-in-sovereign-environment-ovhcloud%2F&amp;action_name=Reference%20Architecture%3A%C2%A0deploying%20the%20Mistral%20Large%20123B%20model%20in%20a%20sovereign%20environment%20with%20OVHcloud&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Mistral Small 24B served with vLLM and AI Deploy &#8211; a single command to deploy an LLM (Part 1)</title>
		<link>https://blog.ovhcloud.com/mistral-small-24b-served-with-vllm-and-ai-deploy-one-command-to-deploy-llm/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Mon, 24 Feb 2025 10:08:37 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[Mistral]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[Public Cloud]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=28212</guid>

					<description><![CDATA[You are not dreaming! You can deploy open-source LLM in a single command line. Deploying advanced language models can be a challenge! But this sometimes this arduous task is becoming increasingly accessible, enabling developers to integrate sophisticated AI capabilities into their applications. In this guide, we will walk through deploying the Mistral-Small-24B-Instruct-2501 model using vLLM [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmistral-small-24b-served-with-vllm-and-ai-deploy-one-command-to-deploy-llm%2F&amp;action_name=Mistral%20Small%2024B%20served%20with%20vLLM%20and%20AI%20Deploy%20%26%238211%3B%20a%20single%20command%20to%20deploy%20an%20LLM%20%28Part%201%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph"><strong><em>You are not dreaming! You can deploy open-source LLM in a single command line</em>.</strong></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="724" src="https://blog.ovhcloud.com/wp-content/uploads/2025/02/image_blog_post_mistral_small_ai_deploy-1024x724.png" alt="Rocket in MistralAI colors in a data center with a French rooster showing rapid LLM deployment" class="wp-image-28219" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/02/image_blog_post_mistral_small_ai_deploy-1024x724.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/image_blog_post_mistral_small_ai_deploy-300x212.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/image_blog_post_mistral_small_ai_deploy-768x543.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/image_blog_post_mistral_small_ai_deploy-1536x1086.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/image_blog_post_mistral_small_ai_deploy.png 2000w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">Deploying advanced language models can be a challenge! But this sometimes this arduous task is becoming increasingly accessible, enabling developers to integrate sophisticated AI capabilities into their applications.</p>



<p class="wp-block-paragraph">In this guide, we will walk through deploying the <strong><a href="https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral-Small-24B-Instruct-2501</a></strong> model using <strong>vLLM</strong> on OVHcloud&#8217;s <a href="https://www.ovhcloud.com/fr/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Deploy platform</a>. This combination offers a powerful solution for efficient and scalable AI model serving.</p>



<p class="wp-block-paragraph">Deploying a model is great, but doing it quickly is even better!</p>



<p class="wp-block-paragraph">🤯 <strong>What if a single command line was enough?</strong> That&#8217;s the challenge we&#8217;re tackling today!</p>



<h2 class="wp-block-heading">Context</h2>



<p class="wp-block-paragraph">Before deployment, let’s take a closer look at our key technologies!</p>



<h3 class="wp-block-heading">Mistral Small</h3>



<p class="wp-block-paragraph">The <code><strong>mistralai/Mistral-Small-24B-Instruct-2501</strong></code> is a 24-billion-parameter instruction-fine-tuned model, renowned for its compact size and performance comparable to larger models.</p>



<p class="wp-block-paragraph">This model, from <a href="https://mistral.ai/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">MistralAI</a>, is an instruction-fine-tuned version of the base model:&nbsp;<a href="https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral-Small-24B-Base-2501</a>.</p>



<p class="wp-block-paragraph">To serve this model efficiently, we will utilize vLLM, an open-source library for <strong>LLM inference</strong>.</p>



<h3 class="wp-block-heading">vLLM</h3>



<p class="wp-block-paragraph"><a href="https://docs.vllm.ai/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vLLM</a> (<strong>Virtual LLM</strong>) is a highly optimized service engine designed to efficiently run large language models. It takes advantage of several key optimizations, such as:</p>



<ul class="wp-block-list">
<li><strong>PagedAttention:</strong> an attention mechanism that reduces memory fragmentation and enables more efficient use of GPU memory</li>



<li><strong>Continuous Batching:</strong> vLLM dynamically adjusts batch sizes in real time, ensuring that the GPU is always used efficiently, even with multiple simultaneous requests</li>



<li><strong>Tensor parallelism:</strong> enables model inference across multiple GPUs to boost performance</li>



<li><strong>Optimized kernel implementations:</strong> vLLM uses custom CUDA kernels for faster execution, reducing latency compared to traditional inference frameworks</li>
</ul>



<p class="wp-block-paragraph">These features make vLLM one of the best choices for large models such as Mistral Small 24B, enabling low-latency, high-throughput inference on the latest GPUs.</p>



<p class="wp-block-paragraph">By deploying on OVHcloud&#8217;s AI Deploy platform, you can deploy this model in a single command line.</p>



<h3 class="wp-block-heading">AI Deploy </h3>



<p class="wp-block-paragraph">OVHcloud AI Deploy is a<strong> Container as a Service</strong> (CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications / APIs based on Machine Learning (ML), Deep Learning (DL) or LLMs.</p>



<p class="wp-block-paragraph">The key benefits are:</p>



<ul class="wp-block-list">
<li><strong>Easy to use:</strong> bring your own custom Docker image and deploy it in a command line or a few clicks surely</li>



<li><strong>High-performance computing:</strong> a complete range of GPUs available (H100, A100, V100S, L40S and L4)</li>



<li><strong>Scalability and flexibility:</strong> supports automatic scaling, allowing your model to effectively handle fluctuating workloads</li>



<li><strong>Cost-efficient:</strong> billing per minute, no surcharges</li>
</ul>



<p class="wp-block-paragraph">✅ To go further, some prerequisites must be checked!</p>



<h2 class="wp-block-heading">Prerequisites</h2>



<p class="wp-block-paragraph">Before you begin, ensure that you have:</p>



<ul class="wp-block-list">
<li><strong>OVHcloud account</strong>: access to the&nbsp;<a href="https://www.ovh.com/auth/?action=gotomanager&amp;from=https://www.ovh.co.uk/&amp;ovhSubsidiary=GB" data-wpel-link="exclude">OVHcloud Control Panel</a></li>



<li><strong>ovhai CLI available:</strong> install the <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">ovhai CLI</a></li>



<li><strong>AI Deploy access</strong>: ensure you have a <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">user for AI Deploy</a></li>



<li><strong>Hugging Face access</strong>: create an <a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face account</a> and generate an <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">access token</a></li>



<li><strong>Gated model authorization</strong>: be sure you have been granted access to <a href="https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Mistral-Small-24B-Instruct-2501</a> model</li>
</ul>



<p class="wp-block-paragraph"><strong>🚀 Having all the ingredients for our recipe, it&#8217;s time to deploy!</strong></p>



<h2 class="wp-block-heading">Deployment of the Mistral Small 24B LLM</h2>



<p class="wp-block-paragraph">Let&#8217;s go for the deployment of the model <code><strong>mistralai/Mistral-Small-24B-Instruct-2501</strong></code></p>



<h3 class="wp-block-heading">Manage access tokens</h3>



<p class="wp-block-paragraph">Export your <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face token</a>.</p>



<pre class="wp-block-code"><code class="">export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx</code></pre>



<p class="wp-block-paragraph"><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-app-token?id=kb_article_view&amp;sysparm_article=KB0035280" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Create a token</a> to access your AI Deploy app once it will be deployed.</p>



<pre class="wp-block-code"><code class="">ovhai token create --role operator ai_deploy_token=my_operator_token</code></pre>



<p class="wp-block-paragraph">Returning the following output:</p>



<p class="wp-block-paragraph"><code><strong>Id:         47292486-fb98-4a5b-8451-600895597a2b<br>Created At: 20-02-25 11:53:05<br>Updated At: 20-02-25 11:53:05<br>Spec:<br>  Name:           ai_deploy_token=my_operator_token<br>  Role:           AiTrainingOperator<br>  Label Selector: <br>Status:<br>  Value:   XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX<br>  Version: 1</strong></code></p>



<p class="wp-block-paragraph">You can now store and export your access token:</p>



<pre class="wp-block-code"><code class="">export MY_OVHAI_ACCESS_TOKEN=<span style="background-color: initial; font-family: inherit; font-size: inherit; font-weight: inherit;">XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</span></code></pre>



<h3 class="wp-block-heading">Launch Mistral Small LLM with AI Deploy</h3>



<p class="wp-block-paragraph">You are ready to start<strong> Mistral-Small-24B</strong> using vLLM and AI Deploy:</p>



<pre class="wp-block-code"><code class="">ovhai app run --name vllm-mistral-small \
              --default-http-port 8000 \
              --label ai_deploy_token=my_operator_token \
              --gpu 2 \
              --flavor l40s-1-gpu \
              -e OUTLINES_CACHE_DIR=/tmp/.outlines \
              -e HF_TOKEN=$MY_HF_TOKEN \
              -e HF_HOME=/hub \
              -e HF_DATASETS_TRUST_REMOTE_CODE=1 \
              -e HF_HUB_ENABLE_HF_TRANSFER=0 \
              -v standalone:/hub:rw \
              -v standalone:/workspace:rw \
              vllm/vllm-openai:v0.8.2 \
              -- bash -c "python3 -m vllm.entrypoints.openai.api_server \
                        --model mistralai/Mistral-Small-24B-Instruct-2501 \
                        --tensor-parallel-size 2 \
                        --tokenizer_mode mistral \
                        --load_format mistral \
                        --config_format mistral \
                        --dtype half"</code></pre>



<p class="wp-block-paragraph"><strong>How to understand the different parameters of this command?</strong></p>



<h5 class="wp-block-heading">1. Start your AI Deploy app</h5>



<p class="wp-block-paragraph">Launch a new app using <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">ovhai CLI</a> and name it.</p>



<p class="wp-block-paragraph"><code><strong>ovhai app run --name vllm-mistral-small</strong></code></p>



<h5 class="wp-block-heading">2. Define access</h5>



<p class="wp-block-paragraph">Define the HTTP API port and restrict access to your token.</p>



<p class="wp-block-paragraph"><strong><code>--default-http-port 8000</code><br><code>--label ai_deploy_token=my_operator_token</code></strong></p>



<h5 class="wp-block-heading">3. Configure GPU resources</h5>



<p class="wp-block-paragraph">Specifies the hardware type (<code><strong>l40s-1-gpu</strong></code>), which refers to an <strong>NVIDIA L40S GPU</strong> and the number (<code><strong>2</strong></code>).</p>



<p class="wp-block-paragraph"><code><strong>--gpu 2<br>--flavor l40s-1-gpu</strong></code></p>



<p class="wp-block-paragraph"><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">⚠️WARNING!</mark></strong> For this model, two L40S are sufficient, but if you want to deploy another model, you will need to check which GPU you need. Note that you can also access to A100 and H100 GPUs for your larger models.</p>



<h5 class="wp-block-heading">4. Set up environment variables</h5>



<p class="wp-block-paragraph">Configure caching for the <strong>Outlines library</strong> (used for efficient text generation):</p>



<p class="wp-block-paragraph"><code><strong>-e OUTLINES_CACHE_DIR=/tmp/.outlines</strong></code></p>



<p class="wp-block-paragraph">Pass the <strong>Hugging Face token</strong> (<code>$MY_HF_TOKEN</code>) for model authentication and download:</p>



<p class="wp-block-paragraph"><code><strong>-e HF_TOKEN=$MY_HF_TOKEN</strong></code></p>



<p class="wp-block-paragraph">Set the <strong>Hugging Face cache directory</strong> to <code>/hub</code> (where models will be stored):</p>



<p class="wp-block-paragraph"><code><strong>-e HF_HOME=/hub</strong></code></p>



<p class="wp-block-paragraph">Allow execution of <strong>custom remote code</strong> from Hugging Face datasets (required for some model behaviors):</p>



<p class="wp-block-paragraph"><code><strong>-e HF_DATASETS_TRUST_REMOTE_CODE=1</strong></code></p>



<p class="wp-block-paragraph">Disable <strong>Hugging Face Hub transfer acceleration</strong> (to use standard model downloading):</p>



<p class="wp-block-paragraph"><code><strong>-e HF_HUB_ENABLE_HF_TRANSFER=0</strong></code></p>



<h5 class="wp-block-heading">5. Mount persistent volumes</h5>



<p class="wp-block-paragraph">Mounts <strong>two persistent storage volumes</strong>:</p>



<ul class="wp-block-list">
<li><code>/hub</code> → Stores Hugging Face model files</li>



<li><code>/workspace</code> → Main working directory</li>
</ul>



<p class="wp-block-paragraph">The <code>rw</code> flag means <strong>read-write access</strong>.</p>



<p class="wp-block-paragraph"><code><strong>-v standalone:/hub:rw<br>-v standalone:/workspace:rw</strong></code></p>



<h5 class="wp-block-heading">6. Choose the target Docker image</h5>



<p class="wp-block-paragraph">Uses the <strong><code>v<strong><code>llm/vllm-openai:v0.8.2</code></strong></code></strong> Docker image (a pre-configured vLLM OpenAI API server).</p>



<p class="wp-block-paragraph"><strong><code>vllm/vllm-openai:v0.8.2</code></strong></p>



<h5 class="wp-block-heading">7. Running the model inside the container</h5>



<p class="wp-block-paragraph">Runs a<strong> bash shell</strong> inside the container and executes a Python command to launch the vLLM API server:</p>



<ul class="wp-block-list">
<li><strong><code>python3 -m vllm.entrypoints.openai.api_server</code></strong> → Starts the OpenAI-compatible vLLM API server</li>



<li><strong><code>--model mistralai/Mistral-Small-24B-Instruct-2501</code></strong> → Loads the <strong>Mistral Small 24B</strong> model from Hugging Face</li>



<li><strong><code>--tensor-parallel-size 2</code></strong> → Distributes the model across <strong>2 GPUs</strong></li>



<li><strong><code>--tokenizer_mode mistral</code></strong> → Uses the <strong>Mistral tokenizer</strong></li>



<li><strong><code>--load_format mistral</code></strong> → Uses Mistral’s model loading format</li>



<li><strong><code>--config_format mistral</code></strong> → Ensures the model configuration follows Mistral&#8217;s standard</li>



<li><strong><code>--dtype half</code></strong> → Uses <strong>FP16 (half-precision floating point)</strong> for optimized GPU performance</li>
</ul>



<p class="wp-block-paragraph">You can now check if your <strong>AI Deploy</strong> app is alive:</p>



<pre class="wp-block-code"><code class="">ovhai app get &lt;your_vllm_app_id&gt;</code></pre>



<p class="wp-block-paragraph">💡<strong>Is your app in <code>RUNNING</code> status?</strong> Perfect! You can check in the logs that the server is started&#8230;</p>



<pre class="wp-block-code"><code class="">ovhai app logs &lt;your_vllm_app_id&gt;</code></pre>



<p class="wp-block-paragraph"><strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-ast-global-color-0-color">⚠️WARNING!</mark></strong> This step may take a little time as the template must be loaded&#8230;<br>After a few minutes, you should get the following information in the logs:</p>



<p class="wp-block-paragraph"><code><strong>2025-02-20T13:48:07Z [app] [tcmzt] INFO:     Started server process [13] 2025-02-20T13:48:07Z [app] [tcmzt] INFO:     Waiting for application startup. 2025-02-20T13:48:07Z [app] [tcmzt] INFO:     Application startup complete. 2025-02-20T13:48:07Z [app] [tcmzt] INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)</strong></code></p>



<p class="wp-block-paragraph">🚦 <strong>Are all the indicators green? </strong>Then it&#8217;s off to inference!</p>



<h3 class="wp-block-heading">Request and send prompt to the LLM</h3>



<p class="wp-block-paragraph">Launch the following query by asking the question of your choice:</p>



<pre class="wp-block-code"><code class="">curl https://&lt;your_vllm_app_id&gt;.app.gra.ai.cloud.ovh.net/v1/chat/completions \
  -H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Small-24B-Instruct-2501",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Give me the name of OVHcloud’s founder."}
    ],
    "stream": false
  }'</code></pre>



<p class="wp-block-paragraph">Returning the following result:</p>



<pre class="wp-block-code"><code class="">{
  "id":"chatcmpl-d6ea734b524bd851668e71d4111ba496",
  "object":"chat.completion",
  "created":1740059807,
  "model":"mistralai/Mistral-Small-24B-Instruct-2501",
  "choices":[
    {
      "index":0,
      "message":{
        "role":"assistant",
        "reasoning_content":null, 
        "content":"The founder of OVHcloud is Octave Klaba.",
        "tool_calls":[]
      },
      "logprobs":null,
      "finish_reason":"stop",
      "stop_reason":null
    }
  ],
  "usage":{
    "prompt_tokens":22,
    "total_tokens":35,
    "completion_tokens":13,
    "prompt_tokens_details":null
  },
  "prompt_logprobs":null
}</code></pre>



<h2 class="wp-block-heading">Conclusion</h2>



<p class="wp-block-paragraph">By following these steps, you have successfully deployed the <code><strong>mistralai/Mistral-Small-24B-Instruct-2501</strong></code> model using <strong>vLLM</strong> on OVHcloud&#8217;s AI Deploy platform. This setup provides a scalable and efficient solution for serving advanced language models in production environments.</p>



<p class="wp-block-paragraph">For further customization and optimization, refer to the <a href="https://help.ovhcloud.com/csm/en-ie-documentation-public-cloud-ai-and-machine-learning-ai-deploy?id=kb_browse_cat&amp;kb_id=574a8325551974502d4c6e78b7421938&amp;kb_category=3241efc6a052d910f078d4b4ef43651f&amp;spa=1" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">vLLM documentation</a> and <a>OVHcloud AI Deploy resources</a>.</p>



<p class="wp-block-paragraph">💪 <strong>Challenges taken!</strong> You can now enjoy the power of your LLM deployed in a single command line!</p>



<p class="wp-block-paragraph">Want even more simplicity? You can also use ready-to-use APIs with <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints</a>!</p>



<p class="wp-block-paragraph"><strong><em>But… what’s next?</em></strong></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmistral-small-24b-served-with-vllm-and-ai-deploy-one-command-to-deploy-llm%2F&amp;action_name=Mistral%20Small%2024B%20served%20with%20vLLM%20and%20AI%20Deploy%20%26%238211%3B%20a%20single%20command%20to%20deploy%20an%20LLM%20%28Part%201%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to serve LLMs with vLLM and OVHcloud AI Deploy</title>
		<link>https://blog.ovhcloud.com/how-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy/</link>
		
		<dc:creator><![CDATA[Mathieu Busquet]]></dc:creator>
		<pubDate>Wed, 29 May 2024 12:22:26 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[AI Endpoints]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Deep learning]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LLaMA]]></category>
		<category><![CDATA[LLaMA 3]]></category>
		<category><![CDATA[LLM Serving]]></category>
		<category><![CDATA[Mistral]]></category>
		<category><![CDATA[Mixtral]]></category>
		<category><![CDATA[vLLM]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=26762</guid>

					<description><![CDATA[In this tutorial, we will learn how to serve Large Language Models (LLMs) using vLLM and the OVHcloud AI Products.<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy%2F&amp;action_name=How%20to%20serve%20LLMs%20with%20vLLM%20and%20OVHcloud%20AI%20Deploy&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em>In this tutorial, we will walk you through the process of serving large language models (LLMs), providing step-by-step instruction</em>.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" width="1024" height="345" src="https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1024x345.png" alt="" class="wp-image-25615" style="width:750px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1024x345.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-300x101.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-768x259.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1536x518.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-2048x690.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading">Introduction</h3>



<p class="wp-block-paragraph">In recent years, <strong>large language models</strong> (LLMs) have become increasingly <strong>popular</strong>, with <strong>open-source</strong> models like <em>Mistral</em> and <em>LLaMA</em> gaining widespread attention. In particular, the <em>LLaMA 3</em> model was released on <em>April 18, 2024</em>, is one of today&#8217;s most powerful open-source LLMs.</p>



<p class="wp-block-paragraph">However, <strong>serving these LLMs can be challenging</strong>, particularly on hardware with limited resources. Indeed, even on expensive hardware, LLMs can be surprisingly slow, with high VRAM utilization and throughput limitations.</p>



<p class="wp-block-paragraph">This is where<strong><em> </em></strong><em><a href="https://github.com/vllm-project/vllm" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>vLLM</strong></a></em> comes in. <em><strong>vLLM</strong></em> is an <strong>open-source project</strong> that enables <strong>fast and easy-to-use LLM inference and serving</strong>. Designed for optimal performance and resource utilization, <em>vLLM</em> supports a range of <a href="https://docs.vllm.ai/en/latest/models/supported_models.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLM architectures</a> and offers <a href="https://docs.vllm.ai/en/latest/models/engine_args.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">flexible customization options</a>. That&#8217;s why we are going to use it to efficiently deploy and scale our LLMs.</p>



<h3 class="wp-block-heading">Objective</h3>



<p class="wp-block-paragraph">In this guide, you will discover how to deploy a LLM thanks to <a href="https://github.com/vllm-project/vllm" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>vLLM</em></a> and the <strong><em>AI Deploy</em></strong> <em>OVHcloud</em> solution. This will enable you to benefit from <em>vLLM</em>&#8216;s optimisations and <em>OVHcloud</em>&#8216;s GPU computing resources. Your LLM will then be exposed by a secured API.</p>



<p class="wp-block-paragraph">🎁 And for those who do not want to bother with the deployment process, <strong>a surprise awaits you at the <a href="#AI-ENDPOINTS">end of the article</a></strong>. We are going to introduce you to our new solution for using LLMs, called <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>AI Endpoints</strong></a>. This product makes it easy to integrate AI capabilities into your applications with a simple API call, without the need for deep AI expertise or infrastructure management. And while it&#8217;s in alpha, it&#8217;s <strong>free</strong>!</p>



<h3 class="wp-block-heading">Requirements</h3>



<p class="wp-block-paragraph">To deploy your <em>vLLM</em> server, you need:</p>



<ul class="wp-block-list">
<li>An <em>OVHcloud</em> account to access the <a href="https://www.ovh.com/auth/?action=gotomanager&amp;from=https://www.ovh.co.uk/&amp;ovhSubsidiary=GB" data-wpel-link="exclude"><em>OVHcloud Control Panel</em></a></li>



<li>A <em>Public Cloud</em> project</li>



<li>A <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">user for the AI Products</a>, related to this <em>Public Cloud</em> project</li>



<li><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">The <em>OVHcloud AI CLI</em></a> installed on your local computer (to interact with the AI products by running commands). </li>



<li><a href="https://www.docker.com/get-started" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Docker</a> installed on your local computer, <strong>or</strong> access to a Debian Docker Instance, which is available on the <a href="https://www.ovh.com/manager/public-cloud/" data-wpel-link="exclude"><em>Public Cloud</em></a></li>
</ul>



<p class="wp-block-paragraph">Once these conditions have been met, you are ready to serve your LLMs.</p>



<h3 class="wp-block-heading">Building a Docker image</h3>



<p class="wp-block-paragraph">Since the <a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>OVHcloud AI Deploy</em></a> solution is based on <a href="https://www.docker.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>Docker</em></a> images, we will be using a <em>Docker</em> image to deploy our <em>vLLM</em> inference server. </p>



<p class="wp-block-paragraph">As a reminder, <em>Docker</em> is a platform that allows you to create, deploy, and run applications in containers. <em>Docker</em> containers are standalone and executable packages that include everything needed to run an application (code, libraries, system tools).</p>



<p class="wp-block-paragraph">To create this <em>Docker</em> image, we will need to write the following <em><strong>Dockerfile</strong></em> into a new folder:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">mkdir my_vllm_image
nano Dockerfile</code></pre>



<pre class="wp-block-code"><code lang="bash" class="language-bash"># 🐳 Base image
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

# 👱 Set the working directory inside the container
WORKDIR /workspace

# 📚 Install missing system packages (git) so we can clone the vLLM project repository
RUN apt-get update &amp;&amp; apt-get install -y git
RUN git clone https://github.com/vllm-project/vllm/

# 📚 Install the Python dependencies
RUN pip3 install --upgrade pip
RUN pip3 install vllm 

# 🔑 Give correct access rights to the OVHcloud user
ENV HOME=/workspace
RUN chown -R 42420:42420 /workspace</code></pre>



<p class="wp-block-paragraph">Let&#8217;s take a closer look at this <em>Dockerfile</em> to understand it:</p>



<ul class="wp-block-list">
<li><strong>FROM</strong>: Specify the base image for our <em>Docker</em> Image. We choose the <em>PyTorch</em> image since it comes with <em>CUDA</em>, <em>CuDNN</em> and <em>torch</em>, which is needed by <em>vLLM</em>. </li>



<li><strong>WORKDIR /workspace</strong>: We set the working directory for the <em>Docker</em> container to <em>/workspace</em>, which is the default folder when we use <em>AI Deploy</em>.</li>



<li><strong>RUN</strong>: It allows us to upgrade <em>pip</em> to the latest version to make sure we have access to the latest libraries and dependencies. We will install <em>vLLM</em> library, and <em>git</em>, which will enable to clone the <a href="https://github.com/vllm-project/vllm/tree/main" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>vLLM</em> repository</a> into th<em>e /workspace</em> directory.</li>



<li><strong>ENV</strong> HOME=/workspace: This sets the <em>HOME</em> environment variable to <em>/workspace</em>. This is a requirement to use the <em>OVHcloud</em> AI Products.</li>



<li><strong>RUN chown -R 42420:42420 /workspace</strong>: This changes the owner of the <em>/workspace</em> directory to the user and group with IDs of <em>42420</em> (<em>OVHcloud</em> user). This is also a requirement to use the <em>OVHcloud</em> AI Products.</li>
</ul>



<p class="wp-block-paragraph">This <em>Dockerfile</em> does not contain a <strong>CMD</strong> instruction and therefore does not launch our <em>VLLM</em> server. Do not worry about that, we will do it directly from <a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Deploy</a>&nbsp;to have more flexibility.</p>



<p class="wp-block-paragraph">Once your Dockerfile is written, launch the following command to build your image:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker build . -t vllm_image:latest</code></pre>



<h3 class="wp-block-heading">Push the image into the shared registry</h3>



<p class="wp-block-paragraph">Once you have built the Docker image, you will need to push it to a <strong>registry</strong> to make it accessible from <em>AI Deploy</em>. A <strong>registry</strong> is a service that allows you to store and distribute <em>Docker</em> images, making it easy to deploy them in different environments.</p>



<p class="wp-block-paragraph">Several registries can be used (<em><a href="https://www.ovhcloud.com/en-gb/public-cloud/managed-private-registry/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Managed Private Registry</a>, <a href="https://hub.docker.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Docker Hub</a>, <a href="https://github.com/features/packages" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GitHub packages</a>, &#8230;</em>). In this tutorial, we will use the <strong><em>OVHcloud</em> <em>shared registry</em></strong>. More information are available in the <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-manage-registries?id=kb_article_view&amp;sysparm_article=KB0057949" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Registries documentation</a>.</p>



<p class="wp-block-paragraph">To find the address of your shared registry, use the following command (<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>ovhai CLI</em></a> needs to be installed on your computer):</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">ovhai registry list</code></pre>



<p class="wp-block-paragraph">Then, log in on your <em>shared registry</em> with your usual <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Platform user</em></a> credentials:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker login -u &lt;user&gt; -p &lt;password&gt; &lt;shared-registry-address&gt;</code></pre>



<p class="wp-block-paragraph">Once you are logged in to the registry, tag the compiled image and push it into your shared registry:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker tag vllm_image:latest &lt;shared-registry-address&gt;/vllm_image:latest
docker push &lt;shared-registry-address&gt;/vllm_image:latest</code></pre>



<h3 class="wp-block-heading">vLLM inference server deployment</h3>



<p class="wp-block-paragraph">Once your image has been pushed, it can be used with <em>AI Deploy</em>, using either the <em>ovhai CLI</em> or the <em>OVHcloud Control Panel (UI)</em>.</p>



<h5 class="wp-block-heading">Creating an access token </h5>



<p class="wp-block-paragraph">Tokens are used as unique authenticators to securely access the <em>AI Deploy</em> apps. By creating a token, you can ensure that only authorized requests are allowed to interact with the <em>vLLM</em> endpoint. You can create this token by using the <em>OVHcloud Control Panel (UI)</em> or by running the following command:</p>



<pre class="wp-block-code"><code lang="" class="">ovhai token create vllm --role operator --label-selector name=vllm</code></pre>



<p class="wp-block-paragraph">This will give you a token that you will need to keep.</p>



<h5 class="wp-block-heading">Creating a Hugging Face token (optionnal)</h5>



<p class="wp-block-paragraph">Note that some models, such as <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLaMA 3</a> require you to accept their license, hence, you need to create a <a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">HuggingFace account</a>, accept the model’s license, and generate a <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">token</a> by accessing your account settings, that will allow you to access the model.</p>



<p class="wp-block-paragraph">For example, when visiting the HugginFace <a href="https://huggingface.co/google/gemma-2b" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Gemma model page</a>, you’ll see this (if you are logged in):</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="716" height="312" src="https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21.png" alt="accept_model_conditions_hugging_face" class="wp-image-26768" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21.png 716w, https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21-300x131.png 300w" sizes="auto, (max-width: 716px) 100vw, 716px" /></figure>



<p class="wp-block-paragraph">If you want to use this model, you will have to Acknowledge the license, and then make sure to create a token in the <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">tokens section</a>.</p>



<p class="wp-block-paragraph">In the next step, we will set this token as an environment variable (named  <code>HF_TOKEN</code>). Doing this will enable us to use any LLM whose conditions of use we have accepted.</p>



<h5 class="wp-block-heading">Run the AI Deploy application</h5>



<p class="wp-block-paragraph">Run the following command to deploy your <em>vLLM</em> server by running your customized <em>Docker</em> image:</p>



<pre class="wp-block-code"><code lang="" class="">ovhai app run &lt;shared-registry-address&gt;/vllm_image:latest \
  --name vllm_app \
  --flavor h100-1-gpu \
  --gpu 1 \
  --env HF_TOKEN="&lt;YOUR_HUGGING_FACE_TOKEN&gt;" \
  --label name=vllm \
  --default-http-port 8080 \
  -- python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model &lt;model&gt; --dtype half</code></pre>



<p class="wp-block-paragraph"><em>You just need to change the address of your registry to the one you used, and the name of the LLM you want to use. Also pay attention to the name of the image, its tag, and the label selector of your label if you haven&#8217;t used the same ones as those given in this tutorial.</em></p>



<p class="wp-block-paragraph"><strong>Parameters explanation</strong></p>



<ul class="wp-block-list">
<li><code>&lt;shared-registry-address&gt;/vllm_image:latest</code> is the image on which the app is based.</li>



<li><code>--name vllm_app</code> is an optional argument that allows you to give your app a custom name, making it easier to manage all your apps.</li>



<li><code>--flavor h100-1-gpu</code> indicates that we want to run our app on H100 GPU(s). You can access the full list of GPUs available by <code>running ovhai capabilities flavor list</code></li>



<li><code>--gpu 1</code> indicates that we request 1 GPU for that app.</li>



<li><code>--env HF_TOKEN</code> is an optional argument that allows us to set our Hugging Face token as an environment variable. This gives us access to models for which we have accepted the conditions.</li>



<li><code>--label name=vllm</code> allows to privatize our LLM by adding the token corresponding to the label selector <code>name=vllm</code>.</li>



<li><code>--default-http-port 8080</code> indicates that the port to reach on the app URL is the <code>8080</code>.</li>



<li><code>--python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model &lt;model&gt;</code> allows to start the vLLM API server. The specified &lt;model&gt; will be downloaded from Hugging Face. Here is a list of those that are <a href="https://docs.vllm.ai/en/latest/models/supported_models.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">supported by vLLM</a>. <a href="https://docs.vllm.ai/en/latest/models/engine_args.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Many arguments</a> can be used to optimize your inference.</li>
</ul>



<p class="wp-block-paragraph">When this <code>ovhai app run</code> command is executed, several pieces of information will appear in your terminal. Get the ID of your application, and open the Info URL in a new tab. Wait a few minutes for your application to launch. When it is <strong>RUNNING</strong>, you can stream its logs by executing:</p>



<pre class="wp-block-code"><code class="">ovhai app logs -f &lt;APP_ID&gt;</code></pre>



<p class="wp-block-paragraph">This will allow you to track the server launch, the model download and any errors you may encounter if you have used a model for which you have not accepted the user contract. </p>



<p class="wp-block-paragraph">If all goes well, you should see the following output, which means that your server is up and running:</p>



<pre class="wp-block-code"><code class="">Started server process [11]
Waiting for application startup.
Application startup complete.
Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)</code></pre>



<h3 class="wp-block-heading">Interacting with your LLM</h3>



<p class="wp-block-paragraph">Once the server is up and running, we can interact with our LLM by hitting the <code>/generate</code> endpoint.</p>



<p class="wp-block-paragraph"><strong>Using cURL</strong></p>



<p class="wp-block-paragraph"><em>Make sure you change the ID to that of your application so that you target the right endpoint. In order for the request to be accepted, also specify the token that you generated previously by executing</em> <code>ovhai token create</code>. Feel free to adapt the parameters of the request (<em>prompt</em>, <em>max_tokens</em>, <em>temperature</em>, &#8230;)</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">curl --request POST \                                             
  --url https://&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net/generate \
  --header 'Authorization: Bearer &lt;AI_TOKEN_generated_with_CLI&gt;' \
  --header 'Content-Type: application/json' \
  --data '{
        "prompt": "&lt;YOUR_PROMPT&gt;",
        "max_tokens": 50,
        "n": 1,
        "stream": false
}'</code></pre>



<p class="wp-block-paragraph"><strong>Using Python</strong></p>



<p class="wp-block-paragraph"><em>Here too, you need to add your personal token and the correct link for your application.</em></p>



<pre class="wp-block-code"><code lang="python" class="language-python">import requests
import json

# change for your host
APP_URL = "https://&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net"
TOKEN = "AI_TOKEN_generated_with_CLI"

url = f"{APP_URL}/generate"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {TOKEN}"
}
data = {
    "prompt": "What a LLM is in AI?",
    "max_tokens": 100,
    "temperature": 0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json()["text"][0])</code></pre>



<h3 class="wp-block-heading" id="AI-ENDPOINTS">OVHcloud AI Endpoints</h3>



<p class="wp-block-paragraph">If you are not interested in building your own image and deploying your own LLM inference server, you can use OVHcloud&#8217;s new <em><strong><a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints</a></strong> </em>product which will make your life definitely easier!</p>



<p class="wp-block-paragraph"><a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> is a serverless solution that provides AI APIs, enabling you to easily use pre-trained and optimized AI models in your applications. </p>



<figure class="wp-block-video"><video height="1400" style="aspect-ratio: 2560 / 1400;" width="2560" controls src="https://blog.ovhcloud.com/wp-content/uploads/2024/05/demo-ai-endpoints.mp4"></video></figure>



<p class="has-text-align-center wp-block-paragraph"><em>Overview of AI Endpoints</em></p>



<p class="wp-block-paragraph">You can use LLM as a Service, choosing the desired model (such as <em>LLaMA</em>, <em>Mistral</em>, or <em>Mixtral</em>) and making an API call to use it in your application. This will allow you to interact with these models without even having to deploy them!</p>



<p class="wp-block-paragraph">In addition to LLM capabilities, <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> also offers a range of other AI models, including speech-to-text, translation, summarization, embeddings and computer vision. </p>



<p class="wp-block-paragraph">Best of all, <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> is currently in alpha phase and is <strong>free to use</strong>, making it an accessible and affordable solution for developers seeking to explore the possibilities of AI. Check <a href="https://blog.ovhcloud.com/enhance-your-applications-with-ai-endpoints/" data-wpel-link="internal">this article</a> and try it out today to discover the power of AI!</p>



<p class="wp-block-paragraph">Join our <a href="https://discord.gg/ovhcloud" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Discord server</a> to interact with the community and send us your feedbacks (#<em>ai-endpoints</em> channel)!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy%2F&amp;action_name=How%20to%20serve%20LLMs%20with%20vLLM%20and%20OVHcloud%20AI%20Deploy&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2024/05/demo-ai-endpoints.mp4" length="14424826" type="video/mp4" />

			</item>
	</channel>
</rss>
