<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>LLM Serving Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/llm-serving/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/llm-serving/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Wed, 14 Aug 2024 07:57:26 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>LLM Serving Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/llm-serving/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Adopting AI in SaaS: how can we move quickly without losing control?</title>
		<link>https://blog.ovhcloud.com/ai-saas-ovhcloud/</link>
		
		<dc:creator><![CDATA[Germain Masse]]></dc:creator>
		<pubDate>Wed, 14 Aug 2024 06:04:09 +0000</pubDate>
				<category><![CDATA[OVHcloud Product News]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[LLM Serving]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[Saas]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=27229</guid>

					<description><![CDATA[The widespread use of AI poses numerous challenges. Including the risks of data leakage, the need for explainable results, handling it in SaaS. But also, the growing dependence on Big Tech. Not to mention the environmental toll linked to AI. No doubt the eco-design of digital services is becoming increasingly popular. Still, the efforts to [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fai-saas-ovhcloud%2F&amp;action_name=Adopting%20AI%20in%20SaaS%3A%20how%20can%20we%20move%20quickly%20without%20losing%20control%3F&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>The widespread use of AI poses numerous challenges. Including the risks of data leakage, the need for explainable results, handling it in SaaS. But also, the growing dependence on Big Tech. Not to mention the environmental toll linked to AI.</p>



<p>No doubt the eco-design of digital services is becoming increasingly popular. Still, the efforts to achieve digital sobriety seem to be marginal. Especially compared to the energy consumed by training general-purpose LLMs. Is there a way to make AI greener? And what would a more “trusted AI” mean?</p>



<p>Here’s a roundup of challenges and solutions.</p>



<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="1024" height="576" src="https://blog.ovhcloud.com/wp-content/uploads/2024/08/AdobeStock_8327344081-1024x576.jpeg" alt="" class="wp-image-27235" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/08/AdobeStock_8327344081-1024x576.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2024/08/AdobeStock_8327344081-300x169.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2024/08/AdobeStock_8327344081-768x432.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2024/08/AdobeStock_8327344081-1536x864.jpeg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2024/08/AdobeStock_8327344081-2048x1152.jpeg 2048w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p><strong>Efficiency of specialised LLMs compared to general-purpose LLMs</strong><strong></strong></p>



<p>General-purpose LLMs, such as GPT-4 developed by OpenAI, LaMa (Meta) and Gemini (Google) are currently in the spotlight. Versatile, omniscient, and able to handle a variety of scenarios. They seem to be able to meet every need: generating text, code, answering questions, translating content, and even composing poems.</p>



<p>However, these general-purpose models have not yet eclipsed specialised LLMs,<a id="_ftnref1" href="#_ftn1"><sup>[1]</sup></a><sup> </sup>which target a narrower range of situations, but perform much better in them. The Retrieval-Augmented Generation (RAG) technique certainly makes it possible to specialise a general-purpose LLM via transfer learning, with or without retraining the model. However, the use of general-purpose LLMs continues to pose a range of challenges. Starting with their generic results, unreliable quality or lack of reproducibility. This will prove even more challenging as the available sources of quality data may become scarce, due to legal actions<a id="_ftnref2" href="#_ftn2"><sup>[2]</sup></a> brought for unauthorised use of content and copyright infringement. Additionally, the use of general-purpose LLMs leads to operator dependency and reinforce monopolies,<a id="_ftnref3" href="#_ftn3"><sup>[3]</sup></a><sup> </sup>which is unfavourable for long-term users.</p>



<p><strong>The impact of AI on the environment<br></strong>Researchers from Hugging Face and the Allen Institute<a href="#_ftn4" id="_ftnref4"><sup>[4]</sup></a><sup> </sup><strong>have shown that in the case of servers with GPUs, the carbon emissions linked to machine use far exceed those linked to manufacturing the components, unlike traditional cloud computing.</strong><strong><sup> </sup></strong><strong>Generating an image using an AI model is one of the most energy-intensive uses, and requires as much electricity as fully charging a smartphone.</strong><a href="#_ftn5" id="_ftnref5"><strong><sup><strong><sup>[5]</sup></strong></sup></strong></a><strong><sup> </sup></strong>Reversing the distribution of carbon emissions throughout the lifecycle of servers in this way means that the power usage effectiveness (PUE) of the datacentres in which AI models are trained and inferred, as well as the energy mix of the countries in which they are located, are very significant selection criteria in calculating your application’s global carbon footprint.</p>



<p>This is a bonus for OVHcloud. Indeed, the Group has long been committed to reducing the carbon footprint of its datacentres.<a id="_ftnref6" href="#_ftn6"><sup>[6]</sup></a></p>



<p>As it might be expected, general-purpose LLMs are more environmentally damaging than specialised models designed for specific tasks. This has been revealed in a series of comparative tests carried out by the same researchers.<a id="_ftnref7" href="#_ftn7"><sup>[7]</sup></a><sup> </sup>With thousands of billions of settings, the largest LLMs are getting larger and more data-intensive.<a id="_ftnref8" href="#_ftn8"><sup>[8]</sup></a> An article in the<sup> </sup><em>New Scientist </em>recently explained that algorithm advances are outpacing Moore’s Law, as after eight months. A large language model would need only half the computing power to achieve the same level of performance.<a id="_ftnref9" href="#_ftn9"><sup>[9]</sup></a><sup> </sup>However, to run a model like OpenAI today, it would cost Microsoft around $700,000 per day<a id="_ftnref10" href="#_ftn10">[10]</a>, or an average cost of 36 cents per query. Still unreasonable from an economic and environmental point of view to meet needs that are often precise and well-defined.</p>



<p>Specialised models, which can be chained to perform complex tasks (referred to as agentisation), are therefore a more environmentally responsible alternative to general-purpose LLMs. On top of that, specialised models, which are more widely available in open source, are also easier to understand and to fine-tune. They seem more suitable for reversibly building innovations for which the ROI is still very uncertain.</p>



<p><strong>Maintaining control: working towards developing a trusted AI</strong><strong></strong></p>



<p>While large companies quickly became aware of the risks of leaking confidential data when using digital services (like online translation, which they are beginning to ban), AI intensified the temptation to output a company’s data and submit it to an algorithm: here to write a report more quickly, there to generate an image that will illustrate a presentation on a confidential project. Samsung learned this the hard way, as a victim of three consecutive data leaks related to the use of ChatGPT by its employees, who notably copied/pasted source code to solve or optimise a problem.<br>You don’t need to disclose a lot of information to say a lot about your intentions. What insights would your rival gather about your strategy from reading your ChatGPT prompts? After all, it is possible for AI to accidentally “scrape” data submitted by users, thanks to a bug<a href="#_ftn11" id="_ftnref11"><sup>[11]</sup></a><sup> </sup>causing security issues. The same goes for datasets you might submit on AI platforms: will your data be used to train and refine the model? Could they benefit potential rivals?<br><br></p>



<p>Beyond this, there is also the question of the transparency of AI models. With it, comes the risk of outsourcing increasingly important tasks to sophisticated AI models. Indeed, they can become &#8220;black boxes&#8221;, making incomprehensible decisions, or producing skewed results because of the data they are trained on. </p>



<p>Let&#8217;s face the possibility that you may not have any problem with the results. Would you run the risk of relying on a service where you can’t explain in broad terms how it works? And that you couldn’t stop using without losing everything? Here, we encounter another problem – reversibility. </p>



<p>If for example, the AI service deemed the party to be over and the infrastructure that it has long financed at a loss must now be made profitable, so it takes advantage of its monopoly and your dependency to increase its rates in an unreasonable way, you could certainly cancel the service. But then you would lose the results of your data training and/or model specialisation, and you would have to start from scratch. In the current absence of standards for portability/interoperability between different AI services, this issue is crucial – all the more so given that, for the moment, while open-source is popular, proprietary models are very much in the majority.</p>



<p>There is no simple answer to the questions that have been raised. That’s because AI development is currently very empirical, based on a trial-and-error model, with no traceability of training data or model modifications.</p>



<p>This, incidentally, makes the “explainability”<a href="#_ftn12" id="_ftnref12"><sup>[12]</sup></a><sup> </sup>of an AI system’s results a real challenge, even though the AI Act establishes a duty to do so (see below).<br><br></p>



<p>The development of a “trustworthy AI”, as it was termed in a 2019 paper<a href="#_ftn13" id="_ftnref13"><sup>[13]</sup></a><sup> </sup>by the Independent High-Level Expert Group on Artificial Intelligence (AI HLEG), is perhaps a direction to keep in mind. It defines a trustworthy AI with three main objectives, which OVHcloud aims to help you achieve: AI must be lawful (legislative or regulatory aspect), ethical (respect for ethical norms) and robust (from both a technical and social perspective).</p>



<p>In the meantime, ensuring swift regulatory compliance at the national, European, and international levels is a powerful lever for promoting greener business practices, without compromising future prospects in the pursuit of innovation.</p>



<p><strong>1/ Complying with current and future regulations</strong></p>



<p>The EU was quick to respond to the democratisation of AI, proposing a draft European regulation on the subject on 21 April 2021. In March 2024, the AI Act was officially adopted. Now, it applies to all services used in the EU, regardless of whether the providers are foreign or not.</p>



<p>The law divides AI systems into four categories, taking into account their impact on fundamental rights in the EU and the security of individuals, groups, societies, and civilization. Each risk category has associated prohibitions<a href="#_ftn14" id="_ftnref14"><sup>[14]</sup></a><sup> </sup>and obligations, ranging from environmental sustainability to security, and including marking content that has been AI-generated.</p>



<p>A “compliance checker” online allows you to quickly find out the extent to which this European AI law applies to your projects: <a href="https://artificialintelligenceact.eu/assessment/eu-ai-act-compliance-checker/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://artificialintelligenceact.eu/</a></p>



<p>Other national and European regulations on personal data protection, such as the GDPR, already apply to your AI projects, holding companies accountable for hosting and transferring personal data outside the EU.</p>



<p>Incidentally, those who complain about regulation being too burdensome in comparison to the American laissez-faire attitude or the Chinese spirit of conquest have the wrong end of the stick: the absence of a genuine European single market is a much bigger factor<a href="#_ftn15" id="_ftnref15"><sup>[15]</sup></a><sup> </sup>behind Europe’s innovation gap. So, too, is the weak support for public procurement, or the incomprehensible message sent by governments that claim to want to develop sovereign solutions by relying on investments by foreign stakeholders.<a href="#_ftn16" id="_ftnref16"><sup>[16]</sup></a></p>



<p>It’s also worth noting that the AI Act provides for the possibility for national competent authorities (<a>the ICO in the UK) </a><a href="#_msocom_1">[1]</a>&nbsp; to set up “regulatory sandboxes”, i.e. a controlled environment to test innovative technologies for a limited time in order to ensure the compliance of the AI system and to not delay any potential placing on the market, with priority access to these sandboxes for SMEs and startups.</p>



<p>In short, regulations today do not hinder the development of projects that take advantage of the possibilities offered by AI, but rather strengthen companies’ obligations regarding the protection of personal data due to increased risks. These obligations will help to reassure users, once this brief period of carelessness and frivolity with AI has passed, and the inevitable first scandals start to surface. As Yoshua Bengio, researcher and founder of MILA (the Quebec Artificial Intelligence Institute), summed up: “We’re going too fast in an unfamiliar direction, and that could change the world in a very positive, or very dangerous, way.”<a href="#_ftn17" id="_ftnref17"><sup>[17]</sup></a><sup> </sup>Countries should therefore seek to regulate AI so that its development does not feel like the Wild West.</p>



<p>In this context, the preference for sovereign solutions will make it easier for your projects to comply with regulations, in addition to establishing a clear medium- and long-term vision. COVID and the current geopolitical instability have shown the cost of relying on foreign entities for essential services, and AI-based services will quickly follow the same path if they integrate the software we use every day, in such critical areas as health, education, transport, or logistics.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><a href="#_ftnref1" id="_ftn1"><sup>[1]</sup></a> General-purpose and specialised LLMs can be distinguished by the number of parameters in their neural network: tens, hundreds or even thousands of billions of parameters for a general-purpose model versus “a few billion” for a specialised model.</p>



<p><a href="#_ftnref2" id="_ftn2"><sup>[2]</sup></a> <a href="https://www.usine-digitale.fr/article/openai-cible-par-deux-class-actions-aux-etats-unis.N2148412" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://www.usine-digitale.fr/article/openai-cible-par-deux-class-actions-aux-etats-unis.N2148412</a>; <a href="https://www.lefigaro.fr/secteur/high-tech/des-journaux-americains-poursuivent-openai-et-microsoft-en-justice-pour-violation-de-leurs-droits-d-auteur-20240430" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://www.lefigaro.fr/secteur/high-tech/des-journaux-americains-poursuivent-openai-et-microsoft-en-justice-pour-violation-de-leurs-droits-d-auteur-20240430</a></p>



<p><a id="_ftn3" href="#_ftnref3"><sup>[3]</sup></a> <a href="https://www.nytimes.com/2024/06/05/technology/nvidia-" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://www.nytimes.com/2024/06/05/technology/nvidia-</a>microsoft-openai-antitrust-doj-ftc.html</p>



<p><a id="_ftn4" href="#_ftnref4"><sup>[4]</sup></a> <a href="http://arxiv.org/pdf/2311.16863" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">http://arxiv.org/pdf/2311.16863</a></p>



<p><a id="_ftn5" href="#_ftnref5"><sup>[5]</sup></a> <a href="https://www.technologyreview.com/2023/12/01/1084189/making-an-image-with-generative-ai-uses-as-much-energy-as-charging-your-phone/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://www.technologyreview.com/2023/12/01/1084189/making-an-image-with-generative-ai-uses-as-much-energy-as-charging-your-phone/</a></p>



<p><a id="_ftn6" href="#_ftnref6"><sup>[6]</sup></a> <a href="https://corporate.ovhcloud.com/en/sustainability/environment/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://corporate.ovhcloud.com/en-gb/sustainability/environment/</a><br>For our <strong>PUE calculation methodology,</strong> refer to <a href="https://corporate.ovhcloud.com/sites/default/files/2024-01/methodo_carboncalc_0.pdf" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://corporate.ovhcloud.com/sites/default/files/2024-01/methodo_carboncalc_0.pdf</a></p>



<p><a id="_ftn7" href="#_ftnref7"><sup>[7]</sup></a> <a href="https://www.silicon.fr/llm-generaliste-specialise-angle-environnemental-473911.html" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://www.silicon.fr/llm-generaliste-specialise-angle-environnemental-473911.html</a></p>



<p><a id="_ftn8" href="#_ftnref8"><sup>[8]</sup></a> <a href="https://www.radiofrance.fr/franceculture/podcasts/le-journal-de-l-eco/le-cout-environnemental-de-l-ia-est-colossal-et-sous-evalue-3781962" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://www.radiofrance.fr/franceculture/podcasts/le-journal-de-l-eco/le-cout-environnemental-de-l-ia-est-colossal-et-sous-evalue-3781962</a></p>



<p><a id="_ftn9" href="#_ftnref9"><sup>[9]</sup></a> <a href="https://www.newscientist.com/article/2424179-ai-chatbots-are-improving-at-an-even-faster-rate-than-computer-chips/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://www.newscientist.com/article/2424179-ai-chatbots-are-improving-at-an-even-faster-rate-than-computer-chips/</a></p>



<p><a id="_ftn10" href="#_ftnref10"><sup>[10]</sup></a> <a href="https://usbeketrica.com/fr/article/chatgpt-coute-t-il-vraiment-700-000-dollars-par-jour-a-openai" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://usbeketrica.com/fr/article/chatgpt-coute-t-il-vraiment-700-000-dollars-par-jour-a-openai</a></p>



<p><a id="_ftn11" href="#_ftnref11"><sup>[11]</sup></a> <a href="https://arstechnica.com/information-technology/2023/02/chatgpt-is-a-data-privacy-nightmare-and-you-ought-to-be-concerned/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://arstechnica.com/information-technology/2023/02/chatgpt-is-a-data-privacy-nightmare-and-you-ought-to-be-concerned/</a></p>



<p><a id="_ftn12" href="#_ftnref12"><sup>[12]</sup></a> <a href="https://www.cnil.fr/fr/definition/explicabilite-ia" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://www.cnil.fr/fr/definition/explicabilite-ia</a></p>



<p><a id="_ftn13" href="#_ftnref13"><sup>[13]</sup></a> <a href="https://op.europa.eu/en/publication-detail/-/publication/d3988569-0434-11ea-8c1f-01aa75ed71a1" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://op.europa.eu/en/publication-detail/-/publication/d3988569-0434-11ea-8c1f-01aa75ed71a1</a></p>



<p><a href="#_ftnref14" id="_ftn14"><sup>[14]</sup></a> AI systems are prohibited if they violate EU values by infringing on fundamental rights, such as:</p>



<p>• Subliminally manipulating behaviours</p>



<p>• Exploiting individuals’ vulnerabilities in order to influence their behaviour</p>



<p>• AI-based social scoring used by governments for general purposes</p>



<p>• The use of “real-time” remote biometric identification systems in publicly accessible spaces for law enforcement purposes (with exceptions).</p>



<p><a id="_ftn15" href="#_ftnref15"><sup>[15]</sup></a> <a href="https://twitter.com/hubertguillaud/status/1795001082843713968" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://twitter.com/hubertguillaud/status/1795001082843713968</a></p>



<p><a id="_ftn16" href="#_ftnref16"><sup>[16]</sup></a> <a href="https://twitter.com/canardenchaine/status/1795862230782640367" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://twitter.com/canardenchaine/status/1795862230782640367</a></p>



<p><a id="_ftn17" href="#_ftnref17"><sup>[17]</sup></a> <a href="https://ici.radio-canada.ca/ohdio/premiere/emissions/ils-ont-fait-annee/segments/entrevue/469120/robot-chatgpt-lois-securite-ordinateurs" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">https://ici.radio-canada.ca/ohdio/premiere/emissions/ils-ont-fait-annee/segments/entrevue/469120/robot-chatgpt-lois-securite-ordinateurs</a></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><a id="_msocom_1"></a></p>



<p>To be localised in translation</p>
<img decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fai-saas-ovhcloud%2F&amp;action_name=Adopting%20AI%20in%20SaaS%3A%20how%20can%20we%20move%20quickly%20without%20losing%20control%3F&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to serve LLMs with vLLM and OVHcloud AI Deploy</title>
		<link>https://blog.ovhcloud.com/how-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy/</link>
		
		<dc:creator><![CDATA[Mathieu Busquet]]></dc:creator>
		<pubDate>Wed, 29 May 2024 12:22:26 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[AI Endpoints]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Deep learning]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LLaMA]]></category>
		<category><![CDATA[LLaMA 3]]></category>
		<category><![CDATA[LLM Serving]]></category>
		<category><![CDATA[Mistral]]></category>
		<category><![CDATA[Mixtral]]></category>
		<category><![CDATA[vLLM]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=26762</guid>

					<description><![CDATA[In this tutorial, we will learn how to serve Large Language Models (LLMs) using vLLM and the OVHcloud AI Products.<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy%2F&amp;action_name=How%20to%20serve%20LLMs%20with%20vLLM%20and%20OVHcloud%20AI%20Deploy&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>In this tutorial, we will walk you through the process of serving large language models (LLMs), providing step-by-step instruction</em>.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img decoding="async" width="1024" height="345" src="https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1024x345.png" alt="" class="wp-image-25615" style="width:750px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1024x345.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-300x101.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-768x259.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1536x518.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-2048x690.png 2048w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p></p>



<h3 class="wp-block-heading">Introduction</h3>



<p>In recent years, <strong>large language models</strong> (LLMs) have become increasingly <strong>popular</strong>, with <strong>open-source</strong> models like <em>Mistral</em> and <em>LLaMA</em> gaining widespread attention. In particular, the <em>LLaMA 3</em> model was released on <em>April 18, 2024</em>, is one of today&#8217;s most powerful open-source LLMs.</p>



<p>However, <strong>serving these LLMs can be challenging</strong>, particularly on hardware with limited resources. Indeed, even on expensive hardware, LLMs can be surprisingly slow, with high VRAM utilization and throughput limitations.</p>



<p>This is where<strong><em> </em></strong><em><a href="https://github.com/vllm-project/vllm" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>vLLM</strong></a></em> comes in. <em><strong>vLLM</strong></em> is an <strong>open-source project</strong> that enables <strong>fast and easy-to-use LLM inference and serving</strong>. Designed for optimal performance and resource utilization, <em>vLLM</em> supports a range of <a href="https://docs.vllm.ai/en/latest/models/supported_models.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLM architectures</a> and offers <a href="https://docs.vllm.ai/en/latest/models/engine_args.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">flexible customization options</a>. That&#8217;s why we are going to use it to efficiently deploy and scale our LLMs.</p>



<h3 class="wp-block-heading">Objective</h3>



<p>In this guide, you will discover how to deploy a LLM thanks to <a href="https://github.com/vllm-project/vllm" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>vLLM</em></a> and the <strong><em>AI Deploy</em></strong> <em>OVHcloud</em> solution. This will enable you to benefit from <em>vLLM</em>&#8216;s optimisations and <em>OVHcloud</em>&#8216;s GPU computing resources. Your LLM will then be exposed by a secured API.</p>



<p>🎁 And for those who do not want to bother with the deployment process, <strong>a surprise awaits you at the <a href="#AI-ENDPOINTS">end of the article</a></strong>. We are going to introduce you to our new solution for using LLMs, called <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>AI Endpoints</strong></a>. This product makes it easy to integrate AI capabilities into your applications with a simple API call, without the need for deep AI expertise or infrastructure management. And while it&#8217;s in alpha, it&#8217;s <strong>free</strong>!</p>



<h3 class="wp-block-heading">Requirements</h3>



<p>To deploy your <em>vLLM</em> server, you need:</p>



<ul class="wp-block-list">
<li>An <em>OVHcloud</em> account to access the <a href="https://www.ovh.com/auth/?action=gotomanager&amp;from=https://www.ovh.co.uk/&amp;ovhSubsidiary=GB" data-wpel-link="exclude"><em>OVHcloud Control Panel</em></a></li>



<li>A <em>Public Cloud</em> project</li>



<li>A <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">user for the AI Products</a>, related to this <em>Public Cloud</em> project</li>



<li><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">The <em>OVHcloud AI CLI</em></a> installed on your local computer (to interact with the AI products by running commands). </li>



<li><a href="https://www.docker.com/get-started" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Docker</a> installed on your local computer, <strong>or</strong> access to a Debian Docker Instance, which is available on the <a href="https://www.ovh.com/manager/public-cloud/" data-wpel-link="exclude"><em>Public Cloud</em></a></li>
</ul>



<p>Once these conditions have been met, you are ready to serve your LLMs.</p>



<h3 class="wp-block-heading">Building a Docker image</h3>



<p>Since the <a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>OVHcloud AI Deploy</em></a> solution is based on <a href="https://www.docker.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>Docker</em></a> images, we will be using a <em>Docker</em> image to deploy our <em>vLLM</em> inference server. </p>



<p>As a reminder, <em>Docker</em> is a platform that allows you to create, deploy, and run applications in containers. <em>Docker</em> containers are standalone and executable packages that include everything needed to run an application (code, libraries, system tools).</p>



<p>To create this <em>Docker</em> image, we will need to write the following <em><strong>Dockerfile</strong></em> into a new folder:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">mkdir my_vllm_image
nano Dockerfile</code></pre>



<pre class="wp-block-code"><code lang="bash" class="language-bash"># 🐳 Base image
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

# 👱 Set the working directory inside the container
WORKDIR /workspace

# 📚 Install missing system packages (git) so we can clone the vLLM project repository
RUN apt-get update &amp;&amp; apt-get install -y git
RUN git clone https://github.com/vllm-project/vllm/

# 📚 Install the Python dependencies
RUN pip3 install --upgrade pip
RUN pip3 install vllm 

# 🔑 Give correct access rights to the OVHcloud user
ENV HOME=/workspace
RUN chown -R 42420:42420 /workspace</code></pre>



<p>Let&#8217;s take a closer look at this <em>Dockerfile</em> to understand it:</p>



<ul class="wp-block-list">
<li><strong>FROM</strong>: Specify the base image for our <em>Docker</em> Image. We choose the <em>PyTorch</em> image since it comes with <em>CUDA</em>, <em>CuDNN</em> and <em>torch</em>, which is needed by <em>vLLM</em>. </li>



<li><strong>WORKDIR /workspace</strong>: We set the working directory for the <em>Docker</em> container to <em>/workspace</em>, which is the default folder when we use <em>AI Deploy</em>.</li>



<li><strong>RUN</strong>: It allows us to upgrade <em>pip</em> to the latest version to make sure we have access to the latest libraries and dependencies. We will install <em>vLLM</em> library, and <em>git</em>, which will enable to clone the <a href="https://github.com/vllm-project/vllm/tree/main" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>vLLM</em> repository</a> into th<em>e /workspace</em> directory.</li>



<li><strong>ENV</strong> HOME=/workspace: This sets the <em>HOME</em> environment variable to <em>/workspace</em>. This is a requirement to use the <em>OVHcloud</em> AI Products.</li>



<li><strong>RUN chown -R 42420:42420 /workspace</strong>: This changes the owner of the <em>/workspace</em> directory to the user and group with IDs of <em>42420</em> (<em>OVHcloud</em> user). This is also a requirement to use the <em>OVHcloud</em> AI Products.</li>
</ul>



<p>This <em>Dockerfile</em> does not contain a <strong>CMD</strong> instruction and therefore does not launch our <em>VLLM</em> server. Do not worry about that, we will do it directly from <a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Deploy</a>&nbsp;to have more flexibility.</p>



<p>Once your Dockerfile is written, launch the following command to build your image:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker build . -t vllm_image:latest</code></pre>



<h3 class="wp-block-heading">Push the image into the shared registry</h3>



<p>Once you have built the Docker image, you will need to push it to a <strong>registry</strong> to make it accessible from <em>AI Deploy</em>. A <strong>registry</strong> is a service that allows you to store and distribute <em>Docker</em> images, making it easy to deploy them in different environments.</p>



<p>Several registries can be used (<em><a href="https://www.ovhcloud.com/en-gb/public-cloud/managed-private-registry/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Managed Private Registry</a>, <a href="https://hub.docker.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Docker Hub</a>, <a href="https://github.com/features/packages" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GitHub packages</a>, &#8230;</em>). In this tutorial, we will use the <strong><em>OVHcloud</em> <em>shared registry</em></strong>. More information are available in the <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-manage-registries?id=kb_article_view&amp;sysparm_article=KB0057949" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Registries documentation</a>.</p>



<p>To find the address of your shared registry, use the following command (<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>ovhai CLI</em></a> needs to be installed on your computer):</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">ovhai registry list</code></pre>



<p>Then, log in on your <em>shared registry</em> with your usual <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Platform user</em></a> credentials:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker login -u &lt;user&gt; -p &lt;password&gt; &lt;shared-registry-address&gt;</code></pre>



<p>Once you are logged in to the registry, tag the compiled image and push it into your shared registry:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker tag vllm_image:latest &lt;shared-registry-address&gt;/vllm_image:latest
docker push &lt;shared-registry-address&gt;/vllm_image:latest</code></pre>



<h3 class="wp-block-heading">vLLM inference server deployment</h3>



<p>Once your image has been pushed, it can be used with <em>AI Deploy</em>, using either the <em>ovhai CLI</em> or the <em>OVHcloud Control Panel (UI)</em>.</p>



<h5 class="wp-block-heading">Creating an access token </h5>



<p>Tokens are used as unique authenticators to securely access the <em>AI Deploy</em> apps. By creating a token, you can ensure that only authorized requests are allowed to interact with the <em>vLLM</em> endpoint. You can create this token by using the <em>OVHcloud Control Panel (UI)</em> or by running the following command:</p>



<pre class="wp-block-code"><code lang="" class="">ovhai token create vllm --role operator --label-selector name=vllm</code></pre>



<p>This will give you a token that you will need to keep.</p>



<h5 class="wp-block-heading">Creating a Hugging Face token (optionnal)</h5>



<p>Note that some models, such as <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLaMA 3</a> require you to accept their license, hence, you need to create a <a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">HuggingFace account</a>, accept the model’s license, and generate a <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">token</a> by accessing your account settings, that will allow you to access the model.</p>



<p>For example, when visiting the HugginFace <a href="https://huggingface.co/google/gemma-2b" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Gemma model page</a>, you’ll see this (if you are logged in):</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="716" height="312" src="https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21.png" alt="accept_model_conditions_hugging_face" class="wp-image-26768" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21.png 716w, https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21-300x131.png 300w" sizes="auto, (max-width: 716px) 100vw, 716px" /></figure>



<p>If you want to use this model, you will have to Acknowledge the license, and then make sure to create a token in the <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">tokens section</a>.</p>



<p>In the next step, we will set this token as an environment variable (named  <code>HF_TOKEN</code>). Doing this will enable us to use any LLM whose conditions of use we have accepted.</p>



<h5 class="wp-block-heading">Run the AI Deploy application</h5>



<p>Run the following command to deploy your <em>vLLM</em> server by running your customized <em>Docker</em> image:</p>



<pre class="wp-block-code"><code lang="" class="">ovhai app run &lt;shared-registry-address&gt;/vllm_image:latest \
  --name vllm_app \
  --flavor h100-1-gpu \
  --gpu 1 \
  --env HF_TOKEN="&lt;YOUR_HUGGING_FACE_TOKEN&gt;" \
  --label name=vllm \
  --default-http-port 8080 \
  -- python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model &lt;model&gt; --dtype half</code></pre>



<p><em>You just need to change the address of your registry to the one you used, and the name of the LLM you want to use. Also pay attention to the name of the image, its tag, and the label selector of your label if you haven&#8217;t used the same ones as those given in this tutorial.</em></p>



<p><strong>Parameters explanation</strong></p>



<ul class="wp-block-list">
<li><code>&lt;shared-registry-address&gt;/vllm_image:latest</code> is the image on which the app is based.</li>



<li><code>--name vllm_app</code> is an optional argument that allows you to give your app a custom name, making it easier to manage all your apps.</li>



<li><code>--flavor h100-1-gpu</code> indicates that we want to run our app on H100 GPU(s). You can access the full list of GPUs available by <code>running ovhai capabilities flavor list</code></li>



<li><code>--gpu 1</code> indicates that we request 1 GPU for that app.</li>



<li><code>--env HF_TOKEN</code> is an optional argument that allows us to set our Hugging Face token as an environment variable. This gives us access to models for which we have accepted the conditions.</li>



<li><code>--label name=vllm</code> allows to privatize our LLM by adding the token corresponding to the label selector <code>name=vllm</code>.</li>



<li><code>--default-http-port 8080</code> indicates that the port to reach on the app URL is the <code>8080</code>.</li>



<li><code>--python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model &lt;model&gt;</code> allows to start the vLLM API server. The specified &lt;model&gt; will be downloaded from Hugging Face. Here is a list of those that are <a href="https://docs.vllm.ai/en/latest/models/supported_models.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">supported by vLLM</a>. <a href="https://docs.vllm.ai/en/latest/models/engine_args.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Many arguments</a> can be used to optimize your inference.</li>
</ul>



<p>When this <code>ovhai app run</code> command is executed, several pieces of information will appear in your terminal. Get the ID of your application, and open the Info URL in a new tab. Wait a few minutes for your application to launch. When it is <strong>RUNNING</strong>, you can stream its logs by executing:</p>



<pre class="wp-block-code"><code class="">ovhai app logs -f &lt;APP_ID&gt;</code></pre>



<p>This will allow you to track the server launch, the model download and any errors you may encounter if you have used a model for which you have not accepted the user contract. </p>



<p>If all goes well, you should see the following output, which means that your server is up and running:</p>



<pre class="wp-block-code"><code class="">Started server process [11]
Waiting for application startup.
Application startup complete.
Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)</code></pre>



<h3 class="wp-block-heading">Interacting with your LLM</h3>



<p>Once the server is up and running, we can interact with our LLM by hitting the <code>/generate</code> endpoint.</p>



<p><strong>Using cURL</strong></p>



<p><em>Make sure you change the ID to that of your application so that you target the right endpoint. In order for the request to be accepted, also specify the token that you generated previously by executing</em> <code>ovhai token create</code>. Feel free to adapt the parameters of the request (<em>prompt</em>, <em>max_tokens</em>, <em>temperature</em>, &#8230;)</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">curl --request POST \                                             
  --url https://&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net/generate \
  --header 'Authorization: Bearer &lt;AI_TOKEN_generated_with_CLI&gt;' \
  --header 'Content-Type: application/json' \
  --data '{
        "prompt": "&lt;YOUR_PROMPT&gt;",
        "max_tokens": 50,
        "n": 1,
        "stream": false
}'</code></pre>



<p><strong>Using Python</strong></p>



<p><em>Here too, you need to add your personal token and the correct link for your application.</em></p>



<pre class="wp-block-code"><code lang="python" class="language-python">import requests
import json

# change for your host
APP_URL = "https://&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net"
TOKEN = "AI_TOKEN_generated_with_CLI"

url = f"{APP_URL}/generate"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {TOKEN}"
}
data = {
    "prompt": "What a LLM is in AI?",
    "max_tokens": 100,
    "temperature": 0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json()["text"][0])</code></pre>



<h3 class="wp-block-heading" id="AI-ENDPOINTS">OVHcloud AI Endpoints</h3>



<p>If you are not interested in building your own image and deploying your own LLM inference server, you can use OVHcloud&#8217;s new <em><strong><a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints</a></strong> </em>product which will make your life definitely easier!</p>



<p><a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> is a serverless solution that provides AI APIs, enabling you to easily use pre-trained and optimized AI models in your applications. </p>



<figure class="wp-block-video"><video height="1400" style="aspect-ratio: 2560 / 1400;" width="2560" controls src="https://blog.ovhcloud.com/wp-content/uploads/2024/05/demo-ai-endpoints.mp4"></video></figure>



<p class="has-text-align-center"><em>Overview of AI Endpoints</em></p>



<p>You can use LLM as a Service, choosing the desired model (such as <em>LLaMA</em>, <em>Mistral</em>, or <em>Mixtral</em>) and making an API call to use it in your application. This will allow you to interact with these models without even having to deploy them!</p>



<p>In addition to LLM capabilities, <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> also offers a range of other AI models, including speech-to-text, translation, summarization, embeddings and computer vision. </p>



<p>Best of all, <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> is currently in alpha phase and is <strong>free to use</strong>, making it an accessible and affordable solution for developers seeking to explore the possibilities of AI. Check <a href="https://blog.ovhcloud.com/enhance-your-applications-with-ai-endpoints/" data-wpel-link="internal">this article</a> and try it out today to discover the power of AI!</p>



<p>Join our <a href="https://discord.gg/ovhcloud" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Discord server</a> to interact with the community and send us your feedbacks (#<em>ai-endpoints</em> channel)!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy%2F&amp;action_name=How%20to%20serve%20LLMs%20with%20vLLM%20and%20OVHcloud%20AI%20Deploy&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2024/05/demo-ai-endpoints.mp4" length="14424826" type="video/mp4" />

			</item>
	</channel>
</rss>
