LLM Serving Archives - OVHcloud Blog

Adopting AI in SaaS: how can we move quickly without losing control?

Germain Masse — Wed, 14 Aug 2024 06:04:09 +0000

The widespread use of AI poses numerous challenges. Including the risks of data leakage, the need for explainable results, handling it in SaaS. But also, the growing dependence on Big Tech. Not to mention the environmental toll linked to AI.

No doubt the eco-design of digital services is becoming increasingly popular. Still, the efforts to achieve digital sobriety seem to be marginal. Especially compared to the energy consumed by training general-purpose LLMs. Is there a way to make AI greener? And what would a more “trusted AI” mean?

Here’s a roundup of challenges and solutions.

Efficiency of specialised LLMs compared to general-purpose LLMs

General-purpose LLMs, such as GPT-4 developed by OpenAI, LaMa (Meta) and Gemini (Google) are currently in the spotlight. Versatile, omniscient, and able to handle a variety of scenarios. They seem to be able to meet every need: generating text, code, answering questions, translating content, and even composing poems.

However, these general-purpose models have not yet eclipsed specialised LLMs,^[1]which target a narrower range of situations, but perform much better in them. The Retrieval-Augmented Generation (RAG) technique certainly makes it possible to specialise a general-purpose LLM via transfer learning, with or without retraining the model. However, the use of general-purpose LLMs continues to pose a range of challenges. Starting with their generic results, unreliable quality or lack of reproducibility. This will prove even more challenging as the available sources of quality data may become scarce, due to legal actions^[2] brought for unauthorised use of content and copyright infringement. Additionally, the use of general-purpose LLMs leads to operator dependency and reinforce monopolies,^[3]which is unfavourable for long-term users.

The impact of AI on the environment
Researchers from Hugging Face and the Allen Institute^[4]have shown that in the case of servers with GPUs, the carbon emissions linked to machine use far exceed those linked to manufacturing the components, unlike traditional cloud computing.Generating an image using an AI model is one of the most energy-intensive uses, and requires as much electricity as fully charging a smartphone.^{^[5]}Reversing the distribution of carbon emissions throughout the lifecycle of servers in this way means that the power usage effectiveness (PUE) of the datacentres in which AI models are trained and inferred, as well as the energy mix of the countries in which they are located, are very significant selection criteria in calculating your application’s global carbon footprint.

This is a bonus for OVHcloud. Indeed, the Group has long been committed to reducing the carbon footprint of its datacentres.^[6]

As it might be expected, general-purpose LLMs are more environmentally damaging than specialised models designed for specific tasks. This has been revealed in a series of comparative tests carried out by the same researchers.^[7]With thousands of billions of settings, the largest LLMs are getting larger and more data-intensive.^[8] An article in theNew Scientist recently explained that algorithm advances are outpacing Moore’s Law, as after eight months. A large language model would need only half the computing power to achieve the same level of performance.^[9]However, to run a model like OpenAI today, it would cost Microsoft around $700,000 per day[10], or an average cost of 36 cents per query. Still unreasonable from an economic and environmental point of view to meet needs that are often precise and well-defined.

Specialised models, which can be chained to perform complex tasks (referred to as agentisation), are therefore a more environmentally responsible alternative to general-purpose LLMs. On top of that, specialised models, which are more widely available in open source, are also easier to understand and to fine-tune. They seem more suitable for reversibly building innovations for which the ROI is still very uncertain.

Maintaining control: working towards developing a trusted AI

While large companies quickly became aware of the risks of leaking confidential data when using digital services (like online translation, which they are beginning to ban), AI intensified the temptation to output a company’s data and submit it to an algorithm: here to write a report more quickly, there to generate an image that will illustrate a presentation on a confidential project. Samsung learned this the hard way, as a victim of three consecutive data leaks related to the use of ChatGPT by its employees, who notably copied/pasted source code to solve or optimise a problem.
You don’t need to disclose a lot of information to say a lot about your intentions. What insights would your rival gather about your strategy from reading your ChatGPT prompts? After all, it is possible for AI to accidentally “scrape” data submitted by users, thanks to a bug^[11]causing security issues. The same goes for datasets you might submit on AI platforms: will your data be used to train and refine the model? Could they benefit potential rivals?

Beyond this, there is also the question of the transparency of AI models. With it, comes the risk of outsourcing increasingly important tasks to sophisticated AI models. Indeed, they can become “black boxes”, making incomprehensible decisions, or producing skewed results because of the data they are trained on.

Let’s face the possibility that you may not have any problem with the results. Would you run the risk of relying on a service where you can’t explain in broad terms how it works? And that you couldn’t stop using without losing everything? Here, we encounter another problem – reversibility.

If for example, the AI service deemed the party to be over and the infrastructure that it has long financed at a loss must now be made profitable, so it takes advantage of its monopoly and your dependency to increase its rates in an unreasonable way, you could certainly cancel the service. But then you would lose the results of your data training and/or model specialisation, and you would have to start from scratch. In the current absence of standards for portability/interoperability between different AI services, this issue is crucial – all the more so given that, for the moment, while open-source is popular, proprietary models are very much in the majority.

There is no simple answer to the questions that have been raised. That’s because AI development is currently very empirical, based on a trial-and-error model, with no traceability of training data or model modifications.

This, incidentally, makes the “explainability”^[12]of an AI system’s results a real challenge, even though the AI Act establishes a duty to do so (see below).

The development of a “trustworthy AI”, as it was termed in a 2019 paper^[13]by the Independent High-Level Expert Group on Artificial Intelligence (AI HLEG), is perhaps a direction to keep in mind. It defines a trustworthy AI with three main objectives, which OVHcloud aims to help you achieve: AI must be lawful (legislative or regulatory aspect), ethical (respect for ethical norms) and robust (from both a technical and social perspective).

In the meantime, ensuring swift regulatory compliance at the national, European, and international levels is a powerful lever for promoting greener business practices, without compromising future prospects in the pursuit of innovation.

1/ Complying with current and future regulations

The EU was quick to respond to the democratisation of AI, proposing a draft European regulation on the subject on 21 April 2021. In March 2024, the AI Act was officially adopted. Now, it applies to all services used in the EU, regardless of whether the providers are foreign or not.

The law divides AI systems into four categories, taking into account their impact on fundamental rights in the EU and the security of individuals, groups, societies, and civilization. Each risk category has associated prohibitions^[14]and obligations, ranging from environmental sustainability to security, and including marking content that has been AI-generated.

A “compliance checker” online allows you to quickly find out the extent to which this European AI law applies to your projects: https://artificialintelligenceact.eu/

Other national and European regulations on personal data protection, such as the GDPR, already apply to your AI projects, holding companies accountable for hosting and transferring personal data outside the EU.

Incidentally, those who complain about regulation being too burdensome in comparison to the American laissez-faire attitude or the Chinese spirit of conquest have the wrong end of the stick: the absence of a genuine European single market is a much bigger factor^[15]behind Europe’s innovation gap. So, too, is the weak support for public procurement, or the incomprehensible message sent by governments that claim to want to develop sovereign solutions by relying on investments by foreign stakeholders.^[16]

It’s also worth noting that the AI Act provides for the possibility for national competent authorities (the ICO in the UK) [1] to set up “regulatory sandboxes”, i.e. a controlled environment to test innovative technologies for a limited time in order to ensure the compliance of the AI system and to not delay any potential placing on the market, with priority access to these sandboxes for SMEs and startups.

In short, regulations today do not hinder the development of projects that take advantage of the possibilities offered by AI, but rather strengthen companies’ obligations regarding the protection of personal data due to increased risks. These obligations will help to reassure users, once this brief period of carelessness and frivolity with AI has passed, and the inevitable first scandals start to surface. As Yoshua Bengio, researcher and founder of MILA (the Quebec Artificial Intelligence Institute), summed up: “We’re going too fast in an unfamiliar direction, and that could change the world in a very positive, or very dangerous, way.”^[17]Countries should therefore seek to regulate AI so that its development does not feel like the Wild West.

In this context, the preference for sovereign solutions will make it easier for your projects to comply with regulations, in addition to establishing a clear medium- and long-term vision. COVID and the current geopolitical instability have shown the cost of relying on foreign entities for essential services, and AI-based services will quickly follow the same path if they integrate the software we use every day, in such critical areas as health, education, transport, or logistics.

^[1] General-purpose and specialised LLMs can be distinguished by the number of parameters in their neural network: tens, hundreds or even thousands of billions of parameters for a general-purpose model versus “a few billion” for a specialised model.

^[2] https://www.usine-digitale.fr/article/openai-cible-par-deux-class-actions-aux-etats-unis.N2148412; https://www.lefigaro.fr/secteur/high-tech/des-journaux-americains-poursuivent-openai-et-microsoft-en-justice-pour-violation-de-leurs-droits-d-auteur-20240430

^[3] https://www.nytimes.com/2024/06/05/technology/nvidia-microsoft-openai-antitrust-doj-ftc.html

^[4] http://arxiv.org/pdf/2311.16863

^[5] https://www.technologyreview.com/2023/12/01/1084189/making-an-image-with-generative-ai-uses-as-much-energy-as-charging-your-phone/

^[6] https://corporate.ovhcloud.com/en-gb/sustainability/environment/
For our PUE calculation methodology, refer to https://corporate.ovhcloud.com/sites/default/files/2024-01/methodo_carboncalc_0.pdf

^[7] https://www.silicon.fr/llm-generaliste-specialise-angle-environnemental-473911.html

^[8] https://www.radiofrance.fr/franceculture/podcasts/le-journal-de-l-eco/le-cout-environnemental-de-l-ia-est-colossal-et-sous-evalue-3781962

^[9] https://www.newscientist.com/article/2424179-ai-chatbots-are-improving-at-an-even-faster-rate-than-computer-chips/

^[10] https://usbeketrica.com/fr/article/chatgpt-coute-t-il-vraiment-700-000-dollars-par-jour-a-openai

^[11] https://arstechnica.com/information-technology/2023/02/chatgpt-is-a-data-privacy-nightmare-and-you-ought-to-be-concerned/

^[12] https://www.cnil.fr/fr/definition/explicabilite-ia

^[13] https://op.europa.eu/en/publication-detail/-/publication/d3988569-0434-11ea-8c1f-01aa75ed71a1

^[14] AI systems are prohibited if they violate EU values by infringing on fundamental rights, such as:

• Subliminally manipulating behaviours

• Exploiting individuals’ vulnerabilities in order to influence their behaviour

• AI-based social scoring used by governments for general purposes

• The use of “real-time” remote biometric identification systems in publicly accessible spaces for law enforcement purposes (with exceptions).

^[15] https://twitter.com/hubertguillaud/status/1795001082843713968

^[16] https://twitter.com/canardenchaine/status/1795862230782640367

^[17] https://ici.radio-canada.ca/ohdio/premiere/emissions/ils-ont-fait-annee/segments/entrevue/469120/robot-chatgpt-lois-securite-ordinateurs

To be localised in translation

How to serve LLMs with vLLM and OVHcloud AI Deploy

Mathieu Busquet — Wed, 29 May 2024 12:22:26 +0000

In this tutorial, we will walk you through the process of serving large language models (LLMs), providing step-by-step instruction.

Introduction

In recent years, large language models (LLMs) have become increasingly popular, with open-source models like Mistral and LLaMA gaining widespread attention. In particular, the LLaMA 3 model was released on April 18, 2024, is one of today’s most powerful open-source LLMs.

However, serving these LLMs can be challenging, particularly on hardware with limited resources. Indeed, even on expensive hardware, LLMs can be surprisingly slow, with high VRAM utilization and throughput limitations.

This is where vLLM comes in. vLLM is an open-source project that enables fast and easy-to-use LLM inference and serving. Designed for optimal performance and resource utilization, vLLM supports a range of LLM architectures and offers flexible customization options. That’s why we are going to use it to efficiently deploy and scale our LLMs.

Objective

In this guide, you will discover how to deploy a LLM thanks to vLLM and the AI Deploy OVHcloud solution. This will enable you to benefit from vLLM‘s optimisations and OVHcloud‘s GPU computing resources. Your LLM will then be exposed by a secured API.

🎁 And for those who do not want to bother with the deployment process, a surprise awaits you at the end of the article. We are going to introduce you to our new solution for using LLMs, called AI Endpoints. This product makes it easy to integrate AI capabilities into your applications with a simple API call, without the need for deep AI expertise or infrastructure management. And while it’s in alpha, it’s free!

Requirements

To deploy your vLLM server, you need:

An OVHcloud account to access the OVHcloud Control Panel
A Public Cloud project
A user for the AI Products, related to this Public Cloud project
The OVHcloud AI CLI installed on your local computer (to interact with the AI products by running commands).
Docker installed on your local computer, or access to a Debian Docker Instance, which is available on the Public Cloud

Once these conditions have been met, you are ready to serve your LLMs.

Building a Docker image

Since the OVHcloud AI Deploy solution is based on Docker images, we will be using a Docker image to deploy our vLLM inference server.

As a reminder, Docker is a platform that allows you to create, deploy, and run applications in containers. Docker containers are standalone and executable packages that include everything needed to run an application (code, libraries, system tools).

To create this Docker image, we will need to write the following Dockerfile into a new folder:

mkdir my_vllm_image
nano Dockerfile

# 🐳 Base image
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

# 👱 Set the working directory inside the container
WORKDIR /workspace

# 📚 Install missing system packages (git) so we can clone the vLLM project repository
RUN apt-get update && apt-get install -y git
RUN git clone https://github.com/vllm-project/vllm/

# 📚 Install the Python dependencies
RUN pip3 install --upgrade pip
RUN pip3 install vllm 

# 🔑 Give correct access rights to the OVHcloud user
ENV HOME=/workspace
RUN chown -R 42420:42420 /workspace

Let’s take a closer look at this Dockerfile to understand it:

FROM: Specify the base image for our Docker Image. We choose the PyTorch image since it comes with CUDA, CuDNN and torch, which is needed by vLLM.
WORKDIR /workspace: We set the working directory for the Docker container to /workspace, which is the default folder when we use AI Deploy.
RUN: It allows us to upgrade pip to the latest version to make sure we have access to the latest libraries and dependencies. We will install vLLM library, and git, which will enable to clone the vLLM repository into the /workspace directory.
ENV HOME=/workspace: This sets the HOME environment variable to /workspace. This is a requirement to use the OVHcloud AI Products.
RUN chown -R 42420:42420 /workspace: This changes the owner of the /workspace directory to the user and group with IDs of 42420 (OVHcloud user). This is also a requirement to use the OVHcloud AI Products.

This Dockerfile does not contain a CMD instruction and therefore does not launch our VLLM server. Do not worry about that, we will do it directly from AI Deploy to have more flexibility.

Once your Dockerfile is written, launch the following command to build your image:

docker build . -t vllm_image:latest

Push the image into the shared registry

Once you have built the Docker image, you will need to push it to a registry to make it accessible from AI Deploy. A registry is a service that allows you to store and distribute Docker images, making it easy to deploy them in different environments.

Several registries can be used (OVHcloud Managed Private Registry, Docker Hub, GitHub packages, …). In this tutorial, we will use the OVHcloud shared registry. More information are available in the Registries documentation.

To find the address of your shared registry, use the following command (ovhai CLI needs to be installed on your computer):

ovhai registry list

Then, log in on your shared registry with your usual AI Platform user credentials:

docker login -u  -p

Once you are logged in to the registry, tag the compiled image and push it into your shared registry:

docker tag vllm_image:latest /vllm_image:latest
docker push /vllm_image:latest

vLLM inference server deployment

Once your image has been pushed, it can be used with AI Deploy, using either the ovhai CLI or the OVHcloud Control Panel (UI).

Creating an access token

Tokens are used as unique authenticators to securely access the AI Deploy apps. By creating a token, you can ensure that only authorized requests are allowed to interact with the vLLM endpoint. You can create this token by using the OVHcloud Control Panel (UI) or by running the following command:

ovhai token create vllm --role operator --label-selector name=vllm

This will give you a token that you will need to keep.

Creating a Hugging Face token (optionnal)

Note that some models, such as LLaMA 3 require you to accept their license, hence, you need to create a HuggingFace account, accept the model’s license, and generate a token by accessing your account settings, that will allow you to access the model.

For example, when visiting the HugginFace Gemma model page, you’ll see this (if you are logged in):

If you want to use this model, you will have to Acknowledge the license, and then make sure to create a token in the tokens section.

In the next step, we will set this token as an environment variable (named HF_TOKEN). Doing this will enable us to use any LLM whose conditions of use we have accepted.

Run the AI Deploy application

Run the following command to deploy your vLLM server by running your customized Docker image:

ovhai app run /vllm_image:latest \
  --name vllm_app \
  --flavor h100-1-gpu \
  --gpu 1 \
  --env HF_TOKEN="" \
  --label name=vllm \
  --default-http-port 8080 \
  -- python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model  --dtype half

You just need to change the address of your registry to the one you used, and the name of the LLM you want to use. Also pay attention to the name of the image, its tag, and the label selector of your label if you haven’t used the same ones as those given in this tutorial.

Parameters explanation

/vllm_image:latest is the image on which the app is based.
--name vllm_app is an optional argument that allows you to give your app a custom name, making it easier to manage all your apps.
--flavor h100-1-gpu indicates that we want to run our app on H100 GPU(s). You can access the full list of GPUs available by running ovhai capabilities flavor list
--gpu 1 indicates that we request 1 GPU for that app.
--env HF_TOKEN is an optional argument that allows us to set our Hugging Face token as an environment variable. This gives us access to models for which we have accepted the conditions.
--label name=vllm allows to privatize our LLM by adding the token corresponding to the label selector name=vllm.
--default-http-port 8080 indicates that the port to reach on the app URL is the 8080.
--python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model allows to start the vLLM API server. The specified will be downloaded from Hugging Face. Here is a list of those that are supported by vLLM. Many arguments can be used to optimize your inference.

When this ovhai app run command is executed, several pieces of information will appear in your terminal. Get the ID of your application, and open the Info URL in a new tab. Wait a few minutes for your application to launch. When it is RUNNING, you can stream its logs by executing:

ovhai app logs -f

This will allow you to track the server launch, the model download and any errors you may encounter if you have used a model for which you have not accepted the user contract.

If all goes well, you should see the following output, which means that your server is up and running:

Started server process [11]
Waiting for application startup.
Application startup complete.
Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

Interacting with your LLM

Once the server is up and running, we can interact with our LLM by hitting the /generate endpoint.

Using cURL

Make sure you change the ID to that of your application so that you target the right endpoint. In order for the request to be accepted, also specify the token that you generated previously by executing ovhai token create. Feel free to adapt the parameters of the request (prompt, max_tokens, temperature, …)

curl --request POST \                                             
  --url https://.app.gra.ai.cloud.ovh.net/generate \
  --header 'Authorization: Bearer ' \
  --header 'Content-Type: application/json' \
  --data '{
        "prompt": "",
        "max_tokens": 50,
        "n": 1,
        "stream": false
}'

Using Python

Here too, you need to add your personal token and the correct link for your application.

import requests
import json

# change for your host
APP_URL = "https://.app.gra.ai.cloud.ovh.net"
TOKEN = "AI_TOKEN_generated_with_CLI"

url = f"{APP_URL}/generate"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {TOKEN}"
}
data = {
    "prompt": "What a LLM is in AI?",
    "max_tokens": 100,
    "temperature": 0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json()["text"][0])

OVHcloud AI Endpoints

If you are not interested in building your own image and deploying your own LLM inference server, you can use OVHcloud’s new AI Endpoints product which will make your life definitely easier!

AI Endpoints is a serverless solution that provides AI APIs, enabling you to easily use pre-trained and optimized AI models in your applications.

Overview of AI Endpoints

You can use LLM as a Service, choosing the desired model (such as LLaMA, Mistral, or Mixtral) and making an API call to use it in your application. This will allow you to interact with these models without even having to deploy them!

In addition to LLM capabilities, AI Endpoints also offers a range of other AI models, including speech-to-text, translation, summarization, embeddings and computer vision.

Best of all, AI Endpoints is currently in alpha phase and is free to use, making it an accessible and affordable solution for developers seeking to explore the possibilities of AI. Check this article and try it out today to discover the power of AI!

Join our Discord server to interact with the community and send us your feedbacks (#ai-endpoints channel)!