You are not dreaming! You can deploy open-source LLM in a single command line.

Deploying advanced language models can be a challenge! But this sometimes this arduous task is becoming increasingly accessible, enabling developers to integrate sophisticated AI capabilities into their applications.
In this guide, we will walk through deploying the Mistral-Small-24B-Instruct-2501 model using vLLM on OVHcloud’s AI Deploy platform. This combination offers a powerful solution for efficient and scalable AI model serving.
Deploying a model is great, but doing it quickly is even better!
🤯 What if a single command line was enough? That’s the challenge we’re tackling today!
Context
Before deployment, let’s take a closer look at our key technologies!
Mistral Small
The mistralai/Mistral-Small-24B-Instruct-2501
is a 24-billion-parameter instruction-fine-tuned model, renowned for its compact size and performance comparable to larger models.
This model, from MistralAI, is an instruction-fine-tuned version of the base model: Mistral-Small-24B-Base-2501.
To serve this model efficiently, we will utilize vLLM, an open-source library for LLM inference.
vLLM
vLLM (Virtual LLM) is a highly optimized service engine designed to efficiently run large language models. It takes advantage of several key optimizations, such as:
- PagedAttention: an attention mechanism that reduces memory fragmentation and enables more efficient use of GPU memory
- Continuous Batching: vLLM dynamically adjusts batch sizes in real time, ensuring that the GPU is always used efficiently, even with multiple simultaneous requests
- Tensor parallelism: enables model inference across multiple GPUs to boost performance
- Optimized kernel implementations: vLLM uses custom CUDA kernels for faster execution, reducing latency compared to traditional inference frameworks
These features make vLLM one of the best choices for large models such as Mistral Small 24B, enabling low-latency, high-throughput inference on the latest GPUs.
By deploying on OVHcloud’s AI Deploy platform, you can deploy this model in a single command line.
AI Deploy
OVHcloud AI Deploy is a Container as a Service (CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications / APIs based on Machine Learning (ML), Deep Learning (DL) or LLMs.
The key benefits are:
- Easy to use: bring your own custom Docker image and deploy it in a command line or a few clicks surely
- High-performance computing: a complete range of GPUs available (H100, A100, V100S, L40S and L4)
- Scalability and flexibility: supports automatic scaling, allowing your model to effectively handle fluctuating workloads
- Cost-efficient: billing per minute, no surcharges
✅ To go further, some prerequisites must be checked!
Prerequisites
Before you begin, ensure that you have:
- OVHcloud account: access to the OVHcloud Control Panel
- ovhai CLI available: install the ovhai CLI
- AI Deploy access: ensure you have a user for AI Deploy
- Hugging Face access: create an Hugging Face account and generate an access token
- Gated model authorization: be sure you have been granted access to Mistral-Small-24B-Instruct-2501 model
🚀 Having all the ingredients for our recipe, it’s time to deploy!
Deployment of the Mistral Small 24B LLM
Let’s go for the deployment of the model mistralai/Mistral-Small-24B-Instruct-2501
Manage access tokens
Export your Hugging Face token.
export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
Create a token to access your AI Deploy app once it will be deployed.
ovhai token create --role operator ai_deploy_token=my_operator_token
Returning the following output:
Id: 47292486-fb98-4a5b-8451-600895597a2b
Created At: 20-02-25 11:53:05
Updated At: 20-02-25 11:53:05
Spec:
Name: ai_deploy_token=my_operator_token
Role: AiTrainingOperator
Label Selector:
Status:
Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Version: 1
You can now store and export your access token:
export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Launch Mistral Small LLM with AI Deploy
You are ready to start Mistral-Small-24B using vLLM and AI Deploy:
ovhai app run --name vllm-mistral-small \
--default-http-port 8000 \
--label ai_deploy_token=my_operator_token \
--gpu 2 \
--flavor l40s-1-gpu \
-e OUTLINES_CACHE_DIR=/tmp/.outlines \
-e HF_TOKEN=$MY_HF_TOKEN \
-e HF_HOME=/hub \
-e HF_DATASETS_TRUST_REMOTE_CODE=1 \
-e HF_HUB_ENABLE_HF_TRANSFER=0 \
-v standalone:/hub:rw \
-v standalone:/workspace:rw \
vllm/vllm-openai:latest \
-- bash -c "python3 -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Small-24B-Instruct-2501 \
--tensor-parallel-size 2 \
--tokenizer_mode mistral \
--load_format mistral \
--config_format mistral \
--dtype half"
How to understand the different parameters of this command?
1. Start your AI Deploy app
Launch a new app using ovhai CLI and name it.
ovhai app run --name vllm-mistral-small
2. Define access
Define the HTTP API port and restrict access to your token.
--default-http-port 8000
--label ai_deploy_token=my_operator_token
3. Configure GPU resources
Specifies the hardware type (l40s-1-gpu
), which refers to an NVIDIA L40S GPU and the number (2
).
--gpu 2
--flavor l40s-1-gpu
⚠️WARNING! For this model, two L40S are sufficient, but if you want to deploy another model, you will need to check which GPU you need. Note that you can also access to A100 and H100 GPUs for your larger models.
4. Set up environment variables
Configure caching for the Outlines library (used for efficient text generation):
-e OUTLINES_CACHE_DIR=/tmp/.outlines
Pass the Hugging Face token ($MY_HF_TOKEN
) for model authentication and download:
-e HF_TOKEN=$MY_HF_TOKEN
Set the Hugging Face cache directory to /hub
(where models will be stored):
-e HF_HOME=/hub
Allow execution of custom remote code from Hugging Face datasets (required for some model behaviors):
-e HF_DATASETS_TRUST_REMOTE_CODE=1
Disable Hugging Face Hub transfer acceleration (to use standard model downloading):
-e HF_HUB_ENABLE_HF_TRANSFER=0
5. Mount persistent volumes
Mounts two persistent storage volumes:
/hub
→ Stores Hugging Face model files/workspace
→ Main working directory
The rw
flag means read-write access.
-v standalone:/hub:rw
-v standalone:/workspace:rw
6. Choose the target Docker image
Uses the vllm/vllm-openai:latest
Docker image (a pre-configured vLLM OpenAI API server).
vllm/vllm-openai:latest
7. Running the model inside the container
Runs a bash shell inside the container and executes a Python command to launch the vLLM API server:
python3 -m vllm.entrypoints.openai.api_server
→ Starts the OpenAI-compatible vLLM API server--model mistralai/Mistral-Small-24B-Instruct-2501
→ Loads the Mistral Small 24B model from Hugging Face--tensor-parallel-size 2
→ Distributes the model across 2 GPUs--tokenizer_mode mistral
→ Uses the Mistral tokenizer--load_format mistral
→ Uses Mistral’s model loading format--config_format mistral
→ Ensures the model configuration follows Mistral’s standard--dtype half
→ Uses FP16 (half-precision floating point) for optimized GPU performance
You can now check if your AI Deploy app is alive:
ovhai app get <your_vllm_app_id>
💡Is your app in RUNNING
status? Perfect! You can check in the logs that the server is started…
ovhai app logs <your_vllm_app_id>
⚠️WARNING! This step may take a little time as the template must be loaded…
After a few minutes, you should get the following information in the logs:
2025-02-20T13:48:07Z [app] [tcmzt] INFO: Started server process [13] 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Waiting for application startup. 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Application startup complete. 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
🚦 Are all the indicators green? Then it’s off to inference!
Request and send prompt to the LLM
Launch the following query by asking the question of your choice:
curl https://<your_vllm_app_id>.app.gra.ai.cloud.ovh.net/v1/chat/completions \
-H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-Small-24B-Instruct-2501",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Give me the name of OVHcloud’s founder."}
],
"stream": false
}'
Returning the following result:
{
"id":"chatcmpl-d6ea734b524bd851668e71d4111ba496",
"object":"chat.completion",
"created":1740059807,
"model":"mistralai/Mistral-Small-24B-Instruct-2501",
"choices":[
{
"index":0,
"message":{
"role":"assistant",
"reasoning_content":null,
"content":"The founder of OVHcloud is Octave Klaba.",
"tool_calls":[]
},
"logprobs":null,
"finish_reason":"stop",
"stop_reason":null
}
],
"usage":{
"prompt_tokens":22,
"total_tokens":35,
"completion_tokens":13,
"prompt_tokens_details":null
},
"prompt_logprobs":null
}
Conclusion
By following these steps, you have successfully deployed the mistralai/Mistral-Small-24B-Instruct-2501
model using vLLM on OVHcloud’s AI Deploy platform. This setup provides a scalable and efficient solution for serving advanced language models in production environments.
For further customization and optimization, refer to the vLLM documentation and OVHcloud AI Deploy resources.
💪 Challenges taken! You can now enjoy the power of your LLM deployed in a single command line!
Want even more simplicity? You can also use ready-to-use APIs with AI Endpoints!
But… what’s next?
Solution Architect