AI Endpoints Archives - OVHcloud Blog

How to process large AI requests with Batch Mode on OVHcloud AI Endpoints

Stéphane Philippart — Mon, 01 Jun 2026 12:26:07 +0000

Let’s say you have 20,000 support tickets to classify before tomorrow morning, or a full product catalog to translate without manually sending each request one by one. That kind of workload can quickly become slow, repetitive and difficult to manage.

Batch Mode is designed to help in exactly this type of scenario.

What is Batch Mode?

When working with LLMs, you often send requests one by one through synchronous endpoints like /v1/chat/completions or /v1/responses. This works fine for real-time use cases, but what can you do if you need to process hundreds or thousands of prompts? Sending them individually is slow, and you’re limited by rate limits.

Batch mode solves this problem. Instead of sending requests one at a time, you upload a file containing all your requests, submit a batch job, and get the results back asynchronously, within a maximum of 24 hours. And here’s the cherry on top: batch mode is 50% cheaper than synchronous requests. Since the platform can schedule your workload more efficiently, you benefit from a significant cost reduction.

This is ideal for:

📊 Bulk classification or summarization tasks
🌍 Large-scale translation jobs
📝 Generating descriptions for a product catalog
🧪 Evaluating model outputs on a test dataset

ℹ️ The Batch API is compatible with the OpenAI Batch API format, so you can use the official OpenAI SDK to interact with it.

When not to use Batch Mode!

Batch Mode is designed for large workloads that do not need an immediate response. This being said, it is not the right choice for real-time use cases such as chatbots, live customer support, interactive assistants or applications where users expect an answer within seconds. For those scenarios, synchronous endpoints remain more appropriate. Use Batch Mode when your requests can be processed asynchronously and retrieved later.

ℹ️ The Batch API is currently in beta. You can find more information about the beta on the dedicated page.

Prerequisites for using Batch Mode

Before getting started, you’ll need:

An AI Endpoints API key
Python 3.10+ installed
The openai Python package

⚠️ You can generate your API key from the AI Endpoints console.

Install the dependency:

pip install openai

Set up your environment variables:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN='your_api_key'
export OVH_AI_ENDPOINTS_BASE_URL='https://oai.endpoints.kepler.ai.cloud.ovh.net/v1'

Step 1: Prepare the Input File

The input file uses the JSON Lines format (.jsonl). Each line is a self-contained request with a unique custom_id that lets you match results to their original requests.

Here’s an example requests.jsonl:

{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-oss-20b", "messages": [{"role": "user", "content": "Summarise the plot of Hamlet in two sentences."}]}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-oss-20b", "messages": [{"role": "user", "content": "Translate 'Good morning' into French, Spanish and German."}]}}

Key points:

Each custom_id must be unique within a batch
The model field must reference a model available in the AI Endpoints catalog
The url field indicates which endpoint to call

Step 2: Upload the File and Create the Batch

Here’s the complete Python code that handles the full workflow: upload, create, poll, and download:

import os
import time

from openai import OpenAI

# Load environment variables
_OVH_AI_ENDPOINTS_ACCESS_TOKEN = os.environ["OVH_AI_ENDPOINTS_ACCESS_TOKEN"]
_OVH_AI_ENDPOINTS_BASE_URL = os.environ["OVH_AI_ENDPOINTS_BASE_URL"]

# Initialize the OpenAI-compatible client targeting OVHcloud AI Endpoints
client = OpenAI(
    base_url=_OVH_AI_ENDPOINTS_BASE_URL,
    api_key=_OVH_AI_ENDPOINTS_ACCESS_TOKEN,
)

# 1. Upload the input JSONL file with purpose="batch"
print("📤 Uploading input file...")
batch_input_file = client.files.create(
    file=open("requests.jsonl", "rb"),
    purpose="batch",
)
print(f"✅ Uploaded file id: {batch_input_file.id}")

# 2. Create the batch referencing the uploaded file
print("🚀 Creating batch...")
batch = client.batches.create(
    input_file_id=batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={"description": "Batch mode example - OVHcloud AI Endpoints"},
)
print(f"✅ Batch created: {batch.id} (status: {batch.status})")

# 3. Poll the batch status until it reaches a terminal state
print("⏳ Polling batch status...")
while True:
    current = client.batches.retrieve(batch.id)
    print(f"   status={current.status} counts={current.request_counts}")
    if current.status in ("completed", "failed", "expired", "cancelled"):
        break
    time.sleep(30)

# 4. Download the results (and errors if any)
final = client.batches.retrieve(batch.id)

if final.output_file_id:
    print("📥 Downloading results.jsonl...")
    output = client.files.content(final.output_file_id)
    with open("results.jsonl", "wb") as f:
        f.write(output.read())
    print("✅ Results written to results.jsonl")

if final.error_file_id:
    print("🐛 Downloading errors.jsonl...")
    errors = client.files.content(final.error_file_id)
    with open("errors.jsonl", "wb") as f:
        f.write(errors.read())
    print("🐛 Errors written to errors.jsonl")

print(f"🏁 Final batch status: {final.status}")

Let’s break down the key steps:

Upload the input file

batch_input_file = client.files.create(
    file=open("requests.jsonl", "rb"),
    purpose="batch",
)

The purpose=”batch” parameter tells the API that this file will be used as batch input.

Create the batch

batch = client.batches.create(
    input_file_id=batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

The completion_window=”24h” means the batch will be stopped after 24 hours if not completed.

Poll the batch status

while True:
    current = client.batches.retrieve(batch.id)
    print(f"   status={current.status} counts={current.request_counts}")
    if current.status in ("completed", "failed", "expired", "cancelled"):
        break
    time.sleep(30)

The client.batches.retrieve(batch.id) call returns the current state of the batch. The request_counts field gives you a breakdown of how many requests are completed, failed, or still in progress, useful for monitoring large batches.

The possible terminal states are:

completed: all requests have been processed successfully
failed: the batch encountered a fatal error
expired: the batch exceeded the completion_window duration
cancelled: the batch was manually cancelled via the API

We poll every 30 seconds here, but you can adjust this interval depending on your use case. For very large batches, a longer interval (e.g., 60–120 seconds) is more appropriate.

Download the results

final = client.batches.retrieve(batch.id)

if final.output_file_id:
    output = client.files.content(final.output_file_id)
    with open("results.jsonl", "wb") as f:
        f.write(output.read())

Once the batch is complete, the output_file_id field contains the ID of the results file. You download it using client.files.content() which returns the raw file content.

Download the errors (if any)

if final.error_file_id:
    errors = client.files.content(final.error_file_id)
    with open("errors.jsonl", "wb") as f:
        f.write(errors.read())

If some requests in your batch failed (e.g., invalid model name, malformed input, token limit exceeded), their details will be available in a separate error file. The error_file_id will be None if all requests succeeded. Each line in errors.jsonl contains the custom_id of the failed request along with the error details, making it easy to identify and retry only the failed ones.

Step 3: Read the Results

The output file (results.jsonl) contains one JSON object per line. Each object includes:

The custom_id matching your original request
The full response body (same format as a synchronous /v1/chat/completions responses)

Here’s what a result looks like:

{
  "id": "964e007472a557240221910ba143bb03",
  "custom_id": "request-1",
  "response": {
    "status_code": 200,
    "body": {
      "id": "chatcmpl-9879ebff777795a3",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "Hamlet, the Prince of Denmark, is driven to madness and vengeance after learning that his father was murdered by his uncle Claudius..."
          },
          "finish_reason": "stop"
        }
      ],
      "model": "gpt-oss-20b",
      "usage": {
        "prompt_tokens": 78,
        "completion_tokens": 297,
        "total_tokens": 375
      }
    }
  },
  "error": null
}

If some requests fail, the errors.jsonl file will contain details about what went wrong for each failed request.

Other Examples Available

The AI Endpoints – Batch mode guide also contains examples in:

JavaScript: using the OpenAI Node.js SDK
Pure HTTP requests: using curl without any framework, if you prefer a language-agnostic approach

These examples demonstrate that you can use the Batch API from any language or tool that can make HTTP requests, since it follows the standard OpenAI-compatible API format.

Conclusion

Batch mode is a powerful feature when you need to process large volumes of repetitive, non time-sensitive inference requests, without worrying about rate limits or timeout issues. Upload your file, submit the batch, and come back later for the results, it’s as simple a solution as that.

The OpenAI-compatible API makes it straightforward to integrate into existing workflows, and with examples available in Python, JavaScript, and raw HTTP, you can use whichever approach fits your stack best.

You have a dedicated Discord channel (#ai-endpoints) on our Discord server, see you there!

For more info on AI Endpoints, find our previous blog posts.

Find the full code example in the GitHub repository: public-cloud-examples/ai/ai-endpoints/batch-mode.

Extract Text from Images with OCR using Python and OVHcloud AI Endpoints

Stéphane Philippart — Wed, 01 Apr 2026 12:55:19 +0000

If you want to have more information on AI Endpoints, please read the following blog post. You can, also, have a look at our previous blog posts on how use AI Endpoints.

You can find the full code example in the GitHub repository.

In this article, we will explore how to perform OCR (Optical Character Recognition) on images using a vision-capable LLM, the OpenAI Python library, and OVHcloud AI Endpoints.

Introduction to OCR with Vision Models

Optical Character Recognition has been around for decades, but traditional OCR engines often struggle with complex layouts, handwritten text, or noisy images. Vision-capable Large Language Models bring a new approach: instead of relying on specialized OCR pipelines, you can simply send an image to a model that understands both visual and textual content.

In this example, we use the OpenAI Python library to create a simple OCR script powered by a vision model hosted on OVHcloud AI Endpoints.

The whole application is a single Python file: no complex setup, just pip install openai and you’re ready to go.

Setting up the Environment Variables

Before running the script, you need to set the following environment variables:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN="your-access-token"
export OVH_AI_ENDPOINTS_MODEL_URL="https://your-model-url"
export OVH_AI_ENDPOINTS_VLLM_MODEL="your-vision-model-name"

You can find how to create your access token, model URL, and model name in the AI Endpoints catalog. Make sure to choose a vision-capable model from the AI Endpoints catalog.

Installing Dependencies

The only dependency is the OpenAI Python library:

pip install openai

Define the System Prompt

The first step is to define a system prompt that describes what our OCR service does. This prompt tells the model how to behave:

SYSTEM_PROMPT = """You are an expert OCR engine.
Extract every piece of text visible in the provided image.
Preserve the original layout as faithfully as possible (line breaks, columns, tables).
Do NOT interpret, summarise, or translate the content.
Use markdown formatting to represent the layout (e.g. tables, lists).
If the image contains no text, reply with: "No text found."
"""

We tell it to behave as an expert OCR engine, to preserve the original layout, and to use markdown formatting for structured content like tables or lists.

Load the Image

Before sending the image to the model, we need to encode it as a base64 string. Here is a simple helper function that reads a local PNG file and returns a base64-encoded string:

import base64
from pathlib import Path

def load_image_as_base64(path: Path) -> str:
    """Load a local image and encode it as base64."""
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

The base64-encoded data is what gets sent to the vision model as part of the prompt.

Extract Text from the Image

The extract_text function sends the image to the vision model and returns the extracted text:

def extract_text(client: OpenAI, image_base64: str, model: str) -> str:
    """Extract text from an image using the vision model."""
    response = client.chat.completions.create(
        model=model,
        temperature=0.0,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{image_base64}"
                        }
                    }
                ]
            }
        ]
    )
    return response.choices[0].message.content

The image is passed as a data URL inside the image_url field, following the OpenAI Vision API format. The temperature is set to 0.0 because we want deterministic, faithful text extraction and not creative output.

Configure the Client

This example uses a vision-capable model hosted on OVHcloud AI Endpoints. Since AI Endpoints exposes an OpenAI-compatible API, we use the OpenAI client and just point it to the OVHcloud endpoint:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"),
    base_url=os.getenv("OVH_AI_ENDPOINTS_MODEL_URL"),
)

model_name = os.getenv("OVH_AI_ENDPOINTS_VLLM_MODEL")

A few things to note:

The API key, base URL, and model name are read from environment variables.
The OpenAI library is compatible with any OpenAI compatible API, making it perfect for use with AI Endpoints.

Assemble and Run

With the client configured, extracting text from an image is straightforward:

image_base64 = load_image_as_base64(Path("./doc.png"))
result = extract_text(client, image_base64, model_name)
print(result)

And that’s it!

Here is the image used for this example:

And the result:

$ python ocr_demo.py
📄 Loading image: doc.png
🔍 Running OCR with Qwen2.5-VL-72B-Instruct via OVHcloud AI Endpoints...

📝 Extracted text 📝
Every month, the OVHcloud Developer Advocate team creates content, shares knowledge, and connects with the tech community. Here’s a look at what we did in March 2026. 🚀

🎙️ “Tranches de Tech” – Our monthly podcast

A new episode of our French-language podcast Tranches de Tech🥑 just dropped!

🎧 Episode 102: Tranches de Tech #26 – Architecte, c’est une bonne situation ça ?

This month we sat down with Alexandre Touret, Architect at Worldline to discuss the evolving role of software architects and the growing impact of AI on development practices. From Spotify’s claim that their devs no longer code, to agentic tools like OpenClaw and Claude Code reshaping workflows. We also cover ANSSI’s revised open-source policy, IBM tripling junior hires, and the critical responsibility of mentoring the next generation of developers in an AI-driven world.

📺 Live on Twitch

We streamed live on Twitch this month! Here’s what we covered:

🎥 Rémy Vandepoel discussed with Hugo Allabert and François Loiseau about our Public VCFaaS. Catch the replay on YouTube ▶️.

🎤 Conference Talks

The team hit the road (and the stage) at several conferences this month:

🇳🇱 KubeCon Amsterdam – Amsterdam, Netherlands 🇳🇱

Aurélie Vache gave a talk: The Ultimate Kubernetes Challenge: An Interactive Trivia Game

Conclusion

In this article, we have seen how to use a vision-capable LLM to perform OCR on images using the OpenAI Python library and OVHcloud AI Endpoints. The OpenAI library makes it very easy to send images to a vision model and extract text, and Python allows us to run the whole thing as a simple script.

You have a dedicated Discord channel (#ai-endpoints) on our Discord server (https://discord.gg/ovhcloud), see you there!

Reference Architecture: build a sovereign n8n RAG workflow for AI agent using OVHcloud Public Cloud solutions

Eléa Petton — Tue, 27 Jan 2026 13:12:03 +0000

What if an n8n workflow, deployed in a sovereign environment, saved you time while giving you peace of mind? From document ingestion to targeted response generation, n8n acts as the conductor of your RAG pipeline without compromising data protection.

n8n workflow overview

In the current landscape of AI agents and knowledge assistants, connecting your internal documentation with Large Language Models (LLMs) is becoming a strategic differentiator.

How? By building Agentic RAG systems capable of retrieving, reasoning, and acting autonomously based on external knowledge.

To make this possible, engineers need a way to connect retrieval pipelines (RAG) with tool-based orchestration.

This article outlines a reference architecture for building a fully automated RAG pipeline orchestrated by n8n, leveraging OVHcloud AI Endpoints and PostgreSQL with pgvector as core components.

The final result will be a system that automatically ingests Markdown documentation from Object Storage, creates embeddings with OVHcloud’s BGE-M3 model available on AI Endpoints, and stores them in a Managed Database PostgreSQL with pgvector extension.

Lastly, you’ll be able to build an AI Agent that lets you chat with an LLM (GPT-OSS-120B on AI Endpoints). This agent, utilising the RAG implementation carried out upstream, will be an expert on OVHcloud products.

You can further improve the process by using an LLM guard to protect the questions sent to the LLM, and set up a chat memory to use conversation history for higher response quality.

But what about n8n?

n8n, the open-source workflow automation tool, offers many benefits and connects seamlessly with over 300 APIs, apps, and services:

Open-source: n8n is a 100% self-hostable solution, which means you retain full data control;
Flexible: combines low-code nodes and custom JavaScript/Python logic;
AI-ready: includes useful integrations for LangChain, OpenAI, and embedding support capabilities;
Composable: enables simple connections between data, APIs, and models in minutes;
Sovereign by design: compliant with privacy-sensitive or regulated sectors.

This reference architecture serves as a blueprint for building a sovereign, scalable Retrieval Augmented Generation (RAG) platform using n8n and OVHcloud Public Cloud solutions.

This setup shows how to orchestrate data ingestion, generate embedding, and enable conversational AI by combining OVHcloud Object Storage, Managed Databases with PostgreSQL, AI Endpoints and AI Deploy.The result? An AI environment that is fully integrated, protects privacy, and is exclusively hosted on OVHcloud’s European infrastructure.

Overview of the n8n workflow architecture for RAG

The workflow involves the following steps:

Ingestion: documentation in markdown format is fetched from OVHcloud Object Storage (S3);
Preprocessing: n8n cleans and normalises the text, removing YAML front-matter and encoding noise;
Vectorisation: Each document is embedded using the BGE-M3 model, which is available via OVHcloud AI Endpoints;
Persistence: vectors and metadata are stored in OVHcloud PostgreSQL Managed Database using pgvector;
Retrieval: when a user sends a query, n8n triggers a LangChain Agent that retrieves relevant chunks from the database;
Reasoning and actions: The AI Agent node combines LLM reasoning, memory, and tool usage to generate a contextual response or trigger downstream actions (Slack reply, Notion update, API call, etc.).

In this tutorial, all services are deployed within the OVHcloud Public Cloud.

Prerequisites

Before you start, double-check that you have:

an OVHcloud Public Cloud account
an OpenStack user with the following roles:
- Administrator
- AI Operator
- Object Storage Operator
An API key for AI Endpoints
ovhai CLI available – install the ovhai CLI
Hugging Face access – create a Hugging Face account and generate an access token

🚀 Now that you have everything you need, you can start building your n8n workflow!

Architecture guide: n8n agentic RAG workflow

You’re all set to configure and deploy your n8n workflow

⚙️ Keep in mind that the following steps can be completed using OVHcloud APIs!

Step 1 – Build the RAG data ingestion pipeline

This first step involves building the foundation of the entire RAG workflow by preparing the elements you need:

n8n deployment
Object Storage bucket creation
PostgreSQL database creation
and more

Remember to set up the proper credentials in n8n so the different elements can connect and function.

1. Deploy n8n on OVHcloud VPS

OVHcloud provides VPS solutions compatible with n8n. Get a ready-to-use virtual server with pre-installed n8n and start building automation workflows without manual setup. With plans ranging from 6 vCores / 12 GB RAM to 24 vCores / 96 GB RAM, you can choose the capacity that suits your workload.

How to set up n8n on a VPS?

Setting up n8n on an OVHcloud VPS generally involves:

Choosing and provisioning your OVHcloud VPS plan;
Connecting to your server via SSH and carrying out the initial server configuration, which includes updating the OS;
Installing n8n, typically with Docker (recommended for ease of management and updates), or npm by following this guide;
Configuring n8n with a domain name, SSL certificate for HTTPS, and any necessary environment variables for databases or settings.

While OVHcloud provides a robust VPS platform, you can find detailed n8n installation guides in the official n8n documentation.

Once the configuration is complete, you can configure the database and bucket in Object Storage.

2. Create Object Storage bucket

First, you have to set up your data source. Here you can store all your documentation in an S3-compatible Object Storage bucket.

Here, assume that all the documentation files are in Markdown format.

From OVHcloud Control Panel, create a new Object Storage container with S3-compatible API solution; follow this guide.

When the bucket is ready, add your Markdown documentation to it.

Note: For this tutorial, we’re using the various OVHcloud product documentation available in Open-Source on the GitHub repository maintained by OVHcloud members.

Click this link to access the repository.

How do you do that? Extract all the guide.en-gb.md files from the GitHub repository and rename each one to match its parent folder.

Example: the documentation about ovhai cli installation docs/pages/public_cloud/ai_machine_learning/cli_10_howto_install_cli/guide.en-gb.md is stored in ovhcloud-products-documentation-md bucket as cli_10_howto_install_cli.md

You should get an overview that looks like this:

Keep the following elements and create a new credential in n8n named OVHcloud S3 gra credentials:

S3 Endpoint: https://s3.gra.io.cloud.ovh.net/
Region: gra
Access Key ID:
Secret Access Key:

Then, create a new n8n node by selecting S3, then Get Multiple Files.
Configure this node as follows:

Connect the node to the previous one before moving on to the next step.

With the first phase done, you can now configure the vector DB.

3. Configure PostgreSQL Managed DB (pgvector)

In this step, you can set up the vector database that lets you store the embeddings generated from your documents.

How? By using OVHcloud’s managed databases, a pgvector extension of PostgreSQL. Go to your OVHcloud Control Panel and follow the steps.

1. Navigate to Databases & Analytics > Databases

2. Create a new database and select PostgreSQL and a datacenter location

3. Select Production plan and Instance type

4. Reset the user password and save it

5. Whitelist the IP of your n8n instance as follows

6. Take note of te following parameters

Make a note of this information and create a new credential in n8n named OVHcloud PGvector credentials:

Host:
Database: defaultdb
User: avnadmin
Password:
Port: 20184

Consider enabling the Ignore SSL Issues (Insecure) button as needed and setting the Maximum Number of Connections value to 1000.

✅ You’re now connected to the database! But what about the PGvector extension?

Add a PosgreSQL node in your n8n workflow Execute a SQL query, and create the extension through an SQL query, which should look like this:

-- drop table as needed
DROP TABLE IF EXISTS md_embeddings;

-- activate pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- create table
CREATE TABLE md_embeddings (
    id SERIAL PRIMARY KEY,
    text TEXT,
    embedding vector(1024),
    metadata JSONB
);

You should get this n8n node:

Finally, you can create a new table and name it md_embeddings using this node. Create a Stop and Error node if you run into errors setting up the table.

All set! Your vector DB is prepped and ready for data! Keep in mind, you still need an embeddings model for the RAG data ingestion pipeline.

4. Access to OVHcloud AI Endpoints

OVHcloud AI Endpoints is a managed service that provides ready-to-use APIs for AI models, including LLM, CodeLLM, embeddings, Speech-to-Text, and image models hosted within OVHcloud’s European infrastructure.

To vectorise the various documents in Markdown format, you have to select an embedding model: BGE-M3.

Usually, your AI Endpoints API key should already be created. If not, head to the AI Endpoints menu in your OVHcloud Control Panel to generate a new API key.

Once this is done, you can create new OpenAI credentials in your n8n.

Why do I need OpenAI credentials? Because AI Endpoints API is fully compatible with OpenAI’s, integrating it is simple and ensures the sovereignty of your data.

How? Thanks to a single endpoint https://oai.endpoints.kepler.ai.cloud.ovh.net/v1, you can request the different AI Endpoints models.

This means you can create a new n8n node by selecting Postgres PGVector Store and Add documents to Vector Store.
Set up this node as shown below:

Then configure the Data Loader with a custom text splitting and a JSON type.

For the text splitter, here are some options:

To finish, select the BGE-M3 embedding model from the model list and set the Dimensions to 1024.

You now have everything you need to build the ingestion pipeline.

5. Set up the ingestion pipeline loop

To make use of a fully automated document ingestion and vectorisation pipeline, you have to integrate some specific nodes, mainly:

a Loop Over Items that downloads each markdown file one by one so that it can be vectorised;
a Code in JavaScript that counts the number of files processed, which subsequently determines the number of requests sent to the embedding model;
an If condition that allows you to check when the 400 requests have been reached;
a Wait node that pauses after every 400 requests to avoid getting rate-limited;
an S3 block Download a file to download each markdown;
another Code in JavaScript to extract and process text from Markdown files by cleaning and removing special characters before sending it to the embeddings model;
a PostgreSQL node to Execute a SQL query to check that the table contains vectors after the process (loop) is complete.

5.1. Create a loop to process each documentation file

Begin by creating a Loop Over Items to process all the Markdown files one at a time. Set the batch size to 1 in this loop.

Add the Loop statement right after the S3 Get Many Files node as shown below:

Time to put the loop’s content into action!

5.2. Count the number of files using a code snippet

Next, choose the Code in JavaScript node from the list to see how many files have been processed. Set “Run Once for Each Item” Mode and “JavaScript” code Language, then add the following code snippet to the designated block.

// simple counter per item
const counter = $runIndex + 1;

return {
  counter
};

Make sure this code snippet is included in the loop.

You can start adding the if part to the loop now.

5.3. Add a condition that applies a rule every 400 requests

Here, you need to create an If node and add the following condition, which you have set as an expression.

{{ (Number($json["counter"]) % 400) === 0 }}

Add it immediately after counting the files:

If this condition is true, trigger the Wait node.

5.4. Insert a pause after each set of 400 requests

Then insert a Wait node to pause for a few seconds before resuming. You can insert Resume “After Time Interval” and set the Wait Amount to “60:00” seconds.

Link it to the If condition when this is True.

Next, you can go ahead and download the Markdown file, and then process it.

5.5. Launch documentation download

To do this, create a new Download a file S3 node and configure it with this File Key expression:

{{ $('Process each documentation file').item.json.Key }}

Want to connect it? That’s easy, link it to the output of the Wait and If statements when the ‘if’ statement returns False; this will allow the file to be processed only if the rate limit is not exceeded.

You’re almost done! Now you need to extract and process the text from the Markdown files – clean and remove any special characters before sending it to the embedding model.

5.6 Clean Markdown text content

Next, create another Code in JavaScript to process text from Markdown files:

// extract binary content
const binary = $input.item.binary.data;

// decoding into clean UTF-8 text
let text = Buffer.from(binary.data, 'base64').toString('utf8');

// cleaning - remove non-printable characters
text = text
  .replace(/[^\x09\x0A\x0D\x20-\x7EÀ-ÿ€£¥•–—‘’“”«»©®™°±§¶÷×]/g, ' ')
  .replace(/\s{2,}/g, ' ')
  .trim();

// check lenght
if (text.length > 14000) {
  text = text.slice(0, 14000);
}

return [{
  text,
  fileName: binary.fileName,
  mimeType: binary.mimeType
}];

Select the “Run Once for Each Item” Mode and place the previous code in the dedicated JavaScript block.

To finish, check that the output text has been sent to the document vectorisation system, which was set up in Step 3 – Configure PostgreSQL Managed DB (pgvector).

How do I confirm that the table contains all elements after vectorisation?

5.7 Double-check that the documents are in the table

To confirm that your RAG system is working, make sure your vector database has different vectors; use a PostgreSQL node with Execute a SQL query in your n8n workflow.

Then, run the following query:

-- count the number of elements
SELECT COUNT(*) FROM md_embeddings;

Next, link this element to the Done section of your Loop, so the elements are counted when the process is complete.

Congrats! You can now run the workflow to begin ingesting documents.

Click the Execute workflow button and wait until the vectorization process is complete.

Remember, everything should be green when it’s finished ✅.

Step 2 – RAG chatbot

With the data ingestion and vectorisation steps completed, you can now begin implementing your AI agent.

This involves building a RAG-based AI Agent by simply starting a chat with an LLM.

1. Set up the chat box to start a conversation

First, configure your AI Agent based on the RAG system, and add a new node in the same n8n workflow: Chat Trigger.

This node will allow you to interact directly with your AI agent! But before that, you need to check that your message is safe.

This node will allow you to interact directly with your AI agent! But before that, you need to check that your message is secure.

2. Set up your LLM Guard with AI Deploy

To check whether a message is secure or not, use an LLM Guard.

What’s an LLM Guard? This is a safety and control layer that sits between users and an LLM, or between the LLM and an external connection. Its main goal is to filter, monitor, and enforce rules on what goes into or comes out of the model 🔐.

You can use AI Deploy from OVHcloud to deploy your desired LLM guard. With a single command line, this AI solution lets you deploy a Hugging Face model using vLLM Docker containers.

For more details, please refer to this blog.

For the use case covered in this article, you can use the open-source model meta-llama/Llama-Guard-3-8B available on Hugging Face.

2.1 Create a Bearer token to request your custom AI Deploy endpoint

Create a token to access your AI Deploy app once it’s deployed.

ovhai token create --role operator ai_deploy_token=my_operator_token

The following output is returned:

Id: 47292486-fb98-4a5b-8451-600895597a2b Created At: 20-10-25 8:53:05 Updated At: 20-10-25 8:53:05 Spec: Name: ai_deploy_token=my_operator_token Role: AiTrainingOperator Label Selector: Status: Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Version: 1

You can now store and export your access token to add it as a new credential in n8n.

export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

2.1 Start Llama Guard 3 model with AI Deploy

Using ovhai CLI, launch the following command and vLLM start inference server.

ovhai app run \
	--name vllm-llama-guard3 \
        --default-http-port 8000 \
        --gpu 1 \
	--flavor l40s-1-gpu \
        --label ai_deploy_token=my_operator_token \
	--env OUTLINES_CACHE_DIR=/tmp/.outlines \
	--env HF_TOKEN=$MY_HF_TOKEN \
	--env HF_HOME=/hub \
	--env HF_DATASETS_TRUST_REMOTE_CODE=1 \
	--env HF_HUB_ENABLE_HF_TRANSFER=0 \
	--volume standalone:/workspace:RW \
	--volume standalone:/hub:RW \
	vllm/vllm-openai:v0.10.1.1 \
	-- bash -c python3 -m vllm.entrypoints.openai.api_server                       
                           --model meta-llama/Llama-Guard-3-8B \                     
                           --tensor-parallel-size 1 \                     
                           --dtype bfloat16

Full command explained:

ovhai app run

This is the core command to run an app using the OVHcloud AI Deploy platform.

--name vllm-llama-guard3

Sets a custom name for the job. For example, vllm-llama-guard3.

--default-http-port 8000

Exposes port 8000 as the default HTTP endpoint. vLLM server typically runs on port 8000.

--gpu 1
--flavor l40s-1-gpu

Allocates 1 GPU L40S for the app. You can adjust the GPU type and number depending on the model you have to deploy.

--volume standalone:/workspace:RW
--volume standalone:/hub:RW

Mounts two persistent storage volumes: /workspace which is the main working directory and /hub to store Hugging Face model files.

--env OUTLINES_CACHE_DIR=/tmp/.outlines
--env HF_TOKEN=$MY_HF_TOKEN
--env HF_HOME=/hub
--env HF_DATASETS_TRUST_REMOTE_CODE=1
--env HF_HUB_ENABLE_HF_TRANSFER=0

These are Hugging Face environment variables you have to set. Please export your Hugging Face access token as environment variable before starting the app: export MY_HF_TOKEN=***********

vllm/vllm-openai:v0.10.1.1

Use the vllm/vllm-openai Docker image (a pre-configured vLLM OpenAI API server).

-- bash -c python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-Guard-3-8B \ --tensor-parallel-size 1 \ --dtype bfloat16

Finally, run a bash shell inside the container and executes a Python command to launch the vLLM API server.

2.2 Check to confirm your AI Deploy app is RUNNING

Replace the by yours.

ovhai app get

You should get:

History: DATE STATE 20-1O-25 09:58:00 QUEUED 20-10-25 09:58:01 INITIALIZING 04-04-25 09:58:07 PENDING 04-04-25 10:03:10 RUNNING Info: Message: App is running

2.3 Create a new n8n credential with AI Deploy app URL and Bearer access token

First, using your , retrieve your AI Deploy app URL.

ovhai app get  -o json | jq '.status.url' -r

Then, create a new OpenAI credential from your n8n workflow, using your AI Deploy URL and the Bearer token as an API key.

Don’t forget to replace 6e10e6a5-2862-4c82-8c08-26c458ca12c7 with your .

2.4 Create the LLM Guard node in n8n workflow

Create a new OpenAI node to Message a model and select the new AI Deploy credential for LLM Guard usage.

Next, create the prompt as follows:

{{ $('Chat with the OVHcloud product expert').item.json.chatInput }}

Then, use an If node to determine if the scenario is safe or unsafe:

If the message is unsafe, send an error message right away to stop the workflow.

But if the message is safe, you can send the request to the AI Agent without issues 🔐.

3. Set up AI Agent

The AI Agent node in n8n acts as an intelligent orchestration layer that combines LLMs, memory, and external tools within an automated workflow.

It allows you to:

Connect a Large Language Model using APIs (e.g., LLMs from AI Endpoints);
Use tools such as HTTP requests, databases, or RAG retrievers so the agent can take actions or fetch real information;
Maintain conversational memory via PostgreSQL databases;
Integrate directly with chat platforms (e.g., Slack, Teams) for interactive assistants (optional).

Simply put, n8n becomes an agentic automation framework, enabling LLMs to not only provide answers, but also think, choose, and perform actions.

Please note that you can change and customise this n8n AI Agent node to fit your use cases, using features like function calling or structured output. This is the most basic configuration for the given use case. You can go even further with different agents.

🧑‍💻 How do I implement this RAG?

First, create an AI Agent node in n8n as follows:

Then, a series of steps are required, the first of which is creating prompts.

3.1 Create prompts

In the AI Agent node on your n8n workflow, edit the user and system prompts.

Begin by creating the prompt, which is also the user message:

{{ $('Chat with the OVHcloud product expert').item.json.chatInput }}

Then create the System Message as shown below:

You have access to a retriever tool connected to a knowledge base.  
Before answering, always search for relevant documents using the retriever tool.  
Use the retrieved context to answer accurately.  
If no relevant documents are found, say that you have no information about it.

You should get a configuration like this:

🤔 Well, an LLM is now needed for this to work!

3.2 Select LLM using AI Endpoints API

First, add an OpenAI Chat Model node, and then set it as the Chat Model for your agent.

Next, select one of the OVHcloud AI Endpoints from the list provided, because they are compatible with Open AI APIs.

✅ How? By using the right API https://oai.endpoints.kepler.ai.cloud.ovh.net/v1

The GPT OSS 120B model has been selected for this use case. Other models, such as Llama, Mistral, and Qwen, are also available.

⚠️ WARNING ⚠️

If you are using a recent version of n8n, you will likely encounter the /responses issue (linked to OpenAI compatibility). To resolve this, you will need to disable the button Use Responses API and everything will work correctly

Tips to fix /responses issue

Your LLM is now set to answer your questions! Don’t forget, it needs access to the knowledge base.

3.3 Connect the knowledge base to the RAG retriever

As usual, the first step is to create an n8n node called PGVector Vector Store node and enter your PGvector credentials.

Next, link this element to the Tools section of the AI Agent node.

Remember to connect your PG vector database so that the retriever can access the previously generated embeddings. Here’s an overview of what you’ll get.

⏳Nearly done! The final step is to add the database memory.

3.4 Manage conversation history with database memory

Creating Database Memory node in n8n (PostgreSQL) lets you link it to your AI Agent, so it can store and retrieve past conversation history. This enables the model to remember and use context from multiple interactions.

So link this PostgreSQL database to the Memory section of your AI agent.

Congrats! 🥳 Your n8n RAG workflow is now complete. Ready to test it?

4. Make the most of your automated workflow

Want to try it? It’s easy!

By clicking the orange Open chat button, you can ask the AI agent questions about OVHcloud products, particularly where you need technical assistance.

For example, you can ask the LLM about rate limits in OVHcloud AI Endpoints and get the information in seconds.

You can now build your own autonomous RAG system using OVHcloud Public Cloud, suited for a wide range of applications.

What’s next?

To sum up, this reference architecture provides a guide on using n8n with OVHcloud AI Endpoints, AI Deploy, Object Storage, and PostgreSQL + pgvector to build a fully controlled, autonomous RAG AI system.

Teams can build scalable AI assistants that work securely and independently in their cloud environment by orchestrating ingestion, embedding generation, vector storage, retrieval, and LLM safety check, and reasoning within a single workflow.

With the core architecture in place, you can add more features to improve the capabilities and robustness of your agentic RAG system:

Web search
Images with OCR
Audio files transcribed using the Whisper model

This delivers an extensive knowledge base and a wider variety of use cases!

Create a podcast transcript with Whisper by AI Endpoints

Stéphane Philippart — Thu, 28 Aug 2025 07:03:04 +0000

Check out this blog post if you want to know more about AI Endpoints.
You can also find more info on AI Endpoints in our previous blog posts.

This blog post explains how to create a podcast transcript using Whisper, a powerful automatic speech recognition (ASR) system developed by OpenAI. Whisper integrates with AI Endpoints and makes it easy to transcribe audio files and add features, like speaker diarization.

ℹ️ You can find the full code on Github ℹ️

Environment Setup

Define your environment variables for accessing AI Endpoints:

$ export OVH_AI_ENDPOINTS_WHISPER_URL=
$ export OVH_AI_ENDPOINTS_ACCESS_TOKEN=
$ export OVH_AI_ENDPOINTS_WHISPER_MODEL=whisper-large-v3

Install dependencies:

$ pip install -r requirements.txt

Audio transcription

With Whisper and the OpenAI client, transcribing audio is as simple as writing a few lines of code:

import os
import json
from openai import OpenAI

# 🛠️ OpenAI client initialisation
client = OpenAI(base_url=os.environ.get('OVH_AI_ENDPOINTS_WHISPER_URL'), 
                api_key=os.environ.get('OVH_AI_ENDPOINTS_ACCESS_TOKEN'))

# 🎼 Audio file loading
with open("../resources/TdT20-trimed-2.mp3", "rb") as audio_file:
    # 📝 Call Whisper transcription API
    transcript = client.audio.transcriptions.create(
        model=os.environ.get('OVH_AI_ENDPOINTS_WHISPER_MODEL'),
        file=audio_file,
        temperature=0.0,
        response_format="verbose_json",
        extra_body={"diarize": True},
    )

FYI:
– we use ‘diarize’ (not a Whisper parameter) to enable diarization, because the OpenAI client lets us add extra body parameters.
– you need verbose_json for diarization (which also means segmentation mode)

Once you have the full transcript, format it in a way that’s easy for humans to read.

Create the script

The JSON field ‘diarization’ contains all of the transcribed, diarized content.

"diarization": [
    {
      "speaker": 0,
      "text": "bla bla bla",
      "start": 16.5,
      "end": 26.38
    },
    {
      "speaker": 1,
      "text": "bla bla",
      "start": 26.38,
      "end": 32.6
    },
    {
      "speaker": 1,
      "text": "bla bla",
      "start": 32.6,
      "end": 40.6
    },
    {
      "speaker": 2,
      "text": "bla bla",
      "start": 40.6,
      "end": 42
    }
]

Because they are segmented, you can merge several fields for the same speaker as detailed below—for speaker 1.

Here’s a sample code for creating the script of a French podcast featuring 3 speakers:

# 🔀 Merge the dialog said by the same speaker     
diarizedTranscript = ''
speakers = ["Aurélie", "Guillaume", "Stéphane"]
previousSpeaker = -1
jsonTranscript = json.loads(transcript.model_dump_json())

# 💬 Only the diarization field is useful
for dialog in jsonTranscript["diarization"]:
    speaker = dialog.get("speaker")
    text = dialog.get("text")
    if (previousSpeaker == speaker):
        diarizedTranscript += f" {text}"
    else:
        diarizedTranscript += f"\n\n{speakers[speaker]}: {text}"
    previousSpeaker = speaker

print(f"\n📝 Diarized Transcript 📝:\n{diarizedTranscript}")

Lastly, run the Python script:

$ python PodcastTranscriptWithWhisper.py

📝 Diarized Transcript 📝:

Stéphane: Bonjour tout le monde, ravi de vous retrouver pour l'enregistrement de ce dernier épisode de la saison avant de prendre des vacances bien méritées et de vous retrouver à la rentrée pour la troisième saison. Nous enregistrons cet épisode le 30 juin à la fraîche, enfin si on peut dire au vu des températures déjà présentes en cette matinée. Justement, elle revient chaudement de Sunnytech et c'est avec plaisir que je la retrouve pour l'enregistrement de cet épisode. Bonjour Aurélie, comment vas-tu ?

Aurélie: Salut, alors ça va très bien. Alors j'avoue, j'ai également très chaud. J'ai le ventilateur qui est juste à côté de moi donc ça va aller pour l'enregistrement du podcast.

Stéphane: Oui, c'est vrai qu'il fait un peu chaud. Et pour ce dernier épisode de la saison, c'est avec un mélange de joie mais aussi d'intimidation que je reçois notre invité. Si je fais ce métier de la façon dont je le fais, c'est grandement grâce à lui. Ce podcast, quelque part, a bien entendu des inspirations de ce que fait notre invité. Je suis donc très content de te recevoir Guillaume. Bonjour Guillaume, comment vas-tu et souhaites-tu te présenter à nos auditrices et auditeurs ? Bonjour à

Guillaume: tous et bien merci déjà de m'avoir invité. Je suis très content de rejoindre votre podcast pour cet épisode. Je m'appelle Guillaume Laforge, je suis un développeur Java depuis la première heure depuis très très longtemps. Je travaille chez Google, en particulier dans la partie Google Cloud. Je me focalise beaucoup sur tout ce qui est Generative AI vu que c'est à la mode évidemment. Les gens me connaissent peut-être ou peut-être ma voix d'ailleurs parce que je fais partie du podcast Les Cascodeurs qu'on a commencé il y a 15 ans ou quelque chose comme ça. Il y a trop longtemps. Ou alors ils me connaissent parce que je suis un des co-fondateurs du langage Groovy, Apache Groovy.

Feel free to try out our new product, AI Endpoints, and share your thoughts.

Hang out with us on Discord at #ai-endpoints or https://discord.gg/ovhcloud. See you soon!

Use Kilo Code with AI Endpoints and VSCode

Stéphane Philippart — Mon, 30 Jun 2025 08:09:03 +0000

If you want to have more information on AI Endpoints, please read the following blog post.
You can, also, have a look at our previous blog posts on how use AI Endpoints.

In a previous blog post we explained how to use Continue with VSCode to create a code assistant with AI Endpoints.

In this blog post, we will explain how to use Kilo Code with VSCode to create a powerful coder companion! If you need more information about Kilo Code, please check out the official Kilo Code documentation.

How to use AI Endpoints with Kilo Code?

The first thing is to install the extension in your VSCode. See the official documentation to see how to do that.

Once the extension is installed, you need to configure an external provider. To do this choose OVHcloud AI Endpoints in the Providers tab.

Here is the values for the Kilo Code parameters to set to use it with AI Endpoints:
– API Provider: OVHcloud AI Endpoints
– API Key: … your API Key 😇
– Model: One of the available models, for instance Qwen2.5-Coder-32B-Instruct (this is one of our coder model at the time I wrote the blog post, fell free to take another available model)

And that’s all, you can enjoy the power of Kilo Code with AI Endpoints! 🚀

Don’t hesitate to test our new product, AI Endpoints, and give us your feedback.

You have a dedicated Discord channel (#ai-endpoints) on our Discord server (https://discord.gg/ovhcloud), see you there!

Model Context Protocol (MCP) with OVHcloud AI Endpoints

Stéphane Philippart — Fri, 27 Jun 2025 08:01:19 +0000

If you want to have more information on AI Endpoints, please read the following blog post.
You can, also, have a look at our previous blog posts on how use AI Endpoints.

OVHcloud AI Endpoints allows developers to easily add AI features to there day to day developments.

In this article, we will explore how to create a Model Context Protocol (MCP) server and client using Quarkus and LangChain4J to interact with OVHcloud AI Endpoints.

ℹ️ You can find the full code on Github ℹ️

Introduction to Model Context Protocol (MCP)

In few words, MCP is a protocol that allows your LLM to ask for additional context or data from external sources during the generation processes.
⚠️ This is not the LLM that calls the external source, it’s the client that handles the call and returns the result to the LLM. ⚠️

If you want more information about MCP, please refer to the official documentation.

In this blog post, we’ll explore how to easily create, in Java, a MCP Server using Quarkus and a client using LangChain4J.

Creating a Server with Quarkus

The goal of this MCP server is to allow the LLM to ask for information about OVHcloud public cloud projects.

ℹ️ The code used to call the OVHcloud API is in the GitHub repository and will not be detailed here.

Thanks to Quarkus, the only things you need to create a MCP server is to define the tools that you want to expose to the LLM.

public class PublicCloudUserTool {

    @RestClient
    OVHcloudMe ovhcloudMe;

    @Tool(description = "Tool to manage the OVHcloud public cloud user.")
    ToolResponse getUserDetails() {
        Long ovhTimestamp = System.currentTimeMillis() / 1000;
        return ToolResponse.success(
                new TextContent(ovhcloudMe.getMe(OVHcloudSignatureHelper.signature("me", ovhTimestamp),
                        Long.toString(ovhTimestamp)).toString()));
    }
}

⚠️ The description is very important as it will be used by the LLM to choose the right tool for the task. ⚠️

At the time of writing, there are two type of MCP servers: stdio and Streamable HTTP.
This blog post uses the Streamable mode thanks to Quarkus with the quarkus-mcp-server-sse extension.

Run your server with the quarkus dev command. Your MCP server will be used on http://localhost:8080.

Using the MCP server with LangChain4J

You can, now, use the MCP server with LangChain4J to create a powerful chatbot that can now interact with your OVHcloud account!

///usr/bin/env jbang "$0" "$@" ; exit $?
//JAVA 24+
//PREVIEW
//DEPS dev.langchain4j:langchain4j-mcp:1.0.1-beta6 dev.langchain4j:langchain4j:1.0.1 dev.langchain4j:langchain4j-mistral-ai:1.0.1-beta6 


import dev.langchain4j.mcp.McpToolProvider;
import dev.langchain4j.mcp.client.DefaultMcpClient;
import dev.langchain4j.mcp.client.McpClient;
import dev.langchain4j.mcp.client.transport.McpTransport;
import dev.langchain4j.mcp.client.transport.http.HttpMcpTransport;
import dev.langchain4j.model.chat.ChatModel;
import dev.langchain4j.model.mistralai.MistralAiChatModel;
import dev.langchain4j.service.AiServices;

// Simple chatbot definition with AI Services from LangChain4J
public interface Bot {
    String chat(String prompt);
}

void main() {
    // Mistral model from OVHcloud AI Endpoints
    ChatModel chatModel = MistralAiChatModel.builder()
            .apiKey(System.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"))
            .baseUrl(System.getenv("OVH_AI_ENDPOINTS_MODEL_URL"))
            .modelName(System.getenv("OVH_AI_ENDPOINTS_MODEL_NAME"))
            .logRequests(false)
            .logResponses(false)
            .build();

    // Configure the MCP server to use
    McpTransport transport = new HttpMcpTransport.Builder()
            // https://xxxx/mcp/sse
            .sseUrl(System.getenv("MCP_SERVER_URL"))
            .logRequests(false)
            .logResponses(false)
            .build();

    // Create the MCP client for the given MCP server
    McpClient mcpClient = new DefaultMcpClient.Builder()
            .transport(transport)
            .build();

    // Configure the tools list for the LLM
    McpToolProvider toolProvider = McpToolProvider.builder()
            .mcpClients(mcpClient)
            .build();

    // Create the chatbot with the given LLM and tools list
    Bot bot = AiServices.builder(Bot.class)
            .chatModel(chatModel)
            .toolProvider(toolProvider)
            .build();

    // Play with the chatbot 🤖
    String response = bot.chat("Can I have some details about my OVHcloud account?");
    System.out.println("RESPONSE: " + response);

}

If you run the code you can see your MCP server and client in action:

$ jbang SimpleMCPClient.java

DEBUG -- Connected to SSE channel at http://127.0.0.1:8080/mcp/sse
DEBUG -- Received the server's POST URL: http://127.0.0.1:8080/mcp/messages/ZDdkZTEyYWMtNzczMC00NDVkLWFhMjktZWI1MGI0YjVjNzFh
DEBUG -- MCP server capabilities: 
{"capabilities":
  {"resources":
    {"listChanged":true},
    "completions":{},
    "logging":{},
    "tools":
      {"listChanged":true},
      "prompts":
        {"listChanged":true}
  },
  "serverInfo":
    {"version":"1.0.0-SNAPSHOT",
    "name":"ovh-mcp"
    },
  "protocolVersion":"2024-11-05"
}

RESPONSE:  Here are the details for your OVHcloud account:
- First name: Stéphane
- Last name: Philippart
- City: XXX
- Country: FR
- Language: fr_FR

You can refer to these details when interacting with the OVHcloud platform or support.

You have a dedicated Discord channel (#ai-endpoints) on our Discord server (https://discord.gg/ovhcloud), see you there!

Using Function Calling with OVHcloud AI Endpoints

Stéphane Philippart — Tue, 24 Jun 2025 07:03:45 +0000

If you want to have more information on AI Endpoints, please read the following blog post.
You can, also, have a look at our previous blog posts on how use AI Endpoints.

OVHcloud AI Endpoints allows developers to easily add AI features to there day to day developments.

Stable Diffusion is a powerful artificial intelligence model to generate images from text descriptions.
You can use it, thanks to AI Endpoints, simply by calling the endpoint with a prompt.

However, creating a good prompt for Stable Diffusion can be challenging.

In this blog post, we will show you how to optimize your prompts using Function Calling and AI Endpoints.

OVHcloud AI Endpoints provides a lot of models, but for this example we will use models from the Large Languages Models (LLM) and Image Generation families.

The following examples use LangChain4J as Framework to do the LLM calls.

ℹ️ You can find the full code on Github ℹ️

Introduction to Function Calling

Function calling refers to the ability of a language model or AI system to ask to invoke and execute pre-defined functions or tasks, such as data processing, calculations, or external API calls, in response to user input or prompts.
This enables the AI system to perform more complex and dynamic tasks, and to leverage external knowledge and services to generate more accurate and informative responses.

In the context of image generation, function calling can be used to enhance the quality of the prompts by optimizing them thanks to external tool based on a LLM.

To create our application we will use LangChain4J to simplify the integration of the AI models and the function calling mechanism.

Tool creation

To use the function calling mechanism, we need to define a tool.
In our example the goal of the tool is to call Stable Diffusion API to generate an image.

⚠️ This is not the model itself that calls the tool but the client that invokes the model. ⚠️

    @Tool("""
    Tool to create an image with Stable Diffusion XL given a prompt and a negative prompt.
    """)
    void generateImage(@P("Prompt that explains the image") String prompt, @P("Negative prompt that explains what the image must not contains") String negativePrompt) throws IOException, InterruptedException {
        System.out.println("Prompt: " + prompt);
        System.out.println("Negative prompt: " + negativePrompt);

        HttpRequest httpRequest = HttpRequest.newBuilder()
                .uri(URI.create(System.getenv("OVH_AI_ENDPOINTS_SD_URL")))
                .POST(HttpRequest.BodyPublishers.ofString("""
                        {"prompt": "%s", 
                         "negative_prompt": "%s"}
                        """.formatted(prompt, negativePrompt)))
                .header("accept", "application/octet-stream")
                .header("Content-Type", "application/json")
                .header("Authorization", "Bearer " + System.getenv("OVH_AI_ENDPOINTS_SDXL_ACCESS_TOKEN"))
                .build();

        HttpResponse response = HttpClient.newHttpClient()
                .send(httpRequest, HttpResponse.BodyHandlers.ofByteArray());

        System.out.println("SDXL status code: " + response.statusCode());
        Files.write(Path.of("generated-image.jpeg"), response.body());
    }

⚠️ One of the main point to help the LLM to choose the right tool to use, is to provide clear and comprehensive description. ⚠️

Once the tool is ready, lets tell to the model that it can use it!

Optimizing the model with a tool

First we create a simple chatbot.

/// Chatbot definition.
/// The goal of the chatbot is to build a powerful prompt for Stable diffusion XML.
interface ChatBot {
    @SystemMessage("""
            Your are an expert of using the Stable Diffusion XL model.
            The user explains in natural language what kind of image he wants.
            You must do the following steps:
              - Understand the user's request.
              - Generate the two kinds of prompts for stable diffusion: the prompt and the negative prompt
              - the prompts must be in english and detailed and optimized for the Stable Diffusion XL model. 
              - once and only once you have this two prompts call the tool with the two prompts.
            If asked about to create an image, you MUST call the `generateImage` function.
            """)
    @UserMessage("Create an image with stable diffusion XLK following this description: {{userMessage}}")
    String chat(String userMessage);
}

It’s not mandatory to create a such detailed system message, but it helps the model to choose the tool when needed.

After this we assemble all the pieces together.

void main() throws Exception {

    // Main chatbot configuration, choose on of the available models on the AI Endpoints catalog (https://endpoints.ai.cloud.ovh.net/catalog)
    ChatModel chatModel = MistralAiChatModel.builder()
            .apiKey(System.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"))
            .baseUrl(System.getenv("OVH_AI_ENDPOINTS_MODEL_URL"))
            .modelName(System.getenv("OVH_AI_ENDPOINTS_MODEL_NAME"))
            .logRequests(false)
            .logResponses(false)
            // To have more deterministic outputs, set temperature to 0.
            .temperature(0.0)
            .build();

    // Add memory to fine tune the SDXL prompt.
    ChatMemory chatMemory = MessageWindowChatMemory.withMaxMessages(10);

    // Build the chatbot thanks to LangChain4J AI Servises mode
    ChatBot chatBot = AiServices.builder(ChatBot.class)
            .chatModel(chatModel)
            .tools(new ImageGenTools())
            .chatMemory(chatMemory)
            .build();

    // Start the conversation loop (enter "exit" to quit)
    String userInput = "";
    Scanner scanner = new Scanner(System.in);
    while (true) {
        System.out.print("Enter your message: ");
        userInput = scanner.nextLine();
        if (userInput.equalsIgnoreCase("exit")) break;
        System.out.println("Response: " + chatBot.chat(userInput));
    }
    scanner.close();
}

ℹ️ We use a loop to be able to ask the model to optimize the image generation parameters based on the previous response. ℹ️

And that it!
It’s time to test our Stable Diffusion optimizer.

$ jbang ImageGeneration.java


Enter your message: Un chat roux mignon photo réaliste

Prompt: A high-quality, realistic image of a cute red cat, with expressive eyes, soft fur, and a playful pose. 
The cat should be well-lit, with a warm and inviting atmosphere.

Negative prompt: No text, no watermarks, no low-quality images, no cartoon-style, no blurry or pixelated images, 
no cats with missing body parts, no cats with unnatural colors, no cats in unrealistic settings, no cats with human features, 
no cats with inappropriate content.

Response: I have successfully generated the image for you. The image should be a high-quality, 
realistic image of a cute red cat, with expressive eyes, soft fur, and a playful pose. The cat should be well-lit, 
with a warm and inviting atmosphere. If you have any issues or need further assistance, please let me know.

Enter your message: exit

ℹ️ As you see, the model translate the prompt 😊

Here is the result of the prompt:

You have a dedicated Discord channel (#ai-endpoints) on our Discord server (https://discord.gg/ovhcloud), see you there!

Using Structured Output with OVHcloud AI Endpoints

Stéphane Philippart — Fri, 23 May 2025 12:14:54 +0000

If you want to have more information on AI Endpoints, please read the following blog post.
You can, also, have a look at our previous blog posts on how use AI Endpoints.

You can find the full code example in the GitHub repository.

In this article, we will explore how to use structured output with OVHcloud AI Endpoints.

Introduction to Structured Output

Structured output allows you to format output data in a way that makes it easier for machines to interpret and process.
We use the langchain4j library to interact with OVHcloud AI Endpoints.
Here is an excerpt of code that shows how to define a structured output format for the responses of the language model:

ResponseFormat responseFormat = ResponseFormat.builder()
         .type(ResponseFormatType.JSON)
         .jsonSchema(JsonSchema.builder()
            .name("Person")
            .rootElement(JsonObjectSchema.builder()
               .addStringProperty("name")
               .addIntegerProperty("age")
               .addNumberProperty("height")
               .addBooleanProperty("married")
               .required("name", "age", "height", "married")
            .build())
         .build())
.build();

In this example, we define a JSON output format with a schema that specifies the name, age, height, and married properties as required.

Configure the model to use

This example uses the Mistral AI model hosted on OVHcloud AI Endpoints.
To configure the model, you need to set up the API key, base URL, and model name as environment variables.
Fell free to use another model, see AI Endpoints catalog.

You can find your access token, model URL, and model name in the OVHcloud AI Endpoints model dashboard.

ChatModel chatModel = MistralAiChatModel.builder()
        .apiKey(System.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"))
        .baseUrl(System.getenv("OVH_AI_ENDPOINTS_MODEL_URL"))
        .modelName(System.getenv("OVH_AI_ENDPOINTS_MODEL_NAME"))
        .logRequests(false)
        .logResponses(false)
.build();

Calling the language model

Thanks to the JSON mode of the LLM, the response from the language model is received as a JSON string:

UserMessage userMessage = UserMessage.from("""
        John is 42 years old.
        He stands 1.75 meters tall.
        Currently unmarried.
        """);

ChatRequest chatRequest = ChatRequest.builder()
        .responseFormat(responseFormat)
        .messages(userMessage)
        .build();

ChatResponse chatResponse = chatModel.chat(chatRequest);

String output = chatResponse.aiMessage().text();
System.out.println("Response: \n" + output); 


// Person is a simple record: record Person(String name, int age, double height, boolean married) {}
Person person = new ObjectMapper().readValue(output, Person.class);
System.out.println(person);

The full source code

///usr/bin/env jbang "$0" "$@" ; exit $?
//JAVA 21+
//PREVIEW
//DEPS dev.langchain4j:langchain4j:1.0.1 dev.langchain4j:langchain4j-mistral-ai:1.0.1-beta6

import com.fasterxml.jackson.databind.ObjectMapper;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.model.chat.request.ChatRequest;
import dev.langchain4j.model.chat.request.ResponseFormat;
import dev.langchain4j.model.chat.request.ResponseFormatType;
import dev.langchain4j.model.chat.request.json.JsonObjectSchema;
import dev.langchain4j.model.chat.request.json.JsonSchema;
import dev.langchain4j.model.chat.response.ChatResponse;
import dev.langchain4j.model.mistralai.MistralAiChatModel;
import dev.langchain4j.model.chat.ChatModel;

record Person(String name, int age, double height, boolean married) {
}

void main() throws Exception {
    ResponseFormat responseFormat = ResponseFormat.builder()
            .type(ResponseFormatType.JSON)
            .jsonSchema(JsonSchema.builder()
                    .name("Person")
                    .rootElement(JsonObjectSchema.builder()
                            .addStringProperty("name")
                            .addIntegerProperty("age")
                            .addNumberProperty("height")
                            .addBooleanProperty("married")
                            .required("name", "age", "height", "married")
                            .build())
                    .build())
            .build();

    UserMessage userMessage = UserMessage.from("""
            John is 42 years old.
            He stands 1.75 meters tall.
            Currently unmarried.
            """);

    ChatRequest chatRequest = ChatRequest.builder()
            .responseFormat(responseFormat)
            .messages(userMessage)
            .build();

    ChatModel chatModel = MistralAiChatModel.builder()
            .apiKey(System.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"))
            .baseUrl(System.getenv("OVH_AI_ENDPOINTS_MODEL_URL"))
            .modelName(System.getenv("OVH_AI_ENDPOINTS_MODEL_NAME"))
            .logRequests(false)
            .logResponses(false)
            .build();

    ChatResponse chatResponse = chatModel.chat(chatRequest);

    System.out.println("Prompt: \n" + userMessage.singleText());
    String output = chatResponse.aiMessage().text();
    System.out.println("Response: \n" + output); 

    Person person = new ObjectMapper().readValue(output, Person.class);
    System.out.println(person); 
}

Running the application

jbang HelloWorld.java
[jbang] Building jar for HelloWorld.java...

Prompt: 
John is 42 years old.
He stands 1.75 meters tall.
Currently unmarried.

Response: 
{"age": 42, "height": 1.75, "married": false, "name": "John"}
Person[name=John, age=42, height=1.75, married=false]

This code example uses JBang, a Java-based tool for creating and running Java programs as scripts.
For more information on JBang, please refer to the JBang documentation.

In this article, we have seen how to use structured output with OVHcloud AI Endpoints with LangChain4J.

You have a dedicated Discord channel (#ai-endpoints) on our Discord server, see you there!

Deep Dive into DeepSeek-R1 – Part 1

Fabien Ric — Thu, 06 Mar 2025 09:56:20 +0000

Introduction

A few weeks ago, the release of the open-source large language model DeepSeek-R1 has taken the AI world by storm. The Chinese research team claimed their new reasoning model was on par with OpenAI’s flagship model o1, open-sourced the model and gave details about the work behind it.

In this blog post series, we will dive into the DeepSeek-R1 model family and see how you can run it on OVHcloud to build a simple chatbot that handles reasoning.

The “R” in DeepSeek-R1 stands for “Reasoning”, so let’s start by defining what a reasoning model is.

What are reasoning models?

Reasoning models are large language models (LLM) capable of reflecting on a problem before generating an answer. Traditionally, LLMs have been improved by spending more compute (more data, increase the number of parameters and the number of training iterations) at training time: it is training-time compute. Reasoning models, however, differ with standard LLMs in the way they use test-time compute, which means that during inference, they spend more time and resources to generate and refine a better answer.

Reasoning models excel at tasks that require understanding and working through a problem step-by-step, such as mathematics, riddles, puzzles, coding, planning tasks and agentic workflows. They may be counterproductive for use cases that don’t require reasoning capabilities, such as knowledge facts (for example, who discovered penicillin).

In a classroom, a reasoning model would be a student that takes time to understand the question, split the problem into manageable steps and detail the resolution process before rushing to write the answer.

Here is a comparison between the outputs of a standard LLM and a reasoning LLM, on an example prompt:

The reasoning model has generated more tokens, showing how it plans to solve the problem, before the actual answer. You can see it generates reasoning content into ... tags, in the case of DeepSeek-R1.

A standard LLM can also show reasoning abilities, that are often more visible when using a technique called Chain-of-Thought prompting (CoT), by adding phrases such as “let’s think step-by-step” in the prompt.

However, a reasoning LLM has been trained to behave this way. Its reasoning skill is internalized, so it doesn’t require specific prompting techniques to trigger the chain of thoughts process.

It’s important to note that DeepSeek-R1 is not the first reasoning model; OpenAI led the way by releasing their o1 model in September 2024.

The two main reasons why DeepSeek-R1 made the headline are its open-source nature, and the paper released by the research team which give many details on how they trained the model, with valuable insight for the open-source community to create reasoning models. Especially, the key highlight of their paper is that they observe the reasoning behavior can emerge only through Reinforcement Learning (RL), without fine-tuning.

The DeepSeek-R1 model family

You may have heard about DeepSeek-R1 but it’s not the only model of the DeepSeek family: DeepSeek-V3, DeepSeek-R1-Zero, and distilled models, are also available. So what are the differences between those models?

First, let’s go through some definitions and an overview of how language models are trained.

Language model training overview

The large language models available in apps and playgrounds are usually trained in 3 steps:

A base model is trained on an unsupervised language modeling task (for instance, next token prediction) with a dataset of trillions of tokens (also called pre-training),
An instruct model is trained from the base model, by fine-tuning it on a massive dataset of instructions, conversations, questions and answers, to improve the performances of the model with the prompts frequently encountered in a chat,
The final model is the instruct model trained to better handle human preferences, avoid the generation of harmful content, etc. with techniques such as RLHF (reinforcement learning from human feedback) and DPO (direct policy optimization).

DeepSeek-V3 training

According to the technical report provided by DeepSeek, DeepSeek-V3 is a mixture-of-experts (MoE) language model trained with the same kind of process, which is described in the image below:

DeepSeek-V3-Base is trained with 14.8 trillion tokens,
A dataset of 1.5 million instructions examples is used to fine-tune the base model,
This instruct model goes through reinforcement learning with several reward models. The final model is DeepSeek-V3.

For the reinforcement learning step, DeepSeek uses their algorithm called GRPO (group relative policy optimization), which uses several reward models to assess the quality of the content generated by the model. The score given by each reward model is combined into a final score, used to update the model so that it maximizes its global score the next time.

DeepSeek-R1 model series training

DeepSeek-R1 models are built with a different training pipeline, using the base model of DeepSeek-V3. The diagram below shows the main steps of the process designed by DeepSeek to create several reasoning models mentioned in their technical report:

Let’s walk through it step-by-step (no pun intended):

1. The main breakthrough described in DeepSeek’s paper: they managed to train the DeepSeek-V3-Base 671B model to learn the reasoning capability with reinforcement learning only, which doesn’t require labeled data, as opposed to supervised fine-tuning. They use the same GRPO algorithm as before, with two rewards: one on the accuracy of the generated content, with “rule-based” experts instead of full reward models, that are also trained and require significant resources. For example, to assess if the model generated a correct Python code, you could have one expert that compiles the generated code and gives a note based on the number of errors. Another expert would generate test cases and see if the generated code can handle them. The other reward they use is about the format of the model’s responses, which must follow the ... tags to enclose the reasoning content. The resulting model is DeepSeek-R1-Zero. However, it has limitations that make it unsuitable for direct use, such as language mixing and poor readability.

2. To overcome these limitations, DeepSeek uses DeepSeek-R1-Zero to create a cold-start reasoning dataset, augmented with other data from sources not explicitly mentioned. DeepSeek-V3-Base is trained with this cold-start data, before applying a new round of reinforcement learning.

3. They use the same RL approach to get a new reasoning model, that generates a better quality of output. Using this model, they build a 100x bigger reasoning data, growing from 5k to 600k samples, using DeepSeek-V3 as a quality judge. This dataset is then completed with 200k samples generated with DeepSeek-V3 on non-reasoning tasks.

4. A second stage of supervised fine-tuning is done with the dataset built earlier.

5. The model is then aligned with human preferences with a final round of reinforcement learning with a specific human preferences reward. The resulting model is DeepSeek-R1.

6. Finally, DeepSeek experimented with fine-tuning much smaller models than DeepSeek-V3 (LLaMa 3.3 70B, Qwen 2.5 32B…) with the dataset built at step 3. In the paper, they call this process distillation. However, it must not be mistaken with the knowledge distillation technique frequently used in deep learning, where a student model learns from the probabilities distribution of a teacher model. Here, the term “distillation” refers to the fact that the reasoning skill is “distilled” into the base model, but it’s plain old supervised fine-tuning. This is how the DeepSeek-R1-Distill model series is trained. The quality of the dataset enables the resulting distilled models to beat much larger models on reasoning tasks, as show in the benchmark below:

Recap

The table below summarize the differences between the model of the DeepSeek-R1 series:

Model	Description
DeepSeek-R1-Zero	Intermediate 671B reasoning model trained from DeepSeek-V3 exclusively with reinforcement learning, and used to bootstrap DeepSeek-R1 training.
DeepSeek-R1	671B reasoning model trained from DeepSeek-V3.
DeepSeek-R1-Distill	Smaller models fine-tuned for reasoning with a dataset generated by an intermediate version of DeepSeek-R1.

Run DeepSeek-R1 on OVHcloud

Now that we’ve seen the differences between all DeepSeek models, let’s try to use them!

AI Endpoints

The fastest way to test DeepSeek-R1 is to use OVHcloud AI Endpoints.

DeepSeek-R1-Distill-Llama-70B is already available, ready to use and optimized for inference speed. Check it out here: https://endpoints.ai.cloud.ovh.net/models/a011515c-0042-41b2-9a00-ec8b5d34462d

AI Endpoints makes it easy to integrate AI into your applications with a simple API call, without the need for deep AI expertise or infrastructure management. And while it’s in beta, it’s free!

Here is an example cURL command to use DeepSeek-R1 Distill Llama 70B on the OpenAI compatible endpoint provided by OVHcloud AI Endpoints:

curl -X 'POST' \
  'https://deepseek-r1-distill-llama-70b.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "max_tokens": 4096,
  "messages": [
    {
      "content": "How can I calculate an approximation of Pi in Python?",
      "role": "user"
    }
  ],
  "model": null,
  "seed": null,
  "stream": false,
  "temperature": 0.7,
  "top_p": 1
}'

We can see in the output the thinking process followed by the answer, which have been truncated for clarity.

{
    "id": "chatcmpl-8c21b2e3fac44d43b63c06fa25e58091",
    "object": "chat.completion",
    "created": 1741199564,
    "model": "DeepSeek-R1-Distill-Llama-70B",
    "choices":
    [
        {
            "index": 0,
            "message":
            {
                "role": "assistant",
                "content": "\nOkay, the user is asking how to approximate Pi using Python. I need to think about different methods they can use. Let's see, there are a few common approaches. \n\nFirst, there's the Monte Carlo method. ... Let me structure the response with each method as a separate section, explaining what it is, how it works, and providing the code. Then, the user can pick which one they prefer based on their situation.\n\n\nThere are several ways to approximate the value of Pi (π) using Python. Below are a few methods:\n\n### 1. Using the Monte Carlo Method..."
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage":
    {
        "prompt_tokens": 14,
        "completion_tokens": 1377,
        "total_tokens": 1391
    }
}

Stéphane Philippart, Developer Relation Advocate at OVHcloud, has written a blog post covering everything you need to know to get up to speed with AI Endpoints and run this model: Release of DeepSeek-R1 on OVHcloud AI Endpoints

AI Deploy

What if you want to run another version of DeepSeek-R1, such as the Qwen 7B distilled version?

You can use another OVHcloud AI product, AI Deploy, to create your own serving endpoint, with vLLM as the inference engine. It is open-source, fast and well maintained, ensuring maximal compatibility with even the most recent AI models.

Eléa Petton, Solution Architect at OVHcloud, has written a blog post explaining in details how to serve an open-source model with vLLM on AI Deploy. Just replace the Mistral Small model with the DeepSeek distilled version you want to use (e.g. deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) and adapt the number of L40S cards needed (1 is enough for the 7B version) : Mistral Small 24B served with vLLM and AI Deploy – a single command to deploy an LLM (Part 1)

Next up, creating a reasoning chatbot with DeepSeek-R1

In part 2 of this blog post series, we will use a DeepSeek-R1-Distill model to create a chatbot that will handle reasoning gracefully, by showing the thinking process of the model.

We will develop our chatbot with OVHcloud AI Endpoints and the Python library Gradio, that enables to quickly create simple chat interfaces.

Here a screenshot of the finalized chatbot we will build:

Stay tuned for the next article in this DeepSeek-R1 series. In the meantime, try out DeepSeek-R1 on AI Endpoints and AI Deploy and let us know what you !

Resources

If you want to learn more about DeepSeek-R1 and the topics we covered in this blog post, such as test-time compute, GRPO, reinforcement learning and reasoning models, we suggest having a look at these resources:

DeepSeek-R1 technical report, by the DeepSeek team
The Illustrated DeepSeek-R1, by Jay Alamar
Understanding Reasoning LLMs, by Sebastian Raschka
A Visual Guide to Reasoning LLMs, by Maarten Grootendorst

Release of DeepSeek-R1 on OVHcloud AI Endpoints

Stéphane Philippart — Fri, 31 Jan 2025 08:11:44 +0000

🚀 We are thrilled to announce the release of Deepseek-R1-Distill-Llama-70B on AI Endpoints!

Distilled from Deepseek-R1, a powerful model excels in math, coding, and reasoning tasks.

With AI Endpoints, you can integrate this model into your applications without needing extensive AI expertise. Our platform is designed with simplicity, security, and data privacy in mind, ensuring your projects are both innovative and safe.

As you will see in the demo below, DeepSeek-R1 will allow you to create AIs based on the chain of “thoughts” mechanism.

In short, the model will build its answer by breaking down the question into several blocks, as a human would break down a problem into several steps before answering.

You can see some reasoning in the response at the beginning of the response between the tags.

Let’s see an example of using DeepSeek-R1!

Chatbot with DeepSeek-R1 model

The first step is to get the necessary dependencies. To do this, create a requirements.txt file:

langchain-core==0.3.33
argparse==1.4.0
langchainhub==0.1.21
langchain-openai==0.3.3

And run the following command:

pip3 install -r requirements.txt

At this step you are ready to develop your chatbot:

import argparse
import time
import os


from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

## Set the OVHcloud AI Endpoints token to use models
_OVH_AI_ENDPOINTS_ACCESS_TOKEN = os.environ.get('OVH_AI_ENDPOINTS_TOKEN') 

# Function in charge to call the LLM model.
# Question parameter is the user's question.
# The function print the LLM answer.
def chat_completion(new_message: str):
  # no need to use a token
  model = ChatOpenAI(model="DeepSeek-R1-Distill-Llama-70B", 
                        api_key=_OVH_AI_ENDPOINTS_ACCESS_TOKEN,
                        base_url='https://deepseek-r1-distill-llama-70b.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1', 
                        streaming=True)

  prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a Nestor, a virtual assistant. Answer to the question."),
    ("human", "{question}"),
  ])

  chain = prompt | model

  print("🤖: ")
  for r in chain.stream({"question", new_message}):
    print(r.content, end="", flush=True)
    time.sleep(0.150)

# Main entrypoint
def main():
  # User input
  parser = argparse.ArgumentParser()
  parser.add_argument('--question', type=str, default="What is the meaning of life?")
  args = parser.parse_args()
  chat_completion(args.question)

if __name__ == '__main__':
    main()

You can try your new assistant with the following command:

python3 chat-bot-streaming.py --question "What is OVHcloud?"

And that it!

Don’t hesitate to test our new product, AI Endpoints, and give us your feedback.

You have a dedicated Discord channel (#ai-endpoints) on our Discord server (https://discord.gg/ovhcloud), see you the