OVHcloud AI Endpoints: Batch Mode guide

Let’s say you have 20,000 support tickets to classify before tomorrow morning, or a full product catalog to translate without manually sending each request one by one. That kind of workload can quickly become slow, repetitive and difficult to manage.

Batch Mode is designed to help in exactly this type of scenario.

What is Batch Mode?

When working with LLMs, you often send requests one by one through synchronous endpoints like /v1/chat/completions or /v1/responses. This works fine for real-time use cases, but what can you do if you need to process hundreds or thousands of prompts? Sending them individually is slow, and you’re limited by rate limits.

Batch mode solves this problem. Instead of sending requests one at a time, you upload a file containing all your requests, submit a batch job, and get the results back asynchronously, within a maximum of 24 hours. And here’s the cherry on top: batch mode is 50% cheaper than synchronous requests. Since the platform can schedule your workload more efficiently, you benefit from a significant cost reduction.

This is ideal for:

📊 Bulk classification or summarization tasks
🌍 Large-scale translation jobs
📝 Generating descriptions for a product catalog
🧪 Evaluating model outputs on a test dataset

ℹ️ The Batch API is compatible with the OpenAI Batch API format, so you can use the official OpenAI SDK to interact with it.

When not to use Batch Mode!

Batch Mode is designed for large workloads that do not need an immediate response. This being said, it is not the right choice for real-time use cases such as chatbots, live customer support, interactive assistants or applications where users expect an answer within seconds. For those scenarios, synchronous endpoints remain more appropriate. Use Batch Mode when your requests can be processed asynchronously and retrieved later.

ℹ️ The Batch API is currently in beta. You can find more information about the beta on the dedicated page.

Prerequisites for using Batch Mode

Before getting started, you’ll need:

An AI Endpoints API key
Python 3.10+ installed
The openai Python package

⚠️ You can generate your API key from the AI Endpoints console.

Install the dependency:

pip install openai

Set up your environment variables:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN='your_api_key'
export OVH_AI_ENDPOINTS_BASE_URL='https://oai.endpoints.kepler.ai.cloud.ovh.net/v1'

Step 1: Prepare the Input File

The input file uses the JSON Lines format (.jsonl). Each line is a self-contained request with a unique custom_id that lets you match results to their original requests.

Here’s an example requests.jsonl:

{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-oss-20b", "messages": [{"role": "user", "content": "Summarise the plot of Hamlet in two sentences."}]}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-oss-20b", "messages": [{"role": "user", "content": "Translate 'Good morning' into French, Spanish and German."}]}}

Key points:

Each custom_id must be unique within a batch
The model field must reference a model available in the AI Endpoints catalog
The url field indicates which endpoint to call

Step 2: Upload the File and Create the Batch

Here’s the complete Python code that handles the full workflow: upload, create, poll, and download:

import os
import time

from openai import OpenAI

# Load environment variables
_OVH_AI_ENDPOINTS_ACCESS_TOKEN = os.environ["OVH_AI_ENDPOINTS_ACCESS_TOKEN"]
_OVH_AI_ENDPOINTS_BASE_URL = os.environ["OVH_AI_ENDPOINTS_BASE_URL"]

# Initialize the OpenAI-compatible client targeting OVHcloud AI Endpoints
client = OpenAI(
    base_url=_OVH_AI_ENDPOINTS_BASE_URL,
    api_key=_OVH_AI_ENDPOINTS_ACCESS_TOKEN,
)

# 1. Upload the input JSONL file with purpose="batch"
print("📤 Uploading input file...")
batch_input_file = client.files.create(
    file=open("requests.jsonl", "rb"),
    purpose="batch",
)
print(f"✅ Uploaded file id: {batch_input_file.id}")

# 2. Create the batch referencing the uploaded file
print("🚀 Creating batch...")
batch = client.batches.create(
    input_file_id=batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={"description": "Batch mode example - OVHcloud AI Endpoints"},
)
print(f"✅ Batch created: {batch.id} (status: {batch.status})")

# 3. Poll the batch status until it reaches a terminal state
print("⏳ Polling batch status...")
while True:
    current = client.batches.retrieve(batch.id)
    print(f"   status={current.status} counts={current.request_counts}")
    if current.status in ("completed", "failed", "expired", "cancelled"):
        break
    time.sleep(30)

# 4. Download the results (and errors if any)
final = client.batches.retrieve(batch.id)

if final.output_file_id:
    print("📥 Downloading results.jsonl...")
    output = client.files.content(final.output_file_id)
    with open("results.jsonl", "wb") as f:
        f.write(output.read())
    print("✅ Results written to results.jsonl")

if final.error_file_id:
    print("🐛 Downloading errors.jsonl...")
    errors = client.files.content(final.error_file_id)
    with open("errors.jsonl", "wb") as f:
        f.write(errors.read())
    print("🐛 Errors written to errors.jsonl")

print(f"🏁 Final batch status: {final.status}")

Let’s break down the key steps:

Upload the input file

batch_input_file = client.files.create(
    file=open("requests.jsonl", "rb"),
    purpose="batch",
)

The purpose=”batch” parameter tells the API that this file will be used as batch input.

Create the batch

batch = client.batches.create(
    input_file_id=batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

The completion_window=”24h” means the batch will be stopped after 24 hours if not completed.

Poll the batch status

while True:
    current = client.batches.retrieve(batch.id)
    print(f"   status={current.status} counts={current.request_counts}")
    if current.status in ("completed", "failed", "expired", "cancelled"):
        break
    time.sleep(30)

The client.batches.retrieve(batch.id) call returns the current state of the batch. The request_counts field gives you a breakdown of how many requests are completed, failed, or still in progress, useful for monitoring large batches.

The possible terminal states are:

completed: all requests have been processed successfully
failed: the batch encountered a fatal error
expired: the batch exceeded the completion_window duration
cancelled: the batch was manually cancelled via the API

We poll every 30 seconds here, but you can adjust this interval depending on your use case. For very large batches, a longer interval (e.g., 60–120 seconds) is more appropriate.

Download the results

final = client.batches.retrieve(batch.id)

if final.output_file_id:
    output = client.files.content(final.output_file_id)
    with open("results.jsonl", "wb") as f:
        f.write(output.read())

Once the batch is complete, the output_file_id field contains the ID of the results file. You download it using client.files.content() which returns the raw file content.

Download the errors (if any)

if final.error_file_id:
    errors = client.files.content(final.error_file_id)
    with open("errors.jsonl", "wb") as f:
        f.write(errors.read())

If some requests in your batch failed (e.g., invalid model name, malformed input, token limit exceeded), their details will be available in a separate error file. The error_file_id will be None if all requests succeeded. Each line in errors.jsonl contains the custom_id of the failed request along with the error details, making it easy to identify and retry only the failed ones.

Step 3: Read the Results

The output file (results.jsonl) contains one JSON object per line. Each object includes:

The custom_id matching your original request
The full response body (same format as a synchronous /v1/chat/completions responses)

Here’s what a result looks like:

{
  "id": "964e007472a557240221910ba143bb03",
  "custom_id": "request-1",
  "response": {
    "status_code": 200,
    "body": {
      "id": "chatcmpl-9879ebff777795a3",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "Hamlet, the Prince of Denmark, is driven to madness and vengeance after learning that his father was murdered by his uncle Claudius..."
          },
          "finish_reason": "stop"
        }
      ],
      "model": "gpt-oss-20b",
      "usage": {
        "prompt_tokens": 78,
        "completion_tokens": 297,
        "total_tokens": 375
      }
    }
  },
  "error": null
}

If some requests fail, the errors.jsonl file will contain details about what went wrong for each failed request.

Other Examples Available

The AI Endpoints – Batch mode guide also contains examples in:

JavaScript: using the OpenAI Node.js SDK
Pure HTTP requests: using curl without any framework, if you prefer a language-agnostic approach

These examples demonstrate that you can use the Batch API from any language or tool that can make HTTP requests, since it follows the standard OpenAI-compatible API format.

Conclusion

Batch mode is a powerful feature when you need to process large volumes of repetitive, non time-sensitive inference requests, without worrying about rate limits or timeout issues. Upload your file, submit the batch, and come back later for the results, it’s as simple a solution as that.

The OpenAI-compatible API makes it straightforward to integrate into existing workflows, and with examples available in Python, JavaScript, and raw HTTP, you can use whichever approach fits your stack best.

You have a dedicated Discord channel (#ai-endpoints) on our Discord server, see you there!

For more info on AI Endpoints, find our previous blog posts.

Find the full code example in the GitHub repository: public-cloud-examples/ai/ai-endpoints/batch-mode.

Stéphane Philippart

Once a developer, always a developer!
Java developer for many years, I have the joy of knowing JDK 1.1, JEE, Struts, ... and now Spring, Quarkus, (core, boot, batch), Angular, Groovy, Golang, ...
For more than ten years I was a Software Architect, a job that allowed me to face many problems inherent to the complex information systems in large groups.
I also had other lives, notably in automation and delivery with the implementation of CI/CD chains based on Jenkins pipelines.
I particularly like sharing and relationships with developers and I became a Developer Relation at OVHcloud.
This new adventure allows me to continue to use technologies that I like such as Kubernetes or AI for example but also to continue to learn and discover a lot of new things.
All the while keeping in mind one of my main motivations as a Developer Relation: making developers happy.
Always sharing, I am the co-creator of the TADx Meetup in Tours, allowing discovery and sharing around different tech topics.