Fine tune an LLM with Axolotl and OVHcloud Machine Learning Services

A robot with a car tuning style

You can add knowledge to a model in several ways 📚: detailed prompts, system prompts, Retrieval Augmented Generation, function calling, …

One of the ways we are interested in this blog post is fine tuning ✨!

We have already written a blog post on how fine tune a Llama model two years ago, but now you will see it easier than two years ago 😉. This time, we will use a Framework to have less to manage: Axolotl.

What do we want to do?

For this blog post, I want to specialize a small model, for example a Llama-3.2-1B-Instruct model. My goal is to be able to ask it a few questions about our OVHcloud AI Endpoints product 📝.

Before fine tune the model, let’s try it! Deploying a Hugging Face model is very simple, thanks to AI Deploy from AI Machine Learning Services 🥳.

And thanks to a previous blog post, we know how to use vLLM and AI Deploy.

ovhai app run --name $1 \
	--flavor l40s-1-gpu \
	--gpu 2 \
	--default-http-port 8000 \
	--env OUTLINES_CACHE_DIR=/tmp/.outlines \
	--env HF_TOKEN=$MY_HUGGING_FACE_TOKEN \
	--env HF_HOME=/hub \
	--env HF_DATASETS_TRUST_REMOTE_CODE=1 \
	--env HF_HUB_ENABLE_HF_TRANSFER=0 \
	--volume standalone:/hub:rw \
	--volume standalone:/workspace:rw \
	vllm/vllm-openai:v0.8.2 \
	-- bash	-c "vllm serve meta-llama/Llama-3.2-1B-Instruct"

⚠️ Make sure to have accepted the license of the model you want use from Hugging Face ⚠️

If you need more information, please refer to the mentioned blog post which explains in detail all the parameters of the command.

To test our different chatbots we will use a simple Gradio application:

# Application to compare answers generation from OVHcloud AI Endpoints exposed model and fine tuned model.
# ⚠️ Do not used in production!! ⚠️

import gradio as gr
import os

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# 📜 Prompts templates 📜
prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", "{system_prompt}"),
        ("human", "{user_prompt}"),
    ]
)

def chat(prompt, system_prompt, temperature, top_p, model_name, model_url, api_key):
    """
    Function to generate a chat response using the provided prompt, system prompt, temperature, top_p, model name, model URL and API key.
    """

    # ⚙️ Initialize the OpenAI model ⚙️
    llm = ChatOpenAI(api_key=api_key, 
                 model=model_name, 
                 base_url=model_url,
                 temperature=temperature,
                 top_p=top_p
                 )

    # 📜 Apply the prompt to the model 📜
    chain = prompt_template | llm
    ai_msg = chain.invoke(
        {
            "system_prompt": system_prompt,
            "user_prompt": prompt
        }
    )

    # 🤖 Return answer in a compatible format for Gradio component.
    return [{"role": "user", "content": prompt}, {"role": "assistant", "content": ai_msg.content}]

# 🖥️ Main application 🖥️
with gr.Blocks() as demo:
    with gr.Row():
        with gr.Column():
            system_prompt = gr.Textbox(value="""You are a specialist on OVHcloud products.
If you can't find any sure and relevant information about the product asked, answer with "This product doesn't exist in OVHcloud""", 
                label="🧑‍🏫 System Prompt 🧑‍🏫")
            temperature = gr.Slider(minimum=0.0, maximum=2.0, step=0.01, label="Temperature", value=0.5)
            top_p = gr.Slider(minimum=0.0, maximum=1.0, step=0.01, label="Top P", value=0.0)
            model_name = gr.Textbox(label="🧠 Model Name 🧠", value='Llama-3.1-8B-Instruct')
            model_url = gr.Textbox(label="🔗 Model URL 🔗", value='https://oai.endpoints.kepler.ai.cloud.ovh.net/v1')
            api_key = gr.Textbox(label="🔑 OVH AI Endpoints Access Token 🔑", value=os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"), type="password")

        with gr.Column():
            chatbot = gr.Chatbot(type="messages", label="🤖 Response 🤖")
            prompt = gr.Textbox(label="📝 Prompt 📝", value='How many requests by minutes can I do with AI Endpoints?')
            submit = gr.Button("Submit")

    submit.click(chat, inputs=[prompt, system_prompt, temperature, top_p, model_name, model_url, api_key], outputs=chatbot)

demo.launch()

ℹ️ You can find all resources to build and run this application in the dedicated folder in the GitHub repository.

Let’s test with a simple question: “How many requests by minutes can I do with AI Endpoints?”.
The first test is with Llama-3.2-1B-Instruct from Hugging Face deployed with vLLM and OVHcloud AI Deploy.

Ask for AI Endpoints rate limit with a Llama-3.2-1B-Instruct model

As you see the result is not the one expected 😅.

For your information, the good answer from the official OVHcloud guide is:
Anonymous: 2 requests per minute, per IP and per model.
Authenticated with an API access key: 400 requests per minute, per Public Cloud project and per model.

How to add fresh data to the model?

As you know, you can use some data during the inference step using the Retrieval Augmented Generation (RAG). To find out how to implement RAG, take a look to a previous blog post 📗.

Another way to add fresh data to a model is fine tuning ✨.

In few words, fine tuning is the process of taking a pre-trained machine learning model and adapting it to a specific task or dataset by training it on additional data. It saves time and complexity than create a model from scratch 😉.

In our case I’ll take the model Llama-3.2-1B-Instruct model from Hugging Face as base model.

ℹ️ The more parameters your base model has, the more computing power you need. In this case, this mode requires between 3GB and 4GB of memory. It’s the reason why I choose to use a single L4 GPU (we need Ampere compatible architecture).

When data is your gold

Adding knowledge to a model need good quality data in a sufficient quantity.

The first part is easy: I take OVHcloud AI Endpoints official documentation in markdown format from our public cloud documentations repository (did you know that you could participate ?) 📚.

The first step is to create dataset with the right format, Axolotl offers several dataset formats. I choose the conversation format (I find it easier for my use case 😉.

{
   "messages": [
     {"role": "...", "content": "..."}, 
     {"role": "...", "content": "..."}, 
     ...]
}

And to create it manually and add relevant information, I use an LLM to transform the markdown data to a well formed dataset 🤖.

Here the Python script used 🐍:

import os
from pathlib import Path
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage

# 🗺️ Define the JSON schema for the response 🗺️
message_schema = {
    "type": "object",
    "properties": {
        "role": {"type": "string"},
        "content": {"type": "string"}
    },
    "required": ["role", "content"]
}

response_format = {
    "type": "json_object",
    "json_schema": {
        "name": "Messages",
        "description": "A list of messages with role and content",
        "properties": {
            "messages": {
                "type": "array",
                "items": message_schema
            }
        }
    }
}

# ⚙️ Initialize the chat model with AI Endpoints configuration ⚙️
chat_model = ChatOpenAI(
    api_key=os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"),
    base_url=os.getenv("OVH_AI_ENDPOINTS_MODEL_URL"),
    model_name=os.getenv("OVH_AI_ENDPOINTS_MODEL_NAME"),
    temperature=0.0
)

# 📂 Define the directory path 📂
directory_path = "docs/pages/public_cloud/ai_machine_learning"
directory = Path(directory_path)

# 🗃️ Walk through the directory and its subdirectories 🗃️
for path in directory.rglob("*"):
    # Check if the current path is a directory
    if path.is_dir():
        # Get the name of the subdirectory
        sub_directory = path.name

        # Construct the path to the "guide.en-gb.md" file in the subdirectory
        guide_file_path = path / "guide.en-gb.md"

        # Check if the "guide.en-gb.md" file exists in the subdirectory
        if "endpoints" in sub_directory and guide_file_path.exists():
            print(f"📗 Guide processed: {sub_directory}")
            with open(guide_file_path, 'r', encoding='utf-8') as file:
                raw_data = file.read()

            user_message = HumanMessage(content=f"""
With the markdown following, generate a JSON file composed as follows: a list named "messages" composed of tuples with a key "role" which can have the value "user" when it's the question and "assistant" when it's the response. To split the document, base it on the markdown chapter titles to create the question, seems like a good idea.
Keep the language English.
I don't need to know the code to do it but I want the JSON result file.
For the "user" field, don't just repeat the title but make a real question, for example "What are the requirements for OVHcloud AI Endpoints?"
Be sure to add OVHcloud with AI Endpoints so that it's clear that OVHcloud creates AI Endpoints.
Generate the entire JSON file.
An example of what it should look like: messages [{{"role":"user", "content":"What is AI Endpoints?"}}]
There must always be a question followed by an answer, never two questions or two answers in a row.
The source markdown file:
{raw_data}
""")
            chat_response = chat_model.invoke([user_message], response_format=response_format)
            
            with open(f"./generated/{sub_directory}.json", 'w', encoding='utf-8') as output_file:
                output_file.write(chat_response.content)
                print(f"✅ Dataset generated: ./generated/{sub_directory}.json")

ℹ️ You can find all resources to build and run this application in the dedicated folder in the GitHub repository.

Here is an extract of the kind of file created as dataset:

[
  {
    "role": "user",
    "content": "What are the requirements for using OVHcloud AI Endpoints?"
  },
  {
    "role": "assistant",
    "content": "To use OVHcloud AI Endpoints, you need the following: \n1. A Public Cloud project in your OVHcloud account \n2. A payment method defined on your Public Cloud project. Access keys created from Public Cloud projects in Discovery mode (without a payment method) cannot use the service."
  },
  {
    "role": "user",
    "content": "What are the rate limits for using OVHcloud AI Endpoints?"
  },
  {
    "role": "assistant",
    "content": "The rate limits for OVHcloud AI Endpoints are as follows:\n- Anonymous: 2 requests per minute, per IP and per model.\n- Authenticated with an API access key: 400 requests per minute, per PCI project and per model."
  }, 
   ...]
}

Concerning the quantity, it’s more complicated. How to generate relevant data for training, without degrading the quality of the data?

To do this, I’ve created synthetic data using an LLM to create it from the original data. The “trick” is to generate more data on the same subject by rewording it without loosing meaning.

Here is the Python script 🐍 to do the data augmentation:

import os
import json
import uuid
from pathlib import Path
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
from jsonschema import validate, ValidationError

# 🗺️ Define the JSON schema for the response 🗺️
message_schema = {
    "type": "object",
    "properties": {
        "role": {"type": "string"},
        "content": {"type": "string"}
    },
    "required": ["role", "content"]
}

response_format = {
    "type": "json_object",
    "json_schema": {
        "name": "Messages",
        "description": "A list of messages with role and content",
        "properties": {
            "messages": {
                "type": "array",
                "items": message_schema
            }
        }
    }
}

# ✅ JSON validity verification ❌
def is_valid(json_data):
    """
    Test the validity of the JSON data against the schema.
    Argument:
        json_data (dict): The JSON data to validate.  
    Raises:
        ValidationError: If the JSON data does not conform to the specified schema.  
    """
    try:
        validate(instance=json_data, schema=response_format["json_schema"])
        return True
    except ValidationError as e:
        print(f"❌ Validation error: {e}")
        return False

# ⚙️ Initialize the chat model with AI Endpoints configuration ⚙️
chat_model = ChatOpenAI(
    api_key=os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"),
    base_url=os.getenv("OVH_AI_ENDPOINTS_MODEL_URL"),
    model_name=os.getenv("OVH_AI_ENDPOINTS_MODEL_NAME"),
    temperature=0.0
)

# 📂 Define the directory path 📂
directory_path = "generated"
print(f"📂 Directory path: {directory_path}")
directory = Path(directory_path)

# 🗃️ Walk through the directory and its subdirectories 🗃️
for path in directory.rglob("*"):
    print(f"📜 Processing file: {path}")
    # Check if the current path is a valid file
    if path.is_file() and path.name.__contains__ ("endpoints"):
        # Read the raw data from the file
        with open(path, 'r', encoding='utf-8') as file:
            raw_data = file.read()

        try:
            json_data = json.loads(raw_data)
        except json.JSONDecodeError:
            print(f"❌ Failed to decode JSON from file: {path.name}")
            continue

        if not is_valid(json_data):
            print(f"❌ Dataset non valide: {path.name}")
            continue
        print(f"✅ Input dataset valide: {path.name}")

        user_message = HumanMessage(content=f"""
        Given the following JSON, generate a similar JSON file where you paraphrase each question in the content attribute
        (when the role attribute is user) and also paraphrase the value of the response to the question stored in the content attribute
        when the role attribute is assistant.
        The objective is to create synthetic datasets based on existing datasets.
        I do not need to know the code to do this, but I want the resulting JSON file.
        It is important that the term OVHcloud is present as much as possible, especially when the terms AI Endpoints are mentioned
        either in the question or in the response.
        There must always be a question followed by an answer, never two questions or two answers in a row.
        It is IMPERATIVE to keep the language in English.
        The source JSON file:
        {raw_data}
        """)

        chat_response = chat_model.invoke([user_message], response_format=response_format)

        output = chat_response.content

        # Replace unauthorized characters
        output = output.replace("\\t", " ")

        generated_file_name = f"{uuid.uuid4()}_{path.name}"
        with open(f"./generated/synthetic/{generated_file_name}", 'w', encoding='utf-8') as output_file:
            output_file.write(output)

        if not is_valid(json.loads(output)):
            print(f"❌ ERROR: File {generated_file_name} is not valid")
        else:
            print(f"✅ Successfully generated file: {generated_file_name}")

ℹ️ Again, you can find all resources to build and run this application in the dedicated folder in the GitHub repository.

Fine tune the model

Yes, it’s time to do the fine tuning, now we have enough data to train it!

ℹ️ There are no real metrics of how much data is needed to train a model well. It depends of the model, the data itself, the subject, …
The only thing to do: to test and adapt 🔁.

To fine tune my model I use Jupyter notebook created with OVHcloud AI Notebooks.

ovhai notebook run conda jupyterlab \
	--name axolto-llm-fine-tune \
	--framework-version 25.3.1-py312-cudadevel128-gpu \
	--flavor l4-1-gpu \
	--gpu 1 \
	--envvar HF_TOKEN=$MY_HF_TOKEN \
	--envvar WANDB_TOKEN=$MY_WANDB_TOKEN \
	--unsecure-http

ℹ️ If you want more information on how create Jupyter notebook with AI Notebooks, please read the dedicated documentation.

⚙️ The HF_TOKEN environment variable is used to pull & push the trained model to Hugging Face
⚙️ The WANDB_TOKEN environment variable is used to follow the training quality in Weight & Biases

Once the notebook is created it’s time to code the training of the model thanks to Axolotl.

The first steps is to install Axolotl CLI and dependencies 🧰.

# Axolotl need these dependencies
!pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126

# Axolotl CLI installation
!pip install --no-build-isolation axolotl[flash-attn,deepspeed]

# Verify Axolotl version and installation
!axolotl --version

The next step is the Hugging Face CLI configuration 🤗.

!pip install -U "huggingface_hub[cli]"

!huggingface-cli --version
import os
from huggingface_hub import login

login(os.getenv("HF_TOKEN"))

Then, you can configure your Weight & Biases access.

pip install wandb

!wandb login $WANDB_TOKEN

Now it’s time to train the model.

!axolotl train /workspace/instruct-lora-1b-ai-endpoints.yml

This is the only line to type to train the model, pretty cool 😎.

ℹ️ With one L4 card, 10 epochs and ~2000 questions and answers in the datasets it took around one and half hour.

As you can see, the command line takes one parameter: the Axolotl configuration file. You can find full details on how to configure Axolotl in the official documentation 📜.
This is the one used to train the model:

base_model: meta-llama/Llama-3.2-1B-Instruct
# optionally might have model_type or tokenizer_type
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name

load_in_8bit: true
load_in_4bit: false

datasets:
  - path: /workspace/ai-endpoints-doc/
    type: chat_template
      
    field_messages: messages
    message_property_mappings:
      role: role
      content: content
    roles:
      user:
        - user
      assistant:
        - assistant

dataset_prepared_path:
val_set_size: 0.01
output_dir: /workspace/out/llama-3.2-1b-ai-endpoints

sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true

wandb_project: ai_endpoints_training
wandb_entity: <user id>
wandb_mode: 
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 10
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: auto
tf32: false

gradient_checkpointing: true
resume_from_checkpoint:
logging_steps: 1
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
special_tokens:
   pad_token: <|end_of_text|>

🔎 Few key points (only the fields modified from the given templates):
base_model: meta-llama/Llama-3.2-1B-Instruct: the base model downloaded from Hugging Face, be sure to have accepted the license before downloading it
path: /workspace/ai-endpoints-doc/: folder where upload the generated dataset
wandb_project: ai_endpoints_training & wandb_entity: <user id>: weights and biases configuration
num_epochs: 10: number of epochs for the training

After the training you can test the new model 🤖:

!echo "What is OVHcloud AI Endpoints and how to use it?" | axolotl inference /workspace/instruct-lora-1b-ai-endpoints.yml --lora-model-dir="/workspace/out/llama-3.2-1b-ai-endpoints" 

Once you have the desired behavior, you can merge the weights and push the new model to Hugging Face:

!axolotl merge-lora /workspace/instruct-lora-1b-ai-endpoints.yml

%cd /workspace/out/llama-3.2-1b-ai-endpoints/merged

!huggingface-cli upload wildagsx/Llama-3.2-1B-Instruct-AI-Endpoints-v0.6 .

ℹ️ you can find all resources to create and run the notebook in the dedicated folder in the GitHub repository.

Test the new model

Once you have pushed your model in Hugging Face you can, again, deploy it with vLLM and AI Deploy to test it ⚡️.

And tada 🥳, our small Llama model is an expert of OVHcloud AI Endpoints!

Don’t hesitate to use the OVHcloud Machine Learning Services products and give your feedback on our our Discord server (https://discord.gg/ovhcloud), see you there 👋!

Once a developer, always a developer!
Java developer for many years, I have the joy of knowing JDK 1.1, JEE, Struts, ... and now Spring, Quarkus, (core, boot, batch), Angular, Groovy, Golang, ...
For more than ten years I was a Software Architect, a job that allowed me to face many problems inherent to the complex information systems in large groups.
I also had other lives, notably in automation and delivery with the implementation of CI/CD chains based on Jenkins pipelines.
I particularly like sharing and relationships with developers and I became a Developer Relation at OVHcloud.
This new adventure allows me to continue to use technologies that I like such as Kubernetes or AI for example but also to continue to learn and discover a lot of new things.
All the while keeping in mind one of my main motivations as a Developer Relation: making developers happy.
Always sharing, I am the co-creator of the TADx Meetup in Tours, allowing discovery and sharing around different tech topics.