AI Notebook Archives - OVHcloud Blog

Fine tune an LLM with Axolotl and OVHcloud Machine Learning Services

Stéphane Philippart — Fri, 25 Jul 2025 13:07:40 +0000

There are many ways to train a model,📚 using detailed instructions, system prompts, Retrieval Augmented Generation, or function calling

One way is fine-tuning, which is what this blog is about! ✨

Two years back we posted a blog on fine-tuning Llama models—it’s not nearly as complicated as it was before 😉. This time we’re using the Framework Axolotl, so hopefully there’s less to manage.

So what’s the plan?

For this blog, I’d like to fine-tune a small model, Llama-3.2-1B-Instruct, and then test it out on a few questions about our OVHcloud AI Endpoints product 📝.

Before we fine-tune, let’s try it out! Deploying a Hugging Face model is super easy with AI Deploy from AI Machine Learning Services 🥳.

And thanks to a previous blog post, we know how to use vLLM and AI Deploy.

ovhai app run --name $1 \
	--flavor l40s-1-gpu \
	--gpu 2 \
	--default-http-port 8000 \
	--env OUTLINES_CACHE_DIR=/tmp/.outlines \
	--env HF_TOKEN=$MY_HUGGING_FACE_TOKEN \
	--env HF_HOME=/hub \
	--env HF_DATASETS_TRUST_REMOTE_CODE=1 \
	--env HF_HUB_ENABLE_HF_TRANSFER=0 \
	--volume standalone:/hub:rw \
	--volume standalone:/workspace:rw \
	vllm/vllm-openai:v0.8.2 \
	-- bash	-c "vllm serve meta-llama/Llama-3.2-1B-Instruct"

⚠️ Make sure you’ve agreed to the terms of use for the model’s license from Hugging Face ⚠️

Check out the blog I mentioned earlier for all the details you need on the command and its parameters.

To test our different chatbots we will use a simple Gradio application:

# Application to compare answers generation from OVHcloud AI Endpoints exposed model and fine tuned model.
# ⚠️ Do not used in production!! ⚠️

import gradio as gr
import os

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# 📜 Prompts templates 📜
prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", "{system_prompt}"),
        ("human", "{user_prompt}"),
    ]
)

def chat(prompt, system_prompt, temperature, top_p, model_name, model_url, api_key):
    """
    Function to generate a chat response using the provided prompt, system prompt, temperature, top_p, model name, model URL and API key.
    """

    # ⚙️ Initialize the OpenAI model ⚙️
    llm = ChatOpenAI(api_key=api_key, 
                 model=model_name, 
                 base_url=model_url,
                 temperature=temperature,
                 top_p=top_p
                 )

    # 📜 Apply the prompt to the model 📜
    chain = prompt_template | llm
    ai_msg = chain.invoke(
        {
            "system_prompt": system_prompt,
            "user_prompt": prompt
        }
    )

    # 🤖 Return answer in a compatible format for Gradio component.
    return [{"role": "user", "content": prompt}, {"role": "assistant", "content": ai_msg.content}]

# 🖥️ Main application 🖥️
with gr.Blocks() as demo:
    with gr.Row():
        with gr.Column():
            system_prompt = gr.Textbox(value="""You are a specialist on OVHcloud products.
If you can't find any sure and relevant information about the product asked, answer with "This product doesn't exist in OVHcloud""", 
                label="🧑‍🏫 System Prompt 🧑‍🏫")
            temperature = gr.Slider(minimum=0.0, maximum=2.0, step=0.01, label="Temperature", value=0.5)
            top_p = gr.Slider(minimum=0.0, maximum=1.0, step=0.01, label="Top P", value=0.0)
            model_name = gr.Textbox(label="🧠 Model Name 🧠", value='Llama-3.1-8B-Instruct')
            model_url = gr.Textbox(label="🔗 Model URL 🔗", value='https://oai.endpoints.kepler.ai.cloud.ovh.net/v1')
            api_key = gr.Textbox(label="🔑 OVH AI Endpoints Access Token 🔑", value=os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"), type="password")

        with gr.Column():
            chatbot = gr.Chatbot(type="messages", label="🤖 Response 🤖")
            prompt = gr.Textbox(label="📝 Prompt 📝", value='How many requests by minutes can I do with AI Endpoints?')
            submit = gr.Button("Submit")

    submit.click(chat, inputs=[prompt, system_prompt, temperature, top_p, model_name, model_url, api_key], outputs=chatbot)

demo.launch()

ℹ️ You can find all resources to build and run this application in the dedicated folder in the GitHub repository.

Let’s test with a simple question: “How many requests by minutes can I do with AI Endpoints?”.
The first test is with Llama-3.2-1B-Instruct from Hugging Face deployed with vLLM and OVHcloud AI Deploy.

The response isn’t exactly what we expected. 😅

FYI, according to the official OVHcloud guide, the correct answer is:
– Anonymous: 2 requests per minute, per IP and per model.
– Authenticated with an API access key: 400 requests per minute, per Public Cloud project and per model.

What’s the best way to feed the model fresh data?

I bet you already know this—you can use some data during the inference step, using Retrieval Augmented Generation (RAG). You can learn how to set up RAG by reading our past blog post. 📗

Another way to feed a model fresh data by fine-tuning. ✨

In a nutshell, fine-tuning is when you take a pre-trained machine learning model and train it further on additional data, so it can do a specific job. It’s quicker and easier than building a model yourself, or from scratch. 😉

For this, I’m picking Llama-3.2-1B-Instruct from Hugging Face as the base model.

ℹ️ The more parameters your base model has, the more computing power you need. In this case, this model needs between 3GB and 4GB of memory, which is why we’ll be using a single L4 GPU (we need Ampere compatible architecture).

When data is your gold

To train a model, you need enough good-quality data.

The first part is easy; I get the OVHcloud AI Endpoints official documentation in a markdown format from our public cloud documentation repository (by the way, would you like to contribute?). 📚

First, create a dataset with the right format, Axolotl offers varying dataset formats. I prefer the conversation format because it’s the easiest for my use case, so I’m going with that. 😉

{
   "messages": [
     {"role": "...", "content": "..."}, 
     {"role": "...", "content": "..."}, 
     ...]
}

And to create it manually and add the relevant information, I use an LLM to convert the markdown data into a well-formed dataset. 🤖

Here we’re using Python script 🐍:

import os
from pathlib import Path
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage

# 🗺️ Define the JSON schema for the response 🗺️
message_schema = {
    "type": "object",
    "properties": {
        "role": {"type": "string"},
        "content": {"type": "string"}
    },
    "required": ["role", "content"]
}

response_format = {
    "type": "json_object",
    "json_schema": {
        "name": "Messages",
        "description": "A list of messages with role and content",
        "properties": {
            "messages": {
                "type": "array",
                "items": message_schema
            }
        }
    }
}

# ⚙️ Initialize the chat model with AI Endpoints configuration ⚙️
chat_model = ChatOpenAI(
    api_key=os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"),
    base_url=os.getenv("OVH_AI_ENDPOINTS_MODEL_URL"),
    model_name=os.getenv("OVH_AI_ENDPOINTS_MODEL_NAME"),
    temperature=0.0
)

# 📂 Define the directory path 📂
directory_path = "docs/pages/public_cloud/ai_machine_learning"
directory = Path(directory_path)

# 🗃️ Walk through the directory and its subdirectories 🗃️
for path in directory.rglob("*"):
    # Check if the current path is a directory
    if path.is_dir():
        # Get the name of the subdirectory
        sub_directory = path.name

        # Construct the path to the "guide.en-gb.md" file in the subdirectory
        guide_file_path = path / "guide.en-gb.md"

        # Check if the "guide.en-gb.md" file exists in the subdirectory
        if "endpoints" in sub_directory and guide_file_path.exists():
            print(f"📗 Guide processed: {sub_directory}")
            with open(guide_file_path, 'r', encoding='utf-8') as file:
                raw_data = file.read()

            user_message = HumanMessage(content=f"""
With the markdown following, generate a JSON file composed as follows: a list named "messages" composed of tuples with a key "role" which can have the value "user" when it's the question and "assistant" when it's the response. To split the document, base it on the markdown chapter titles to create the question, seems like a good idea.
Keep the language English.
I don't need to know the code to do it but I want the JSON result file.
For the "user" field, don't just repeat the title but make a real question, for example "What are the requirements for OVHcloud AI Endpoints?"
Be sure to add OVHcloud with AI Endpoints so that it's clear that OVHcloud creates AI Endpoints.
Generate the entire JSON file.
An example of what it should look like: messages [{{"role":"user", "content":"What is AI Endpoints?"}}]
There must always be a question followed by an answer, never two questions or two answers in a row.
The source markdown file:
{raw_data}
""")
            chat_response = chat_model.invoke([user_message], response_format=response_format)
            
            with open(f"./generated/{sub_directory}.json", 'w', encoding='utf-8') as output_file:
                output_file.write(chat_response.content)
                print(f"✅ Dataset generated: ./generated/{sub_directory}.json")

ℹ️ You can find all resources to build and run this application in the dedicated folder in the GitHub repository.

Here’s a sample of the file created as the dataset:

[
  {
    "role": "user",
    "content": "What are the requirements for using OVHcloud AI Endpoints?"
  },
  {
    "role": "assistant",
    "content": "To use OVHcloud AI Endpoints, you need the following: \n1. A Public Cloud project in your OVHcloud account \n2. A payment method defined on your Public Cloud project. Access keys created from Public Cloud projects in Discovery mode (without a payment method) cannot use the service."
  },
  {
    "role": "user",
    "content": "What are the rate limits for using OVHcloud AI Endpoints?"
  },
  {
    "role": "assistant",
    "content": "The rate limits for OVHcloud AI Endpoints are as follows:\n- Anonymous: 2 requests per minute, per IP and per model.\n- Authenticated with an API access key: 400 requests per minute, per PCI project and per model."
  }, 
   ...]
}

As for quantity, it’s a bit tricky. How can we generate the right data for training without lowering data quality?

To do this, I’ve created synthetic data using an LLM to create it from the original data. The trick is to generate more data on the same topic by rephrasing it differently but with the same idea.

Here is the Python script 🐍 to do the data augmentation:

import os
import json
import uuid
from pathlib import Path
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
from jsonschema import validate, ValidationError

# 🗺️ Define the JSON schema for the response 🗺️
message_schema = {
    "type": "object",
    "properties": {
        "role": {"type": "string"},
        "content": {"type": "string"}
    },
    "required": ["role", "content"]
}

response_format = {
    "type": "json_object",
    "json_schema": {
        "name": "Messages",
        "description": "A list of messages with role and content",
        "properties": {
            "messages": {
                "type": "array",
                "items": message_schema
            }
        }
    }
}

# ✅ JSON validity verification ❌
def is_valid(json_data):
    """
    Test the validity of the JSON data against the schema.
    Argument:
        json_data (dict): The JSON data to validate.  
    Raises:
        ValidationError: If the JSON data does not conform to the specified schema.  
    """
    try:
        validate(instance=json_data, schema=response_format["json_schema"])
        return True
    except ValidationError as e:
        print(f"❌ Validation error: {e}")
        return False

# ⚙️ Initialize the chat model with AI Endpoints configuration ⚙️
chat_model = ChatOpenAI(
    api_key=os.getenv("OVH_AI_ENDPOINTS_ACCESS_TOKEN"),
    base_url=os.getenv("OVH_AI_ENDPOINTS_MODEL_URL"),
    model_name=os.getenv("OVH_AI_ENDPOINTS_MODEL_NAME"),
    temperature=0.0
)

# 📂 Define the directory path 📂
directory_path = "generated"
print(f"📂 Directory path: {directory_path}")
directory = Path(directory_path)

# 🗃️ Walk through the directory and its subdirectories 🗃️
for path in directory.rglob("*"):
    print(f"📜 Processing file: {path}")
    # Check if the current path is a valid file
    if path.is_file() and path.name.__contains__ ("endpoints"):
        # Read the raw data from the file
        with open(path, 'r', encoding='utf-8') as file:
            raw_data = file.read()

        try:
            json_data = json.loads(raw_data)
        except json.JSONDecodeError:
            print(f"❌ Failed to decode JSON from file: {path.name}")
            continue

        if not is_valid(json_data):
            print(f"❌ Dataset non valide: {path.name}")
            continue
        print(f"✅ Input dataset valide: {path.name}")

        user_message = HumanMessage(content=f"""
        Given the following JSON, generate a similar JSON file where you paraphrase each question in the content attribute
        (when the role attribute is user) and also paraphrase the value of the response to the question stored in the content attribute
        when the role attribute is assistant.
        The objective is to create synthetic datasets based on existing datasets.
        I do not need to know the code to do this, but I want the resulting JSON file.
        It is important that the term OVHcloud is present as much as possible, especially when the terms AI Endpoints are mentioned
        either in the question or in the response.
        There must always be a question followed by an answer, never two questions or two answers in a row.
        It is IMPERATIVE to keep the language in English.
        The source JSON file:
        {raw_data}
        """)

        chat_response = chat_model.invoke([user_message], response_format=response_format)

        output = chat_response.content

        # Replace unauthorized characters
        output = output.replace("\\t", " ")

        generated_file_name = f"{uuid.uuid4()}_{path.name}"
        with open(f"./generated/synthetic/{generated_file_name}", 'w', encoding='utf-8') as output_file:
            output_file.write(output)

        if not is_valid(json.loads(output)):
            print(f"❌ ERROR: File {generated_file_name} is not valid")
        else:
            print(f"✅ Successfully generated file: {generated_file_name}")

ℹ️ Again, you can find all resources to build and run this application in the dedicated folder in the GitHub repository.

Fine-tune the model

We now have enough training data, let’s fine-tune!

ℹ️ It’s hard to say exactly how much data is needed to train a model properly. It all depends on the model, the data, the topic, and so on.
The only option is to test and adapt. 🔁.

I use Jupyter notebook, created with OVHcloud AI Notebooks, to fine-tune my models.

ovhai notebook run conda jupyterlab \
	--name axolto-llm-fine-tune \
	--framework-version 25.3.1-py312-cudadevel128-gpu \
	--flavor l4-1-gpu \
	--gpu 1 \
	--envvar HF_TOKEN=$MY_HF_TOKEN \
	--envvar WANDB_TOKEN=$MY_WANDB_TOKEN \
	--unsecure-http

ℹ️ For more details on how to create Jupyter notebook with AI Notebooks, read the documentation.

⚙️ The HF_TOKEN environment variable is used to pull and push the trained model to Hugging Face
⚙️ The WANDB_TOKEN environment variable helps you track training quality in Weight & Biases

Once the notebook is set up, you can start coding the model’s training with Axolotl.

To start, install Axolotl CLI and its dependencies. 🧰

# Axolotl need these dependencies
!pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126

# Axolotl CLI installation
!pip install --no-build-isolation axolotl[flash-attn,deepspeed]

# Verify Axolotl version and installation
!axolotl --version

The next step is to configure the Hugging Face CLI. 🤗

!pip install -U "huggingface_hub[cli]"

!huggingface-cli --version

import os
from huggingface_hub import login

login(os.getenv("HF_TOKEN"))

Then, configure your Weight & Biases access.

pip install wandb

!wandb login $WANDB_TOKEN

Once all that’s done, it’s time to train the model.

!axolotl train /workspace/instruct-lora-1b-ai-endpoints.yml

You only need to type this one line to train it, how cool is that? 😎

ℹ️ With one L4 card, 10 epochs, and roughly 2000 questions and answers in the datasets, it ran for about 90 minutes.

Basically, the command line needs just one parameter: the Axolotl config file. You can find everything you need to set up Axolotl in the official documentation.📜
Here’s what the model was trained on:

base_model: meta-llama/Llama-3.2-1B-Instruct
# optionally might have model_type or tokenizer_type
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name

load_in_8bit: true
load_in_4bit: false

datasets:
  - path: /workspace/ai-endpoints-doc/
    type: chat_template
      
    field_messages: messages
    message_property_mappings:
      role: role
      content: content
    roles:
      user:
        - user
      assistant:
        - assistant

dataset_prepared_path:
val_set_size: 0.01
output_dir: /workspace/out/llama-3.2-1b-ai-endpoints

sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true

wandb_project: ai_endpoints_training
wandb_entity: 
wandb_mode: 
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 10
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: auto
tf32: false

gradient_checkpointing: true
resume_from_checkpoint:
logging_steps: 1
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
special_tokens:
   pad_token: <|end_of_text|>

🔎 Some key points (only the fields modified from the given templates):
– base_model: meta-llama/Llama-3.2-1B-Instruct: before you download the base model from Hugging Face, be sure to accept the licence’s terms of use [JD1]
– path: /workspace/ai-endpoints-doc/: folder to upload the generated dataset
– wandb_project: ai_endpoints_training & wandb_entity: : to configure weights and biases
– num_epochs: 10: number of epochs for the training

After the training, you can test the new model 🤖:

!echo "What is OVHcloud AI Endpoints and how to use it?" | axolotl inference /workspace/instruct-lora-1b-ai-endpoints.yml --lora-model-dir="/workspace/out/llama-3.2-1b-ai-endpoints"

When you’re satisfied with the result, merge the weights and upload the new model to Hugging Face:

!axolotl merge-lora /workspace/instruct-lora-1b-ai-endpoints.yml

%cd /workspace/out/llama-3.2-1b-ai-endpoints/merged

!huggingface-cli upload wildagsx/Llama-3.2-1B-Instruct-AI-Endpoints-v0.6 .

ℹ️ You can find all resources to create and run the notebook in the dedicated folder in the GitHub repository.

Test the new model

Once you have pushed your model in Hugging Face you can, again, deploy it with vLLM and AI Deploy to test it ⚡️.

Ta-da! 🥳 Our little Llama model is now an OVHcloud AI Endpoints pro!

Feel free to try out OVHcloud Machine Learning products, and share your thoughts on our Discord server (https://discord.gg/ovhcloud), see you soon! 👋

Deploy a custom Docker image for Data Science project – A spam classifier with FastAPI (Part 3)

Eléa Petton — Fri, 30 Dec 2022 10:39:54 +0000

A guide to deploy a custom Docker image for an API with FastAPI and AI Deploy.

Welcome to the third article concerning custom Docker image deployment. If you haven’t read the previous ones, you can check it:

– Gradio sketch recognition app
– Streamlit app for EDA and interactive prediction

When creating code for a Data Science project, you probably want it to be as portable as possible. In other words, it can be run as many times as you like, even on different machines.

Unfortunately, it is often the case that a Data Science code works fine locally on a machine but gives errors during runtime. It can be due to different versions of libraries installed on the host machine.

To deal with this problem, you can use Docker.

The article is organized as follows:

Objectives
Concepts
Define a model for spam classification
Build the FastAPI app with Python
Containerize your app with Docker
Launch the app with AI Deploy

All the code for this blogpost is available in our dedicated GitHub repository. You can test it with OVHcloud AI Deploy tool, please refer to the documentation to boot it up.

Objectives

In this article, you will learn how to develop FastAPI API for spam classification.

Once your app is up and running locally, it will be a matter of containerizing it, then deploying the custom Docker image with AI Deploy.

Concepts

In Artificial Intelligence, you have probably heard of Natural Language Processing (NLP). NLP gathers several tasks related to language processing such as text classification.

This technique is ideal for distinguishing spam from other messages.

Spam Ham Collection Dataset

The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research.

The dataset contains 5,574 messages in English. The SMS are tagged as follow:

HAM if the message is legitimate
SPAM if it is not

The collection is a text file, where each line has the correct class followed by the raw message.

Logistic regression

What is a Logistic Regression?

Logistic regression is a statistical model. It allows to study the relationships between a set of i qualitative variables (Xi) and a qualitative variable (Y).

It is a generalized linear model using a logistic function as a link function.

A logistic regression model can also predict the probability of an event occurring (value close to 1) or not (value close to 0) from the optimization of the regression coefficients. This result always varies between 0 and 1.

For the spam classification use case, words are inputs and class (spam or ham) is output.

FastAPI

What is FastAPI?

FastAPI is a web framework for building RESTful APIs with Python.

FastAPI is based on Pydantic and type guidance to validate, serialize and deserialize data, and automatically generate OpenAPI documents.

Docker

Docker platform allows you to build, run and manage isolated applications. The principle is to build an application that contains not only the written code but also all the context to run the code: libraries and their versions for example

When you wrap your application with all its context, you build a Docker image, which can be saved in your local repository or in the Docker Hub.

To get started with Docker, please, check this documentation.

To build a Docker image, you will define 2 elements:

the application code (FastAPI app)
the Dockerfile

In the next steps, you will see how to develop the Python code for your app, but also how to write the Dockerfile.

Finally, you will see how to deploy your custom docker image with OVHcloud AI Deploy tool.

AI Deploy

AI Deploy enables AI models and managed applications to be started via Docker containers.

To know more about AI Deploy, please refer to this documentation.

Define a model for spam classification

❗ To develop an API that uses a Machine Learning model, you have to load the model in the correct format. For this tutorial, a Logistic Regression is used and the Python file model.py is used to define it.

To better understand the model.py code, refer to the notebook which details all the steps.

First of all, you have to import the Python libraries needed to create the Logistic Regression in the model.py file.

import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

Now, you can create the Logistic Regression based on the Spam Ham Collection Dataset. The Python framework named Scikit-Learn is used to define this model.

Firstly, you can load the dataset and transform your input file into a dataframe.

You will also be able to define the input and the output of the model.

def load_data():

    PATH = 'SMSSpamCollection'
    df = pd.read_csv(PATH, delimiter = "\t", names=["classe", "message"])

    X = df['message']
    y = df['classe']

    return X, y

In a second step, you split the data in a training and a test set.

To separate the dataset fairly and to have a test_size between 0 and 1, you can calculate ntest as follows.

def split_data(X, y):

    ntest = 2000/(3572+2000)

    X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=ntest, random_state=0)

    return X_train, y_train

Now you can concentrate on creating the Machine Learning model. To do this, create a spam_classifier_model function.

To fully understand the code, refer to Steps 6 to 9 of this notebook. In these steps you will learn how to:

create the model using Logistic Regression
evaluate on the test set
do dimension reduction with stop words and term frequency
do dimension reduction to post-processing of the model

def spam_classifier_model(Xtrain, ytrain):

    model_logistic_regression = LogisticRegression()
    model_logistic_regression = model_logistic_regression.fit(Xtrain, ytrain)

    coeff = model_logistic_regression.coef_
    coef_abs = np.abs(coeff)

    quantiles = np.quantile(coef_abs,[0, 0.25, 0.5, 0.75, 0.9, 1])

    index = np.where(coeff[0] > quantiles[1])
    newXtrain = Xtrain[:, index[0]]

    model_logistic_regression = LogisticRegression()
    model_logistic_regression.fit(newXtrain, ytrain)

    return model_logistic_regression, index

Once these Python functions are defined, you can call and apply them as follows.

Firstly, extract input and output data with load_data():

data_input, data_output = load_data()

Secondly, split the data using the split_data(data_input, data_output):

X_train, ytrain = split_data(data_input, data_output)

❗ Here, there is no need to use the test set. Indeed, the evaluation of the final model has already been done in Step 9 - Dimensionality reduction: post processing of the model of the notebook.

Thirdly, transform and fit training set. In order to prepare the data, you can use CountVectorizer from Scikit-Learn to remove stop-words and then fit_transform to fit the inputs.

vectorizer = CountVectorizer(stop_words='english', binary=True, min_df=10)
Xtrain = vectorizer.fit_transform(X_train.tolist())
Xtrain = Xtrain.toarray()

Fourthly, use the model and index for prediction by calling spam_classifier_model function.

model_logistic_regression, index = spam_classifier_model(Xtrain, ytrain)

Find out the full Python code here.

Have you successfully defined your model? Good job 🥳 !

Let’s go for the creation of the API!

Build the FastAPI app with Python

❗ All the codes below are available in the app.py file. You can find the complete Python code of the app.py file here.

To begin, you can import dependencies for FastAPI app.

uvicorn
fastapi
pydantic

import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel
from model import model_logistic_regression, index, vectorizer

In the first place, you can initialize an instance of FastAPI.

app = FastAPI()

Next, you can define the data format by creating the Python class named request_body. Here, the string (str) format is required.

class request_body(BaseModel):
    message : str

Now, you can create the process function in order to prepare the sent message to be used by the model.

def process_message(message):

    desc = vectorizer.transform(message)
    dense_desc = desc.toarray()
    dense_select = dense_desc[:, index[0]]

    return dense_select

At the exit of this function the message does not contain any more stop words, it is put in the right format for the model thanks to the transform and is then represented as an array.

Now that the function for processing the input data is defined, you can pass the GET and POST methods.

First, let’s go for the GET method!

@app.get('/')
def root():
    return {'message': 'Welcome to the SPAM classifier API'}

Here you can see the welcome message when you arrive on your API.

{"message":"Welcome to the SPAM classifier API"}

Now it’s the turn of the POST method. In this part of the code, you will be able to:

define the message format
check if a message has been sent or not
process the message to fit with the model
extract the probabilities
return the results

@app.post('/spam_detection_path')
def classify_message(data : request_body):

    message = [
        data.message
    ]

    if (not (message)):
        raise HTTPException(status_code=400, detail="Please Provide a valid text message")

    dense_select = process_message(message)

    label = model_logistic_regression.predict(dense_select)
    proba = model_logistic_regression.predict_proba(dense_select)

    if label[0]=='ham':
        label_proba = proba[0][0]
    else:
        label_proba = proba[0][1]

    return {'label': label[0], 'label_probability': label_proba}

❗ Again, you can find the full code here.

Before deploying your API, you can test it locally using the following command:

uvicorn app:app --reload

Then, you can test your app locally at the following address: http://localhost:8000/

You will arrive on the following page:

How to interact with your API?

You can add /docs at the end of the url of your app: http://localhost:8000/docs

A new page opens to you. It provides a complete dashboard for interacting with the API!

To be able to send a message for classification, select /spam_detection_path in the green box. Click on Try it out and type the message of your choice in the dedicated zone.

Enter the message of your choice. It must be in the form of a string.

Example: "A new free service for you only"

To get the result of the prediction, click on the Execute button.

Finally, you obtain the result of the prediction with the label and the confidence score.

Your app works locally? Congratulations 🎉 !

Now it’s time to move on to containerization!

Containerize your app with Docker

First of all, you have to build the file that will contain the different Python modules to be installed with their corresponding version.

Create the requirements.txt file

The requirements.txt file will allow us to write all the modules needed to make our application work.

fastapi==0.87.0
pydantic==1.10.2
uvicorn==0.20.0
pandas==1.5.1
scikit-learn==1.1.3

This file will be useful when writing the Dockerfile.

Write the Dockerfile

Your Dockerfile should start with the the FROM instruction indicating the parent image to use. In our case we choose to start from a classic Python image.

For this Streamlit app, you can use version 3.8 of Python.

FROM python:3.8

Next, you have to to fill in the working directory and add all files into.

❗ Here you must be in the /workspace directory. This is the basic directory for launching an OVHcloud AI Deploy.

WORKDIR /workspace
ADD . /workspace

Install the requirements.txt file which contains your needed Python modules using a pip install… command.

RUN pip install -r requirements.txt

Set the listening port of the container. For FastAPI, you can use the port 8000.

EXPOSE 8000

Then, you have to define the entrypoint and the default launching command to start the application.

ENTRYPOINT ["uvicorn"]
CMD [ "streamlit", "run", "/workspace/app.py", "--server.address=0.0.0.0" ]

Finally, you can give correct access rights to OVHcloud user (42420:42420).

RUN chown -R 42420:42420 /workspace
ENV HOME=/workspace

Once your Dockerfile is defined, you will be able to build your custom docker image.

Build the Docker image from the Dockerfile

First, you can launch the following command from the Dockerfile directory to build your application image.

docker build . -t fastapi-spam-classification:latest

⚠️ The dot . argument indicates that your build context (place of the Dockerfile and other needed files) is the current directory.

⚠️ The -t argument allows you to choose the identifier to give to your image. Usually image identifiers are composed of a name and a version tag :. For this example we chose fastapi-spam-classification:latest.

Test it locally

Now, you can run the following Docker command to launch your application locally on your computer.

docker run --rm -it -p 8080:8080 --user=42420:42420 fastapi-spam-classification:latest

⚠️ The -p 8000:8000 argument indicates that you want to execute a port redirection from the port 8000 of your local machine into the port 8000 of the Docker container.

⚠️ Don't forget the --user=42420:42420 argument if you want to simulate the exact same behaviour that will occur on AI Deploy. It executes the Docker container as the specific OVHcloud user (user 42420:42420).

Once started, your application should be available on http://localhost:8000.

Your Docker image seems to work? Good job 👍 !

It’s time to push it and deploy it!

Push the image into the shared registry

❗ The shared registry of AI Deploy should only be used for testing purpose. Please consider attaching your own Docker registry. More information about this can be found here.

Then, you have to find the address of your shared registry by launching this command.

ovhai registry list

Next, log in on the shared registry with your usual OpenStack credentials.

docker login -u  -p

To finish, you need to push the created image into the shared registry.

docker tag fastapi-spam-classification:latest /fastapi-spam-classification:latest

docker push /fastapi-spam-classification:latest

Once you have pushed your custom Docker image into the shared registry, you are ready to launch your app 🚀 !

Launch the AI Deploy app

The following command starts a new job running your FastAPI application.

ovhai app run \
      --default-http-port 8000 \
      --cpu 4 \
      /fastapi-spam-classification:latest

Choose the compute resources

First, you can either choose the number of GPUs or CPUs for your app.

--cpu 4 indicates that we request 4 CPUs for that app.

Make the app public

Finally, if you want your app to be accessible without the need to authenticate, specify it as follows.

Consider adding the --unsecure-http attribute if you want your application to be reachable without any authentication.

Conclusion

Well done 🎉 ! You have learned how to build your own Docker image for a dedicated spam classification API!

You have also been able to deploy this app thanks to OVHcloud’s AI Deploy tool.

Want to find out more?

Notebook

You want to access the notebook? Refer to the GitHub repository.

App

You want to access to the full code to create the FastAPI API? Refer to the GitHub repository.

To launch and test this app with AI Deploy, please refer to our documentation.

References

Object detection: train YOLOv5 on a custom dataset

Eléa Petton — Thu, 17 Mar 2022 15:21:22 +0000

A guide to train a YOLO object detection algorithm on your dataset. It’s based on the YOLOv5 open source repository by Ultralytics.

All the code for this blogpost is available in our dedicated GitHub repository. And you can test it in our AI Training, please refer to our documentation to boot it up.

Introduction

Computer Vision

” Computer vision is a specific field that deals with how computers can gain high-level understanding from digital images or videos. “
” From the perspective of engineering, it seeks to understand an automate tasks that the human visual system can do. “
Wikipedia

The use cases are numerous …

Automotive: autonomous car
Medical: cell detection
Retailing: automatic basket content detection
…

Object Detection

Object detection is a branch of computer vision that identifies and locates objects in an image or video stream. This technique allows objects to be labelled accurately. Object detection can be used to determine and count objects in a scene or to track their movement.

Objective

The purpose of this article is to show how it is possible to train YOLOv5 to recognise objects. YOLOv5 is an object detection algorithm. Although closely related to image classification, object detection performs image classification on a more precise scale. Object detection locates and categorises features in images.

It is based on the YOLOv5 repository by Ultralytics.

Use case: COCO dataset

COCO is a large-scale object detection, segmentation, and also captioning dataset. It has several features:

Object segmentation
Recognition in context
Superpixel stuff segmentation
330K images
1.5 million object instances
80 object categories
91 stuff categories
5 captions per image
250 000 people with keypoints

⚠️ Next, we will see how to use and train our own dataset to train a YOLOv5 model. But for this tutorial, we will use the COCO dataset.

OVHcloud disclaims to the fullest extent authorized by law all warranties, whether express or implied, including any implied warranties of title, non-infringement, quiet enjoyment, integration, merchantability or fitness for a particular purpose regarding the use of the COCO dataset in the context of this notebook. The user shall fully comply with the terms of use that appears on the database website (https://cocodataset.org/).

Create your own dataset

To train our own dataset, we can refer to the following steps:

Collecte your training images: to get our object detector off the ground, we need to first collect training images.

⚠️ You must pay attention to the format of the images in your dataset. Think of putting your images in .jpg format!

Define the number of classes: we have to make sure that the number of objects in each class is uniformly distributed.

Annotation of your training images: to train our object detector, we need to supervise its training using bounding box annotations. We have to draw a box around each object we want the detector to see and label each box with the object class we want the detector to predict.

↪️ Labels should be written as follows:

num_label: label (or class) number. If you have n classes, the label number will be between 0 and n-1
X and Y: correspond to the coordinates of the centre of the box
width: width of the box
height: height of the box

If an image contains several labels, we write a line for each label in the same .txt file.

Split your dataset: we choose how to disperse our data (for example, keep 80% data in the training set and 20% in the validation set).

⚠️ Images and labels must have the same name.

Exemple:

data/train/images/img0.jpg # image
data/train/labels/img0.txt # label

Set up files and directory structure: to train the YOLOv5 model, we need to add a .yaml file to describe the parameters of our dataset.

We have to specify the train and validation files.

train: /workspace/data/train/images
val: /workspace/data/valid/images

After that, we define number and names of classes.

nc: 80
names: ['aeroplane', 'apple', 'backpack', 'banana', 'baseball bat', 'baseball glove', 'bear', 'bed', 'bench', 'bicycle', 'bird', 'boat', 'book', 'bottle', 'bowl', 'broccoli', 'bus', 'cake', 'car', 'carrot', 'cat', 'cell phone', 'chair', 'clock', 'cow', 'cup', 'diningtable', 'dog', 'donut', 'elephant', 'fire hydrant', 'fork', 'frisbee', 'giraffe', 'hair drier', 'handbag', 'horse', 'hot dog', 'keyboard', 'kite', 'knife', 'laptop', 'microwave', 'motorbike', 'mouse', 'orange', 'oven', 'parking meter', 'person', 'pizza', 'pottedplant', 'refrigerator', 'remote', 'sandwich', 'scissors', 'sheep', 'sink', 'skateboard', 'skis', 'snowboard', 'sofa', 'spoon', 'sports ball', 'stop sign', 'suitcase', 'surfboard', 'teddy bear', 'tennis racket', 'tie', 'toaster', 'toilet', 'toothbrush', 'traffic light', 'train', 'truck', 'tvmonitor', 'umbrella', 'vase', 'wine glass', 'zebra']

Let’s follow the different steps!

Install YOLOv5 dependences

1. Clone YOLOv5 repository

git clone https://github.com/ultralytics/yolov5 /workspace/yolov5

2. Install dependencies as necessary

Now, we have to go to the /yolov5 folder and install the “requirements.txt” file containing all the necessary dependencies.

cd /workspace/yolov5

⚠️ Before installing the “requirements.txt” file, you have to modify it.

To access it, follow this path:
workspace -> yolov5 -> requirements.txt

Then we have to replace the line opencv-python>=4.1.2 by opencv-python--headless.

Now we can save the “requirements.txt” file by selecting File in the Jupyter toolbar, then Save File.

Then, we can start the installation!

pip install -r requirements.txt

It’s almost over!

The last step is to open a new terminal:
File -> New -> Terminal

Once in our new terminal, we run the following command: pip uninstall setuptools

We confirm our action by selecting Y.

And finally, run the command: pip install setuptools==59.5.0

The installations are now complete.

⚠️ Reboot the notebook kernel and follow the next steps!

Import dependencies

import torch
import yaml
from IPython.display import Image, clear_output
from utils.plots import plot_results

We check GPU availability.

print('Setup complete. Using torch %s %s' % (torch.__version__, torch.cuda.get_device_properties(0) if torch.cuda.is_available() else 'CPU'))

Define and train YOLOv5 model

1. Define YOLOv5 model

We go to the directory where the “data.yaml” file is located.

cd /workspace/data

with open("data.yaml", 'r') as stream:
    num_classes = str(yaml.safe_load(stream)['nc'])

The model configuration used for the tutorial is YOLOv5s.

cat /workspace/yolov5/models/yolov5s.yaml

# YOLOv5 🚀 by Ultralytics, GPL-3.0 license

# Parameters
nc: 80 # number of classes depth_multiple: 0.33 # model depth multiple width_multiple: 0.50 # layer channel multiple anchors: - [10,13, 16,30, 33,23] # P3/8 - [30,61, 62,45, 59,119] # P4/16 - [116,90, 156,198, 373,326] # P5/32

# YOLOv5 backbone backbone: # [from, number, module, args] [[-1, 1, Focus, [64, 3]], # 0-P1/2 [-1, 1, Conv, [128, 3, 2]], # 1-P2/4 [-1, 3, C3, [128]], [-1, 1, Conv, [256, 3, 2]], # 3-P3/8 [-1, 9, C3, [256]], [-1, 1, Conv, [512, 3, 2]], # 5-P4/16 [-1, 9, C3, [512]], [-1, 1, Conv, [1024, 3, 2]], # 7-P5/32 [-1, 1, SPP, [1024, [5, 9, 13]]], [-1, 3, C3, [1024, False]], # 9 ]

# YOLOv5 head head:
[[-1, 1, Conv, [512, 1, 1]], [-1, 1, nn.Upsample, [None, 2, 'nearest']], [[-1, 6], 1, Concat, [1]], # cat backbone P4 [-1, 3, C3, [512, False]], # 13 [-1, 1, Conv, [256, 1, 1]], [-1, 1, nn.Upsample, [None, 2, 'nearest']], [[-1, 4], 1, Concat, [1]], # cat backbone P3 [-1, 3, C3, [256, False]], # 17 (P3/8-small) [-1, 1, Conv, [256, 3, 2]], [[-1, 14], 1, Concat, [1]], # cat head P4 [-1, 3, C3, [512, False]], # 20 (P4/16-medium) [-1, 1, Conv, [512, 3, 2]], [[-1, 10], 1, Concat, [1]], # cat head P5 [-1, 3, C3, [1024, False]], # 23 (P5/32-large) [[17, 20, 23], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5) ]

2. Run YOLOv5 training

Parameters definitions:

img: refers to the input images size.
batch: refers to the batch size (number of training examples utilized in one iteration).
epochs: refers to the number of training epochs. An epoch corresponds to one cycle through the full training dataset.
data: refers to the path to the yaml file.
cfg: define the model configuration.

We will train YOLOv5s model on custom dataset for 100 epochs.

Evaluate YOLOv5 performance on COCO dataset

Image(filename='/workspace/yolov5/runs/train/yolov5s_results/results.png', width=1000)  
# view results.png

Graphs and functions explanation

Loss functions:

For the training set:

Box: loss due to a box prediction not exactly covering an object.
Objectness: loss due to a wrong box-object IoU [1] prediction.
Classification: loss due to deviations from predicting ‘1’ for the correct classes and ‘0’ for all the other classes for the object in that box.

For the valid set (the same loss functions as for the training data):

val Box
val Objectness
val Classification

Precision & Recall:

Precision: measures how accurate are the predictions. It is the percentage of your correct predictions
Recall: measures how good it finds all the positives

How to calculate Precision and Recall ?

Accuracy functions:

mAP (mean Average Precision) compares the ground-truth bounding box to the detected box and returns a score. The higher the score, the more accurate the model is in its detections.

mAP@ 0.5：when IoU is set to 0.5, the AP [2] of all pictures of each category is calculated, and then all categories are averaged : mAP
mAP@ 0.5:0.95：represents the average mAP at different IoU thresholds (from 0.5 to 0.95 in steps of 0.05)

[1] IoU (Intersection over Union) = measures the overlap between two boundaries. It is used to measure how much the predicted boundary overlaps with the ground truth

How to calculate IoU ?

[2] AP (Average precision) = popular metric in measuring the accuracy of object detectors. It computes the average precision value for recall value over 0 to 1

Inference

1. Run YOLOv5 inference on test images

We can perform inference on the contents of the /data/images folder. Images can be adde of your choice in the same folder in order to perform tests.

First, our trained weights saved in the weights folder. We can use the best weights and print the test images list.

cd /workspace/yolov5/
python detect.py --weights runs/train/yolov5s_results/weights/best.pt --img 416 --conf 0.4 --source data/images --name yolov5s_results

/workspace/yolov5 detect: weights=['runs/train/yolov5s_results/weights/best.pt'], source=data/images, imgsz=416, conf_thres=0.4, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=yolov5s_results, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False YOLOv5 🚀 v5.0-306-g4495e00 torch 1.8.1 CUDA:0 (Tesla V100S-PCIE-32GB, 32510.5MB)

Fusing layers...
Model Summary: 308 layers, 21356877 parameters, 0 gradients, 51.3 GFLOPs

Then, we have the the classes number of occurrences present in the image.

image 1/3 /workspace/yolov5/data/images/dog_street.jpg: 416x416 1 bicycle, 1 dog, 5 persons, Done. (0.017s)

image 2/3 /workspace/yolov5/data/images/lunch_computer.jpg: 288x416 1 broccoli, 1 cell phone, 1 cup, 1 diningtable, 1 fork, 1 keyboard, 1 knife, Done. (0.021s)

image 3/3 /workspace/yolov5/data/images/policeman_horse.jpg: 320x416 6 cars, 2 horses, 2 persons, 1 traffic light, Done. (0.020s)

Results saved to runs/detect/yolov5s_results Done. (0.322s)

2. Export trained weights for future inference

Our weights are saved after training our model over 100 epochs.

Two weight files exist:
– the best one: best.pt
– the last one: last.pt

We choose the best one and we will start by renaming it

cd /workspace/yolov5/runs/train/yolov5s_results/weights/
os.rename("best.pt","yolov5s_100epochs.pt")

Then, we copy it in a new folder where we can put all the weights generated during your trainings.

cp /workspace/yolov5/runs/train/yolov5s_results/weights/yolov5s_100epochs.pt /workspace/models_train/yolov5s_100epochs.pt

Conclusion

The accuracy of the model can be improved by increasing the number of epochs, but after a certain period we reach a threshold, so the value should be determined accordingly.

The accuracy obtained for the test set is 93.71 %, which is a satisfactory result.

Want to find out more?

Notebook

You want to access the notebook? Refer to the GitHub repository.

To launch and test this notebook with AI Notebooks, please refer to our documentation.

App

You want to access the tutorial to create a simple app? Refer to the GitHub repository.

To launch and test this app with AI Training, please refer to our documentation.

References

AI Notebooks: analyze and classify sounds with AI

Eléa Petton — Fri, 04 Mar 2022 08:57:00 +0000

A guide to analyze and classify marine mammal sounds.

Since you’re reading a blog post from a technology company, I bet you’ve heard about AI, Machine and Deep Learning many times before.

Audio or sound classification is a technique with multiple applications in the field of AI and data science.

Use cases

Acoustic data classification:

– identifies location
– differentiates environments
– has a role in ecosystem monitoring

Environmental sound classification:

– recognition of urban sounds
– used in security system
– used in predictive maintenance
– used to differentiate animal sounds

Music classification:

– classify music
– key role in: audio libraries organisation by genre, improvement of recommandation algorithms, discovery of trends, listener preferences through data analysis, …

Natural language classification:

– human speech classification
– common in: chatbots, virtual assistants, tech-to-speech application, …

In this article we will look at the classification of marine mammal sounds.

Objective

The purpose of this article is to explain how to train a model to classify audios using AI Notebooks.

In this tutorial, the sounds in the dataset are in .wav format. To be able to use them and obtain results, it is necessary to pre-process this data by following different steps.

Analyse one of these audio recordings
Transform each sound file into a .csv file
Train your model from the .csv file

USE CASE: Best of Watkins Marine Mammal Sound Database

This dataset is composed of 55 different folders corresponding to the marine mammals. In each folder are stored several sound files of each animal.

You can get more information about this dataset on this website.

The data distribution is as follows:

⚠️ For this example, we choose only the first 45 classes (or folders).

Let’s follow the different steps!

Audio libraries

1. Loading an audio file with Librosa

Librosa is a Python module for audio signal analysis. By using Librosa, you can extract key features from the audio samples such as Tempo, Chroma Energy Normalized, Mel-Freqency Cepstral Coefficients, Spectral Centroid, Spectral Contrast, Spectral Rolloff, and Zero Crossing Rate. If you want to know more about this library, refer to the documentation.

import librosa
import librosa.display as lplt

You can start by looking at your data by displaying different parameters using the Librosa library.

First, you can do a test on a file.

test_sound = "data/AtlanticSpottedDolphin/61025001.wav"

Loads and decodes the audio.

data, sr = librosa.load(test_sound)
print(type(data), type(sr))

librosa.load(test_sound ,sr = 45600)

(array([-0.0739522 , -0.06588229, -0.06673266, ..., 0.03021295, 0.05592792, 0. ], dtype=float32), 45600)

2. Playing Audio with IPython.display.Audio

IPython.display.Audio advises you play audio directly in a Jupyter notebook.

Using IPython.display.Audio to play the audio.

import IPython

IPython.display.Audio(data, rate = sr)

Visualizing Audio

1. Waveforms

Waveforms are visual representations of sound as time on the x-axis and amplitude on the y-axis. They allow for quick analysis of audio data.

We can plot the audio array using librosa.display.waveplot.

plt.show(librosa.display.waveplot(data))

2. Spectrograms

A spectrogram is a visual way of representing the intensity of a signal over time at various frequencies present in a particular waveform.

stft = librosa.stft(data)
plt.colorbar(librosa.display.specshow(stft, sr = sr, x_axis = 'time', y_axis = 'hz'))

stft_db = librosa.amplitude_to_db(abs(stft))
plt.colorbar(librosa.display.specshow(stft_db, sr = sr, x_axis = 'time', y_axis = 'hz'))

3. Spectral Rolloff

Spectral Rolloff is the frequency below which a specified percentage of the total spectral energy.

librosa.feature.spectral_rolloff calculates the attenuation frequency for each frame of a signal.

spectral_rolloff = librosa.feature.spectral_rolloff(data + 0.01, sr = sr)[0]
plt.show(librosa.display.waveplot(data, sr = sr, alpha = 0.4))

4. Chroma Feature

This tool is perfect for analyzing musical features whose pitches can be meaningfully categorized and whose tuning is close to the equal temperament scale.

chroma = librosa.feature.chroma_stft(data, sr = sr)
lplt.specshow(chroma, sr = sr, x_axis = "time" ,y_axis = "chroma", cmap = "coolwarm")
plt.colorbar()
plt.title("Chroma Features")
plt.show()

5. Zero Crossing Rate

A zero crossing occurs if successive samples have different algebraic signs.

The rate at which zero crossings occur is a simple measure of the frequency content of a signal.
The number of zero-crossings measures the number of times in a time interval that the amplitude of speech signals passes through a zero value.

start = 1000
end = 1200
plt.plot(data[start:end])
plt.grid()

Data preprocessing

1. Data transformation

To train your model, preprocessing of data is required. First of all, you have to convert the .wav into a .csv file.

Define columns name:

header = "filename length chroma_stft_mean chroma_stft_var rms_mean rms_var spectral_centroid_mean spectral_centroid_var spectral_bandwidth_mean spectral_bandwidth_var rolloff_mean rolloff_var zero_crossing_rate_mean zero_crossing_rate_var harmony_mean harmony_var perceptr_mean perceptr_var tempo mfcc1_mean mfcc1_var mfcc2_mean mfcc2_var mfcc3_mean mfcc3_var mfcc4_mean mfcc4_var label".split()

Create the data.csv file:

import csv

file = open('data.csv', 'w', newline = '')
with file:
    writer = csv.writer(file)
    writer.writerow(header)

Define character string of marine mammals (45):

There are 45 different marine animals, or 45 classes.

marine_mammals = "AtlanticSpottedDolphin BeardedSeal Beluga_WhiteWhale BlueWhale BottlenoseDolphin Boutu_AmazonRiverDolphin BowheadWhale ClymeneDolphin Commerson'sDolphin CommonDolphin Dall'sPorpoise DuskyDolphin FalseKillerWhale Fin_FinbackWhale FinlessPorpoise Fraser'sDolphin Grampus_Risso'sDolphin GraySeal GrayWhale HarborPorpoise HarbourSeal HarpSeal Heaviside'sDolphin HoodedSeal HumpbackWhale IrawaddyDolphin JuanFernandezFurSeal KillerWhale LeopardSeal Long_FinnedPilotWhale LongBeaked(Pacific)CommonDolphin MelonHeadedWhale MinkeWhale Narwhal NewZealandFurSeal NorthernRightWhale PantropicalSpottedDolphin RibbonSeal RingedSeal RossSeal Rough_ToothedDolphin SeaOtter Short_Finned(Pacific)PilotWhale SouthernRightWhale SpermWhale".split()

Transform each .wav file into a .csv row:

for animal in marine_mammals:

  for filename in os.listdir(f"/workspace/data/{animal}/"):

    sound_name = f"/workspace/data/{animal}/{filename}"
    y, sr = librosa.load(sound_name, mono = True, duration = 30)
    chroma_stft = librosa.feature.chroma_stft(y = y, sr = sr)
    rmse = librosa.feature.rms(y = y)
    spec_cent = librosa.feature.spectral_centroid(y = y, sr = sr)
    spec_bw = librosa.feature.spectral_bandwidth(y = y, sr = sr)
    rolloff = librosa.feature.spectral_rolloff(y = y, sr = sr)
    zcr = librosa.feature.zero_crossing_rate(y)
    mfcc = librosa.feature.mfcc(y = y, sr = sr)
    to_append = f'{filename} {np.mean(chroma_stft)} {np.mean(rmse)} {np.mean(spec_cent)} {np.mean(spec_bw)} {np.mean(rolloff)} {np.mean(zcr)}'
    
    for e in mfcc:
        to_append += f' {np.mean(e)}'

    to_append += f' {animal}'
    file = open('data.csv', 'a', newline = '')
    
    with file:
        writer = csv.writer(file)
        writer.writerow(to_append.split())

Display the data.csv file:

df = pd.read_csv('data.csv')

2. Features extraction

In the preprocessing of the data, feature extraction is necessary before running the training. The purpose is to define the inputs and outputs of the neural network.

OUTPUT (y): last column which is the label.

You cannot use text directly for training. You will encode these labels with the LabelEncoder() function of sklearn.preprocessing.

Before running a model, you need to convert this type of categorical text data into numerical data that the model can understand.

from sklearn.preprocessing import LabelEncoder

class_list = df.iloc[:,-1]
converter = LabelEncoder()
y = converter.fit_transform(class_list)
print("y: ", y)

y : [ 0 0 0 ... 44 44 44]

INPUTS (X): all other columns are input parameters to the neural network.

Remove the first column which does not provide any information for the training (the filename) and the last one which corresponds to the output.

from sklearn.preprocessing import StandardScaler

fit = StandardScaler()
X = fit.fit_transform(np.array(df.iloc[:, 1:26], dtype=float))

3. Split dataset for training

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Building the model

The first step is to build the model and display the summary.

For the CNN model, all hidden layers use a ReLU activation function, the output layer a Softmax function and a Dropout is used to avoid overfitting.

import keras
import tensorflow as tf
from tensorflow.keras.models import Sequential

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(512, activation = 'relu', input_shape = (X_train.shape[1],)),
    tf.keras.layers.Dropout(0.2),
    
    tf.keras.layers.Dense(256, activation = 'relu'),
    keras.layers.Dropout(0.2),
    
    tf.keras.layers.Dense(128, activation = 'relu'),
    tf.keras.layers.Dropout(0.2),
    
    tf.keras.layers.Dense(64, activation = 'relu'),
    tf.keras.layers.Dropout(0.2),
    
    tf.keras.layers.Dense(45, activation = 'softmax'),
])

print(model.summary())

Model training and evaluation

Adam optimizer is used to train the model over 100 epochs. This choice was made because it allows us to obtain better results.

The loss is calculated with the sparse_categorical_crossentropy function.

def trainModel(model,epochs, optimizer):
    batch_size = 128
    model.compile(optimizer = optimizer, loss = 'sparse_categorical_crossentropy', metrics = 'accuracy')
    return model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs = epochs, batch_size = batch_size)

Now, launch the training!

model_history = trainModel(model = model, epochs = 100, optimizer = 'adam')

Display loss curves

loss_train_curve = model_history.history["loss"]
loss_val_curve = model_history.history["val_loss"]
plt.plot(loss_train_curve, label = "Train")
plt.plot(loss_val_curve, label = "Validation")
plt.legend(loc = 'upper right')
plt.title("Loss")
plt.show()

Display accuracy curves

acc_train_curve = model_history.history["accuracy"]
acc_val_curve = model_history.history["val_accuracy"]
plt.plot(acc_train_curve, label = "Train")
plt.plot(acc_val_curve, label = "Validation")
plt.legend(loc = 'lower right')
plt.title("Accuracy")
plt.show()

test_loss, test_acc = model.evaluate(X_test, y_test, batch_size = 128)
print("The test loss is: ", test_loss)
print("The best accuracy is: ", test_acc*100)

20/20 [==============================] - 0s 3ms/step - loss: 0.2854 - accuracy: 0.9371
The test loss is: 0.24700121581554413
The best accuracy is: 93.71269345283508

Save the model for future inference

1. Save and store the model in an OVHcloud Object Container

model.save('/workspace/model-marine-mammal-sounds/saved_model/my_model')

You can check your model directory.

%ls /workspace/model-marine-mammal-sounds/saved_model

Saved_model contains an assets folder, saved_model.pb, and variables folder.

%ls /workspace/model-marine-mammal-sounds/saved_model/my_model

Then, you are able to load this model.

model = tf.keras.models.load_model('/workspace/model-marine-mammal-sounds/saved_model/my_model')

Do you want to use this model in a Streamlit app? Refer to our GitHub repository.

Streamlit app overview

Conclusion

Want to find out more?

If you want to access the notebook, refer to the GitHub repository.

To launch and test this notebook with AI Notebooks, please refer to our documentation.

You can also look at this presentation done at a OVHcloud Startup Program event at Station F:

I hope you have enjoyed this article. Try for yourself!

References

https://blog.clairvoyantsoft.com/music-genre-classification-using-cnn-ef9461553726

https://towardsdatascience.com/music-genre-classification-with-python-c714d032f0d8

OVHcloud AI Notebooks: the power of Jupyter without any compromise

Bastien Verdebout — Fri, 18 Feb 2022 10:01:22 +0000

Are you using notebooks, such as Google Colab, for your business, studies or own usage? Are you reaching the maximum capabilities of this service and are looking for a simple yet powerful alternative? This blog post is for you! We will explore our own solution.

Let’s be honest – Google Colab is solving many challenges. It enables millions of people around the world to learn and use Python and play live with hundreds of libraries for free. It’s quite often for data-oriented use-cases – but not only. To be fair, I discovered the power of notebooks through Google Colab a long time ago.

Notebook? You said notebook?

A notebook is a document which contains code (e.g. Python code), text and images that can be read by us, humans, but also executed inside the notebook. With Jupyter app, it can be launched inside your web browser, allowing you to explore and experiment easily and share your work with many others.

Think of notebooks as cooking recipes that you can follow live, step-by-step and see the result directly. It’s truly wonderful.

Now imagine notebooks that can be linked to remote power and storage to perform your use cases at scale. That’s it 😀

Example of Jupyter notebook

Solutions like Google Colab are nice, but limited

There are plenty of managed notebook solutions on the market today. The market is split in two – complicated solutions inside cloud providers (AWS Sagemaker, Google AI Platform Notebooks, Azure ML Notebooks…) and pure players trying to bring notebooks to the masses. The main actor in this field is Google Colab.

Historically, Google Colab is based on Jupyterlab, and forked from this awesome open source project few years ago (source: their FAQ).

Since mid-2021, they now provide 3 plans in 9 countries, from free to paid (Pro and Pro +).

After browsing their website and FAQ, I drafted this comparative table:

Comparative table between Google Colab and OVHcloud AI Notebooks

If you are a student or an individual, Google Colab is ideal for you, as most pricing plans contain the basic features you need for exploring this magic world of data. That’s cool.

But once you are working on more intensive projects, you may reach Colab limitations:

Compute resources are not guaranteed: for example, you don’t know exactly how long and how much you will have GPU power or RAM memory. This is very critical when timing is important (and reproducibility)
You cannot chose which GPU model will be used: it might be a good one, or an old one like K80
No background execution except in Pro+: you cannot close your internet browser, or it will automatically stop your work
Maximum 24 hours time of execution: if you’re running intensive trainings, it’s a serious limitation
Not official JupyterLab version: the live code editor is based on Jupyterlab, but not the exact open source one. You have some features missing
Unavailable in multiple countries: quoting their FAQ, “For now, both Colab Pro and Pro+ are only available in the following countries: United States, Canada, Japan, Brazil, Germany, France, India, United Kingdom, and Thailand”
Requires acceptance of Google ToS: when you use Google Colab, you need to fully accept the Google terms of services and privacy policy. Read them carefully 😃

Seeing their pricing, it’s fully understandable to put some limits.

But what can you do if you want more?

Good news everyone! You have some (simple) alternatives

We are a European company, and if I’m correct, the only one to provide managed notebooks in the cloud with GPU power at scale. We released OVHcloud AI Notebooks a few months ago and it’s really exciting to solve people’s challenges. Based on the current usage, it’s clearly a success (thumbs up to the team behind this new product!).

AI Notebooks = European legislation, guaranteed resources, back-end execution, native Jupyterlab or VSCode editor, no maximum running time, available everywhere… and yet it’s also simple to use.

Go back to the comparative table. You’ll see that we solve many blockers 😃.

To explain further, I’ve made a short video where I start in Google Colab and migrate my work in OVHcloud AI Notebooks. I took my time, explained everything and it lasts 8 minutes. If i wanted to automate it, it should take +-10 seconds.

Video tutorial to migrate from Google Colab to AI Notebooks

Want to give it a try? Fearing it’s too expensive?

Our pricing is quite simple: you don’t pay per month, you pay what you consume.

1 CPU is 0,03€ per hour, 1 NVIDIA V100S GPU is 1,75€ per hour.

If you use a notebook with 2 CPUs during 24 hours, it will cost 1,44 euro (0,03 x 2 x 24h).

It’s more expensive compared to Google Colab plans, however it’s not exactly the same product. OVHcloud AI Notebooks offers managed notebooks in the cloud, but with less limitations (our real competitors are AWS Sagemaker or Google AI Platforms).

And we support startups and research! If you are interested, reach our startup program where the whole OVHcloud Public Cloud ecosystem is included (AI tools, K8s, storage, …) up to 100’000 euros.

We also do some philanthropy for schools, open source projects… contact us!

Thanks for reading 😃