Deep learning Archives - OVHcloud Blog

How to serve LLMs with vLLM and OVHcloud AI Deploy

Mathieu Busquet — Wed, 29 May 2024 12:22:26 +0000

In this tutorial, we will walk you through the process of serving large language models (LLMs), providing step-by-step instruction.

Introduction

In recent years, large language models (LLMs) have become increasingly popular, with open-source models like Mistral and LLaMA gaining widespread attention. In particular, the LLaMA 3 model was released on April 18, 2024, is one of today’s most powerful open-source LLMs.

However, serving these LLMs can be challenging, particularly on hardware with limited resources. Indeed, even on expensive hardware, LLMs can be surprisingly slow, with high VRAM utilization and throughput limitations.

This is where vLLM comes in. vLLM is an open-source project that enables fast and easy-to-use LLM inference and serving. Designed for optimal performance and resource utilization, vLLM supports a range of LLM architectures and offers flexible customization options. That’s why we are going to use it to efficiently deploy and scale our LLMs.

Objective

In this guide, you will discover how to deploy a LLM thanks to vLLM and the AI Deploy OVHcloud solution. This will enable you to benefit from vLLM‘s optimisations and OVHcloud‘s GPU computing resources. Your LLM will then be exposed by a secured API.

🎁 And for those who do not want to bother with the deployment process, a surprise awaits you at the end of the article. We are going to introduce you to our new solution for using LLMs, called AI Endpoints. This product makes it easy to integrate AI capabilities into your applications with a simple API call, without the need for deep AI expertise or infrastructure management. And while it’s in alpha, it’s free!

Requirements

To deploy your vLLM server, you need:

An OVHcloud account to access the OVHcloud Control Panel
A Public Cloud project
A user for the AI Products, related to this Public Cloud project
The OVHcloud AI CLI installed on your local computer (to interact with the AI products by running commands).
Docker installed on your local computer, or access to a Debian Docker Instance, which is available on the Public Cloud

Once these conditions have been met, you are ready to serve your LLMs.

Building a Docker image

Since the OVHcloud AI Deploy solution is based on Docker images, we will be using a Docker image to deploy our vLLM inference server.

As a reminder, Docker is a platform that allows you to create, deploy, and run applications in containers. Docker containers are standalone and executable packages that include everything needed to run an application (code, libraries, system tools).

To create this Docker image, we will need to write the following Dockerfile into a new folder:

mkdir my_vllm_image
nano Dockerfile

# 🐳 Base image
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

# 👱 Set the working directory inside the container
WORKDIR /workspace

# 📚 Install missing system packages (git) so we can clone the vLLM project repository
RUN apt-get update && apt-get install -y git
RUN git clone https://github.com/vllm-project/vllm/

# 📚 Install the Python dependencies
RUN pip3 install --upgrade pip
RUN pip3 install vllm 

# 🔑 Give correct access rights to the OVHcloud user
ENV HOME=/workspace
RUN chown -R 42420:42420 /workspace

Let’s take a closer look at this Dockerfile to understand it:

FROM: Specify the base image for our Docker Image. We choose the PyTorch image since it comes with CUDA, CuDNN and torch, which is needed by vLLM.
WORKDIR /workspace: We set the working directory for the Docker container to /workspace, which is the default folder when we use AI Deploy.
RUN: It allows us to upgrade pip to the latest version to make sure we have access to the latest libraries and dependencies. We will install vLLM library, and git, which will enable to clone the vLLM repository into the /workspace directory.
ENV HOME=/workspace: This sets the HOME environment variable to /workspace. This is a requirement to use the OVHcloud AI Products.
RUN chown -R 42420:42420 /workspace: This changes the owner of the /workspace directory to the user and group with IDs of 42420 (OVHcloud user). This is also a requirement to use the OVHcloud AI Products.

This Dockerfile does not contain a CMD instruction and therefore does not launch our VLLM server. Do not worry about that, we will do it directly from AI Deploy to have more flexibility.

Once your Dockerfile is written, launch the following command to build your image:

docker build . -t vllm_image:latest

Push the image into the shared registry

Once you have built the Docker image, you will need to push it to a registry to make it accessible from AI Deploy. A registry is a service that allows you to store and distribute Docker images, making it easy to deploy them in different environments.

Several registries can be used (OVHcloud Managed Private Registry, Docker Hub, GitHub packages, …). In this tutorial, we will use the OVHcloud shared registry. More information are available in the Registries documentation.

To find the address of your shared registry, use the following command (ovhai CLI needs to be installed on your computer):

ovhai registry list

Then, log in on your shared registry with your usual AI Platform user credentials:

docker login -u  -p

Once you are logged in to the registry, tag the compiled image and push it into your shared registry:

docker tag vllm_image:latest /vllm_image:latest
docker push /vllm_image:latest

vLLM inference server deployment

Once your image has been pushed, it can be used with AI Deploy, using either the ovhai CLI or the OVHcloud Control Panel (UI).

Creating an access token

Tokens are used as unique authenticators to securely access the AI Deploy apps. By creating a token, you can ensure that only authorized requests are allowed to interact with the vLLM endpoint. You can create this token by using the OVHcloud Control Panel (UI) or by running the following command:

ovhai token create vllm --role operator --label-selector name=vllm

This will give you a token that you will need to keep.

Creating a Hugging Face token (optionnal)

Note that some models, such as LLaMA 3 require you to accept their license, hence, you need to create a HuggingFace account, accept the model’s license, and generate a token by accessing your account settings, that will allow you to access the model.

For example, when visiting the HugginFace Gemma model page, you’ll see this (if you are logged in):

If you want to use this model, you will have to Acknowledge the license, and then make sure to create a token in the tokens section.

In the next step, we will set this token as an environment variable (named HF_TOKEN). Doing this will enable us to use any LLM whose conditions of use we have accepted.

Run the AI Deploy application

Run the following command to deploy your vLLM server by running your customized Docker image:

ovhai app run /vllm_image:latest \
  --name vllm_app \
  --flavor h100-1-gpu \
  --gpu 1 \
  --env HF_TOKEN="" \
  --label name=vllm \
  --default-http-port 8080 \
  -- python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model  --dtype half

You just need to change the address of your registry to the one you used, and the name of the LLM you want to use. Also pay attention to the name of the image, its tag, and the label selector of your label if you haven’t used the same ones as those given in this tutorial.

Parameters explanation

/vllm_image:latest is the image on which the app is based.
--name vllm_app is an optional argument that allows you to give your app a custom name, making it easier to manage all your apps.
--flavor h100-1-gpu indicates that we want to run our app on H100 GPU(s). You can access the full list of GPUs available by running ovhai capabilities flavor list
--gpu 1 indicates that we request 1 GPU for that app.
--env HF_TOKEN is an optional argument that allows us to set our Hugging Face token as an environment variable. This gives us access to models for which we have accepted the conditions.
--label name=vllm allows to privatize our LLM by adding the token corresponding to the label selector name=vllm.
--default-http-port 8080 indicates that the port to reach on the app URL is the 8080.
--python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model allows to start the vLLM API server. The specified will be downloaded from Hugging Face. Here is a list of those that are supported by vLLM. Many arguments can be used to optimize your inference.

When this ovhai app run command is executed, several pieces of information will appear in your terminal. Get the ID of your application, and open the Info URL in a new tab. Wait a few minutes for your application to launch. When it is RUNNING, you can stream its logs by executing:

ovhai app logs -f

This will allow you to track the server launch, the model download and any errors you may encounter if you have used a model for which you have not accepted the user contract.

If all goes well, you should see the following output, which means that your server is up and running:

Started server process [11]
Waiting for application startup.
Application startup complete.
Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

Interacting with your LLM

Once the server is up and running, we can interact with our LLM by hitting the /generate endpoint.

Using cURL

Make sure you change the ID to that of your application so that you target the right endpoint. In order for the request to be accepted, also specify the token that you generated previously by executing ovhai token create. Feel free to adapt the parameters of the request (prompt, max_tokens, temperature, …)

curl --request POST \                                             
  --url https://.app.gra.ai.cloud.ovh.net/generate \
  --header 'Authorization: Bearer ' \
  --header 'Content-Type: application/json' \
  --data '{
        "prompt": "",
        "max_tokens": 50,
        "n": 1,
        "stream": false
}'

Using Python

Here too, you need to add your personal token and the correct link for your application.

import requests
import json

# change for your host
APP_URL = "https://.app.gra.ai.cloud.ovh.net"
TOKEN = "AI_TOKEN_generated_with_CLI"

url = f"{APP_URL}/generate"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {TOKEN}"
}
data = {
    "prompt": "What a LLM is in AI?",
    "max_tokens": 100,
    "temperature": 0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json()["text"][0])

OVHcloud AI Endpoints

If you are not interested in building your own image and deploying your own LLM inference server, you can use OVHcloud’s new AI Endpoints product which will make your life definitely easier!

AI Endpoints is a serverless solution that provides AI APIs, enabling you to easily use pre-trained and optimized AI models in your applications.

Overview of AI Endpoints

You can use LLM as a Service, choosing the desired model (such as LLaMA, Mistral, or Mixtral) and making an API call to use it in your application. This will allow you to interact with these models without even having to deploy them!

In addition to LLM capabilities, AI Endpoints also offers a range of other AI models, including speech-to-text, translation, summarization, embeddings and computer vision.

Best of all, AI Endpoints is currently in alpha phase and is free to use, making it an accessible and affordable solution for developers seeking to explore the possibilities of AI. Check this article and try it out today to discover the power of AI!

Join our Discord server to interact with the community and send us your feedbacks (#ai-endpoints channel)!

Fine-Tuning LLaMA 2 Models using a single GPU, QLoRA and AI Notebooks

Mathieu Busquet — Fri, 21 Jul 2023 15:04:00 +0000

In this tutorial, we will walk you through the process of fine-tuning LLaMA 2 models, providing step-by-step instructions.

All the code related to this article is available in our dedicated GitHub repository . You can reproduce all the experiments with OVHcloud AI Notebooks.

Introduction

On July 18, 2023, Meta released LLaMA 2, the latest version of their Large Language Model (LLM).

Trained between January 2023 and July 2023 on 2 trillion tokens, these new models outperforms other LLMs on many benchmarks, including reasoning, coding, proficiency, and knowledge tests. This release comes in different flavors, with parameter sizes of 7B, 13B, and a mind-blowing 70B. Models are intended for free for both commercial and research use in English.

To suit every text generation needed and fine-tune these models, we will use QLoRA (Efficient Finetuning of Quantized LLMs), a highly efficient fine-tuning technique that involves quantizing a pretrained LLM to just 4 bits and adding small “Low-Rank Adapters”. This unique approach allows for fine-tuning LLMs using just a single GPU! This technique is supported by the PEFT library.

To fine-tune our model, we will create a OVHcloud AI Notebooks with only 1 GPU.

Mandatory requirements

To successfully fine-tune LLaMA 2 models, you will need the following:

Fill Meta’s form to request access to the next version of Llama. Indeed, the use of Llama 2 is governed by the Meta license, that you must accept in order to download the model weights and tokenizer.
Have a Hugging Face account (with the same email address you entered in Meta’s form).
Have a Hugging Face token.
Visit the page of one of the LLaMA 2 available models (version 7B, 13B or 70B), and accept Hugging Face’s license terms and acceptable use policy.
Log in to the Hugging Face model Hub from your notebook’s terminal by running the huggingface-cli login command, and enter your token. You will not need to add your token as git credential.
Powerful Computing Resources: Fine-tuning the Llama 2 model requires substantial computational power. Ensure you are running code on GPU(s) when using AI Notebooks or AI Training.

Set up your Python environment

Create the following requirements.txt file:

torch
accelerate @ git+https://github.com/huggingface/accelerate.git
bitsandbytes
datasets==2.13.1
transformers @ git+https://github.com/huggingface/transformers.git
peft @ git+https://github.com/huggingface/peft.git
trl @ git+https://github.com/lvwerra/trl.git
scipy

Then install and import the installed libraries:

pip install -r requirements.txt

import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
    DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset

Download LLaMA 2 model

As mentioned before, LLaMA 2 models come in different flavors which are 7B, 13B, and 70B. Your choice can be influenced by your computational resources. Indeed, larger models require more resources, memory, processing power, and training time.

To download the model you have been granted access to, make sure you are logged in to the Hugging Face model hub. As mentioned in the requirements step, you need to use the huggingface-cli login command.

The following function will help us to download the model and its tokenizer. It requires a bitsandbytes configuration that we will define later.

def load_model(model_name, bnb_config):
    n_gpus = torch.cuda.device_count()
    max_memory = f'{40960}MB'

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto", # dispatch efficiently the model on the available ressources
        max_memory = {i: max_memory for i in range(n_gpus)},
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)

    # Needed for LLaMA tokenizer
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer

Download a Dataset

There are many datasets that can help you fine-tune your model. You can even use your own dataset!

In this tutorial, we are going to download and use the Databricks Dolly 15k dataset, which contains 15,000 prompt/response pairs. It was crafted by over 5,000 Databricks employees during March and April of 2023.

This dataset is designed specifically for fine-tuning large language models. Released under the CC BY-SA 3.0 license, it can be used, modified, and extended by any individual or company, even for commercial applications. So it’s a perfect fit for our use case!

However, like most datasets, this one has its limitations. Indeed, pay attention to the following points:

It consists of content collected from the public internet, which means it may contain objectionable, incorrect or biased content and typo errors, which could influence the behavior of models fine-tuned using this dataset.
Since the dataset has been created for Databricks by their own employees, it’s worth noting that the dataset reflects the interests and semantic choices of Databricks employees, which may not be representative of the global population at large.
We only have access to the train split of the dataset, which is its largest subset.

# Load the databricks dataset from Hugging Face
from datasets import load_dataset

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

Explore dataset

Once the dataset is downloaded, we can take a look at it to understand what it contains:

print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')

*** OUTPUT ***
Number of prompts: 15011
Column Names are: ['instruction', 'context', 'response', 'category']

As we can see, each sample is a dictionary that contains:

An instruction: What could be entered by the user, such as a question
A context: Help to interpret the sample
A response: Answer to the instruction
A category: Classify the sample between Open Q&A, Closed Q&A, Extract information from Wikipedia, Summarize information from Wikipedia, Brainstorming, Classification, Creative writing

Pre-processing dataset

Instruction fine-tuning is a common technique used to fine-tune a base LLM for a specific downstream use-case.

It will help us to format our prompts as follows:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Sea or Mountain

### Response:
I believe Mountain are more attractive but Ocean has it's own beauty and this tropical weather definitely turn you on! SO 50% 50%

### End

To delimit each prompt part by hashtags, we can use the following function:

def create_prompt_formats(sample):
    """
    Format various fields of the sample ('instruction', 'context', 'response')
    Then concatenate them using two newline characters 
    :param sample: Sample dictionnary
    """

    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruction:"
    INPUT_KEY = "Input:"
    RESPONSE_KEY = "### Response:"
    END_KEY = "### End"
    
    blurb = f"{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}\n{sample['instruction']}"
    input_context = f"{INPUT_KEY}\n{sample['context']}" if sample["context"] else None
    response = f"{RESPONSE_KEY}\n{sample['response']}"
    end = f"{END_KEY}"
    
    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    
    sample["text"] = formatted_prompt

    return sample

Now, we will use our model tokenizer to process these prompts into tokenized ones.

The goal is to create input sequences of uniform length (which are suitable for fine-tuning the language model because it maximizes efficiency and minimize computational overhead), that must not exceed the model’s maximum token limit.

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )


# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset: str):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """
    
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)
    
    # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=["instruction", "context", "response", "text", "category"],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
    
    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset

With these functions, our dataset will be ready for fine-tuning !

Create a bitsandbytes configuration

This will allow us to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices. We choose to apply bfloat16 compute data type and nested quantization for memory-saving purposes.

def create_bnb_config():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    return bnb_config

To leverage the LoRa method, we need to wrap the model as a PeftModel.

To do this, we need to implement a LoRa configuration:

def create_peft_config(modules):
    """
    Create Parameter-Efficient Fine-Tuning config for your model
    :param modules: Names of the modules to apply Lora to
    """
    config = LoraConfig(
        r=16,  # dimension of the updated matrices
        lora_alpha=64,  # parameter for scaling
        target_modules=modules,
        lora_dropout=0.1,  # dropout probability for layers
        bias="none",
        task_type="CAUSAL_LM",
    )

    return config

Previous function needs the target modules to update the necessary matrices. The following function will get them for our model:

# SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

Once everything is set up and the base model is prepared, we can use the print_trainable_parameters() helper function to see how many trainable parameters are in the model.

def print_trainable_parameters(model, use_4bit=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit:
        trainable_params /= 2
    print(
        f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
    )

We expect the LoRa model to have fewer trainable parameters compared to the original one, since we want to perform fine-tuning.

Train

Now that everything is ready, we can pre-process our dataset and load our model using the set configurations:

# Load model from HF with user's token and with bitsandbytes config

model_name = "meta-llama/Llama-2-7b-hf" 

bnb_config = create_bnb_config()

model, tokenizer = load_model(model_name, bnb_config)

## Preprocess dataset

max_length = get_max_length(model)

dataset = preprocess_dataset(tokenizer, max_length, seed, dataset)

Then, we can run our fine-tuning process:

def train(model, tokenizer, dataset, output_dir):
    # Apply preprocessing to the model to prepare it by
    # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
    model.gradient_checkpointing_enable()

    # 2 - Using the prepare_model_for_kbit_training method from PEFT
    model = prepare_model_for_kbit_training(model)

    # Get lora module names
    modules = find_all_linear_names(model)

    # Create PEFT config for these modules and wrap the model to PEFT
    peft_config = create_peft_config(modules)
    model = get_peft_model(model, peft_config)
    
    # Print information about the percentage of trainable parameters
    print_trainable_parameters(model)
    
    # Training parameters
    trainer = Trainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            max_steps=20,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            output_dir="outputs",
            optim="paged_adamw_8bit",
        ),
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )
    
    model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs
    
    ### SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
    # Verifying the datatypes before training
    
    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total+= v
    for k, v in dtypes.items():
        print(k, v, v/total)
     
    do_train = True
    
    # Launch training
    print("Training...")
    
    if do_train:
        train_result = trainer.train()
        metrics = train_result.metrics
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()
        print(metrics)    
    
    ###
    
    # Saving model
    print("Saving last checkpoint of the model...")
    os.makedirs(output_dir, exist_ok=True)
    trainer.model.save_pretrained(output_dir)
    
    # Free memory for merging weights
    del model
    del trainer
    torch.cuda.empty_cache()
    
    
output_dir = "results/llama2/final_checkpoint"
train(model, tokenizer, dataset, output_dir)

If you prefer to have a number of epochs (entire training dataset will be passed through the model) instead of a number of training steps (forward and backward passes through the model with one batch of data), you can replace the max_steps argument by num_train_epochs.

To later load and use the model for inference, we have used the trainer.model.save_pretrained(output_dir) function, which saves the fine-tuned model’s weights, configuration, and tokenizer files.

Fine-tuning llama2 results on databricks-dolly-15k dataset

Unfortunately, it is possible that the latest weights are not the best. To solve this problem, you can implement a EarlyStoppingCallback, from transformers, during your fine-tuning. This will enable you to regularly test your model on the validation set, if you have one, and keep only the best weights.

Merge weights

Once we have our fine-tuned weights, we can build our fine-tuned model and save it to a new directory, with its associated tokenizer. By performing these steps, we can have a memory-efficient fine-tuned model and tokenizer ready for inference!

model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
model = model.merge_and_unload()

output_merged_dir = "results/llama2/final_merged_checkpoint"
os.makedirs(output_merged_dir, exist_ok=True)
model.save_pretrained(output_merged_dir, safe_serialization=True)

# save tokenizer for easy inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_merged_dir)

Conclusion

We hope you have enjoyed this article!

You are now able to fine-tune LLaMA 2 models on your own datasets!

In our next tutorial, you will discover how to Deploy your Fine-tuned LLM on OVHcloud AI Deploy for inference!

AI Notebooks: analyze and classify sounds with AI

Eléa Petton — Fri, 04 Mar 2022 08:57:00 +0000

A guide to analyze and classify marine mammal sounds.

Since you’re reading a blog post from a technology company, I bet you’ve heard about AI, Machine and Deep Learning many times before.

Audio or sound classification is a technique with multiple applications in the field of AI and data science.

Use cases

Acoustic data classification:

– identifies location
– differentiates environments
– has a role in ecosystem monitoring

Environmental sound classification:

– recognition of urban sounds
– used in security system
– used in predictive maintenance
– used to differentiate animal sounds

Music classification:

– classify music
– key role in: audio libraries organisation by genre, improvement of recommandation algorithms, discovery of trends, listener preferences through data analysis, …

Natural language classification:

– human speech classification
– common in: chatbots, virtual assistants, tech-to-speech application, …

In this article we will look at the classification of marine mammal sounds.

Objective

The purpose of this article is to explain how to train a model to classify audios using AI Notebooks.

In this tutorial, the sounds in the dataset are in .wav format. To be able to use them and obtain results, it is necessary to pre-process this data by following different steps.

Analyse one of these audio recordings
Transform each sound file into a .csv file
Train your model from the .csv file

USE CASE: Best of Watkins Marine Mammal Sound Database

This dataset is composed of 55 different folders corresponding to the marine mammals. In each folder are stored several sound files of each animal.

You can get more information about this dataset on this website.

The data distribution is as follows:

⚠️ For this example, we choose only the first 45 classes (or folders).

Let’s follow the different steps!

Audio libraries

1. Loading an audio file with Librosa

Librosa is a Python module for audio signal analysis. By using Librosa, you can extract key features from the audio samples such as Tempo, Chroma Energy Normalized, Mel-Freqency Cepstral Coefficients, Spectral Centroid, Spectral Contrast, Spectral Rolloff, and Zero Crossing Rate. If you want to know more about this library, refer to the documentation.

import librosa
import librosa.display as lplt

You can start by looking at your data by displaying different parameters using the Librosa library.

First, you can do a test on a file.

test_sound = "data/AtlanticSpottedDolphin/61025001.wav"

Loads and decodes the audio.

data, sr = librosa.load(test_sound)
print(type(data), type(sr))

librosa.load(test_sound ,sr = 45600)

(array([-0.0739522 , -0.06588229, -0.06673266, ..., 0.03021295, 0.05592792, 0. ], dtype=float32), 45600)

2. Playing Audio with IPython.display.Audio

IPython.display.Audio advises you play audio directly in a Jupyter notebook.

Using IPython.display.Audio to play the audio.

import IPython

IPython.display.Audio(data, rate = sr)

Visualizing Audio

1. Waveforms

Waveforms are visual representations of sound as time on the x-axis and amplitude on the y-axis. They allow for quick analysis of audio data.

We can plot the audio array using librosa.display.waveplot.

plt.show(librosa.display.waveplot(data))

2. Spectrograms

A spectrogram is a visual way of representing the intensity of a signal over time at various frequencies present in a particular waveform.

stft = librosa.stft(data)
plt.colorbar(librosa.display.specshow(stft, sr = sr, x_axis = 'time', y_axis = 'hz'))

stft_db = librosa.amplitude_to_db(abs(stft))
plt.colorbar(librosa.display.specshow(stft_db, sr = sr, x_axis = 'time', y_axis = 'hz'))

3. Spectral Rolloff

Spectral Rolloff is the frequency below which a specified percentage of the total spectral energy.

librosa.feature.spectral_rolloff calculates the attenuation frequency for each frame of a signal.

spectral_rolloff = librosa.feature.spectral_rolloff(data + 0.01, sr = sr)[0]
plt.show(librosa.display.waveplot(data, sr = sr, alpha = 0.4))

4. Chroma Feature

This tool is perfect for analyzing musical features whose pitches can be meaningfully categorized and whose tuning is close to the equal temperament scale.

chroma = librosa.feature.chroma_stft(data, sr = sr)
lplt.specshow(chroma, sr = sr, x_axis = "time" ,y_axis = "chroma", cmap = "coolwarm")
plt.colorbar()
plt.title("Chroma Features")
plt.show()

5. Zero Crossing Rate

A zero crossing occurs if successive samples have different algebraic signs.

The rate at which zero crossings occur is a simple measure of the frequency content of a signal.
The number of zero-crossings measures the number of times in a time interval that the amplitude of speech signals passes through a zero value.

start = 1000
end = 1200
plt.plot(data[start:end])
plt.grid()

Data preprocessing

1. Data transformation

To train your model, preprocessing of data is required. First of all, you have to convert the .wav into a .csv file.

Define columns name:

header = "filename length chroma_stft_mean chroma_stft_var rms_mean rms_var spectral_centroid_mean spectral_centroid_var spectral_bandwidth_mean spectral_bandwidth_var rolloff_mean rolloff_var zero_crossing_rate_mean zero_crossing_rate_var harmony_mean harmony_var perceptr_mean perceptr_var tempo mfcc1_mean mfcc1_var mfcc2_mean mfcc2_var mfcc3_mean mfcc3_var mfcc4_mean mfcc4_var label".split()

Create the data.csv file:

import csv

file = open('data.csv', 'w', newline = '')
with file:
    writer = csv.writer(file)
    writer.writerow(header)

Define character string of marine mammals (45):

There are 45 different marine animals, or 45 classes.

marine_mammals = "AtlanticSpottedDolphin BeardedSeal Beluga_WhiteWhale BlueWhale BottlenoseDolphin Boutu_AmazonRiverDolphin BowheadWhale ClymeneDolphin Commerson'sDolphin CommonDolphin Dall'sPorpoise DuskyDolphin FalseKillerWhale Fin_FinbackWhale FinlessPorpoise Fraser'sDolphin Grampus_Risso'sDolphin GraySeal GrayWhale HarborPorpoise HarbourSeal HarpSeal Heaviside'sDolphin HoodedSeal HumpbackWhale IrawaddyDolphin JuanFernandezFurSeal KillerWhale LeopardSeal Long_FinnedPilotWhale LongBeaked(Pacific)CommonDolphin MelonHeadedWhale MinkeWhale Narwhal NewZealandFurSeal NorthernRightWhale PantropicalSpottedDolphin RibbonSeal RingedSeal RossSeal Rough_ToothedDolphin SeaOtter Short_Finned(Pacific)PilotWhale SouthernRightWhale SpermWhale".split()

Transform each .wav file into a .csv row:

for animal in marine_mammals:

  for filename in os.listdir(f"/workspace/data/{animal}/"):

    sound_name = f"/workspace/data/{animal}/{filename}"
    y, sr = librosa.load(sound_name, mono = True, duration = 30)
    chroma_stft = librosa.feature.chroma_stft(y = y, sr = sr)
    rmse = librosa.feature.rms(y = y)
    spec_cent = librosa.feature.spectral_centroid(y = y, sr = sr)
    spec_bw = librosa.feature.spectral_bandwidth(y = y, sr = sr)
    rolloff = librosa.feature.spectral_rolloff(y = y, sr = sr)
    zcr = librosa.feature.zero_crossing_rate(y)
    mfcc = librosa.feature.mfcc(y = y, sr = sr)
    to_append = f'{filename} {np.mean(chroma_stft)} {np.mean(rmse)} {np.mean(spec_cent)} {np.mean(spec_bw)} {np.mean(rolloff)} {np.mean(zcr)}'
    
    for e in mfcc:
        to_append += f' {np.mean(e)}'

    to_append += f' {animal}'
    file = open('data.csv', 'a', newline = '')
    
    with file:
        writer = csv.writer(file)
        writer.writerow(to_append.split())

Display the data.csv file:

df = pd.read_csv('data.csv')

2. Features extraction

In the preprocessing of the data, feature extraction is necessary before running the training. The purpose is to define the inputs and outputs of the neural network.

OUTPUT (y): last column which is the label.

You cannot use text directly for training. You will encode these labels with the LabelEncoder() function of sklearn.preprocessing.

Before running a model, you need to convert this type of categorical text data into numerical data that the model can understand.

from sklearn.preprocessing import LabelEncoder

class_list = df.iloc[:,-1]
converter = LabelEncoder()
y = converter.fit_transform(class_list)
print("y: ", y)

y : [ 0 0 0 ... 44 44 44]

INPUTS (X): all other columns are input parameters to the neural network.

Remove the first column which does not provide any information for the training (the filename) and the last one which corresponds to the output.

from sklearn.preprocessing import StandardScaler

fit = StandardScaler()
X = fit.fit_transform(np.array(df.iloc[:, 1:26], dtype=float))

3. Split dataset for training

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Building the model

The first step is to build the model and display the summary.

For the CNN model, all hidden layers use a ReLU activation function, the output layer a Softmax function and a Dropout is used to avoid overfitting.

import keras
import tensorflow as tf
from tensorflow.keras.models import Sequential

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(512, activation = 'relu', input_shape = (X_train.shape[1],)),
    tf.keras.layers.Dropout(0.2),
    
    tf.keras.layers.Dense(256, activation = 'relu'),
    keras.layers.Dropout(0.2),
    
    tf.keras.layers.Dense(128, activation = 'relu'),
    tf.keras.layers.Dropout(0.2),
    
    tf.keras.layers.Dense(64, activation = 'relu'),
    tf.keras.layers.Dropout(0.2),
    
    tf.keras.layers.Dense(45, activation = 'softmax'),
])

print(model.summary())

Model training and evaluation

Adam optimizer is used to train the model over 100 epochs. This choice was made because it allows us to obtain better results.

The loss is calculated with the sparse_categorical_crossentropy function.

def trainModel(model,epochs, optimizer):
    batch_size = 128
    model.compile(optimizer = optimizer, loss = 'sparse_categorical_crossentropy', metrics = 'accuracy')
    return model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs = epochs, batch_size = batch_size)

Now, launch the training!

model_history = trainModel(model = model, epochs = 100, optimizer = 'adam')

Display loss curves

loss_train_curve = model_history.history["loss"]
loss_val_curve = model_history.history["val_loss"]
plt.plot(loss_train_curve, label = "Train")
plt.plot(loss_val_curve, label = "Validation")
plt.legend(loc = 'upper right')
plt.title("Loss")
plt.show()

Display accuracy curves

acc_train_curve = model_history.history["accuracy"]
acc_val_curve = model_history.history["val_accuracy"]
plt.plot(acc_train_curve, label = "Train")
plt.plot(acc_val_curve, label = "Validation")
plt.legend(loc = 'lower right')
plt.title("Accuracy")
plt.show()

test_loss, test_acc = model.evaluate(X_test, y_test, batch_size = 128)
print("The test loss is: ", test_loss)
print("The best accuracy is: ", test_acc*100)

20/20 [==============================] - 0s 3ms/step - loss: 0.2854 - accuracy: 0.9371
The test loss is: 0.24700121581554413
The best accuracy is: 93.71269345283508

Save the model for future inference

1. Save and store the model in an OVHcloud Object Container

model.save('/workspace/model-marine-mammal-sounds/saved_model/my_model')

You can check your model directory.

%ls /workspace/model-marine-mammal-sounds/saved_model

Saved_model contains an assets folder, saved_model.pb, and variables folder.

%ls /workspace/model-marine-mammal-sounds/saved_model/my_model

Then, you are able to load this model.

model = tf.keras.models.load_model('/workspace/model-marine-mammal-sounds/saved_model/my_model')

Do you want to use this model in a Streamlit app? Refer to our GitHub repository.

Streamlit app overview

Conclusion

The accuracy of the model can be improved by increasing the number of epochs, but after a certain period we reach a threshold, so the value should be determined accordingly.

The accuracy obtained for the test set is 93.71 %, which is a satisfactory result.

Want to find out more?

If you want to access the notebook, refer to the GitHub repository.

To launch and test this notebook with AI Notebooks, please refer to our documentation.

You can also look at this presentation done at a OVHcloud Startup Program event at Station F:

I hope you have enjoyed this article. Try for yourself!

References

https://blog.clairvoyantsoft.com/music-genre-classification-using-cnn-ef9461553726

https://towardsdatascience.com/music-genre-classification-with-python-c714d032f0d8

Managing GPU pools efficiently in AI pipelines

Bastien Verdebout — Tue, 22 Dec 2020 16:18:36 +0000

A growing number of companies are using artificial intelligence on a daily basis — and dealing with the back-end architecture can reveal some unexpected challenges.

Whether the machine learning workload involves fraud detection, forecasts, chatbots, computer vision or NLP, it will need frequent access to computing power for training and fine-tuning.

GPUs have proven to be a game-changer for deep learning. If you’re wondering why, you can find out more by reading our blog post about GPU architectures. A few years ago, manufacturers such as NVIDIA began to develop specific ranges for cloud datacentres. You may be familiar with the NVIDIA TITAN RTX for gaming — and in our datacentres, we use NVIDIA A100, V100, Tesla and DGX GPUs for enterprise-grade workloads.

In short, GPUs are perfect for tasks that can be solved or improved by AI, and require a lot of processing power.
They offer optimal compute, and are widely used in deep learning. A growing number of companies are using AI, and GPUs seem to be the best choice for them.

However, when dealing with pools of GPUs, the back-end architecture can be really tricky.

So how do we use them to benefit a company with minimal hassle and headaches? On-premise or in the cloud?

These are good questions that I’m keen to discuss here, from both a business and technical perspective.

Dealing with GPU pools… The struggle is real.

For anyone who has had to deploy and manage more than 1 GPU for a data-AI team, I’m sure this topic will bring tears to your eyes, and make your voice tremble. Yes, it is indeed complicated.

I can talk about it on our blog, because our team of data scientists here at OVHcloud had to deal with the exact same annoying issues. Thankfully, we solved all of them — stay tuned!

GPU sharing is hard. Even if one GPU is better than none, in most cases it will not be sufficient, and a GPU pool will be far more effective. From a tech perspective, dealing with a GPU pool — or worse, allowing your team to use this pool simultaneously — is very tricky. The market is really mature for CPU sharing (via hypervisors), but by design, a GPU has to be attached to a VM or container. This means that quite often, it needs to be “booked” for a specific workload. To get around this issue, you’ll need to provide a scale-out with orchestration, so that you can dynamically assign GPUs to jobs over time. Whenever you tell yourself “I want to launch this task with 4 GPUs for 2 days“, you should simply be able to ask, and the back-end should work its magic for you.

Setting up and maintaining an architecture is time-consuming. So you’ve deployed servers with GPU, updated and upgraded your Linux distros, installed your main AI packages, CUDA drivers, and now you want to move on to something else. But wait — a new TensorFlow version has been released, and you also have a security patch to apply. What you initially thought to be a single task is now taking up 4-5 hours of your time per week.

Diagnosing is quite complex. If, for whatever reason, something isn’t working as it should — good luck. You barely know who is doing what, and you can’t track jobs or usage unless you connect to the platform yourself and set up monitoring tools. Remember to grab your snorkel set, because you’ll need to deep-dive.

Bottlenecks are almost inevitable. Imagine setting up a pool of GPUs based on your current AI project workloads. Your infrastructure is not really designed to scale automatically, and as soon as the AI workloads increase, your jobs have to be scheduled while the GPU fleet is being updated constantly. A backlog starts to accumulate, and a bottleneck is created as a result.

Providing tools for teams to work collaboratively on code is mandatory. Usually, your team will need to share their data experimentations — and the best way to do this for now is with JupyterLab Notebooks (we love them) or VSCode. But you’ll need to keep in mind that this is more software to set up and maintain.

Securing data access is essential. The required data must be easily accessible, and sensitive data must be covered by security guarantees.

Cost control is difficult. Even worse, for one reason or another (who said holidays?), you might need to stop almost all your GPU servers for a week or two — but to do this, you would need to wait for any ongoing jobs to be completed.

All jokes aside, while we may be passionate about tech and hardware, we have other things to do. Data engineers cannot achieve their full potential and talent in maintenance-based or billing-based tasks.

Kubeflow to the rescue?

Kubernetes 1.0 was launched 5 years ago. Whatever your opinion is on it, in five years they have become the de facto standard for container orchestration in enterprise environments.

Data scientists use containers for portability, agility, and community — but Kubernetes was made to orchestrate services, not data experimentations.

Kubernetes alone is not tailored for a data team. It presents too much complexity, with the sole benefit of solving the orchestration issue.

We need something that not only improves orchestration, but also code contribution, tests and deployments.

Luckily, Kubeflow appeared 2 years ago, and was open-sourced by Google at the time. Its main promise is to simplify complex ML workflows, for example data processing => data labeling => training => serving, and complete it with notebooks.

I do really love the promise, and the way they simplify ML pipelines. Kubeflow can be run over K8s clusters on-premise or in the cloud, and can also be set up on a single VM or even on a workstation (Linux/Mac/Windows).

Students can easily have their own ML environment. However, for the most advanced uses, a workstation or a single VM might be out of the question, and you would need a K8s cluster with Kubeflow installed on top of that. You’ll have a nice UI for starting notebooks and creating ML pipelines (processing/training/inference), but still zero GPU support by default.

Central Dashboard / Image property of Kubeflow.org

XGBoost pipeline / Image property of Kubeflow.org

Your GPU support will depend on your setup. It may differ if you host it on GCP, AWS, Azure, OVHcloud, on-premise, MicroK8s, or anything else.

For example, on AWS EKS, you need to declare GPU pools in your Kubeflow manifest:

# Official doc: https://www.kubeflow.org/docs/aws/customizing-aws/

# NodeGroup holds all configuration attributes that are specific to a node group
# You can have several node groups in your cluster.
nodeGroups:
  - name: eks-gpu
    instanceType: p2.xlarge
    availabilityZones: ["us-west-2b"]
    desiredCapacity: 2
    minSize: 0
    maxSize: 2
    volumeSize: 30
    ssh:
      allow: true
      publicKeyPath: '~/.ssh/id_rsa.pub'

On GCP GKE, you will need to run this command to export a GPU pool:

# Official doc: https://www.kubeflow.org/docs/gke/customizing-gke/#common-customizations
 
export GPU_POOL_NAME=
 
gcloud container node-pools create ${GPU_POOL_NAME} \
--accelerator type=nvidia-tesla-k80,count=1 \
--zone us-central1-a --cluster ${KF_NAME} \
--num-nodes=1 --machine-type=n1-standard-4 --min-nodes=0 --max-nodes=5 --enable-autoscaling

You will then need to install NVIDIA drivers on all the GPU nodes. NVIDIA maintains a deamonset, which enables you to install them easily:

# Official doc: https://www.kubeflow.org/docs/gke/customizing-gke/#common-customizations
 
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Once you have done this, you will be able to create GPU pools (don’t forget to check your quotas before — with a basic account, you are restricted by default, and you will need to contact their support).

Okay, but do things get easier from here?

As we say in France, especially in Normandy, yes but no.

Yes, Kubeflow does resolve some of the challenges we’ve mentioned — but some of the biggest challenges are yet to come, and they will take up a lot of your daily routine. Many manual operations will still require you to dig into specific K8s documentation, or guides published by cloud providers.

Below is a summary of Kubeflow vs GPU pool challenges.

Challenges	Status
GPU pool with sharing option	YES but will require manual configuration (declaration in manifest, driver installation, etc.).
Collaborative tools	YES definitely. Notebooks are provided via Kubeflow.
Infrastructure maintenance	Definitely NO. Now you have a Kubeflow cluster to maintain and operate.
Infrastructure diagnosis	YES BUT NO. Activity Dashboard and reporting tools based on SpartaKus, Logs, etc. But provided to the data engineers, not data scientists themselves. They may come back to you.
Infrastructure agility/flexibility	TRICKY. It will depend on your hosting implementation. If it’s on-premise, definitely no. You’ll need to buy hardware components (an NVIDIA V100 costs approximately $10K without chassis, electricity usage, etc.) Some cloud providers can provide “auto-scaling GPU pools” from 0 to n, which is nice.
Secured data access	TRICKY. It will depend on how you locate your data, and the technology used. It’s not a ready-to-use solution.
Cost control	TRICKY. Again, it will depend on your hosting implementation. It’s not easy, since you need to take care of the infrastructure. Some hidden costs can appear, too (network traffic, monitoring, etc.).

Kubeflow vs Challenges

Forget infrastructure, welcome to GPU platforms made for AI

You can now find various third-party solutions on the market that go one step further. Instead of dealing with the architecture and the Kubernetes cluster, what if you simply focused on your machine learning or deep learning code?

There are well-known solutions such as Paperspace Gradient — or smaller ones, like Run:AI — and we’re pleased to offer another option on the market: AI Training. We’re using this post as a self-promotion opportunity (it’s our blog after all), but the logic remains the same for competitors.

What are the concepts behind it?

No infrastructure to manage

You don’t need to set up and manage a K8s cluster, or a Kubeflow cluster.

You don’t need to declare GPU pools in your manifest.

You don’t need to install NVIDIA drivers on the nodes.

With GPU Platforms like OVHcloud AI Training, your neural network training is as simple as this:

# Upload data directly to Object Storage
ovhai data upload myBucket@GRA train.zip
 
# Launch a job with 4 GPUs on a Pytorch environment, with Object Storage bucket directly linked to it
ovhai job run \
     --gpu 4 \
    --volume myBucket@GRA:/data:RW \
    ovhcom/ai-training-pytorch:1.6.0

This line of code will provide you with a JupyterLab Notebook directly plugged to a pool of 4x NVIDIA GPUs, with the Pytorch environment installed. This is all you need to do, and the entire process takes around 15 seconds.

Parallel computing — a great advantage

One of the most significant benefits is that since the infrastructure is not on your premises, you can count on the provider to scale it.

So you can run dozens of jobs simultaneously. A classic use case is to fine-tune all of your models once a week or once a month, with 1 line of bash script:

# Start a basic loop
for model in my_models_listing
do
 
# Launch a job with 4 GPUs on a Pytorch environment, with Object Storage bucket directly linked to it
echo "starting training of $model"
ovhai job run \
--gpu 3 \
--volume myBucket@GRA:/data:RW \
my_docker_repository/$model
 
done

If you have 10 models, it will launch 10x 3 GPUs in few seconds, and it will stop them once the job is complete, from sequential to parallel work.

Collaboration out of the box

All of these platforms natively include notebooks, directly plugged to GPU power. With OVHcloud AI Training, we also provide pre-installed environments for TensorFlow, Hugging Face, Pytorch, MXnet, Fast.AI — and others will be added to this list soon.

JupyterLab Notebook

Data set access made easy

I haven’t tested all the GPU platforms on the market, but usually they provide some useful ways to access data. We aim to provide the best work environment for data science teams, so we are also offering an easy way for them to access their data — by enabling them to attach object storage containers during the job launch.

OVHcloud AI Training : attach Object Storage containers to notebooks

Cost control for users

Third-party GPU platforms quite often provide clear pricing. This is the case for Paperspace, but not for Run:AI (I was unable to find their price list). This is also the case for OVHcloud AI Training.

GPU power: You pay £1.58/hour/NVIDIA V100s GPU
Storage: Standard price of OVHcloud Object Storage (compliant with AWS S3 protocol)
Notebooks: Included
Observability tools: Logs and metrics included
Subscription: No, it’s pay-as-you-go, per minute

And there we go — cost and budget estimation is now simple. Try it out for yourself!

Mission complete?

Below is a summary addressing the major challenges to resolve when dealing with GPU pool sharing. It’s a big yes!

Challenges	Status
GPU pool with sharing option	YES definitely. In fact, even many GPU pools in parallel, if you want to.
Collaborative tools	YES definitely. Notebooks are always provided, as far as I know.
Infrastructure maintenance	YES definitely. Infrastructure is managed by the provider. You will need need to connect via SSH to debug.
Infrastructure diagnosis	YES. Logs and metrics provided on our side, at least.
Infrastructure agility/flexibility	YES definitely. Scale up or down one or more GPU pools, use them for 10 minutes or a full month, etc.
Secured data access	Depends on the solution you choose, but usually it’s a YES via simplified object storage access.
Cost control	Depends on the solution you choose, but usually is a YES with packaged prices and zero investments to make (zero CAPEX).

Conclusion

If we go back to the main challenges faced by a company that requires shared GPU pools, we can say without a doubt that Kubernetes is a market-standard for AI pipeline orchestration.

An on-premise K8s cluster with Kubeflow is really interesting if the data cannot be processed into the cloud (e.g. banking, hospitals, any kind of sensitive data) or if your team has flat (and lower-level) GPU requirements. You can invest in a few GPUs and manage the fleet yourself with software on top. But if you need more power, very soon the cloud will become the only viable option. Hardware investments, hardware obsolescence, electricity usage and scaling will give you some headaches.

Then, depending on the situation, Kubeflow in the cloud might be really useful. It delivers powerful pipeline features, notebooks, and enables users to manage virtual GPU pools.

But if you want to avoid infrastructure tasks, control your spending, and focus on your added value and code, you might consider GPU platforms as your first choice.

However, there is no such thing as magic — and without knowing exactly what you want, even the best platform won’t be able to meet your needs. Yet some start-ups, not listed here, can offer a combination of platforms and expertise to help you in your project, infrastructures and use cases.

Thank you for reading, and don’t forget that we also offer inference at scale with ML Serving. This is the next logical step after training.

Want to find out more?

Solution page: https://www.ovhcloud.com/en-gb/public-cloud/ai-training/
Public documentation: https://docs.ovh.com/gb/en/ai-training/
Community: community.ovh.com/en/

A journey through the wondrous land of Machine Learning or “Can I really buy a palace in Paris for 100,000€?” (Part 2)

Guillaume Ruty — Thu, 03 Sep 2020 14:52:09 +0000

Spoiler alert, no you can’t.

A few months ago, I explained how to use Dataiku – a well-known interactive AI studio – and how to use data, made available by the French government, to build a Machine Learning model predicting the market value of real estate. Sadly, it failed miserably: when I tried it on the transactions made in my street, the same year I bought my flat, the model predicted that all of them had the same market value.

In this blog post, I will point out several reasons why our experiment failed, and then I will try to train a new model, taking into account what we will have learned.

Why our model failed

There are several reasons why our model failed. Three of them stand out:

The open data format layout
The data variety
Dataiku’s default model parameters

Open Data Format Layout

You can find a description of the data layout on the dedicated webpage. I will not list all the columns of the schema (there are 40 of them), but the most important one is the first one: id_mutation. This information is a unique transaction number, and not an unusual column to find.

However, if you look at the dataset itself, you will see that some transactions are spread over multiple lines. They correspond to transactions regrouping multiple parcels. In the example of my own transaction, there are two lines: one for the flat itself, and one for a separate basement under the building.

The problem is that the full price is found on every such line. From the point of view of my AI studio, which only sees a set of lines it interprets as data points, it looks like my basement and my flats are two different properties that cost an equal amount! This gets worse for properties that have lands and several constructs attached to them. How can we expect our algorithm to learn appropriately under these conditions?

Dataiku doesn’t naturally understand that a data point can consist of multiple lines!

Data Variety

In this case, we are trying to predict the price of a flat in Paris. However, the data we gave the algorithm regroups every real estate transaction made in France over the last few years. While you might think that more data is always better, this is not necessary the case.

The real estate market changes according to where you are, and Paris is a very specific case in France, with prices being much higher than in other big cities and the rest of France. Of course, this can be seen in the data, but the training algorithm does not know that in advance, and it is very hard for it to learn how to price a small flat in Paris and a farm with acres of land in Lozère at the same time.

Model training parameters

In the last blog post, you have seen how easy it is to use Dataiku. But it comes at a price: the default script can be used for very simple use-cases. But it is not suited for complex tasks – like predicting real-estate prices. I myself do not have much experience with Dataiku. However, by digging deeper into the details, I was able to correct a few obvious mistakes:

Data types: A lot of the columns in the dataset are specific types: integers, geographic coordinates, dates etc. Most of them are correctly identified by Dataiku, but some of them – such as geographic coordinates, or dates – are not.
Data analysis: If you remember the previous post, at one point we were looking at different models trained by the algorithm. We didn’t take the time to look at the design automated by the model. This section allows us to tweak several elements; such as the types of algorithms we run, the learning parameters, the choice of the dataset etc…

With so many features present in the dataset, Dataiku tried to reduce the number of features it would analyze, in order to simplify the learning algorithm. But it made poor choices. For example, it considers the street number but not the street itself. Even worse, it doesn’t even look at the date, or the parcels’ surface area (but it does consider land surface when present…), which is by far the most important factor in most cities!

How to fix all of that

Fortunately, there are ways to solve these issues. Dataiku integrates tools to transform and filter your datasets before running your algorithms. It also allows you to change training parameters. Rather than walking you through all the steps, I’m going to summarize what I did for each of the issues we identified earlier:

Data layout

First, I grouped the lines that corresponded with the same transactions. Depending on the fields, I either summed them up (when it was a living area surface, for example), kept one of them (address), or concatenated them (when it was the identifier for an outbuilding, for example).
Second, I removed several unnecessary or redundant fields that add noise to the algorithm; such as street name (there are already per-city-unique street codes), street number suffix (“Bis” or “Ter” commonly found in an address after a street number) or other administration-related information.
Finally, some transactions contain not only several parcels (on several lines) but also several subparcels per parcel, each with its own surface and subparcel number. This subdivision is mostly administrative, and subparcels are often previously adjoining flats that have been reunited. To simplify the data, I cut the subparcel numbers and summed their respective surfaces, before regrouping the lines.

Data variety

First, as we are trying to train a model to estimate the price of Parisian flats, I filtered out all the transactions that didn’t happen in Paris (which as you can expect is most of it).
Second, I removed all the transactions that had incomplete data for important fields (such as surface or address).
Finally, I removed outliers: transactions corresponding to properties that don’t correspond to standard flats; such as houses, commercial land, very high-end flats etc…

Model training parameters

Model training parameters:
- First, I made sure that the model considered all the features. Note: rather than removing unnecessary fields from the dataset, I could have just told the algorithm to ignore the corresponding features. However, my preference is to increase the readability of the dataset to make it easier to explore. Moreover, Dataiku loads data in RAM to process it, so making it run on a clean dataset makes it more RAM-efficient.
- Second, I trained the algorithm on different sets of features: in some cases I kept the district but not the street. As there are a lot of different streets in Paris this is a categorical feature with high cardinality (lots of different possibilities that can’t be numerized).
- Finally, I tried different families of Machine Learning algorithms: Random Forest – basically building a decision tree; XGBoost – gradient boosting; SVM (Support Vector Machine)- a generalization of linear classifiers; and KNN (K-Nearest-Neighbours) – which tries to categorize data points by looking at its neighbors according to different metrics.

Did it work?

So, after all that, how did we fare? Well, first off, let us look at the R2 score of our models. Depending on the training session, our best models have an R2 score between 0.8 and 0.85. As a reminder, a R2 score of 1 would mean that the model perfectly predicts the price of every data point used in the training evaluation phase. The best models in our previous tries had an R2 score between 0.1 and 0.2, so we are already clearly better here. Let us now look at a few predictions from this model.

First, I re-checked all the transactions from my street. This time, the prediction for my flat is ~16% lower than the price I paid. But unlike last time, every flat has a different estimate and these estimates are all in the correct order of magnitude. Most values have less than 20% error when compared to the real price, and the worst estimates have ~50% error. Obviously, this margin of error is unacceptable when investing in a flat. However, when compared to the previous iteration of our model – that returned the same estimate for all the flats in my street – we are making significant progress.

So, now that we at least have the correct order of magnitude, let’s try and tweak some values in our input dataset to see if the model reacts predictably. To do this, I took the data point of my own transaction and created new data points, each time by changing one of the features of the original data point:

the surface to reduce it
the coordinates (street name, street code, geographic coordinates, etc) to put it in a cheaper district
the date of transaction to year 2015 (3 years prior to the real date)

With each of these modifications, we would expect the new estimates to be lower than the original one (the real estate market in Paris is in permanent inflation). Let us look at the results:

Real Price	Original Estimate	Reduced Surface Estimate	Other District Estimate	Older Estimate
100%	84%	45%	61%	76%

At least the model behaves in an appropriate way, qualitatively speaking.

How could we do better?

At this point, we have used common sense to significantly improve our previous results and build a model that gives predictions in a good order of magnitude and that behaves as we expect when tweaking the features of data points. However, the remaining margin of error makes it unsuitable for real-world application . But why, and what could we do to keep improving our model? Well there are several reasons:

Data complexity: I am going to contradict myself a little. While complex data is harder to digest for a Machine Learning algorithm, it is necessary to preserve this complexity if it reflects a complexity in the final result. In this case, we might not only have oversimplified the data, but the original data itself may lack a lot of relevant information.

We trained our algorithm on general location and surface, which admittedly are the most important criteria, but our dataset lacks very important information such as floors, exposure, construction years, insulation diagnostics, condition, accessibility, view, general state of the flats etc…

There are private datasets built by notarial offices that are more complete than our open dataset, but while those might have features such as floor or construction year, they would probably lack more subjective information; such as general state or view.

The dataset lacks very important information about the flats.

Data amount: Even if we had very complete data, we would need a vast amount of it. The more features we include in our training, the more data we need. And for such a complex task, the ~150K transactions per year we have in Paris are probably not enough. A solution could be to create artificial data points: flats that don’t really exist, but that human experts would still be able to evaluate.

But there are three issues with that: first, any bias in the experts would inevitably be passed on the model. Second, we would have to generate a huge number of artificial, but realistic data points and then would need the help of multiple human experts to label it. Finally, the aforementioned experts would label this artificial data based on their current experience. It would be very hard for them to remember the market prices from a few years ago. This means to have a homogeneous dataset over the years, we would have to create this artificial data over time and at the same pace as the real transactions happen.
Skills: Finally, being a data scientist is a full-time job that requires experience and skill. A real data scientist would probably be able to reach better results than what I obtained by adjusting the learning parameters and choosing the most appropriate algorithms.

Furthermore, even good data scientists would have to know their way around real estate and its pricing. It’s very hard to build advanced Machine Learning models without having a good comprehension of the topic at hand.

Summary

In this blog post, we discussed why our previous attempt at training a model to predict the price of flats in Paris failed. The data we used was not cleaned enough and we used Dataiku’s default training parameters rather than verifying that they made sense.

After that, we corrected our mistakes, cleaned the data and tweaked the training parameters. This improved the result of our model a lot, but not enough to use it realistically. There are ways to improve the model further, but the available datasets lack some information and the amount of data itself may not be sufficient to build a robust model.

Fortunately, the intent of this series was never to predict the price of flats in Paris perfectly. If it was possible, there would be no more real estate agencies. Instead, it serves as an illustration of how anyone can take raw data, find a problem related to the data and train a model to tackle this problem.

However, the dataset that we used in this example was quite small: only a few gigabytes. Everything happened on a single VM and we had to do everything manually, on a fixed dataset. What would I do if I wanted to handle petabytes of data? If I wanted to handle continuously streaming data? If I wanted to expose my model so that external applications could query it?

That is what we are going to look at next time, in the final blog post of the series.

How PCI-Express works and why you should care? #GPU

Jean-Louis Queguiner — Thu, 09 Jul 2020 10:16:00 +0000

What is PCI-Express ?

Everyone, and I mean everyone, should pay attention when they do intensive Machine Learning / Deep Learning Training.

As I explained in a previous blog post, GPUs have accelerated Artificial Intelligence evolution massively.

However, building a GPUs server is not that easy. And failing to create an appropriate infrastructure can have consequences on training time.

If you use GPUs, you should know that there are 2 ways to connect them to the motherboard to allow it to connect to the other components (network, CPU, storage device). Solution 1 is through PCI Express and solution 2 through SXM2. We will talk about SXM2 in the future. Today, we will focus on PCI Express. This is because it has a strong dependency with the choice of adjacent hardware such as PCI BUS or CPU.

NVIDIA V100 with SXM2 design	NVIDIA V100 with PCI express design
Source : https://www.ebizpc.com/NVIDIA-Tesla-V100-900-2G502-0300-000-16GB-GPU-p/900-2g503-0310-000.htm	Source : https://nvidiastore.com.br/nvidia-tesla-v100-16gb

SXM2 design VS PCI Express Design

This is a major element to consider when talking about deep learning as data loading phase is a waste of compute time, so bandwidth between components and GPUs is a key bottleneck in most deep learning training contexts.

How does PCI-Express work and why you should care about the number of PCIe lanes?

What is a PCI-Express Lanes and are there any associated CPU limitations?

Each GPU V100 is using the 16 PCI-e lanes. What does it mean exactly?

Extract from NVidia V100 product specification sheet

The “x16” means that the PCIe has 16 dedicated lanes. So… next question: What is a PCI Express lane ?

What’s a PCI Express lane?

2 PCI Express Devices with its interconnexion : figure inspired of the awesome article – what is chipset and why should I care

PCIe lanes are used to communicate between PCIe Devices or between PCIe and CPU. A lane is composed of 2 wires: one for inbound communications and one, which has double the traffic bandwidth, for outbound.

Lane communications are similar to network Layer 1 communications – it’s all about transferring bits as fast as possible through electrical wires! However, the technique used for PCIe Link is a bit different as the PCIe device is composed of xN lanes. In our previous example N=16 but it could be any power of 2 from 1 to 16 (1/2/4/8/16).

So… if PCIe is similar to network architecture it means that PCIe layers exist, doesn’t it?

Yes ! you are right PCIe has 4 layers:

**The Physical Layer (aka the Big Negotiation Layer)**

The Physical Layer (PL) is responsible for negotiating the terms and conditions for receiving the raw packets (PLP for Physical Layer Packets) i.e the lane width and the frequency with the other device.

You should be aware that only the smallest number of lanes of the two devices will be used. This is why choosing the appropriate CPU is so important. CPUs have a limited number of lanes that they can manage so having a nice GPU with 16 PCIe Lanes and having a CPU with 8 PCIe Bus lanes will be as efficient as throwing away half your money because it doesn’t fit in your wallet.

Packets received at the Physical Layer (aka PHY) are coming from other PCIe devices or from the system (via Direct Access Memory — DAM or from CPU for instance) and are encapsulated in a frame.

The purpose of a Start-of-Frame is to say: “I am sending you data, this is the beginning,” and it takes just 1 byte to say that!

The End-of-Frame word is also 1 byte to say “goodbye I’m done with it”.

This layer implement a 8b/10b or 128b/130b decoding that we will explain later and is mainly used for clock recovery.

**The Data Link Layer Packet (aka Let’s put this mess in the right order)**

The Data Link Layer Packet (DLLP) is starting with a Packet Sequence Number. This is really important as a packet might get corrupted at one point, so may need to be uniquely identified for retry purposes. The Sequence Number is coded on 2 bytes.

The Data Link Layer Packet is then followed by the Transaction Layer Packet and then closed with the LCRC (Local Cyclic Redundancy Check) and is used to check the Transaction Layer Packet (meaning the actual Payload) integrity.

If the LCRC is validated, then the Data Link Layer sends an ACK (ACKnowledge) signal to the emitter through the Physical Layer. Otherwise it sends a NAK (Not AcKnowledge) signal to the emitter which will resend the frame associated with the sequence number to retry; this part handles the replay buffer on the receiver side.

The Transaction Layer

The Transaction Layer is responsible for managing the actual payload (Header + Data) as well as the (optional) message digest ECRC (End to End Cyclic Redundancy Check). This Transaction Layer Packet is coming from the Data Link Layer where it has been decapsulated.

An integrity check is performed if needed/requested. This step will check the integrity of the business logic and will insure no packet corruption when passing data from Data Link Layer to Transaction Layer.

The header is describing the type of transaction such as:

Memory Transaction
I/O Transaction
Configuration Transaction
or Message Transaction

The Application Layer

The role of the application layer is to handle the User Logic. This layer is sending the Header and the data payload to the Transaction Layer. The magic happens in this layer where data in rooted to different hardware components.

How PCIe is communicating with the rest of the world?

PCIe Link is using the packet switching concept used in network in a full duplex mode.

PCIe device have an internal clock to orchestrate PCIe Data Transfer Cycles. This Data Transfer Cycle is also orchestrated thanks to the Referential Clock. The latter is sending a signal through a Dedicated Lane (which is not part of the x1/2/4/8/16/32 mentioned above). This clock will help both receiving and emitting devices to synchronize for packets communications.

Each PCIe lane is used to send bytes in parallel with other lanes. The Clock Synchronization mentioned above will help the receiver to put back those bytes in the right order

x16 means 16 lanes of parallel communication on generation 3 of PCIe protocol

You may have the bytes in order but do you have the data integrity at the physical layer ?

To ensure integrity PCIe device uses 8b/10b encoding for PCIe generations 1 and 2 or 128b/130b encoding scheme for generations 3 and 4.

These encodings are used to prevent the loss of temporal landmarks, especially when transmitting consecutive similar bits. This process is called “Clock Recovery”

Those 128 bits of payload data are sent and 2 bytes of control are appended to it.

Quick examples

Let’s simplify it with a 8b/10b example: according to IEEE 802.3 clause 36, table 36–1a based on Ethernet specifications here is the table 8b/10b encoding:

IEEE 802.3 clause 36, table 36–1a – 8b/10b encoding table

So how can the receiver make the difference between all those repeating 0 (Code Group Name D0.0) ?

8b/10b encoding is composed of 5b/6b + 3b/4b encodings.

Therefore 00000 000 will be encoded into 100111 0100 the 5 first bits of the original data 00000 are encoded to 100111 using 5b/6b encoding (rd+); same goes for the second group of 3bits of original data 000 encoded into 0100 using 3b/4b encoding (rd-).

It could have been also 5b/6b encoding rd+ and 3b/4b encoding rd- making 00000 000 turning into 011000 1011

Therefore the original data which was 8bits is now 10bits due to bits control (1 control bit for 5b/6b and 1 fir 3b/4b).

But don’t worry I will draft a blog post later dedicated to encoding.

PCIe Generations 1 and 2 were designed with 8b/10b encoding meaning that the actual data transmitted was only 80% of the total load (as 20% — 2 bits are used as Clock synchronization).

PCIe Gen3&4 were designed with 128b/130b meaning that the control bits are now representing only 1.56% of the payload. Quite good isn’t it?

Let’s calculate the PCIe bandwidth together

Here is the table of PCIe versions specifications

Number of Lanes	PCIe 1.0 (2003)	PCIe 2.0 (2007)	PCIe 3.0 (2010)	PCIe 4.0 (2017)	PCIe 5.0 (2019)	PCIe 6.0 (2021)
x1	250 MB/s	500 MB/s	1 GB/s	2 GB/s	4 GB/s	8 GB/s
x2	500 MB/s	1 GB/s	2 GB/s	4 GB/s	8 GB/s	16 GB/s
x4	1 GB/s	2 GB/s	4 GB/s	8 GB/s	16 GB/s	32 GB/s
x8	2 GB/s	4 GB/s	8 GB/s	16 GB/s	32 GB/s	64 GB/s
x16	4 GB/s	8 GB/s	16 GB/s	32 GB/s	64 GB/s	128 GB/s

consortium PCI-SIG PCIe theoretical bandwidth/Lane/Way specification sheet

	PCIe 1.0 (2003)	PCIe 2.0 (2007)	PCIe 3.0 (2010)	PCIe 4.0 (2017)	PCIe 5.0 (2019)	PCIe 6.0 (2021)
Frequency	2.5 GT/s	5.0 GT/s	8.0 GT/s	16 GT/s	32 GT/s	64 GT/s

consortium PCI-SIG PCIe theoretical raw bit rate specification sheet

To obtain such numbers let’s look at the general Bandwidth formula:

BW stands for Bandwidth
MT/s : Mega Transfers per second
Encoding could be 4b/5b/, 8b/10b, 128b/130b, …

For PCIe v1.0:

For PCIe v3.0 (the one that interest us for NVIDIA V100):

Therefore with 16 lanes for a NVIDIA V100 connected in PCIe v3.0, we have an effective data rate transfer (data bandwidth) of nearly 16GB/s/way (actual bandwidth is 15.75GB/s/way)

You need to be careful not to get confused, as total bandwidth can also be interpreted as two ways bandwidth; in this case we consider total bandwidth x16 to be around 32GB/s.

Note : Another element that we haven’t considered is that the maximum theoretical bandwidth needs to be reduced by around 1 Gb/s for error correction protocols (ECRC and LCRC) as well as the Headers (Start tag, Sequence tag, Header) and Footer (End tag) overheads explained earlier in this blog post.

In conclusion

We have seen that PCI Express has evolved a lot and that It’s based on the same concepts as network. To take the best from the PCIe devices it is necessary to understand the fundamentals of the underlying infrastructure.

Failing to choose the right underlying Motherboard, CPU or BUS can lead to major performance bottleneck and GPU under performance.

To sum up :

Friends don’t let friends build their own GPUs hosts 😉
Jean-Louis Quéguiner July 1^st, 2020

If you liked this post but you want to drill down a bit into the Deep Learning and AI aspect of things don’t hesitate to check out my other blog posts:

Distributed Training in a Deep Learning Context

Jean-Louis Queguiner — Tue, 05 May 2020 10:14:07 +0000

Previously on OVHcloud Blog …

In previous blog posts we have discussed a high level approach to deep learning as well as what is meant by ‘training’ in relation to Deep Learning.

Following the article, I had lots of questions entering my twitter inbox, especially regarding how GPUs actually works.

Don’t worry it’s a friend, he is ok with me sharing the DM 😉

I decided, therefore, to write an article on how GPUs work:

During our R&D process around hardware and AI models, the question of distributed training came up (quickly). But before looking in-depth at distributed training, I invite you to read the following article to understand how Deep Learning training actually works:

As previously discussed, Neural Networks training depends on :

Input Data
Neural Network architecture composed of ‘Layers’
Weights
Learning Rate (step used to adjust neural network weights)

Why do we need distributed Learning

Deep Learning is mainly used for non structured data pattern learning. Non structured data – such as text corpus, image, video or sound – can represent a huge amount of data to train on.

Training such a library can take days or even weeks because of the size of data and/or the size of the network.

Multiple distributed learning approaches can be considered.

The different Distributed Learning approaches

There are two main categories for distributed training when it comes to Deep Learning and both of them are based on the divide and conquer paradigm.

The first category is named : “Distributed Data Parallelism” where the data is split across multiple GPUs.

The second category is called : “Model Parallelism” where the deep learning model is split across multiple GPUs.

However the Distributed Data Parallelism is the most common approach as it fits almost any problem. The second approach has some serious technical limitations in relation to model splitting. Splitting a model is a highly technical approach, as you need to know the space used by each part of the network into the DRAM of the GPU. Once you have the DRAM usage per slice you need to enforce the computation by hard coding Neural Network Layers placement onto the desired GPU. This approach makes it hardware-related, as the DRAM may vary from one GPU to the other, while the Distributed Data Parallelism will just require data size adjustments (usually batch size) which is relatively simple.

Distributed Data Parallelism model has two variants, each of which has its advantages and disadvantages. The first variant allows you to train a model with a synchronous weight adjustment. That is to say that each training batch in each GPU will return the corrections that need to be made to the model in order for it to be trained. And that it will have to wait until all the workers have finished their task to have a new set of weights so it can recognise this in the next training batch.

Whereas the second variant lets you work in an asynchronous way. This means each batch of each GPU will report the corrections that need to be made to the neural network. The weights coordinator will send a new set of weights without waiting for the other GPUs to finish training their own batch.

3 cheat sheets to better understand Distributed Deep Learning

In this cheat sheets lets assume you’re using docker with a volume attached.

Now you need to choose your Distributed Training strategy (wisely)

What does Training Neural Networks mean?

Jean-Louis Queguiner — Wed, 22 Apr 2020 16:37:25 +0000

In a previous blog post we discussed general concepts surrounding Deep Learning. In this blog post, we will go deeper into the basic concepts of training a (deep) Neural Network.

Where does “Neural” comes from ?

As you should know, a biological neuron is composed of multiple dendrites, a nucleus and a axon (if only you had paid attention in your biology classes). When a stimuli is sent to the brain, it is received through the synapse located at the extremity of the dendrite.

When a stimuli arrives at the brain it is transmitted to the neuron via the synaptic receptors which adjust the strength of the signal sent to the nucleus. This message is transported by the dendrites to the nucleus to then be processed in combination with other signals emanating from other receptors on the other dendrites. Thus the combination of all these signals takes place in the nucleus. After processing all these signals, the nucleus will emit an output signal through its single axon. The axon will then stream this signal to several other downstream neurons via its axon terminations. Thus a neuron analysis is pushed in the subsequent layers of neurons. When you are confronted with the complexity and efficiency of this system, you can only imagine the millennia of biological evolution that brought us here.

On the other hand, artificial neural networks are built on the principle of bio-mimicry. External stimuli (the data), whose signal strength is adjusted by the neuronal weights (remember the synapse?, circulates to the neuron (place where the mathematical calculation will happen) via the dendrites. The result of the calculation – called the output – is then re-transmitted (via the axon) to several other neurons and then subsequent layers are combined, and so on.

Therefore, their is a clear parallel between biological neurons and artificial neural networks as presented in the figure below.

Basec on https://medium.com/swlh/learning-paradigms-in-neural-networks-30854975aa8d

The Artificial Neural Network Recipe

To build a good Artificial Neural Network (ANN) you will need the following ingredients

Ingredients:

Artificial Neurons (processing node) composed of:
- (many) input neuron(s) connection(s) (dendrites)
- a computation unit (nucleus) composed of:
  - a linear function (ax+b)
  - an activation function (equivalent to the the synapse)
- an output (axon)

Preparation to get an ANN for image classification training:

Decide on the number of output classes (meaning the number of image classes – for example two for cat vs dog)
Draw as many computation units as the number of output classes (congrats you just create the Output Layer of the ANN)
Add as many Hidden Layers as needed within the defined architecture (for instance vgg16 or any other popular architecture). Tip – Hidden Layers are just a set of neighboured Compute Units, they are not linked together.
Stack those Hidden Layers to the Output Layer using Neural Connections
It is important to understand that the Input Layer is basically a layer of data ingestion
Add an Input Layer that is adapted to ingest your data (or you will adapt your data format to the pre-defined architecture)
Assemble many Artificial Neurons together in a way where the output (axon) an Neuron on a given Layer is (one) of the input of another Neuron on a subsequent Layer. As a consequence, the Input Layer is linked to the Hidden Layers which are then linked to the Output Layer (as shown in the picture below) using Neural Connections (also shown in the picture below).
Enjoy your meal

simplified schema of a neural network architecture

What does it mean to train an Artificial Neural Network ?

All Neurons of a given Layer are generating an Output, but they don’t have the same Weight for the next Neurons Layer. This means that if a Neuron on a layer observes a given pattern it might mean less for the overall picture and will be partially or completely muted. This is what we call Weighting: a big weight means that the Input is important and of course a small weight means that we should ignore it. Every Neural Connection between Neurons will have an associated Weight.

And this is the magic of Neural Network Adaptability: Weights will be adjusted over the training to fit the objectives we have set (recognize that a dog is a dog and that a cat is a cat). In simple terms: Training a Neural Network means finding the appropriate Weights of the Neural Connections thanks to a feedback loop called Gradient Backward propagation … and that’s it folks.

Parallel between Control Theory and Deep Learning Training

The engineering field of control theory defines similar principles to the mechanism used for training neural networks.

Control Theory general concepts

In control systems, a setpoint is the target value for the system.

A setpoint (input) is defined and then processed by a controller, which adjusts the setpoint’s value according to the feedback loop (Manipulated Variable). Once the setpoint has been adjusted it is then sent to the controlled system which will produce an output. This output is monitored using an appropriate metric which is then compared (comparator) to the original input via a feedback loop. This allows the controller to define the level of adjustment (Manipulated Variable) of the original setpoint.

Control Theory applied to a radiator

Let’s take the example of a resistance (controlled system) in a radiator. Imagine you decide to set the room temperature to 20 ° C (setpoint). The radiator starts up, supplies the resistance with a certain intensity defined by the controller. A probe (thermometer) will then take the ambient temperature (feedback elements) which is then compared (comparator) to the setpoint (desired temperature) and adjusts (controller) the electric intensity sent to the resistance. The adjustment of the new intensity is deployed via an incremental adjustment step.

Control Theory applied to Neural Network Training

The training of a neuron network is similar to a radiator insofar as the controlled system is the cat or dog detection model.

The objective is no longer to have the minimum difference between the setpoint temperature and the actual temperature but to minimize the error (Loss) between the classification of the incoming data (a cat is a cat) and the one made by the neural network.

In order to achieve this, the system will have to look at the input (setpoint) and compute an output (controlled system) based on the parameters defined in the algorithm. This phase is called the forward pass.

Once the output has been calculated, the system will re-propagate the evaluation error using Gradient Retro-propagation (Feedback Elements). While the temperature difference between the setpoint and the thermometer was converted into electrical intensity for the radiator, here the system will adjust the weights of the different inputs into each neuron with a given step (learning rate).

Parallel between electrical engineering controlled system and neural network training process

One thing to consider: The Valley Problem

When training the system, the backward propagation will lead the system to reduce the error it’s making to best fit the objectives you have set (finding that a dog is a dog…).

Choosing the learning rate at which you will adjust your weights (what one call adjustment step in Control Theory).

Just as is the case in control theory, the control system can face several issues if it is not designed correctly:

If the correction step (learning rate) is too small it will lead to a very slow convergence (i.e. it will take a very long time to get your room to 20°C…).
Too smaller learning rate can also lead to you being stuck in a local minima
If the correction step (learning rate) is too high it will lead the system to never converge (beat around the bush) or worse (i.e. the radiator will oscillate between being either too hot or too cold)
The system could enter into a resonance state (divergence).

In the end Training an Artificial Neural Network (ANN) requires just a few steps:

First an ANN will require a random weight initialization
Split the dataset in batches (batch size)
Send the batches 1 by 1 to the GPU
Calculate the forward pass (what would be the output with the current weights)
Compare the calculated output to the expected output (loss)
Adjust the weights (using the learning rate increment or decrement) according to the backward pass (backward gradient propagation).
Go back to square 2

Further notice

That’s all folks, you are now all set to read our future blog post which focuses on Distributed Training in a Deep Learning Context.

Understanding the anatomy of GPUs using Pokémon

Jean-Louis Queguiner — Wed, 13 Mar 2019 16:25:32 +0000

Please welcome this beautiful new born in GPGPU Nvidia Family Ampere
BLOG UPDATE FROM MAY 14, 2020

Congratulations

In the previous episode…

In our previous blog post about Deep Learning, we explained that this technology is all about massive parallel matrix computation, and that these computations are simplistic operations: + and x.

Fact 1: GPUs are good for (drum roll)…

Once you get that Deep Learning is just massive parallel matrix multiplications and additions, the magic happens. General Purpose Graphic Processing Units (GPGPU) (i.e. GPUs, or variants of GPUs, designed for something other than graphic processing) are perfect for…

matrix multiplications and additions !

Perfect isn’t it ? But why ? Let me tell you a little story

Fact 2: There was a time when GPUs were just GPUs

Yes, you read that correctly…

The first GPUs in the 90s were designed in a very linear way. The engineer took the engineering process used for graphical rendering and implemented it into the hardware.

To keep it simple, this is what a graphical rendering process looks like:

Uses for GPUs included transformation, building lighting effects, building triangle setups and clipping, and integrating rendering engines at a scale that was not achievable at the time (tens of millions of polygons per second).

The first GPUs integrated the various steps of image processing and rendering in a linear way. Each part of the process had predefined hardware components associated with vertex shaders, tessellation modules, geometry shaders, etc.

In short, graphics cards were initially designed to perform graphical processing. What a surprise!

Fact 3: CPUs are sports cars, GPUs are massive trucks

As explained earlier, for image processing and rendering, you don’t want your image being generated pixel per pixel – you want it in a single shot. That means that every pixel of the image – representing every object pointed in the camera, at a given time, in a given position – needs to be calculated at once.

It’s a complete contrast with CPU logic, where operations are meant to be achieved in a sequential way. As a result, GPGPUs needed a massively parallel general-purpose architecture to be able to process all the points (vertex), build all the meshes (tessellation), build the lighting, perform the object transformation from the absolute referential, apply texture, and perform shading (I’m still probably missing some parts!). However, the purpose of this blog post is not to look in-depth at image processing and rendering, as we will do that in another blog post in the future.

As explained in our previous post, CPUs are like sports cars, able to calculate a chunk of data really fast with minimal latency, while GPUs are trucks, moving lots of data at once, but suffering from latency as a result.

Here is a nice video from Mythbusters, where the two concepts of CPU and GPU are explained:

Fact 4: 2006 – NVIDIA killed the image processing Taylorism

The previous method for performing image processing was done using specialised manpower (hardware) at every stage of the production line in the image factory.

This all changed in 2006, when NVIDIA decided to introduce General Purpose Graphical Processing Units using Arithmetic Logical Units (ALUs), aka CUDA cores, which were able to run multi-purpose computations (a bit like a Jean-Claude Van Damme of GPU computation units!).

GoDaddy Commercial (2013) featuring Jean-Claude Van Damme Source : https://imgur.com/r/gifs/PvuZxBZ

Even today, modern GPU architectures (such as Fermi, Kepler or Volta) are composed of non-general cores, named Special Function Units (SFUs), to run high-performance mathematical graphical operations, such as sin, cosine, reciprocal, and square root, as well as Texture Mapping Units (TMUs) for the high-dimension matrix operations involved in image texture mapping.

Fact 5: GPGPUs can be explained simply with Pokémon!

GPU architectures can seem difficult to understand at first, but trust me… they are not!

Here is my gift to you: a Pokédex to help you understand GPUs in simple terms.

The Micro-Architecture Family

Here’s how you use it…

You basically have four families of cards:

This family will already be known to many of you. We are, of course, talking about Fermi, Maxwell, Kepler, Volta, Ampere etc.

A beautiful picture of new born with all the other familier

The Architecture Family

This is the center, where the magic happens: orchestration, cache, workload scheduling… It’s the brain of the GPU.

The Multi-Core Units (aka CUDA Cores) Family

This represents the physical core, where the maths computations actually happen.

The Programming Model Family

The different layers of the programming model are used to abstract the GPU’s parallel computation for a programmer. It also makes the code portable to any GPU architecture.

How to play

Start by choosing a card from the Micro-Architecture family
Look at the components, and choose the appropriate card from the Architecture family
Look at the components within the Micro-Architecture family and pick them from the Multi-Core Units family, then place them under the Architecture card
Now, if you want to know how to program a GPU, place the Programming Model – Multi-Core Units special card on top of the Multi-Core Units cards
Finally, on top of the Programming Model – Multi-Core Units special card, place all the Programming Model cards near the SM
You then should have something that look like this:

Examples of card configurations:

Fermi

Kepler

Maxwell

Pascal

Volta

Turing

After playing around with different Micro-Architectures, Architectures and Multi-Core Units for a bit, you should see that GPUs are just as simple as Pokémon!

Enjoy the attached PDF, which will allow you to print your own GPU Pokédex. You can download it here: GPU Cards Game

Deep Learning explained to my 8-year-old daughter

Jean-Louis Queguiner — Fri, 15 Feb 2019 14:56:56 +0000

Machine Learning and especially Deep Learning are hot topics and you are sure to have come across the buzzword “Artificial Intelligence” in the media.

Yet these are not new concepts. The first Artificial Neural Network (ANN) was introduced in the 40s. So why all the recent interest around neural networks and Deep Learning?

We will explore this and other concepts in a series of blog posts on GPUs and Machine Learning.

YABAIR – Yet Another Blog About Image Recognition

In the 80s, I remember my father building character recognition for bank checks. He used primitives and derivatives around pixel darkness level. Examining so many different types of handwriting was a real pain because he needed one equation to apply to all the variations.

In the last few years, It has become clear that the best way to deal with this type of problem is through Convolutional Neural Networks. Equations designed by humans are no longer fit to handle infinite handwriting patterns.

Let’s take a look at one of the most classic examples: building a number recognition system, a neural network to recognise handwritten digits.

Fact 1: It’s as simple as counting

We’ll start by counting how many times the small red shapes in the top row can be seen in each of the black, hand-written digits, (in the left-hand column).

Simplified matrix for handwritten numbers

Now let’s try to recognise (infer) a new hand-written digit, by counting the number of matches with the same red shapes. We’ll then compare this to our previous table, in order to identify which number has the most correspondences:

Matching shapes for handwritten numbers

Congratulations! You’ve just built the world’s simplest neural network system for recognising hand-written digits.

Fact 2: An image is just a matrix

A computer views an image as a matrix. A black and white image is a 2D matrix.

Let’s consider an image. To keep it simple, let’s take a small black and white image of an 8, with square dimensions of 28 pixels.

Every cell of the matrix represents the intensity of the pixel from 0 (which represents black), to 255 (which represents a pure white pixel).

The image will therefore be represented as the following 28 x 28 pixel matrix.

Image of a handwritten 8 and the associated intensity matrix

Fact 3: Convolutional layers are just bat-signals

To work out which pattern is displayed in a picture (in this case the handwritten 8) we will use a kind of bat-signal/flashlight. In machine learning, the flashlight is called a filter. The filter is used to perform a classic convolution matrix calculation used in usual image processing software such as Gimp.

The filter will scan the picture in order to find the pattern in the image and will trigger a positive feedback if a match is found. It works a bit like a toddler shape sorting box: triangle filter matching triangle hole, square filter matching square hole and so on.

Image filters work like children shape sorting boxes

Fact 4: Filter matching is an embarrassingly parallel task

To be more scientific the image filtering process looks a bit like the animation below. As you can see, every step of the filter scanning is independent, which means that this task can be highly parallelised.

It’s important to note that tens of filters will be applied at the same time, in parallel as none of them are dependent.

https://github.com/vdumoulin

Fact 5: Just repeat the filtering operation (matrix convolution) as many times as possible

We just saw that the input image/matrix is filtered using multiple matrix convolutions.

To improve the accuracy of the image recognition just take the filtered image from the previous operation and filter again and again and again…

Of course, we are oversimplifying things somewhat, but generally the more filters you apply, and the more you repeat this operation in sequence, the more precise your results will be.

It’s like creating new abstraction layers to get a clearer and clearer object filter description, starting from primitive filters to filters that look like edges, wheel, squares, cubes, …

Fact 6: Matrix convolutions are just x and +

An image is worth a thousand words: the following picture is a simplistic view of a source image (8×8) filtered with a convolution filter (3×3). The projection of the torch light (in this example a Sobel Gx Filter) provides one value.

Example of a convolution filter (Sobel Gx) applied to an input matrix (Source : https://datascience.stackexchange.com/questions/23183/why-convolutions-always-use-odd-numbers-as-filter-size/23186)

This is where the magic happens, simple matrix operations are highly parallelised which fits perfectly with a General Purpose Graphical Processing Unit use case.

Fact 7: Need to simplify and summarise what’s been detected? Just use max()

We need to summarise what’s been detected by the filters in order to generalise the knowledge.

To do so, we will sample the output of the previous filtering operation.

This operation is call pooling or downsampling but in fact it’s about reducing the size of the matrix.

You can use any reducing operation such as: max, min, average, count, median, sum and so on.

Example of a max pooling layer (Source : Stanford’s CS231n)

Fact 8: Flatten everything to get on your feet

Let’s not forget the main purpose of the neural network we are working on: building an image recognition system, also called image classification.

If the purpose of the neural network is to detect hand-written digits there will be 10 classes at the end to map the input image to : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

To map this input to a class after passing through all those filters and downsampling layers, we will have just 10 neurons (each of them representing a class) and each will connect to the last sub sampled layer.

Below is an overview of the original LeNet-5 Convolutional Neural Network designed by Yann Lecun one of the few early adopter of this technology for image recognition.

LeNet-5 architecture published in the original paper (source : http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf).

Fact 9: Deep Learning is just LEAN – continuous improvement based on a feedback loop

The beauty of the technology does not only come from the convolution but from the capacity of the network to learn and adapt by itself. By implementing a feedback loop called backpropagation the network will mitigate and inhibit some “neurons” in the different layers using weights.

Let’s KISS (keep it simple): we look at the output of the network, if the guess (the output 0,1,2,3,4,5,6,7,8 or 9) is wrong, we look at which filter(s) “made a mistake”, we give this filter or filters a small weight so they will not make the same mistake next time. And voila! The system learns and keeps improving itself.

Fact 10: It all amounts to the fact that Deep Learning is embarrassingly parallel

Ingesting thousands of images, running tens of filters, applying downsampling, flattening the output … all of these steps can be done in parallel which make the system embarrassingly parallel. Embarrassingly means in reality a perfectly parallel problem and it’s just a perfect use case for GPGPU (General Purpose Graphic Processing Unit), which are perfect for massively parallel computing.

Fact 11: Need more precision? Just go deeper

Of course it is a bit of an oversimplification, but if we look at the main “image recognition competition”, known as the ImageNet challenge, we can see that the error rate has decreased with the depth of the neural network. It is generally acknowledged that, among other elements, the depth of the network will lead to a better capacity for generalisation and precision.

Imagenet competition winner error rates VS number of layers in the network (source : https://medium.com/@sidereal/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5)

In conclusion

We have taken a brief look at the concept of Deep Learning as applied to image recognition. It’s worth noting that almost every new architecture for image recognition (medical, satellite, autonomous driving, …) uses these same principles with a different number of layers, different types of filters, different initialisation points, different matrix sizes, different tricks (like image augmentation, dropouts, weight compression, …). The concepts remain the same:

51Number detection process

In other words, we saw that the training and inference of deep learning models comes down to lots and lots of basic matrix operations that can be done in parallel, and this is exactly what our good old graphical processors (GPU) are made for.

In the next post we will discuss how precisely a GPU works and how technically deep learning is implemented into it.