Distributed Training in a Deep Learning Context

Distributed Learning in a Deep Learning context

Previously on OVHcloud Blog …

In previous blog posts we have discussed a high level approach to deep learning as well as what is meant by ‘training’ in relation to Deep Learning.

Following the article, I had lots of questions entering my twitter inbox, especially regarding how GPUs actually works.

Don’t worry it’s a friend, he is ok with me sharing the DM 😉

I decided, therefore, to write an article on how GPUs work:

During our R&D process around hardware and AI models, the question of distributed training came up (quickly). But before looking in-depth at distributed training, I invite you to read the following article to understand how Deep Learning training actually works:

What does training neural networks mean?

As previously discussed, Neural Networks training depends on :

Input Data
Neural Network architecture composed of ‘Layers’
Weights
Learning Rate (step used to adjust neural network weights)

Why do we need distributed Learning

Deep Learning is mainly used for non structured data pattern learning. Non structured data – such as text corpus, image, video or sound – can represent a huge amount of data to train on.

Training such a library can take days or even weeks because of the size of data and/or the size of the network.

Multiple distributed learning approaches can be considered.

The different Distributed Learning approaches

There are two main categories for distributed training when it comes to Deep Learning and both of them are based on the divide and conquer paradigm.

The first category is named : “Distributed Data Parallelism” where the data is split across multiple GPUs.

The second category is called : “Model Parallelism” where the deep learning model is split across multiple GPUs.

However the Distributed Data Parallelism is the most common approach as it fits almost any problem. The second approach has some serious technical limitations in relation to model splitting. Splitting a model is a highly technical approach, as you need to know the space used by each part of the network into the DRAM of the GPU. Once you have the DRAM usage per slice you need to enforce the computation by hard coding Neural Network Layers placement onto the desired GPU. This approach makes it hardware-related, as the DRAM may vary from one GPU to the other, while the Distributed Data Parallelism will just require data size adjustments (usually batch size) which is relatively simple.

Distributed Data Parallelism model has two variants, each of which has its advantages and disadvantages. The first variant allows you to train a model with a synchronous weight adjustment. That is to say that each training batch in each GPU will return the corrections that need to be made to the model in order for it to be trained. And that it will have to wait until all the workers have finished their task to have a new set of weights so it can recognise this in the next training batch.

Whereas the second variant lets you work in an asynchronous way. This means each batch of each GPU will report the corrections that need to be made to the neural network. The weights coordinator will send a new set of weights without waiting for the other GPUs to finish training their own batch.