Neural networks Archives - OVHcloud Blog

Distributed Training in a Deep Learning Context

Jean-Louis Queguiner — Tue, 05 May 2020 10:14:07 +0000

Previously on OVHcloud Blog …

In previous blog posts we have discussed a high level approach to deep learning as well as what is meant by ‘training’ in relation to Deep Learning.

Following the article, I had lots of questions entering my twitter inbox, especially regarding how GPUs actually works.

Don’t worry it’s a friend, he is ok with me sharing the DM 😉

I decided, therefore, to write an article on how GPUs work:

During our R&D process around hardware and AI models, the question of distributed training came up (quickly). But before looking in-depth at distributed training, I invite you to read the following article to understand how Deep Learning training actually works:

As previously discussed, Neural Networks training depends on :

Input Data
Neural Network architecture composed of ‘Layers’
Weights
Learning Rate (step used to adjust neural network weights)

Why do we need distributed Learning

Deep Learning is mainly used for non structured data pattern learning. Non structured data – such as text corpus, image, video or sound – can represent a huge amount of data to train on.

Training such a library can take days or even weeks because of the size of data and/or the size of the network.

Multiple distributed learning approaches can be considered.

The different Distributed Learning approaches

There are two main categories for distributed training when it comes to Deep Learning and both of them are based on the divide and conquer paradigm.

The first category is named : “Distributed Data Parallelism” where the data is split across multiple GPUs.

The second category is called : “Model Parallelism” where the deep learning model is split across multiple GPUs.

However the Distributed Data Parallelism is the most common approach as it fits almost any problem. The second approach has some serious technical limitations in relation to model splitting. Splitting a model is a highly technical approach, as you need to know the space used by each part of the network into the DRAM of the GPU. Once you have the DRAM usage per slice you need to enforce the computation by hard coding Neural Network Layers placement onto the desired GPU. This approach makes it hardware-related, as the DRAM may vary from one GPU to the other, while the Distributed Data Parallelism will just require data size adjustments (usually batch size) which is relatively simple.

Distributed Data Parallelism model has two variants, each of which has its advantages and disadvantages. The first variant allows you to train a model with a synchronous weight adjustment. That is to say that each training batch in each GPU will return the corrections that need to be made to the model in order for it to be trained. And that it will have to wait until all the workers have finished their task to have a new set of weights so it can recognise this in the next training batch.

Whereas the second variant lets you work in an asynchronous way. This means each batch of each GPU will report the corrections that need to be made to the neural network. The weights coordinator will send a new set of weights without waiting for the other GPUs to finish training their own batch.

3 cheat sheets to better understand Distributed Deep Learning

In this cheat sheets lets assume you’re using docker with a volume attached.

Now you need to choose your Distributed Training strategy (wisely)

What does Training Neural Networks mean?

Jean-Louis Queguiner — Wed, 22 Apr 2020 16:37:25 +0000

In a previous blog post we discussed general concepts surrounding Deep Learning. In this blog post, we will go deeper into the basic concepts of training a (deep) Neural Network.

Where does “Neural” comes from ?

As you should know, a biological neuron is composed of multiple dendrites, a nucleus and a axon (if only you had paid attention in your biology classes). When a stimuli is sent to the brain, it is received through the synapse located at the extremity of the dendrite.

When a stimuli arrives at the brain it is transmitted to the neuron via the synaptic receptors which adjust the strength of the signal sent to the nucleus. This message is transported by the dendrites to the nucleus to then be processed in combination with other signals emanating from other receptors on the other dendrites. Thus the combination of all these signals takes place in the nucleus. After processing all these signals, the nucleus will emit an output signal through its single axon. The axon will then stream this signal to several other downstream neurons via its axon terminations. Thus a neuron analysis is pushed in the subsequent layers of neurons. When you are confronted with the complexity and efficiency of this system, you can only imagine the millennia of biological evolution that brought us here.

On the other hand, artificial neural networks are built on the principle of bio-mimicry. External stimuli (the data), whose signal strength is adjusted by the neuronal weights (remember the synapse?, circulates to the neuron (place where the mathematical calculation will happen) via the dendrites. The result of the calculation – called the output – is then re-transmitted (via the axon) to several other neurons and then subsequent layers are combined, and so on.

Therefore, their is a clear parallel between biological neurons and artificial neural networks as presented in the figure below.

Basec on https://medium.com/swlh/learning-paradigms-in-neural-networks-30854975aa8d

The Artificial Neural Network Recipe

To build a good Artificial Neural Network (ANN) you will need the following ingredients

Ingredients:

Artificial Neurons (processing node) composed of:
- (many) input neuron(s) connection(s) (dendrites)
- a computation unit (nucleus) composed of:
  - a linear function (ax+b)
  - an activation function (equivalent to the the synapse)
- an output (axon)

Preparation to get an ANN for image classification training:

Decide on the number of output classes (meaning the number of image classes – for example two for cat vs dog)
Draw as many computation units as the number of output classes (congrats you just create the Output Layer of the ANN)
Add as many Hidden Layers as needed within the defined architecture (for instance vgg16 or any other popular architecture). Tip – Hidden Layers are just a set of neighboured Compute Units, they are not linked together.
Stack those Hidden Layers to the Output Layer using Neural Connections
It is important to understand that the Input Layer is basically a layer of data ingestion
Add an Input Layer that is adapted to ingest your data (or you will adapt your data format to the pre-defined architecture)
Assemble many Artificial Neurons together in a way where the output (axon) an Neuron on a given Layer is (one) of the input of another Neuron on a subsequent Layer. As a consequence, the Input Layer is linked to the Hidden Layers which are then linked to the Output Layer (as shown in the picture below) using Neural Connections (also shown in the picture below).
Enjoy your meal

simplified schema of a neural network architecture

What does it mean to train an Artificial Neural Network ?

All Neurons of a given Layer are generating an Output, but they don’t have the same Weight for the next Neurons Layer. This means that if a Neuron on a layer observes a given pattern it might mean less for the overall picture and will be partially or completely muted. This is what we call Weighting: a big weight means that the Input is important and of course a small weight means that we should ignore it. Every Neural Connection between Neurons will have an associated Weight.

And this is the magic of Neural Network Adaptability: Weights will be adjusted over the training to fit the objectives we have set (recognize that a dog is a dog and that a cat is a cat). In simple terms: Training a Neural Network means finding the appropriate Weights of the Neural Connections thanks to a feedback loop called Gradient Backward propagation … and that’s it folks.

Parallel between Control Theory and Deep Learning Training

The engineering field of control theory defines similar principles to the mechanism used for training neural networks.

Control Theory general concepts

In control systems, a setpoint is the target value for the system.

A setpoint (input) is defined and then processed by a controller, which adjusts the setpoint’s value according to the feedback loop (Manipulated Variable). Once the setpoint has been adjusted it is then sent to the controlled system which will produce an output. This output is monitored using an appropriate metric which is then compared (comparator) to the original input via a feedback loop. This allows the controller to define the level of adjustment (Manipulated Variable) of the original setpoint.

Control Theory applied to a radiator

Let’s take the example of a resistance (controlled system) in a radiator. Imagine you decide to set the room temperature to 20 ° C (setpoint). The radiator starts up, supplies the resistance with a certain intensity defined by the controller. A probe (thermometer) will then take the ambient temperature (feedback elements) which is then compared (comparator) to the setpoint (desired temperature) and adjusts (controller) the electric intensity sent to the resistance. The adjustment of the new intensity is deployed via an incremental adjustment step.

Control Theory applied to Neural Network Training

The training of a neuron network is similar to a radiator insofar as the controlled system is the cat or dog detection model.

The objective is no longer to have the minimum difference between the setpoint temperature and the actual temperature but to minimize the error (Loss) between the classification of the incoming data (a cat is a cat) and the one made by the neural network.

In order to achieve this, the system will have to look at the input (setpoint) and compute an output (controlled system) based on the parameters defined in the algorithm. This phase is called the forward pass.

Once the output has been calculated, the system will re-propagate the evaluation error using Gradient Retro-propagation (Feedback Elements). While the temperature difference between the setpoint and the thermometer was converted into electrical intensity for the radiator, here the system will adjust the weights of the different inputs into each neuron with a given step (learning rate).

Parallel between electrical engineering controlled system and neural network training process

One thing to consider: The Valley Problem

When training the system, the backward propagation will lead the system to reduce the error it’s making to best fit the objectives you have set (finding that a dog is a dog…).

Choosing the learning rate at which you will adjust your weights (what one call adjustment step in Control Theory).

Just as is the case in control theory, the control system can face several issues if it is not designed correctly:

If the correction step (learning rate) is too small it will lead to a very slow convergence (i.e. it will take a very long time to get your room to 20°C…).
Too smaller learning rate can also lead to you being stuck in a local minima
If the correction step (learning rate) is too high it will lead the system to never converge (beat around the bush) or worse (i.e. the radiator will oscillate between being either too hot or too cold)
The system could enter into a resonance state (divergence).

In the end Training an Artificial Neural Network (ANN) requires just a few steps:

First an ANN will require a random weight initialization
Split the dataset in batches (batch size)
Send the batches 1 by 1 to the GPU
Calculate the forward pass (what would be the output with the current weights)
Compare the calculated output to the expected output (loss)
Adjust the weights (using the learning rate increment or decrement) according to the backward pass (backward gradient propagation).
Go back to square 2

Further notice

That’s all folks, you are now all set to read our future blog post which focuses on Distributed Training in a Deep Learning Context.

Deep Learning explained to my 8-year-old daughter

Jean-Louis Queguiner — Fri, 15 Feb 2019 14:56:56 +0000

Machine Learning and especially Deep Learning are hot topics and you are sure to have come across the buzzword “Artificial Intelligence” in the media.

Yet these are not new concepts. The first Artificial Neural Network (ANN) was introduced in the 40s. So why all the recent interest around neural networks and Deep Learning?

We will explore this and other concepts in a series of blog posts on GPUs and Machine Learning.

YABAIR – Yet Another Blog About Image Recognition

In the 80s, I remember my father building character recognition for bank checks. He used primitives and derivatives around pixel darkness level. Examining so many different types of handwriting was a real pain because he needed one equation to apply to all the variations.

In the last few years, It has become clear that the best way to deal with this type of problem is through Convolutional Neural Networks. Equations designed by humans are no longer fit to handle infinite handwriting patterns.

Let’s take a look at one of the most classic examples: building a number recognition system, a neural network to recognise handwritten digits.

Fact 1: It’s as simple as counting

We’ll start by counting how many times the small red shapes in the top row can be seen in each of the black, hand-written digits, (in the left-hand column).

Simplified matrix for handwritten numbers

Now let’s try to recognise (infer) a new hand-written digit, by counting the number of matches with the same red shapes. We’ll then compare this to our previous table, in order to identify which number has the most correspondences:

Matching shapes for handwritten numbers

Congratulations! You’ve just built the world’s simplest neural network system for recognising hand-written digits.

Fact 2: An image is just a matrix

A computer views an image as a matrix. A black and white image is a 2D matrix.

Let’s consider an image. To keep it simple, let’s take a small black and white image of an 8, with square dimensions of 28 pixels.

Every cell of the matrix represents the intensity of the pixel from 0 (which represents black), to 255 (which represents a pure white pixel).

The image will therefore be represented as the following 28 x 28 pixel matrix.

Image of a handwritten 8 and the associated intensity matrix

Fact 3: Convolutional layers are just bat-signals

To work out which pattern is displayed in a picture (in this case the handwritten 8) we will use a kind of bat-signal/flashlight. In machine learning, the flashlight is called a filter. The filter is used to perform a classic convolution matrix calculation used in usual image processing software such as Gimp.

The filter will scan the picture in order to find the pattern in the image and will trigger a positive feedback if a match is found. It works a bit like a toddler shape sorting box: triangle filter matching triangle hole, square filter matching square hole and so on.

Image filters work like children shape sorting boxes

Fact 4: Filter matching is an embarrassingly parallel task

To be more scientific the image filtering process looks a bit like the animation below. As you can see, every step of the filter scanning is independent, which means that this task can be highly parallelised.

It’s important to note that tens of filters will be applied at the same time, in parallel as none of them are dependent.

https://github.com/vdumoulin

Fact 5: Just repeat the filtering operation (matrix convolution) as many times as possible

We just saw that the input image/matrix is filtered using multiple matrix convolutions.

To improve the accuracy of the image recognition just take the filtered image from the previous operation and filter again and again and again…

Of course, we are oversimplifying things somewhat, but generally the more filters you apply, and the more you repeat this operation in sequence, the more precise your results will be.

It’s like creating new abstraction layers to get a clearer and clearer object filter description, starting from primitive filters to filters that look like edges, wheel, squares, cubes, …

Fact 6: Matrix convolutions are just x and +

An image is worth a thousand words: the following picture is a simplistic view of a source image (8×8) filtered with a convolution filter (3×3). The projection of the torch light (in this example a Sobel Gx Filter) provides one value.

Example of a convolution filter (Sobel Gx) applied to an input matrix (Source : https://datascience.stackexchange.com/questions/23183/why-convolutions-always-use-odd-numbers-as-filter-size/23186)

This is where the magic happens, simple matrix operations are highly parallelised which fits perfectly with a General Purpose Graphical Processing Unit use case.

Fact 7: Need to simplify and summarise what’s been detected? Just use max()

We need to summarise what’s been detected by the filters in order to generalise the knowledge.

To do so, we will sample the output of the previous filtering operation.

This operation is call pooling or downsampling but in fact it’s about reducing the size of the matrix.

You can use any reducing operation such as: max, min, average, count, median, sum and so on.

Example of a max pooling layer (Source : Stanford’s CS231n)

Fact 8: Flatten everything to get on your feet

Let’s not forget the main purpose of the neural network we are working on: building an image recognition system, also called image classification.

If the purpose of the neural network is to detect hand-written digits there will be 10 classes at the end to map the input image to : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

To map this input to a class after passing through all those filters and downsampling layers, we will have just 10 neurons (each of them representing a class) and each will connect to the last sub sampled layer.

Below is an overview of the original LeNet-5 Convolutional Neural Network designed by Yann Lecun one of the few early adopter of this technology for image recognition.

LeNet-5 architecture published in the original paper (source : http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf).

Fact 9: Deep Learning is just LEAN – continuous improvement based on a feedback loop

The beauty of the technology does not only come from the convolution but from the capacity of the network to learn and adapt by itself. By implementing a feedback loop called backpropagation the network will mitigate and inhibit some “neurons” in the different layers using weights.

Let’s KISS (keep it simple): we look at the output of the network, if the guess (the output 0,1,2,3,4,5,6,7,8 or 9) is wrong, we look at which filter(s) “made a mistake”, we give this filter or filters a small weight so they will not make the same mistake next time. And voila! The system learns and keeps improving itself.

Fact 10: It all amounts to the fact that Deep Learning is embarrassingly parallel

Ingesting thousands of images, running tens of filters, applying downsampling, flattening the output … all of these steps can be done in parallel which make the system embarrassingly parallel. Embarrassingly means in reality a perfectly parallel problem and it’s just a perfect use case for GPGPU (General Purpose Graphic Processing Unit), which are perfect for massively parallel computing.

Fact 11: Need more precision? Just go deeper

Of course it is a bit of an oversimplification, but if we look at the main “image recognition competition”, known as the ImageNet challenge, we can see that the error rate has decreased with the depth of the neural network. It is generally acknowledged that, among other elements, the depth of the network will lead to a better capacity for generalisation and precision.

Imagenet competition winner error rates VS number of layers in the network (source : https://medium.com/@sidereal/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5)

In conclusion

We have taken a brief look at the concept of Deep Learning as applied to image recognition. It’s worth noting that almost every new architecture for image recognition (medical, satellite, autonomous driving, …) uses these same principles with a different number of layers, different types of filters, different initialisation points, different matrix sizes, different tricks (like image augmentation, dropouts, weight compression, …). The concepts remain the same:

51Number detection process

In other words, we saw that the training and inference of deep learning models comes down to lots and lots of basic matrix operations that can be done in parallel, and this is exactly what our good old graphical processors (GPU) are made for.

In the next post we will discuss how precisely a GPU works and how technically deep learning is implemented into it.