Jean-Louis Queguiner, Author at OVHcloud Blog

How to virtualize PCI-e GPU in the cloud ?

Jean-Louis Queguiner — Fri, 14 Jan 2022 14:00:00 +0000

A quick introduction to virtualization

The purpose of virtualization is to isolated the user software environment from the hardware environment. The orchestration between these virtual environments is made by the hypervisor. Hypervisor also provides the ability for a Virtual Machines (VM) to execute instructions that are not directly compatible with the underlying architecture/hardware. VT-x and AMD-V which are Intel and AMD virtualization technologies.

Some of you might know QEMU/NEMU/KVM/Lib Virt. All of these tools have specific roles as explained below :

Tool	Tool Family	Purpose
KVM (Kernel-based Virtual Machine)	Virtualization	– Deamon that manipulates VMs. In oposition to *QEMU, KVM* leverage the Virtualization extension provided by the CPU itself without any emulation (*VT-x* or *AMD-V*)
Lib Virt	Virtualization	Manages/Manipulates Virtualization (API, CLI and a Deamon – libvirtd)
QEMU (Quick EMUlator)	Emulation	– *Emulates* the processor and peripherals. It’s basically converting input guest instruction (for instance *ARM) into the actual hardware compatible instructions (for instance x86*). – Supports all virtualizations/emulations. – It has a reputation to be slow
NEMU	Emulation	– NEMU is a fork of QEMU that focuses on modern CPUs *used with advanced virtualization features to increase the speed* of existing QEMU implementation. – NEMU focuses on KVM support to better leverage the CPU’s virtualization extension and therefore reduce the need to translate instructions to the CPUs. – It doesn’t supports all hardware – It’s faster than QEMU

The Different components can also be represented in form of a stack.

Now… Let’s talk about AI and GPU’s

Why AI practitioners loves virtualization and containerization

When running AI workload it’s often recommended to use Docker. Indeed a big part of the pain in AI when it comes to workload management is making sure every library is compiled with the right CUDA accelerators (cuBLAS, cuDNN) associated to the right CUDA version with the right Driver.

Playing a bit with Deep Learning framework and with CUDA version ensure you a massive headache by the end of the day.

This is why Virtualization is an awesome that allow you to break and rebuild you environment. But adding a containerized technology on top will help you to hence reproducibility and flexibility while playing with different frameworks.

This issue had been identified a while ago by Nvidia when then decided to push their Nvidia GPU Cloud (NGC) platform.

Handle GPUs in a virtualized environment

When we presented the virtualization layers earlier we didn’t mentioned GPUs as they are not available on every hardware platform. There are multiple ways to handle GPUs in a virtualized environment:

emulation through nvemulate (Nvidia) or OpenCL device emulator (AMD) where you emulate instruction (cuda or opencl).
virtualization of the GPU (vGPU) or a slice of it where you give access to a slice of instruction to the GPU with an intermediate interpreter. In this case the GPU is accessed through a virtualized material address and not directly. Multiple market solution exist (citrix, vmware, NVIDIA GRID for VDI, NVIDIA Quadro Virtual Data Center Workstation, NVIDIA Virtual Compute Server (vCS), …)
passthrough-ization (I know… this word doesn’t exits …) where you have direct access to the physical hardware. In this case we will use VFIO that will assign the physical address of the GPU to the guest VM. This mode provides slightly better performance (between 1 and 2 percents).

Here are 2 representations of a AI virtualized and dockerized environment.

Now let’s talk about the CPUs/GPUs virtualization trap.

Why PCI Express pass-through and associated limitations ?

It was important to define how virtualization works to understand the limitations. The virtualization at the KVM level will attribute a CPU time for each VM using CPU cycles. At each CPU cycle the kernel will potentially move the Virtual Machine from the attached CPU to another CPU and this is when things can get dirty.

PCI Express pass-through concept is that the KVM will not emulate the GPU address but will pass the instruction directly to the Graphical Processing Unit to the physical address of the device. The hypervisor is therefore giving to the virtual Machine a plain access to the GPU through the PCI-e BUS. This is resulting in a non controlled usage of the GPU by the Hypervisor meaning that you are, as a customer, accessing the RAW hardware at full performances.

CPU is a sports car, GPU is a massive truck

Because of the CPU limited throughput (indeed CPU as around a 10x slower throughput than GPUs). Sometimes performing a simple operation will be slower to do on a CPU than transferring it to the GPU and then calculate it.

Think about it : why would you remove your suitcase from the sports car to put it in the truck if its the only thing you have to move from Berlin to Paris. So it’s a constant trade off between having an operation executed on the CPU or on a GPU. thankfully, once again, those crazy AI framework core developer did those arbitrage for us.

Does the CPU performance matters even when you are using a GPU?

Yes ! As explained before (in this blog post How PCI express works and why you should care) CPUs are major bottlenecks when pushing data from CPU to GPU using PCie mechanism.

Practical Case

Lets imagine a host with 2 CPU sockets and each of them are linked to their own PCIe slots. The schema below is showing a simplified hardware architecture associated to the implementation of GPU using PCIe.

Each CPU socket is attached to it’s PCIe through PCIe Lanes (if you don’t know what PCIe lanes are read this : https://www.ovh.com/blog/how-pci-express-works-and-why-you-should-care-gpu/) ; looking back at the schema accessing PCIe #1 using CPU2 will be sequence this way :

The VM requests GPU#1 to it’s attached CPU for the current clock cycle meaning CPU#2 for our scenario
The CPU#2 will then request access to GPU#1 calling CPU#1 as GPU#1 is attached physically to CPU#1 (I mean the Socket of CPU#1) through the PCIe link.
This back and forth overhead is estimated to be around 3%* following our benchmarks.

* The estimated 3% were prior to the L1TF Intel vulnerability which as been since patched with a counter measure that is now invalidating/flushing the L1 CPU Cache at the kernel level therefore it might be a little bit less since both VM will have to rebuild the L1 cache even if the VM stays on the same CPU.

Example of a virtualization full life cycle with multiple GPUs and VMs

In the following section we will look at 4 sequences of clock cycles and we will detail the communication flow between GPU(s), the VM they are attached to and the CPU and RAM managed by the hypervisor to make these VM run.

The initial setup

In the initial setup we will consider 3 VMs running on the same physical host. The physical host is composed of:

2 Physical CPUs (2 CPU sockets)
2 RAM components
4 GPUs (2 attached to each CPU socket)

The Scenario is sequenced with to as the initial clock cycle. Each cycle is considered to last ẟt (delta t). Therefore :

to is the initial setup
to + ẟt the system state after 1 CPU cycles
to + 2ẟt the system state after 2 CPU cycles
to + 3ẟt the system state after 3 CPU cycles

How to read this sequence ? It’s pretty simple; let’s take the example of VM3:

VM3 is assigned to CPU#1 at the initial setup (Yellow),
then VM3 is moved to CPU#2 for the 2nd cycle (green),
then VM3 stays on CPU#2 for the 3rd cycle (red),
Finally VM3 is moved to CPU#3 for the 4th and final cycle (purple).

Lets look at the full sequence with the associated communication flows

First cycle communication

At the first cycle the VM#1 will have acces to GPU#4 and RAM#2

Second cycle communication

At the first cycle the VM#1 will have acces to GPU#4 and RAM#2 through CPU#1 and will therefore have an overhead

Third cycle communication : the speculative execution special case.

During the third cycle the VM#1 will still have RAM#2 and CPU#2 attributed and this access will also be requested through the CPU#1.

Speculative execution explained

During every cycle something special may happen depending on your execution plan (not since L1TF Intel vulnerability for intel CPUs) and it’s called speculative execution.

Basically the CPU will anticipate the execution plan (called execution pipeline) and stored future potential results in it’s L1 cache (the cache of the CPU it self).

This anticipation of computation (compute ahead) is performed while the CPU computing power is at rest meaning while transmitting resulting data of the performed execution to the next computational stage. This means that the computational branch calculated ahead might not be relevant if the results of the previous stage doesn’t needs to be evaluate based on the algorithm defined (basically if the condition of the branch – like of if in your code – is not met).

This speculative execution will be relevant if and only if the 3 following conditions are met :

The Speculation Execution mode is activated (which is not the case this L1TF intel vulnerability : spectre)
The CPU instruction of the following cycle needs to be performed (scheduled) on the same CPU as the previous stage.
The condition of the branch needs to be met.

Therefore if speculative execution was setup this 3rd cycle could have been optimized because the VM#1 is are running on the same CPU (CPU#1) as the previous CPU cycle (2nd cycle)

Fourth cycle

In this cycle VM#1 is scheduled back to CPU#2 and won’t have an overhead as attached hardware is directly linked to the related CPU socket.

CPU pinning

To avoid the back and forth explained before a new concept had been introduced which is called CPU pinning.

The OpenStack case

OpenStack is able to handle the NUMA-Node processing for you it is able to teach libvirt to statically pin vCPU to a physical CPU so that the vCPUs will no longer “move around” as described above.

You can check more information regarding this topic here. This process is implemented on OVHcloud public cloud offers based on openstack.

The VMware case

Here is an example with VMware CPU pinning configuration. This feature is available on OVHcloud Private Cloud offers as well are OVHcloud new managed VMware on Bare Metal offers

Splitting a GPU thanks to GPU virtualization : understanding the performance impacts

One might want to split a GPU accross multiple VM using GPU virtualization. It’s especially interesting in terms of cost when you have a big memory like Nvidia V100s that have 32GB of VRAM vs V100 that only have 16GB (check this blog post for further details).

Translation : Illustration of a GPU splitted in 2 that was supposed to 32GB and is assigned only 16GB or VRAM.

We decided not to go this way for the following reasons :

the first reason is that as described in the figure below the impact of splitting the GPU#4 for instance will be that CPU#2 might be overflowed.
the second reason is that the computing power will be shared (meaning cuda core when it comes to Nvidia) making the performance of the GPU potentially not being used at it’s full advertized capacity per VM.

In conclusion

In this article we presented how vitrualization works, how the orchestration of computing cycles matters and why we decided to go in a PCI express passthrough mode for our VM GPU offers.

You can also check our blog explaining how to manage GPU pools effeciently in the cloud using our new AI training offer

Managing GPU pools efficiently in AI pipelines

Going Further

If you want to go further I suggest that you read these 2 excellent articles.

Read my other related blogs

How PCI-Express works and why you should care? #GPU

Understanding the anatomy of GPUs using Pokémon

Deep Learning explained to my 8-year-old daughter

Distributed Training in a Deep Learning Context

2021: major technological advances to accelerate, democratize and certify AI uses

Jean-Louis Queguiner — Tue, 26 Jan 2021 14:09:53 +0000

In 2020, OVHcloud launched a portfolio of PaaS solutions dedicated to Data and AI. As an innovative cloud player for the past 20 years, our up-to-date technology coupled with our close collaboration with data science communities as well as the most cutting-edge players in the sector has enabled us to identify 10 strong AI trends in 2021.

1. Moving towards a limited number of standard AI libraries

With double digit growth, the French-American startup HuggingFace is paving the way towards the convergence of different Artificial Intelligence techniques within a single library.

Neural network techniques have considerably evolved in recent years. Like machine learning libraries such as Scikit-Learn, which had laid the foundations of standardization, the rise of HuggingFace’s Transformers library seems to usher in a new technical era: the convergence of tools that simplify and democratize the use of AI.

2. Advances in NLP spread to other areas of AI application

Natural Language Processing (NLP) grew exponentially in 2019 and 2020 due to the emergence of new technologies including Transformers. The latter was particularly noteworthy because it brought with it high-performance and highly agnostic NLP models. It mainly addresses the difficulties inherent to the temporal and spatial aspects of language. It solves the difficulty of establishing the link between the beginning and the end of a sentence and identifying key elements.

Given the performance of those new techniques in solving this type of problem, it becomes clear that the same techniques can also be used in areas not directly related to language such as video, voice, or even image processing. Even if there were few research paper concerning this topic in 2020, there are bringing out 2021 to be a year of improvements in the state of the art surrounding these subjects.

3. The New Era of Speech Recognition

As we have seen, techniques related to NLP will benefit to numerous learning fields in which data temporality plays a very important role, among which Speech Recognition. Thus, MILA (known for its eminent professor Yoshua Bengio), in collaboration with Nvidia, Samsung and Nuance have announced the Speechbrain Project but has not yet revealed all its secrets but could be a “game changer” in 2021.

4. I annotate, you annotate…

We all will annotate this year! The widespread use of AI by companies will lead to an explosion in data labeling solutions, and should be accompanied by an expansion of open-source tools. A few big startups should stand out this year, like Weights and Biases for experiment management, which was democratized last year.

5. AI to be taught earlier in the IT curriculum

Although the bachelor and master’s degree programs in computer science already deals with the notions of artificial intelligence, in September 2021 the first artificial intelligence teaching programs may arrive in scientific fields upstream of the master’s or bachelor’s degrees.

6. Fake or not fake?

The rollout of Generative Neural Networks (GANs), over about 3 years, will spawn a real revolution in multimedia, especially in video games and video creation. As with great power comes great responsibility, with those technologies comes risk of malicious uses of generated images, as for example deep fake. Stay vigilant GAFAM, you are under the magnifying glass!

7. A major open source player?

All indications are that an open-source player, centralizing several areas of artificial intelligence applications such as image, sound, text, video – might emerge in the course of 2021 or 2022.

8. Indicators to help reduce energy consumption

The associated power consumption remains important, especially for the operation and cooling of GPUs. The ecological impact of Artificial Intelligence is a growing topic. We can expect new indicators related to ecological impact in the research papers as an indicator… and why not in some cloud providers communications 😉.

OVHcloud already started working on this topic through the Green Cloud Task Force.

9. Responsible and ethical AI certifications

Everyone is talking about ethics and responsibility; it is certain that the subject will be a priority for the major certification bodies.
New ISO certifications, dedicated to AI, are expected to be launched this year to address critical topics such as: reversibility transparency of algorithms, multi-locality context application avoiding biases (skin color, age, gender, language, culture, accent, …).

10. Collaborative solutions and container to secure reproducibility and to put in production

As the processes for implementing AI projects within companies are becoming more widely accessible and structured, we are seeing the trend of entire ecosystem looking forward to use/implement several collaborative data science tools, based on Project Jupyter’s logic. Real-time collaborative code editing seems like a promising path! Reproducibility and production proof AI implementations seems to converge toward the container technology which should arrive in force for the data scientist community.

And a last one, my personal conclusion

And here is an 11th prediction in the form of a more personal conclusion: the trend towards simplifying usage for developers/data scientist will grow… It is for this reason that we have worked to simplify as much as possible the user experience of our AI services such as AI Training and ML Serving 😉

And what a bonus if these tools are on a trusted cloud 💖

Happy cloud year 2021!

Our partnership with Project Jupyter: the value of an open-source data science community

Jean-Louis Queguiner — Tue, 10 Nov 2020 21:44:24 +0000

Between two major online events, JupyterCon and PyData – both sponsored by OVHcloud – I wanted to share my team’s feedback on our collaboration with Project Jupyter. The partnership, which begun a few years back, is still growing and we are continuing to learn a lot in relation to the open-source experience.

I’d like to introduce Maël Le Gal, one of our Data & AI DevOps for almost three years. In 2019, Maël was heavily invested in making OVHcloud one of the hosters of Binder Hubs.

“Joining the multi-cloud Binder Federation and becoming an infrastructure provider of MyBinder.org was a thrilling project! I experienced open collaboration on a new scale. Typically, all the discussions regarding architecture evolution are held publicly within the Binder community on GitHub and additional day-to-day communications – such as operational tasks – are held collectively on Gitter. It’s a very different approach to building a trusting relationships than what we are accustomed to in the IT industry, and I really appreciate it!” , explained Maël.

In hindsight, it was a unique, co-innovation opportunity for OVHcloud to become the first partner of the Binder Federation, and it became even more interesting when the two other partners – Gesis and Turing – joined as well. Thanks to our collaborative methods, we can easily and collectively submit a new feature; once most of the community agrees on the feature, we deploy it on each of the three clouds.
Maël Le Gal, Data & AI DeVops with OVHcloud, explains the daily collaboration of the Binder Federation.

This is what I wrote about my experience with MyBinder last year: “Working with open source requires a very human-centric mindset to build consensus and deliver progress when everyone has different objectives, priorities, timelines and points of view.”

Here is the Binder Federation as of beginning of November 2020.

2020: scaling projects and accelerating

Building on this experience, Maël contributed to another hosting initiative for Project Jupyter earlier this year that quickly scaled: NBViewer – the web application behind The Jupyter Notebook Viewer, hosted by OVHcloud.

“Jupyter was quite happy about our first collaboration and, in March, asked us to replace a previous hosting provider that had disengaged from the open-source community. Because we already had the deployment experience, and methods from Binder, it was a very smooth deployment. We were happy to see that the traffic running on the OVHcloud infrastructure grew from 25% to 100%”, explained Maël.

Now I’d like to introduce another member of my team: Guillaume Salou, our AI Technical Lead, who’s been working closely with Project Jupyter from the beginning and driving the OVHcloud sponsorhip of the JupyterCon 2020 digital event as an infrastructure donor.

“We kicked off this project in June with Project Jupyter and NumFOCUS Foundation as well as IBL Education, responsible for the Open EDX deployment at the conference. OVHcloud has offered the underlying infrastructure to host the global event (for the first time fully online) and its educational platform and now we had to deploy it!

The type of infrastructure we’re talking about is, of course, GPUs, coupled with Kubernetes to support all the talks, demos and online experiments that were planned.

I want to highlight three aspects of this infrastructure that meet Project Jupyter’s requirements:

Ability to scale up and down as necessary;

Ability to load balance on other cloud providers and the reversibility of our services;
Simplified rights management and user account creation based on Openstack.“

This collaboration has steered our efforts towards simplifying features and providing ready-to-use services for the event’s organizers and users; it’s a new milestone in OVHcloud’s AI approach to eliminate all infrastructure set-up and management complexity to facilitate and spread usages.
Guillaume Salou, AI Technical Lead with OVHcloud explains how the collaboration on JupyterCon 2020 has supported internal efforts to simplify the user experience.

To conclude: this partnership with Project Jupyter is here to stay. Thanks a lot to our partner as well as internal teams for making this open and fruitful collaboration a reality!

How PCI-Express works and why you should care? #GPU

Jean-Louis Queguiner — Thu, 09 Jul 2020 10:16:00 +0000

What is PCI-Express ?

Everyone, and I mean everyone, should pay attention when they do intensive Machine Learning / Deep Learning Training.

As I explained in a previous blog post, GPUs have accelerated Artificial Intelligence evolution massively.

However, building a GPUs server is not that easy. And failing to create an appropriate infrastructure can have consequences on training time.

If you use GPUs, you should know that there are 2 ways to connect them to the motherboard to allow it to connect to the other components (network, CPU, storage device). Solution 1 is through PCI Express and solution 2 through SXM2. We will talk about SXM2 in the future. Today, we will focus on PCI Express. This is because it has a strong dependency with the choice of adjacent hardware such as PCI BUS or CPU.

NVIDIA V100 with SXM2 design	NVIDIA V100 with PCI express design
Source : https://www.ebizpc.com/NVIDIA-Tesla-V100-900-2G502-0300-000-16GB-GPU-p/900-2g503-0310-000.htm	Source : https://nvidiastore.com.br/nvidia-tesla-v100-16gb

SXM2 design VS PCI Express Design

This is a major element to consider when talking about deep learning as data loading phase is a waste of compute time, so bandwidth between components and GPUs is a key bottleneck in most deep learning training contexts.

How does PCI-Express work and why you should care about the number of PCIe lanes?

What is a PCI-Express Lanes and are there any associated CPU limitations?

Each GPU V100 is using the 16 PCI-e lanes. What does it mean exactly?

Extract from NVidia V100 product specification sheet

The “x16” means that the PCIe has 16 dedicated lanes. So… next question: What is a PCI Express lane ?

What’s a PCI Express lane?

2 PCI Express Devices with its interconnexion : figure inspired of the awesome article – what is chipset and why should I care

PCIe lanes are used to communicate between PCIe Devices or between PCIe and CPU. A lane is composed of 2 wires: one for inbound communications and one, which has double the traffic bandwidth, for outbound.

Lane communications are similar to network Layer 1 communications – it’s all about transferring bits as fast as possible through electrical wires! However, the technique used for PCIe Link is a bit different as the PCIe device is composed of xN lanes. In our previous example N=16 but it could be any power of 2 from 1 to 16 (1/2/4/8/16).

So… if PCIe is similar to network architecture it means that PCIe layers exist, doesn’t it?

Yes ! you are right PCIe has 4 layers:

**The Physical Layer (aka the Big Negotiation Layer)**

The Physical Layer (PL) is responsible for negotiating the terms and conditions for receiving the raw packets (PLP for Physical Layer Packets) i.e the lane width and the frequency with the other device.

You should be aware that only the smallest number of lanes of the two devices will be used. This is why choosing the appropriate CPU is so important. CPUs have a limited number of lanes that they can manage so having a nice GPU with 16 PCIe Lanes and having a CPU with 8 PCIe Bus lanes will be as efficient as throwing away half your money because it doesn’t fit in your wallet.

Packets received at the Physical Layer (aka PHY) are coming from other PCIe devices or from the system (via Direct Access Memory — DAM or from CPU for instance) and are encapsulated in a frame.

The purpose of a Start-of-Frame is to say: “I am sending you data, this is the beginning,” and it takes just 1 byte to say that!

The End-of-Frame word is also 1 byte to say “goodbye I’m done with it”.

This layer implement a 8b/10b or 128b/130b decoding that we will explain later and is mainly used for clock recovery.

**The Data Link Layer Packet (aka Let’s put this mess in the right order)**

The Data Link Layer Packet (DLLP) is starting with a Packet Sequence Number. This is really important as a packet might get corrupted at one point, so may need to be uniquely identified for retry purposes. The Sequence Number is coded on 2 bytes.

The Data Link Layer Packet is then followed by the Transaction Layer Packet and then closed with the LCRC (Local Cyclic Redundancy Check) and is used to check the Transaction Layer Packet (meaning the actual Payload) integrity.

If the LCRC is validated, then the Data Link Layer sends an ACK (ACKnowledge) signal to the emitter through the Physical Layer. Otherwise it sends a NAK (Not AcKnowledge) signal to the emitter which will resend the frame associated with the sequence number to retry; this part handles the replay buffer on the receiver side.

The Transaction Layer

The Transaction Layer is responsible for managing the actual payload (Header + Data) as well as the (optional) message digest ECRC (End to End Cyclic Redundancy Check). This Transaction Layer Packet is coming from the Data Link Layer where it has been decapsulated.

An integrity check is performed if needed/requested. This step will check the integrity of the business logic and will insure no packet corruption when passing data from Data Link Layer to Transaction Layer.

The header is describing the type of transaction such as:

Memory Transaction
I/O Transaction
Configuration Transaction
or Message Transaction

The Application Layer

The role of the application layer is to handle the User Logic. This layer is sending the Header and the data payload to the Transaction Layer. The magic happens in this layer where data in rooted to different hardware components.

How PCIe is communicating with the rest of the world?

PCIe Link is using the packet switching concept used in network in a full duplex mode.

PCIe device have an internal clock to orchestrate PCIe Data Transfer Cycles. This Data Transfer Cycle is also orchestrated thanks to the Referential Clock. The latter is sending a signal through a Dedicated Lane (which is not part of the x1/2/4/8/16/32 mentioned above). This clock will help both receiving and emitting devices to synchronize for packets communications.

Each PCIe lane is used to send bytes in parallel with other lanes. The Clock Synchronization mentioned above will help the receiver to put back those bytes in the right order

x16 means 16 lanes of parallel communication on generation 3 of PCIe protocol

You may have the bytes in order but do you have the data integrity at the physical layer ?

To ensure integrity PCIe device uses 8b/10b encoding for PCIe generations 1 and 2 or 128b/130b encoding scheme for generations 3 and 4.

These encodings are used to prevent the loss of temporal landmarks, especially when transmitting consecutive similar bits. This process is called “Clock Recovery”

Those 128 bits of payload data are sent and 2 bytes of control are appended to it.

Quick examples

Let’s simplify it with a 8b/10b example: according to IEEE 802.3 clause 36, table 36–1a based on Ethernet specifications here is the table 8b/10b encoding:

IEEE 802.3 clause 36, table 36–1a – 8b/10b encoding table

So how can the receiver make the difference between all those repeating 0 (Code Group Name D0.0) ?

8b/10b encoding is composed of 5b/6b + 3b/4b encodings.

Therefore 00000 000 will be encoded into 100111 0100 the 5 first bits of the original data 00000 are encoded to 100111 using 5b/6b encoding (rd+); same goes for the second group of 3bits of original data 000 encoded into 0100 using 3b/4b encoding (rd-).

It could have been also 5b/6b encoding rd+ and 3b/4b encoding rd- making 00000 000 turning into 011000 1011

Therefore the original data which was 8bits is now 10bits due to bits control (1 control bit for 5b/6b and 1 fir 3b/4b).

But don’t worry I will draft a blog post later dedicated to encoding.

PCIe Generations 1 and 2 were designed with 8b/10b encoding meaning that the actual data transmitted was only 80% of the total load (as 20% — 2 bits are used as Clock synchronization).

PCIe Gen3&4 were designed with 128b/130b meaning that the control bits are now representing only 1.56% of the payload. Quite good isn’t it?

Let’s calculate the PCIe bandwidth together

Here is the table of PCIe versions specifications

Number of Lanes	PCIe 1.0 (2003)	PCIe 2.0 (2007)	PCIe 3.0 (2010)	PCIe 4.0 (2017)	PCIe 5.0 (2019)	PCIe 6.0 (2021)
x1	250 MB/s	500 MB/s	1 GB/s	2 GB/s	4 GB/s	8 GB/s
x2	500 MB/s	1 GB/s	2 GB/s	4 GB/s	8 GB/s	16 GB/s
x4	1 GB/s	2 GB/s	4 GB/s	8 GB/s	16 GB/s	32 GB/s
x8	2 GB/s	4 GB/s	8 GB/s	16 GB/s	32 GB/s	64 GB/s
x16	4 GB/s	8 GB/s	16 GB/s	32 GB/s	64 GB/s	128 GB/s

consortium PCI-SIG PCIe theoretical bandwidth/Lane/Way specification sheet

	PCIe 1.0 (2003)	PCIe 2.0 (2007)	PCIe 3.0 (2010)	PCIe 4.0 (2017)	PCIe 5.0 (2019)	PCIe 6.0 (2021)
Frequency	2.5 GT/s	5.0 GT/s	8.0 GT/s	16 GT/s	32 GT/s	64 GT/s

consortium PCI-SIG PCIe theoretical raw bit rate specification sheet

To obtain such numbers let’s look at the general Bandwidth formula:

BW stands for Bandwidth
MT/s : Mega Transfers per second
Encoding could be 4b/5b/, 8b/10b, 128b/130b, …

For PCIe v1.0:

For PCIe v3.0 (the one that interest us for NVIDIA V100):

Therefore with 16 lanes for a NVIDIA V100 connected in PCIe v3.0, we have an effective data rate transfer (data bandwidth) of nearly 16GB/s/way (actual bandwidth is 15.75GB/s/way)

You need to be careful not to get confused, as total bandwidth can also be interpreted as two ways bandwidth; in this case we consider total bandwidth x16 to be around 32GB/s.

Note : Another element that we haven’t considered is that the maximum theoretical bandwidth needs to be reduced by around 1 Gb/s for error correction protocols (ECRC and LCRC) as well as the Headers (Start tag, Sequence tag, Header) and Footer (End tag) overheads explained earlier in this blog post.

In conclusion

We have seen that PCI Express has evolved a lot and that It’s based on the same concepts as network. To take the best from the PCIe devices it is necessary to understand the fundamentals of the underlying infrastructure.

Failing to choose the right underlying Motherboard, CPU or BUS can lead to major performance bottleneck and GPU under performance.

To sum up :

Friends don’t let friends build their own GPUs hosts 😉
Jean-Louis Quéguiner July 1^st, 2020

If you liked this post but you want to drill down a bit into the Deep Learning and AI aspect of things don’t hesitate to check out my other blog posts:

Sponsorship of the JupyterCon 2020: sharing values and supporting with infrastructure

Jean-Louis Queguiner — Thu, 02 Jul 2020 08:06:51 +0000

Echoing yesterday’s announcement on the Jupyter blog, OVHcloud is proud to support JupyterCon as platinum sponsor and infrastructure donor.

As you probably know, Jupyter has been a huge enabler of the programming community, with over 140 Kernels supported such as Python, R, Julia, Spark, Sas, Haskell, Ruby, C++, Go, etc…

The Jupyter Project embodies open and collaborative dev communities, both key values in the makeup of our own ecosystem.

In recent years, Jupyter and OVHcloud have worked hand in hand on projects such as Mybinder.org and NbViewer. I would like to thank the Jupyter board and NumFocus for their openness and trust, which has enabled such a successful partnership.

In 2020, we moved further, offering a full year of support for the JupiterCon and helping to contribute to the event’s success.

With the transition to a fully digital experience, COVID-19 travel restrictions are an opportunity not only to reduce travel costs and ease accessibility, but to limit the carbon footprint tied to knowledge sharing within the ecosystem.

Understanding the anatomy of GPUs using Pokémon

Jean-Louis Queguiner — Wed, 13 Mar 2019 16:25:32 +0000

Please welcome this beautiful new born in GPGPU Nvidia Family Ampere
BLOG UPDATE FROM MAY 14, 2020

Congratulations

In the previous episode…

In our previous blog post about Deep Learning, we explained that this technology is all about massive parallel matrix computation, and that these computations are simplistic operations: + and x.

Fact 1: GPUs are good for (drum roll)…

Once you get that Deep Learning is just massive parallel matrix multiplications and additions, the magic happens. General Purpose Graphic Processing Units (GPGPU) (i.e. GPUs, or variants of GPUs, designed for something other than graphic processing) are perfect for…

matrix multiplications and additions !

Perfect isn’t it ? But why ? Let me tell you a little story

Fact 2: There was a time when GPUs were just GPUs

Yes, you read that correctly…

The first GPUs in the 90s were designed in a very linear way. The engineer took the engineering process used for graphical rendering and implemented it into the hardware.

To keep it simple, this is what a graphical rendering process looks like:

Uses for GPUs included transformation, building lighting effects, building triangle setups and clipping, and integrating rendering engines at a scale that was not achievable at the time (tens of millions of polygons per second).

The first GPUs integrated the various steps of image processing and rendering in a linear way. Each part of the process had predefined hardware components associated with vertex shaders, tessellation modules, geometry shaders, etc.

In short, graphics cards were initially designed to perform graphical processing. What a surprise!

Fact 3: CPUs are sports cars, GPUs are massive trucks

As explained earlier, for image processing and rendering, you don’t want your image being generated pixel per pixel – you want it in a single shot. That means that every pixel of the image – representing every object pointed in the camera, at a given time, in a given position – needs to be calculated at once.

It’s a complete contrast with CPU logic, where operations are meant to be achieved in a sequential way. As a result, GPGPUs needed a massively parallel general-purpose architecture to be able to process all the points (vertex), build all the meshes (tessellation), build the lighting, perform the object transformation from the absolute referential, apply texture, and perform shading (I’m still probably missing some parts!). However, the purpose of this blog post is not to look in-depth at image processing and rendering, as we will do that in another blog post in the future.

As explained in our previous post, CPUs are like sports cars, able to calculate a chunk of data really fast with minimal latency, while GPUs are trucks, moving lots of data at once, but suffering from latency as a result.

Here is a nice video from Mythbusters, where the two concepts of CPU and GPU are explained:

Fact 4: 2006 – NVIDIA killed the image processing Taylorism

The previous method for performing image processing was done using specialised manpower (hardware) at every stage of the production line in the image factory.

This all changed in 2006, when NVIDIA decided to introduce General Purpose Graphical Processing Units using Arithmetic Logical Units (ALUs), aka CUDA cores, which were able to run multi-purpose computations (a bit like a Jean-Claude Van Damme of GPU computation units!).

GoDaddy Commercial (2013) featuring Jean-Claude Van Damme Source : https://imgur.com/r/gifs/PvuZxBZ

Even today, modern GPU architectures (such as Fermi, Kepler or Volta) are composed of non-general cores, named Special Function Units (SFUs), to run high-performance mathematical graphical operations, such as sin, cosine, reciprocal, and square root, as well as Texture Mapping Units (TMUs) for the high-dimension matrix operations involved in image texture mapping.

Fact 5: GPGPUs can be explained simply with Pokémon!

GPU architectures can seem difficult to understand at first, but trust me… they are not!

Here is my gift to you: a Pokédex to help you understand GPUs in simple terms.

The Micro-Architecture Family

Here’s how you use it…

You basically have four families of cards:

This family will already be known to many of you. We are, of course, talking about Fermi, Maxwell, Kepler, Volta, Ampere etc.

A beautiful picture of new born with all the other familier

The Architecture Family

This is the center, where the magic happens: orchestration, cache, workload scheduling… It’s the brain of the GPU.

The Multi-Core Units (aka CUDA Cores) Family

This represents the physical core, where the maths computations actually happen.

The Programming Model Family

The different layers of the programming model are used to abstract the GPU’s parallel computation for a programmer. It also makes the code portable to any GPU architecture.

How to play

Start by choosing a card from the Micro-Architecture family
Look at the components, and choose the appropriate card from the Architecture family
Look at the components within the Micro-Architecture family and pick them from the Multi-Core Units family, then place them under the Architecture card
Now, if you want to know how to program a GPU, place the Programming Model – Multi-Core Units special card on top of the Multi-Core Units cards
Finally, on top of the Programming Model – Multi-Core Units special card, place all the Programming Model cards near the SM
You then should have something that look like this:

Examples of card configurations:

Fermi

Kepler

Maxwell

Pascal

Volta

Turing

After playing around with different Micro-Architectures, Architectures and Multi-Core Units for a bit, you should see that GPUs are just as simple as Pokémon!

Enjoy the attached PDF, which will allow you to print your own GPU Pokédex. You can download it here: GPU Cards Game