How to virtualize PCI-e GPU in the cloud ?

A quick introduction to virtualization

The purpose of virtualization is to isolated the user software environment from the hardware environment. The orchestration between these virtual environments is made by the hypervisor. Hypervisor also provides the ability for a Virtual Machines (VM) to execute instructions that are not directly compatible with the underlying architecture/hardware. VT-x and AMD-V which are Intel and AMD virtualization technologies.

How to virtualize PCI-e GPU in the cloud ?

Some of you might know QEMU/NEMU/KVM/Lib Virt. All of these tools have specific roles as explained below :

Tool	Tool Family	Purpose
KVM (Kernel-based Virtual Machine)	Virtualization	– Deamon that manipulates VMs. In oposition to *QEMU, KVM* leverage the Virtualization extension provided by the CPU itself without any emulation (*VT-x* or *AMD-V*)
Lib Virt	Virtualization	Manages/Manipulates Virtualization (API, CLI and a Deamon – libvirtd)
QEMU (Quick EMUlator)	Emulation	– *Emulates* the processor and peripherals. It’s basically converting input guest instruction (for instance *ARM) into the actual hardware compatible instructions (for instance x86*). – Supports all virtualizations/emulations. – It has a reputation to be slow
NEMU	Emulation	– NEMU is a fork of QEMU that focuses on modern CPUs *used with advanced virtualization features to increase the speed* of existing QEMU implementation. – NEMU focuses on KVM support to better leverage the CPU’s virtualization extension and therefore reduce the need to translate instructions to the CPUs. – It doesn’t supports all hardware – It’s faster than QEMU

The Different components can also be represented in form of a stack.

Now… Let’s talk about AI and GPU’s

Why AI practitioners loves virtualization and containerization

When running AI workload it’s often recommended to use Docker. Indeed a big part of the pain in AI when it comes to workload management is making sure every library is compiled with the right CUDA accelerators (cuBLAS, cuDNN) associated to the right CUDA version with the right Driver.

Playing a bit with Deep Learning framework and with CUDA version ensure you a massive headache by the end of the day.

This is why Virtualization is an awesome that allow you to break and rebuild you environment. But adding a containerized technology on top will help you to hence reproducibility and flexibility while playing with different frameworks.

This issue had been identified a while ago by Nvidia when then decided to push their Nvidia GPU Cloud (NGC) platform.

Handle GPUs in a virtualized environment

When we presented the virtualization layers earlier we didn’t mentioned GPUs as they are not available on every hardware platform. There are multiple ways to handle GPUs in a virtualized environment:

emulation through nvemulate (Nvidia) or OpenCL device emulator (AMD) where you emulate instruction (cuda or opencl).
virtualization of the GPU (vGPU) or a slice of it where you give access to a slice of instruction to the GPU with an intermediate interpreter. In this case the GPU is accessed through a virtualized material address and not directly. Multiple market solution exist (citrix, vmware, NVIDIA GRID for VDI, NVIDIA Quadro Virtual Data Center Workstation, NVIDIA Virtual Compute Server (vCS), …)
passthrough-ization (I know… this word doesn’t exits …) where you have direct access to the physical hardware. In this case we will use VFIO that will assign the physical address of the GPU to the guest VM. This mode provides slightly better performance (between 1 and 2 percents).

Here are 2 representations of a AI virtualized and dockerized environment.

Now let’s talk about the CPUs/GPUs virtualization trap.

Why PCI Express pass-through and associated limitations ?

It was important to define how virtualization works to understand the limitations. The virtualization at the KVM level will attribute a CPU time for each VM using CPU cycles. At each CPU cycle the kernel will potentially move the Virtual Machine from the attached CPU to another CPU and this is when things can get dirty.

PCI Express pass-through concept is that the KVM will not emulate the GPU address but will pass the instruction directly to the Graphical Processing Unit to the physical address of the device. The hypervisor is therefore giving to the virtual Machine a plain access to the GPU through the PCI-e BUS. This is resulting in a non controlled usage of the GPU by the Hypervisor meaning that you are, as a customer, accessing the RAW hardware at full performances.

CPU is a sports car, GPU is a massive truck

Because of the CPU limited throughput (indeed CPU as around a 10x slower throughput than GPUs). Sometimes performing a simple operation will be slower to do on a CPU than transferring it to the GPU and then calculate it.

Think about it : why would you remove your suitcase from the sports car to put it in the truck if its the only thing you have to move from Berlin to Paris. So it’s a constant trade off between having an operation executed on the CPU or on a GPU. thankfully, once again, those crazy AI framework core developer did those arbitrage for us.

Does the CPU performance matters even when you are using a GPU?

Yes ! As explained before (in this blog post How PCI express works and why you should care) CPUs are major bottlenecks when pushing data from CPU to GPU using PCie mechanism.

Practical Case

Lets imagine a host with 2 CPU sockets and each of them are linked to their own PCIe slots. The schema below is showing a simplified hardware architecture associated to the implementation of GPU using PCIe.

Each CPU socket is attached to it’s PCIe through PCIe Lanes (if you don’t know what PCIe lanes are read this : https://www.ovh.com/blog/how-pci-express-works-and-why-you-should-care-gpu/) ; looking back at the schema accessing PCIe #1 using CPU2 will be sequence this way :

The VM requests GPU#1 to it’s attached CPU for the current clock cycle meaning CPU#2 for our scenario
The CPU#2 will then request access to GPU#1 calling CPU#1 as GPU#1 is attached physically to CPU#1 (I mean the Socket of CPU#1) through the PCIe link.
This back and forth overhead is estimated to be around 3%* following our benchmarks.

* The estimated 3% were prior to the L1TF Intel vulnerability which as been since patched with a counter measure that is now invalidating/flushing the L1 CPU Cache at the kernel level therefore it might be a little bit less since both VM will have to rebuild the L1 cache even if the VM stays on the same CPU.

Example of a virtualization full life cycle with multiple GPUs and VMs

In the following section we will look at 4 sequences of clock cycles and we will detail the communication flow between GPU(s), the VM they are attached to and the CPU and RAM managed by the hypervisor to make these VM run.

The initial setup

In the initial setup we will consider 3 VMs running on the same physical host. The physical host is composed of:

2 Physical CPUs (2 CPU sockets)
2 RAM components
4 GPUs (2 attached to each CPU socket)

The Scenario is sequenced with to as the initial clock cycle. Each cycle is considered to last ẟt (delta t). Therefore :

to is the initial setup
to + ẟt the system state after 1 CPU cycles
to + 2ẟt the system state after 2 CPU cycles
to + 3ẟt the system state after 3 CPU cycles

How to read this sequence ? It’s pretty simple; let’s take the example of VM3:

VM3 is assigned to CPU#1 at the initial setup (Yellow),
then VM3 is moved to CPU#2 for the 2nd cycle (green),
then VM3 stays on CPU#2 for the 3rd cycle (red),
Finally VM3 is moved to CPU#3 for the 4th and final cycle (purple).

Lets look at the full sequence with the associated communication flows

First cycle communication

At the first cycle the VM#1 will have acces to GPU#4 and RAM#2

Second cycle communication

At the first cycle the VM#1 will have acces to GPU#4 and RAM#2 through CPU#1 and will therefore have an overhead

Third cycle communication : the speculative execution special case.

During the third cycle the VM#1 will still have RAM#2 and CPU#2 attributed and this access will also be requested through the CPU#1.

Speculative execution explained

During every cycle something special may happen depending on your execution plan (not since L1TF Intel vulnerability for intel CPUs) and it’s called speculative execution.

Basically the CPU will anticipate the execution plan (called execution pipeline) and stored future potential results in it’s L1 cache (the cache of the CPU it self).

This anticipation of computation (compute ahead) is performed while the CPU computing power is at rest meaning while transmitting resulting data of the performed execution to the next computational stage. This means that the computational branch calculated ahead might not be relevant if the results of the previous stage doesn’t needs to be evaluate based on the algorithm defined (basically if the condition of the branch – like of if in your code – is not met).

This speculative execution will be relevant if and only if the 3 following conditions are met :

The Speculation Execution mode is activated (which is not the case this L1TF intel vulnerability : spectre)
The CPU instruction of the following cycle needs to be performed (scheduled) on the same CPU as the previous stage.
The condition of the branch needs to be met.

Therefore if speculative execution was setup this 3rd cycle could have been optimized because the VM#1 is are running on the same CPU (CPU#1) as the previous CPU cycle (2nd cycle)

Fourth cycle

In this cycle VM#1 is scheduled back to CPU#2 and won’t have an overhead as attached hardware is directly linked to the related CPU socket.

CPU pinning

To avoid the back and forth explained before a new concept had been introduced which is called CPU pinning.

The OpenStack case

OpenStack is able to handle the NUMA-Node processing for you it is able to teach libvirt to statically pin vCPU to a physical CPU so that the vCPUs will no longer “move around” as described above.

You can check more information regarding this topic here. This process is implemented on OVHcloud public cloud offers based on openstack.

The VMware case

Here is an example with VMware CPU pinning configuration. This feature is available on OVHcloud Private Cloud offers as well are OVHcloud new managed VMware on Bare Metal offers

Splitting a GPU thanks to GPU virtualization : understanding the performance impacts

One might want to split a GPU accross multiple VM using GPU virtualization. It’s especially interesting in terms of cost when you have a big memory like Nvidia V100s that have 32GB of VRAM vs V100 that only have 16GB (check this blog post for further details).

Translation : Illustration of a GPU splitted in 2 that was supposed to 32GB and is assigned only 16GB or VRAM.

We decided not to go this way for the following reasons :

the first reason is that as described in the figure below the impact of splitting the GPU#4 for instance will be that CPU#2 might be overflowed.
the second reason is that the computing power will be shared (meaning cuda core when it comes to Nvidia) making the performance of the GPU potentially not being used at it’s full advertized capacity per VM.

In conclusion

In this article we presented how vitrualization works, how the orchestration of computing cycles matters and why we decided to go in a PCI express passthrough mode for our VM GPU offers.

You can also check our blog explaining how to manage GPU pools effeciently in the cloud using our new AI training offer