A quick introduction to virtualization
The purpose of virtualization is to isolated the user software environment from the hardware environment. The orchestration between these virtual environments is made by the hypervisor. Hypervisor also provides the ability for a Virtual Machines (VM) to execute instructions that are not directly compatible with the underlying architecture/hardware. VT-x and AMD-V which are Intel and AMD virtualization technologies.
Some of you might know QEMU/NEMU/KVM/Lib Virt. All of these tools have specific roles as explained below :
(Kernel-based Virtual Machine)
|Virtualization||– Deamon that manipulates VMs. In oposition to QEMU, KVM leverage the Virtualization extension provided by the CPU itself without any emulation (VT-x or AMD-V)|
|Lib Virt||Virtualization||Manages/Manipulates Virtualization (API, CLI and a Deamon – libvirtd)|
|Emulation||– Emulates the processor and peripherals. It’s basically converting input guest instruction (for instance ARM) into the actual hardware compatible instructions (for instance x86).|
– Supports all virtualizations/emulations.
– It has a reputation to be slow
|NEMU||Emulation||– NEMU is a fork of QEMU that focuses on modern CPUs used with advanced virtualization features to increase the speed of existing QEMU implementation.|
– NEMU focuses on KVM support to better leverage the CPU’s virtualization extension and therefore reduce the need to translate instructions to the CPUs.
– It doesn’t supports all hardware
– It’s faster than QEMU
The Different components can also be represented in form of a stack.
Now… Let’s talk about AI and GPU’s
Why AI practitioners loves virtualization and containerization
When running AI workload it’s often recommended to use Docker. Indeed a big part of the pain in AI when it comes to workload management is making sure every library is compiled with the right CUDA accelerators (cuBLAS, cuDNN) associated to the right CUDA version with the right Driver.
Playing a bit with Deep Learning framework and with CUDA version ensure you a massive headache by the end of the day.
This is why Virtualization is an awesome that allow you to break and rebuild you environment. But adding a containerized technology on top will help you to hence reproducibility and flexibility while playing with different frameworks.
This issue had been identified a while ago by Nvidia when then decided to push their Nvidia GPU Cloud (NGC) platform.
Handle GPUs in a virtualized environment
When we presented the virtualization layers earlier we didn’t mentioned GPUs as they are not available on every hardware platform. There are multiple ways to handle GPUs in a virtualized environment:
- emulation through nvemulate (Nvidia) or OpenCL device emulator (AMD) where you emulate instruction (cuda or opencl).
- virtualization of the GPU (vGPU) or a slice of it where you give access to a slice of instruction to the GPU with an intermediate interpreter. In this case the GPU is accessed through a virtualized material address and not directly. Multiple market solution exist (citrix, vmware, NVIDIA GRID for VDI, NVIDIA Quadro Virtual Data Center Workstation, NVIDIA Virtual Compute Server (vCS), …)
- passthrough-ization (I know… this word doesn’t exits …) where you have direct access to the physical hardware. In this case we will use VFIO that will assign the physical address of the GPU to the guest VM. This mode provides slightly better performance (between 1 and 2 percents).
Here are 2 representations of a AI virtualized and dockerized environment.
Now let’s talk about the CPUs/GPUs virtualization trap.
Why PCI Express pass-through and associated limitations ?
It was important to define how virtualization works to understand the limitations. The virtualization at the KVM level will attribute a CPU time for each VM using CPU cycles. At each CPU cycle the kernel will potentially move the Virtual Machine from the attached CPU to another CPU and this is when things can get dirty.
PCI Express pass-through concept is that the KVM will not emulate the GPU address but will pass the instruction directly to the Graphical Processing Unit to the physical address of the device. The hypervisor is therefore giving to the virtual Machine a plain access to the GPU through the PCI-e BUS. This is resulting in a non controlled usage of the GPU by the Hypervisor meaning that you are, as a customer, accessing the RAW hardware at full performances.
CPU is a sports car, GPU is a massive truck
Because of the CPU limited throughput (indeed CPU as around a 10x slower throughput than GPUs). Sometimes performing a simple operation will be slower to do on a CPU than transferring it to the GPU and then calculate it.
Think about it : why would you remove your suitcase from the sports car to put it in the truck if its the only thing you have to move from Berlin to Paris. So it’s a constant trade off between having an operation executed on the CPU or on a GPU. thankfully, once again, those crazy AI framework core developer did those arbitrage for us.
Does the CPU performance matters even when you are using a GPU?
Yes ! As explained before (in this blog post How PCI express works and why you should care) CPUs are major bottlenecks when pushing data from CPU to GPU using PCie mechanism.
Lets imagine a host with 2 CPU sockets and each of them are linked to their own PCIe slots. The schema below is showing a simplified hardware architecture associated to the implementation of GPU using PCIe.
Each CPU socket is attached to it’s PCIe through PCIe Lanes (if you don’t know what PCIe lanes are read this : https://www.ovh.com/blog/how-pci-express-works-and-why-you-should-care-gpu/) ; looking back at the schema accessing PCIe #1 using CPU2 will be sequence this way :
- The VM requests GPU#1 to it’s attached CPU for the current clock cycle meaning CPU#2 for our scenario
- The CPU#2 will then request access to GPU#1 calling CPU#1 as GPU#1 is attached physically to CPU#1 (I mean the Socket of CPU#1) through the PCIe link.
- This back and forth overhead is estimated to be around 3%* following our benchmarks.
* The estimated 3% were prior to the L1TF Intel vulnerability which as been since patched with a counter measure that is now invalidating/flushing the L1 CPU Cache at the kernel level therefore it might be a little bit less since both VM will have to rebuild the L1 cache even if the VM stays on the same CPU.
Example of a virtualization full life cycle with multiple GPUs and VMs
In the following section we will look at 4 sequences of clock cycles and we will detail the communication flow between GPU(s), the VM they are attached to and the CPU and RAM managed by the hypervisor to make these VM run.
The initial setup
In the initial setup we will consider 3 VMs running on the same physical host. The physical host is composed of:
- 2 Physical CPUs (2 CPU sockets)
- 2 RAM components
- 4 GPUs (2 attached to each CPU socket)
The Scenario is sequenced with to as the initial clock cycle. Each cycle is considered to last ẟt (delta t). Therefore :
- to is the initial setup
- to + ẟt the system state after 1 CPU cycles
- to + 2ẟt the system state after 2 CPU cycles
- to + 3ẟt the system state after 3 CPU cycles
How to read this sequence ? It’s pretty simple; let’s take the example of VM3:
- VM3 is assigned to CPU#1 at the initial setup (Yellow),
- then VM3 is moved to CPU#2 for the 2nd cycle (green),
- then VM3 stays on CPU#2 for the 3rd cycle (red),
- Finally VM3 is moved to CPU#3 for the 4th and final cycle (purple).
Lets look at the full sequence with the associated communication flows
First cycle communication
At the first cycle the VM#1 will have acces to GPU#4 and RAM#2
Second cycle communication
At the first cycle the VM#1 will have acces to GPU#4 and RAM#2 through CPU#1 and will therefore have an overhead
Third cycle communication : the speculative execution special case.
During the third cycle the VM#1 will still have RAM#2 and CPU#2 attributed and this access will also be requested through the CPU#1.
Speculative execution explained
During every cycle something special may happen depending on your execution plan (not since L1TF Intel vulnerability for intel CPUs) and it’s called speculative execution.
Basically the CPU will anticipate the execution plan (called execution pipeline) and stored future potential results in it’s L1 cache (the cache of the CPU it self).
This anticipation of computation (compute ahead) is performed while the CPU computing power is at rest meaning while transmitting resulting data of the performed execution to the next computational stage. This means that the computational branch calculated ahead might not be relevant if the results of the previous stage doesn’t needs to be evaluate based on the algorithm defined (basically if the condition of the branch – like of if in your code – is not met).
This speculative execution will be relevant if and only if the 3 following conditions are met :
- The Speculation Execution mode is activated (which is not the case this L1TF intel vulnerability : spectre)
- The CPU instruction of the following cycle needs to be performed (scheduled) on the same CPU as the previous stage.
- The condition of the branch needs to be met.
Therefore if speculative execution was setup this 3rd cycle could have been optimized because the VM#1 is are running on the same CPU (CPU#1) as the previous CPU cycle (2nd cycle)
In this cycle VM#1 is scheduled back to CPU#2 and won’t have an overhead as attached hardware is directly linked to the related CPU socket.
To avoid the back and forth explained before a new concept had been introduced which is called CPU pinning.
The OpenStack case
OpenStack is able to handle the NUMA-Node processing for you it is able to teach libvirt to statically pin vCPU to a physical CPU so that the vCPUs will no longer “move around” as described above.
The VMware case
Splitting a GPU thanks to GPU virtualization : understanding the performance impacts
One might want to split a GPU accross multiple VM using GPU virtualization. It’s especially interesting in terms of cost when you have a big memory like Nvidia V100s that have 32GB of VRAM vs V100 that only have 16GB (check this blog post for further details).
We decided not to go this way for the following reasons :
- the first reason is that as described in the figure below the impact of splitting the GPU#4 for instance will be that CPU#2 might be overflowed.
- the second reason is that the computing power will be shared (meaning cuda core when it comes to Nvidia) making the performance of the GPU potentially not being used at it’s full advertized capacity per VM.
In this article we presented how vitrualization works, how the orchestration of computing cycles matters and why we decided to go in a PCI express passthrough mode for our VM GPU offers.
If you want to go further I suggest that you read these 2 excellent articles.