How PCI-Express works and why you should care? #GPU

What is PCI-Express ?

Everyone, and I mean everyone, should pay attention when they do intensive Machine Learning / Deep Learning Training.

As I explained in a previous blog post, GPUs have accelerated Artificial Intelligence evolution massively.

However, building a GPUs server is not that easy. And failing to create an appropriate infrastructure can have consequences on training time.

If you use GPUs, you should know that there are 2 ways to connect them to the motherboard to allow it to connect to the other components (network, CPU, storage device). Solution 1 is through PCI Express and solution 2 through SXM2. We will talk about SXM2 in the future. Today, we will focus on PCI Express. This is because it has a strong dependency with the choice of adjacent hardware such as PCI BUS or CPU.

NVIDIA V100 with SXM2 design — SXM2 design VS PCI Express Design

NVIDIA V100 with PCI express design — SXM2 design VS PCI Express Design

NVIDIA V100 with SXM2 design	NVIDIA V100 with PCI express design
Source : https://www.ebizpc.com/NVIDIA-Tesla-V100-900-2G502-0300-000-16GB-GPU-p/900-2g503-0310-000.htm	Source : https://nvidiastore.com.br/nvidia-tesla-v100-16gb

This is a major element to consider when talking about deep learning as data loading phase is a waste of compute time, so bandwidth between components and GPUs is a key bottleneck in most deep learning training contexts.

How does PCI-Express work and why you should care about the number of PCIe lanes?

What is a PCI-Express Lanes and are there any associated CPU limitations?

Each GPU V100 is using the 16 PCI-e lanes. What does it mean exactly?

Extract from NVidia V100 product specification sheet

The “x16” means that the PCIe has 16 dedicated lanes. So… next question: What is a PCI Express lane ?

What’s a PCI Express lane?

2 PCI Express Devices with its interconnexion : figure inspired of the awesome article – what is chipset and why should I care

PCIe lanes are used to communicate between PCIe Devices or between PCIe and CPU. A lane is composed of 2 wires: one for inbound communications and one, which has double the traffic bandwidth, for outbound.

Lane communications are similar to network Layer 1 communications – it’s all about transferring bits as fast as possible through electrical wires! However, the technique used for PCIe Link is a bit different as the PCIe device is composed of xN lanes. In our previous example N=16 but it could be any power of 2 from 1 to 16 (1/2/4/8/16).

So… if PCIe is similar to network architecture it means that PCIe layers exist, doesn’t it?

Yes ! you are right PCIe has 4 layers:

**The Physical Layer (aka the Big Negotiation Layer)**

The Physical Layer (PL) is responsible for negotiating the terms and conditions for receiving the raw packets (PLP for Physical Layer Packets) i.e the lane width and the frequency with the other device.

You should be aware that only the smallest number of lanes of the two devices will be used. This is why choosing the appropriate CPU is so important. CPUs have a limited number of lanes that they can manage so having a nice GPU with 16 PCIe Lanes and having a CPU with 8 PCIe Bus lanes will be as efficient as throwing away half your money because it doesn’t fit in your wallet.

Packets received at the Physical Layer (aka PHY) are coming from other PCIe devices or from the system (via Direct Access Memory — DAM or from CPU for instance) and are encapsulated in a frame.

The purpose of a Start-of-Frame is to say: “I am sending you data, this is the beginning,” and it takes just 1 byte to say that!

The End-of-Frame word is also 1 byte to say “goodbye I’m done with it”.

This layer implement a 8b/10b or 128b/130b decoding that we will explain later and is mainly used for clock recovery.

**The Data Link Layer Packet (aka Let’s put this mess in the right order)**

The Data Link Layer Packet (DLLP) is starting with a Packet Sequence Number. This is really important as a packet might get corrupted at one point, so may need to be uniquely identified for retry purposes. The Sequence Number is coded on 2 bytes.

The Data Link Layer Packet is then followed by the Transaction Layer Packet and then closed with the LCRC (Local Cyclic Redundancy Check) and is used to check the Transaction Layer Packet (meaning the actual Payload) integrity.

If the LCRC is validated, then the Data Link Layer sends an ACK (ACKnowledge) signal to the emitter through the Physical Layer. Otherwise it sends a NAK (Not AcKnowledge) signal to the emitter which will resend the frame associated with the sequence number to retry; this part handles the replay buffer on the receiver side.

The Transaction Layer

The Transaction Layer is responsible for managing the actual payload (Header + Data) as well as the (optional) message digest ECRC (End to End Cyclic Redundancy Check). This Transaction Layer Packet is coming from the Data Link Layer where it has been decapsulated.

An integrity check is performed if needed/requested. This step will check the integrity of the business logic and will insure no packet corruption when passing data from Data Link Layer to Transaction Layer.

The header is describing the type of transaction such as:

Memory Transaction
I/O Transaction
Configuration Transaction
or Message Transaction

The Application Layer

The role of the application layer is to handle the User Logic. This layer is sending the Header and the data payload to the Transaction Layer. The magic happens in this layer where data in rooted to different hardware components.

How PCIe is communicating with the rest of the world?

PCIe Link is using the packet switching concept used in network in a full duplex mode.

PCIe device have an internal clock to orchestrate PCIe Data Transfer Cycles. This Data Transfer Cycle is also orchestrated thanks to the Referential Clock. The latter is sending a signal through a Dedicated Lane (which is not part of the x1/2/4/8/16/32 mentioned above). This clock will help both receiving and emitting devices to synchronize for packets communications.

Each PCIe lane is used to send bytes in parallel with other lanes. The Clock Synchronization mentioned above will help the receiver to put back those bytes in the right order

x16 means 16 lanes of parallel communication on generation 3 of PCIe protocol

You may have the bytes in order but do you have the data integrity at the physical layer ?

To ensure integrity PCIe device uses 8b/10b encoding for PCIe generations 1 and 2 or 128b/130b encoding scheme for generations 3 and 4.

These encodings are used to prevent the loss of temporal landmarks, especially when transmitting consecutive similar bits. This process is called “Clock Recovery”

Those 128 bits of payload data are sent and 2 bytes of control are appended to it.

Quick examples

Let’s simplify it with a 8b/10b example: according to IEEE 802.3 clause 36, table 36–1a based on Ethernet specifications here is the table 8b/10b encoding:

IEEE 802.3 clause 36, table 36–1a - 8b/10b encoding table — IEEE 802.3 clause 36, table 36–1a – 8b/10b encoding table

So how can the receiver make the difference between all those repeating 0 (Code Group Name D0.0) ?

8b/10b encoding is composed of 5b/6b + 3b/4b encodings.

Therefore 00000 000 will be encoded into 100111 0100 the 5 first bits of the original data 00000 are encoded to 100111 using 5b/6b encoding (rd+); same goes for the second group of 3bits of original data 000 encoded into 0100 using 3b/4b encoding (rd-).

It could have been also 5b/6b encoding rd+ and 3b/4b encoding rd- making 00000 000 turning into 011000 1011

Therefore the original data which was 8bits is now 10bits due to bits control (1 control bit for 5b/6b and 1 fir 3b/4b).

But don’t worry I will draft a blog post later dedicated to encoding.

PCIe Generations 1 and 2 were designed with 8b/10b encoding meaning that the actual data transmitted was only 80% of the total load (as 20% — 2 bits are used as Clock synchronization).

PCIe Gen3&4 were designed with 128b/130b meaning that the control bits are now representing only 1.56% of the payload. Quite good isn’t it?

Let’s calculate the PCIe bandwidth together

Here is the table of PCIe versions specifications

Number of Lanes	PCIe 1.0 (2003)	PCIe 2.0 (2007)	PCIe 3.0 (2010)	PCIe 4.0 (2017)	PCIe 5.0 (2019)	PCIe 6.0 (2021)
x1	250 MB/s	500 MB/s	1 GB/s	2 GB/s	4 GB/s	8 GB/s
x2	500 MB/s	1 GB/s	2 GB/s	4 GB/s	8 GB/s	16 GB/s
x4	1 GB/s	2 GB/s	4 GB/s	8 GB/s	16 GB/s	32 GB/s
x8	2 GB/s	4 GB/s	8 GB/s	16 GB/s	32 GB/s	64 GB/s
x16	4 GB/s	8 GB/s	16 GB/s	32 GB/s	64 GB/s	128 GB/s

consortium PCI-SIG PCIe theoretical bandwidth/Lane/Way specification sheet

	PCIe 1.0 (2003)	PCIe 2.0 (2007)	PCIe 3.0 (2010)	PCIe 4.0 (2017)	PCIe 5.0 (2019)	PCIe 6.0 (2021)
Frequency	2.5 GT/s	5.0 GT/s	8.0 GT/s	16 GT/s	32 GT/s	64 GT/s

consortium PCI-SIG PCIe theoretical raw bit rate specification sheet

To obtain such numbers let’s look at the general Bandwidth formula:

BW stands for Bandwidth
MT/s : Mega Transfers per second
Encoding could be 4b/5b/, 8b/10b, 128b/130b, …

For PCIe v1.0:

BW/lane\ (MB/s) = \ 2\ 500\ (MT/s)\ *\ \frac{8\ bits}{10\ bits} * \frac{1\ Byte}{8\ bits

For PCIe v3.0 (the one that interest us for NVIDIA V100):

BW/lane\ (MB/s) = \ 8\ 000\ (MT/s)\ *\ \frac{128\ bits}{130\ bits} * \frac{1\ Byte}{8\ bits}

$BW/lane\ (MB/s) = \ 984.6\ (MB/s)$

Therefore with 16 lanes for a NVIDIA V100 connected in PCIe v3.0, we have an effective data rate transfer (data bandwidth) of nearly 16GB/s/way (actual bandwidth is 15.75GB/s/way)

You need to be careful not to get confused, as total bandwidth can also be interpreted as two ways bandwidth; in this case we consider total bandwidth x16 to be around 32GB/s.

Note : Another element that we haven’t considered is that the maximum theoretical bandwidth needs to be reduced by around 1 Gb/s for error correction protocols (ECRC and LCRC) as well as the Headers (Start tag, Sequence tag, Header) and Footer (End tag) overheads explained earlier in this blog post.

In conclusion

We have seen that PCI Express has evolved a lot and that It’s based on the same concepts as network. To take the best from the PCIe devices it is necessary to understand the fundamentals of the underlying infrastructure.

Failing to choose the right underlying Motherboard, CPU or BUS can lead to major performance bottleneck and GPU under performance.

To sum up :

Friends don’t let friends build their own GPUs hosts 😉
Jean-Louis Quéguiner July 1^st, 2020

If you liked this post but you want to drill down a bit into the Deep Learning and AI aspect of things don’t hesitate to check out my other blog posts:

What does training neural networks mean?

Distributed Learning in a Deep Learning context

Jean-Louis Queguiner

+ posts

Head of #Data & #AI @OVH

Product Unit Lead and Technical Direct #Data & #AI Passionate about Cyber and reproducing AI Research Papers

Leading Databuzzword french podcast: spreaker.com/user/guignol

and
https://www.youtube.com/c/Databuzzword