It's a question that pops up surprisingly often in the world of high-performance computing, especially for folks diving deep into machine learning: when you're packing your system with powerful GPUs, how much of a difference does it really make if they're running on PCIe x16 lanes versus x8?
I've been asked this a lot, and the honest, albeit slightly unsatisfying, answer is always: "It depends." But dig a little deeper, and for many real-world applications, the impact might be less dramatic than you'd think. To get a clearer picture, I decided to put this to the test, focusing on some common machine learning workloads using a formidable setup: four NVIDIA Titan V GPUs.
Setting the Stage: The Hardware and Software
For this exploration, I used a robust system: a Gigabyte motherboard equipped with four x16 PCIe sockets. To accommodate all four GPUs at their full potential, a PLX PCIe switch was integrated into sockets 2 and 3. Powering this beast was an Intel Xeon W-2195 processor, an 18-core Skylake-W chip with AVX512 capabilities, paired with a generous 256GB of ECC memory. Storage was handled by a speedy Samsung 256GB NVMe M.2 drive.
On the software side, we were running Ubuntu 16.04 within Docker containers, specifically using NVIDIA Docker V2. For the machine learning frameworks, TensorFlow 1.7 was utilized from an NVIDIA NGC Docker image, while Keras 2.1.5 (with a local Anaconda Python installation) served as the front-end for some tests.
The "Sticky Note" Method: A Clever Workaround
Now, how do you physically force a PCIe x16 card to run at x8 without swapping motherboards or complex BIOS settings? In this case, a rather ingenious, albeit unconventional, method was employed: the "sticky note" trick. By carefully covering half the pins on the PCIe connector of each GPU with cut-down sticky notes, the connection was effectively limited to x8 lanes. This ensured that the hardware configuration was otherwise identical, allowing for a direct comparison. (Just a friendly reminder: this isn't something to try at home without understanding the risks!)
The Workloads: What We Tested
To get a comprehensive view, several different types of jobs were run:
- GoogLeNet Training: This classic convolutional neural network training task was executed using TensorFlow, leveraging synthetic data for image input. This particular job is known to show significant benefits from Tensor-cores, so it was a prime candidate for observing performance differences.
- Billion Words Benchmark LSTM Train: For natural language processing, a large-scale LSTM model was trained on the "billion words" news-feed corpus. This is a memory and compute-intensive task.
- VGG Model in Keras: To simulate a more "real-world" scenario, the VGG convolutional neural network was implemented and trained using Keras. Two variations were tested: one with data streamed from disk and another with data streamed directly from memory. The dataset consisted of 25,000 images from the popular Kaggle "dogs vs cats" competition.
- Peer-to-Peer Bandwidth and Latency: For a fundamental understanding of the direct impact of lane reduction, the
p2pBandwidthLatencyTestfrom NVIDIA's CUDA samples was used. This provides a baseline measurement of how data moves between GPUs.
The Results: Is X16 Always King?
After running all the tests, the results painted a surprisingly consistent picture: for these specific machine learning workloads, the performance difference between PCIe x16 and x8 was, in many cases, quite small. In some instances, there was virtually no discernible difference.
For instance, in the GoogLeNet training with TensorFlow (both FP32 and FP16 using Tensor-cores), the images processed per second showed only a marginal advantage for x16. Similarly, the Billion Words Benchmark LSTM training saw very little deviation. The VGG model training in Keras also exhibited minimal performance gains when running at x16 compared to x8, whether the data was streamed from disk or memory.
The peer-to-peer bandwidth and latency tests, however, did confirm the expected halving of direct bandwidth when moving from x16 to x8. This is a fundamental hardware characteristic. But the crucial takeaway is that for these particular GPU-accelerated computing tasks, the bottleneck doesn't always lie solely in the PCIe bandwidth.
So, What's the Verdict?
While the raw bandwidth of PCIe x16 is double that of x8, the real-world impact on performance for many machine learning applications, especially when using multiple powerful GPUs, can be surprisingly muted. This is often because the GPUs themselves become the primary bottleneck, or the nature of the workload doesn't saturate the x8 connection. For tasks that involve massive data transfers or very high-frequency, low-latency communication between GPUs, x16 might offer a more noticeable edge. But for many common deep learning training jobs, running your GPUs at x8 lanes is likely to yield performance that is very close to, if not indistinguishable from, running them at x16. It's a good reminder that performance is a complex interplay of hardware, software, and the specific task at hand.
