is now Verda Learn more

NVIDIA GB300 NVL72: GPU Architecture & Performance

AI team 7 min read
NVIDIA GB300 NVL72: GPU Architecture & Performance

NVIDIA’s Blackwell Ultra (GB300 and B300) introduces significant architectural and performance improvements over the Blackwell generation (GB200 and B200), particularly for FP4 formats.

In this post, we analyze per-GPU specifications, key differences with performance implications, GB300 NVL72 rack-level performance, Grace CPU unified memory architecture, and multi-node NVLink and NCCL all-reduce test results.

Blackwell GPU architecture overview (TL;DR)

Each Blackwell GPU superchip configuration consists of 1 Grace CPU and 2 Blackwell GPUs. All Blackwell GPUs feature fifth-generation NVLink with 1.8 TB/s total bandwidth (900 GB/s unidirectional).

Per-GPU specifications

Performance numbers without sparsity (divide by 2):

Technical Specifications
Feature GB300 GB200 B300 B200
FP4 15 PFLOPS 9 PFLOPS 14 PFLOPS 9 PFLOPS
FP8 5 PFLOPS 4.5 PFLOPS 4.5 PFLOPS 4.5 PFLOPS
INT8 0.165 POPS 4.5 POPS 0.15 POPS 4.5 POPS
FP16/BF16 2.5 PFLOPS 2.25 PFLOPS 2.25 2.25 PFLOPS
TF32 1.25 PFLOPS 1.1PFLOPS 1.1 PFLOPS 1.1PFLOPS
FP32 0.080 PFLOPS 0.075 PFLOPS 0.075 PFLOPS 0.075 PFLOPS
Attention acceleration (SFU EX2) 10.7 TeraExponentials/s 5 TeraExponentials/s 10.7 TeraExponentials/s 5 TeraExponentials/s
GPU Memory 279 GB HBM3E 129 GB HBM3E 270 GB HBM3E 180 GB HBM3E
GPU Memory Bandwidth 8 TB/s 7.7 TB/s 7.7 TB/s 7.7 TB/s
Max Thermal Design Power (TDP) 1.4 kW 1.2 kW 1.1 kW 1 kW

Key differences between GB300 and B200

The equation for the theoretical peak compute is as follows:

GB300 NVL72 rack-aggregate performance

Performance numbers without sparsity (divide by 2):

Metric GB300 rack aggregate
FP4 1080 PFLOPS (1.08 EFLOPS)
FP8 360 PFLOPS
FP16/BF16 180 PFLOPS
FP32 3 PFLOPS
GPU memory 20 TB
GPU memory bandwidth 576 TB/s
NVLink bandwidth 130 TB/s
Max GPU TDP 100.8 kW

Grace CPU unified memory & fabric architecture

GB300 NVL72 includes 36 Grace CPUs and 72 GPUs (2 GPUs per Grace CPU), establishing a unified, coherent memory architecture via NVLink-C2C interconnects. This design extends the GPU’s addressable memory beyond local HBM3e, creating a capacity-optimized tier accessible across the fabric.

Grace CPU memory metric Rack value (GB300 NVL72) Per Grace CPU (/36) Notes
LPDDR5X capacity 17 TB ~0.472 TB (~472 GB)
LPDDR5X bandwidth 14 TB/s ~0.389 TB/s (~389 GB/s)
Rough per-GPU share of Grace BW - ~194 GB/s per GPU Assuming even split across 2 GPUs per Grace CPU

The NVIDIA GB200 Grace CPU integrates ARM’s high-performance Neoverse V2 cores. However, the implementation does not use the V2 configuration with a 2 MB L2 cache per core, instead employing a reduced 1 MB L2 cache.

Experimental evaluations have reported a materially higher rate of L1 cache misses on GB200 relative to comparable x86-based platforms (see Megatron DeepSeek V3 Training on GB200 NVL72). In addition, the GB200 CX7 does not include an integrated PCIe switch.

As a result, scale-out RDMA traffic must traverse the full Grace network-on-chip (NoC) and then cross the NVLink-C2C interconnect to reach the Blackwell GPU. Most of these problems are solved in the GB300, which uses CX8 and an external PCIe switch.

gb300 cx8 dd.pcie topo

Multi-node NVLink & NCCL results

Full rackscale NCCL run (18x nodes) from nccl-tests:

OMPI_ALLOW_RUN_AS_ROOT=1 OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 mpirun -H $(echo pod4-gb300-1-tray{01..18}-f3:4 | tr ' ' , ) --mca btl ^openib --mca btl_openib_warn_no_device_params_found 0 ./build/all_reduce_perf -e 16G -n 1000 -b 1G -f 2 -g 1
Message Size (GB) Count (elements) Type RedOp Out-of-place time (us) Out-of-place algbw (GB/s) Out-of-place busbw (GB/s) Out-of-place #wrong In-place time (us) In-place algbw (GB/s) In-place busbw (GB/s) In-place #wrong
1 268435456 float sum 2914.04 368.47 726.71 0 2917.91 367.98 725.75 0
2 536870912 float sum 5253.87 408.74 806.13 0 5256.57 408.53 805.72 0
4 1073741824 float sum 9972.94 430.66 849.36 0 9982.61 430.25 848.54 0
8 2147483648 float sum 19176.2 447.95 883.45 0 19151.2 448.53 884.61 0
16 4294967296 float sum 36878.3 465.85 918.77 0 36873.3 465.92 918.89 0

References

  1. (Zartbot) Inside Nvidia GPU: Discussing Blackwell's Limitations and Predicting Rubin's Microarchitecture #3
  2. White Paper, NVIDIA Blackwell Architecture Technical Brief, table 3. System Specifications, Per GPU specs
  3. NVIDIA Blackwell Ultra Datasheet
  4. NVIDIA GB300 NVL72 Platform
  5. NVIDIA Grace Performance Tuning Guide
  6. NVIDIA NVLINK Multi-Node Tuning Guide, NCCL
  7. Introducing NVFP4 for Efficient and Accurate Low-Precision Inference
  8. NVIDIA DGX B300 Datasheet
  9. NVIDIA DGX B200 Datasheet
  10. Strangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data! (Horace He)

Subscribe to our newsletter

Get the latest updates on GPU benchmarks and AI research
Your information will be used in accordance with the Privacy Policy. You may opt out at any time.