NVIDIA GB300 NVL72: GPU specs, architecture & performance

NVIDIA’s Blackwell Ultra (GB300 and B300) introduces significant architectural and performance improvements over the Blackwell generation (GB200 and B200), particularly for FP4 formats.

In this post, we analyze per-GPU specifications, key differences with performance implications, GB300 NVL72 rack-level performance, Grace CPU unified memory architecture, and multi-node NVLink and NCCL all-reduce test results.

Blackwell GPU architecture overview (TL;DR)

Each Blackwell GPU superchip configuration consists of 1 Grace CPU and 2 Blackwell GPUs. All Blackwell GPUs feature fifth-generation NVLink with 1.8 TB/s total bandwidth (900 GB/s unidirectional).

Per-GPU specifications

Performance numbers without sparsity (divide by 2):

Technical Specifications
Feature	GB300	GB200	B300	B200
FP4	15 PFLOPS	9 PFLOPS	14 PFLOPS	9 PFLOPS
FP8	5 PFLOPS	4.5 PFLOPS	4.5 PFLOPS	4.5 PFLOPS
INT8	0.165 POPS	4.5 POPS	0.15 POPS	4.5 POPS
FP16/BF16	2.5 PFLOPS	2.25 PFLOPS	2.25	2.25 PFLOPS
TF32	1.25 PFLOPS	1.1PFLOPS	1.1 PFLOPS	1.1PFLOPS
FP32	0.080 PFLOPS	0.075 PFLOPS	0.075 PFLOPS	0.075 PFLOPS
Attention acceleration (SFU EX2)	10.7 TeraExponentials/s	5 TeraExponentials/s	10.7 TeraExponentials/s	5 TeraExponentials/s
GPU Memory	279 GB HBM3E	129 GB HBM3E	270 GB HBM3E	180 GB HBM3E
GPU Memory Bandwidth	8 TB/s	7.7 TB/s	7.7 TB/s	7.7 TB/s
Max Thermal Design Power (TDP)	1.4 kW	1.2 kW	1.1 kW	1 kW

Key differences between GB300 and B200

Compute: GB300 dense FP4 performance is 66.67% faster compared to B200 (15 vs 9 PFLOPS), FP8 +11.1% (5 vs 4.5 PFLOPS), and FP16/BF16 +11.1% (2.5 vs 2.25 PFLOPS). TF32 is also slightly faster with a speedup of 13.64% (1.25 vs 1.1 PFLOPS).
Memory capacity and bandwidth: GB300 memory is 55% bigger compared to B200 (279 vs 180 GB) and HBM bandwidth (8 vs 7.7 TB/s, with a boost of +3.9%).
Power headroom and clocks: GB300 has 40% higher max GPU TDP (1.4 kW vs 1 kW), which increases headroom to sustain higher SM/Tensor-Core clocks during long GEMM-heavy phases.
The B200 SFU Problem: In B200, the number of SMs on a single die was reduced to 80. Thus, the performance of the SFU (Special Function Unit) paired with the CUDA cores was not enhanced. This tradeoff sacrifices GEMM performance against creating a compute bottleneck for Softmax in Attention. Blackwell Ultra increases SFU from 5 to 10.7 TeraExponentials/s, doubling B200 performance.
B200/B300/GB300 GEMM performance): Part of the practical GEMM throughput gain from B200 → B300 (1.0 → 1.1 kW) and further to GB300 (1.4 kW) comes from power headroom that lets boost clocks stay higher for longer. GPUs are unable to sustain their peak clock speed due to power throttling. Thus, the power increase also translates into higher achievable performance.

The equation for the theoretical peak compute is as follows:

\mathrm{Peak\ FLOPS} = \mathrm{Tensor\ Cores} \times \mathrm{Peak\ Clock\ Speed} \times \mathrm{FLOPs\ per\ Tensor\text{-}Core\ instruction}

GB300 NVL72 rack-aggregate performance

Performance numbers without sparsity (divide by 2):

Metric	GB300 rack aggregate
FP4	1080 PFLOPS (1.08 EFLOPS)
FP8	360 PFLOPS
FP16/BF16	180 PFLOPS
FP32	3 PFLOPS
GPU memory	20 TB
GPU memory bandwidth	576 TB/s
NVLink bandwidth	130 TB/s
Max GPU TDP	100.8 kW

Grace CPU unified memory & fabric architecture

GB300 NVL72 includes 36 Grace CPUs and 72 GPUs (2 GPUs per Grace CPU), establishing a unified, coherent memory architecture via NVLink-C2C interconnects. This design extends the GPU’s addressable memory beyond local HBM3e, creating a capacity-optimized tier accessible across the fabric.

Grace CPU memory metric	Rack value (GB300 NVL72)	Per Grace CPU ( `/36` )	Notes
LPDDR5X capacity	17 TB	~0.472 TB (~472 GB)
LPDDR5X bandwidth	14 TB/s	~0.389 TB/s (~389 GB/s)
Rough per-GPU share of Grace BW	-	~194 GB/s per GPU	Assuming even split across 2 GPUs per Grace CPU

Grace LPDDR5X bandwidth depends on capacity SKU, up to 512 GB/s for 120/240 GB configs, and up to 384 GB/s for 480 GB configs.
GB300 NVL72 averages at ~472 GB LPDDR5X per Grace CPU ( 17 TB / 36 ), which aligns with the 480 GB-class config.

The NVIDIA GB200 Grace CPU integrates ARM’s high-performance Neoverse V2 cores. However, the implementation does not use the V2 configuration with a 2 MB L2 cache per core, instead employing a reduced 1 MB L2 cache.

Experimental evaluations have reported a materially higher rate of L1 cache misses on GB200 relative to comparable x86-based platforms (see Megatron DeepSeek V3 Training on GB200 NVL72). In addition, the GB200 CX7 does not include an integrated PCIe switch.

As a result, scale-out RDMA traffic must traverse the full Grace network-on-chip (NoC) and then cross the NVLink-C2C interconnect to reach the Blackwell GPU. Most of these problems are solved in the GB300, which uses CX8 and an external PCIe switch.

Multi-node NVLink & NCCL results

Full rackscale NCCL run (18x nodes) from nccl-tests:

OMPI_ALLOW_RUN_AS_ROOT=1 OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 mpirun -H $(echo pod4-gb300-1-tray{01..18}-f3:4 | tr ' ' , ) --mca btl ^openib --mca btl_openib_warn_no_device_params_found 0 ./build/all_reduce_perf -e 16G -n 1000 -b 1G -f 2 -g 1

Message Size (GB)	Count (elements)	Type	RedOp	Out-of-place time (us)	Out-of-place algbw (GB/s)	Out-of-place busbw (GB/s)	In-place time (us)	In-place algbw (GB/s)	In-place busbw (GB/s)
1	268435456	float	sum	2914.04	368.47	726.71	2917.91	367.98	725.75
2	536870912	float	sum	5253.87	408.74	806.13	5256.57	408.53	805.72
4	1073741824	float	sum	9972.94	430.66	849.36	9982.61	430.25	848.54
8	2147483648	float	sum	19176.2	447.95	883.45	19151.2	448.53	884.61
16	4294967296	float	sum	36878.3	465.85	918.77	36873.3	465.92	918.89

Out-of-bounds values: 0 (OK)
Average bus bandwidth: 836.792 GB/s

NVIDIA GB300 NVL72: GPU specs, architecture & performance

Blackwell GPU architecture overview (TL;DR)

Per-GPU specifications

Key differences between GB300 and B200

GB300 NVL72 rack-aggregate performance

Grace CPU unified memory & fabric architecture

Multi-node NVLink & NCCL results

References

These might also interest you

Verda Monthly Digest: June Edition

Verda hits $100M revenue run rate as we build for the on-demand future of AI compute

Verda Monthly Digest: May Edition