is now Verda Learn more

NVFP4 Explained: How NVIDIA Blackwell Unlocks Low-Precision Floating Point

Daniel Obolensky, Riccardo Mereu 17 min read
NVFP4 Explained: How NVIDIA Blackwell Unlocks Low-Precision Floating Point

In this mini-series, we explore NVIDIA NVFP4, a 4-bit floating point format introduced with the Blackwell architecture. In this first part, we focus on what makes NVFP4 practical. In Part 2, we will cover how to write a GEMM kernel for B200 using CuTeDSL.

From Moore's Law to Huang's Law

Moore's Law, traditionally described as the transistor count doubling about every two years, is slowing. As a result, the performance gains of a single GPU chip will eventually reach saturation. At the same time, demand for computing power is accelerating, which drives the need for more efficient GPUs. Jensen Huang, NVIDIA's CEO, notably observed that GPU capabilities have been advancing much faster than those of CPUs (Huang's Law [1]).

As the semiconductor industry approaches the physical limits of transistor scaling, engineers increasingly focus on architecture, packaging, and circuit design to reduce the energy and area required for each computation, thereby improving performance per Watt and overall efficiency.

Making Data Smaller

NVFP4 is an innovative 4-bit floating point format introduced with the NVIDIA Blackwell GPU architecture. NVFP4 builds on the concept of low-bit microscaling floating-point formats (MX Formats) [2] and enhances the OCP MXFP4 by introducing a different scaling pattern.

One of the crucial points to keep in mind when dealing with floating points is that we have to deal with a fixed budget: the number of bits used. To understand the trade-offs introduced by this constraint, we must distinguish between three concepts that depend on how we allocate the bits of the representation:

Dynamic Range in FP32, FP16, FP8, and FP4

A binade is one power of 2 of dynamic range, essentially how many binary numbers sharing the same sign and exponent bits fit between the smallest and largest representable values:

Format Exponent Bits Binades Notes
FP32 8 ~277 Good for most computations
FP16 5 ~40 Good for most activations
BF16 8 ~261 Same FP32 range, but limited precision
FP8 E4M3 4 ~18 Suitable for forward pass
FP8 E5M2 5 ~32 Needed for backward pass
FP4 E2M1 2 ~3.6 Needs additional tricks to work

Going to 4 bits yields only 3.6 binades, which means it cannot represent typical tensor value distributions, which often span 10-20 binades. This is precisely why block scaling becomes essential at 4-bit precision.

Comparison of floating point formats including FP32, BF16, FP8, MXFP, and NVFP4, showing scaling approaches and bit allocations Figure 1. The figure summarizes the different floating point formats discussed in this post. Chronologically: FP32, BF16, FP8, which usually uses only a tensor-scale factor. The MXFP* formats use a 32-element block level E8M0 (just the exponent of an FP32) scale factor, and finally NVFP4, which uses a combination of 16-element block-level fractional scaling E4M3 and a full FP32 tensor-level scaling.

As an example, if we want to represent with a fixed number of decimal digits we will end up with different approximations. We know that . If we were limited to a budget of three digits, we could have two choices:

  1. , which has an absolute error of ,
  2. , which has an absolute error of .

Both approximations use the same budget in terms of digits but achieve a different accuracy in representing the value we want to use in our computations.

If we would like to use more digits, we could end up with more accurate representations but sometimes also less accurate even if more precise representations. Let's clarify with an example. If we use six digits we could for instance end up storing:

  1. , which has an absolute error of ,
  2. , which has an absolute error of .

As we can see in Figure 2, the values try to capture a good approximation of the real value of $\pi$; varying the budget for our representation leads to different solutions and tradeoffs. This simple example shows clearly that the choice of numerical representation greatly affects the outcome of computations.

Illustration of precision versus accuracy using different approximations of pi, showing how more digits do not always improve accuracy Figure 2. Illustration of precision vs. accuracy using different approximations of . A more precise representation, e.g., more decimal digits here, does not necessarily mean higher accuracy (closer to the true value). The three examples show: 3.141 (low precision, moderate accuracy), 3.141543 (higher precision and accuracy), and 3.142738 (higher precision but lower accuracy than 3.141).

The real number line allows for infinite precision, but silicon and memory are finite. Using a floating point representation, we can sample the real line using three bit fields:

  1. Sign (S): Positive or negative.
  2. Exponent (E): The dynamic range (which power of 2 is used).
  3. Mantissa (M): The precision (samples between powers of two).

The mathematical representation is defined as:

Let's break down the formula:

In normalized floating point representation, the significand always starts with an implicit leading 1 (this is why it's called "normalized"). The mantissa bits, e.g., 1001001000, represent the fractional digits that come after this implicit 1, forming the complete significand 1.1001001000 in binary. Each bit position corresponds to a negative power of 2: the first bit after the decimal point represents , the second , the third , and so on.

When the exponent field is all zeros (E = 0), the number becomes subnormal (also called denormalized). Instead of an implicit leading 1, subnormals use an implicit leading 0, forming significands like 0.1001001000. The subnormal floats are a linearly spaced set of values, which span the gap between the negative and positive normal floats. This provides a gradual underflow to zero rather than an abrupt jump and mitigates underflows.

For FP32, the smallest normalized number is approximately (with significand 1.0). Subnormals fill the gap between zero and this value, with the smallest subnormal being approximately (significand 0.00...01 with 23 mantissa bits).

Number Line Around Zero (FP32 example):

Exponent:    E=0 (subnormal)         E=1 (normalized)    E=2      E=3
             ├──────────────────────┼───────────────────┼────────┼─────────>
             |                      |                   |        |
             0                   2^-126              2^-125   2^-124

             * * * * * * * * *      *   *   *   *      *   *    *    *
             ^               ^      ^
             0.            2^-149  2^-126 (smallest normalized)
                      (smallest subnormal)

Without subnormals, any calculation producing a value smaller than would immediately round to zero, causing abrupt precision loss. Subnormals smooth this transition by trading dynamic range (giving up the implicit leading 1) for finer granularity near zero.

Microscaling (MX) Formats

While DeepSeek-V3 demonstrates that FP8 is viable with careful engineering, the desire for efficiency pushed AI workloads toward even smaller formats like 6-bit or 4-bit. At these precisions, standard per-tensor scaling breaks down. A single large outlier in a tensor of millions of parameters can skew the quantization scale, effectively pushing all smaller values to zero.

To make these low-precision formats practical, a consortium of tech companies, including AMD, Arm, Intel, NVIDIA, and Qualcomm, aligned under the Open Compute Project (OCP) to introduce the specification of the Microscaling Formats [2].

The core idea is moving from per-tensor to per-block scaling. Instead of assigning one scaling factor to an entire tensor, the tensor is divided into small blocks (e.g., 32 elements), each with its own shared 8-bit scale exponent.
How it works:

  1. Block grouping: Elements are grouped into blocks of elements (typically $k=32$).
  2. Shared per-block scale: The hardware finds the maximum absolute value in each block to determine a shared 8-bit exponent.
  3. Local quantization: Individual elements are quantized to 4 bits relative to their block's scale.

A simple example (real blocks use 32 elements): Consider a block of 4 values: [0.001, 0.002, 100.0, 0.003]. With per-tensor scaling, the scale would be dominated by 100.0, and the small values would all round to zero. With per-block scaling, if this block gets its own scale, the outlier only affects these 4 neighbors; the rest of the tensor remains well-quantized. This compartmentalization of numerical noise is the key breakthrough enabling training at 4-bit precision.

What Is NVFP4?

Building on the MX foundation, NVIDIA developed NVFP4 for their Blackwell architecture, adding hardware-specific refinements to push the limits of low-bit training.

NVFP4 is a 4-bit floating point format (E2M1):

With only 16 unique values available in a 4-bit representation, careful scaling becomes critical. To put this in perspective: FP32 can represent ~4 billion distinct positive values; NVFP4 can represent 8, approximately -6 to 6. For example, the values in the range could include 0.0, 0.5, 1.0, 1.5, 2, 3, 4, 6 (same for the negative range).

While the OCP MX specification uses 32-element blocks, NVIDIA relies on a finer granularity: 16-element blocks. By calculating the shared scale factor over fewer elements, NVFP4 confines outliers more tightly, i.e., a single spike distorts a smaller neighborhood, preserving fidelity in surrounding weights.

NVFP4 data layout showing 16-element blocks with shared FP8 scale factors Figure 3. A 16×32 matrix stored in NVFP4 format. Each block contains 16 contiguous FP4 elements (gray and green) with a shared FP8 scale factor (yellow). The largest magnitude element in each block (green) is scaled to the FP4 maximum representable value. A per tensor FP32 scale factor is also applied (not shown). (Source [3])

Algorithmic Interventions in NVFP4

Hardware support is only half the story. Training a model in 4-bit precision without diverging into noise requires specific algorithmic interventions, as detailed in NVIDIA's paper "Pretraining Large Language Models with NVFP4" [3].

2D Block Scaling

Scaling is applied along both row-wise and column-wise dimensions for weight matrices (16×16 blocks). Why both? During forward pass, scaling happens along rows; during backward pass, tensors are transposed, so scaling happens along columns. Without 2D scaling, the same weight would have two different quantized representations, breaking the chain rule and degrading training quality.

Random Hadamard Transform (RHT)

One of the biggest enemies of quantization is "outlier features"—specific neurons that consistently fire with massive values. These outliers can wreck the quantization scale for their entire block.
The Random Hadamard Transform "smears" outlier information across the entire vector before quantization:

This mathematical operation redistributes energy so that no single element dominates the scale calculation.

Stochastic Rounding (SR)

With very few bits, standard "round-to-nearest" creates systematic bias. Always rounding 0.4 down to 0 accumulates massive error over billions of operations.

NVFP4 uses stochastic rounding, which rounds probabilistically based on the distance to the nearest representable values:

where

This ensures that on average, the expected value of the rounded number equals the original. Over many operations, rounding errors cancel out rather than accumulate in one direction, allowing gradient descent to converge correctly despite the severe quantization.

Compute flow of an NVFP4 quantized linear layer showing GEMM operations with inputs quantized to 4-bit floating point Figure 4. Illustration of the compute flow for an NVFP4 quantized linear layer. All GEMM operations quantize their inputs to NVFP4. Source [3].

NVIDIA Blackwell GPU Architecture

A prerequisite for writing performant GPU kernels is to understand the hardware architecture inside the GPU and how data flows through it. A GPU is optimized for exactly two operations: arithmetic operations on data, and the movement/storage of that data between hierarchical memory pools. The efficiency of the former is bounded by transistor physics; the efficiency of the latter is bounded by the speed of light and wire capacitance.

NOTE: The following information is based on the Blackwell architecture of the B200 GPU, and all further mentions of 'GPU' refer to the B200. When we speak about matrix multiplication, the operation we refer to has the form: D = A * B or D = A * B + C

Movement of Data: Physical Hierarchy

The memory architecture of modern GPUs is a spatially organized response to the inverse relationship between latency and density in CMOS design. If they could, NVIDIA would place every compute unit adjacent to register-speed memory. Instead, contemporary GPUs pack threads across 148 Streaming Multiprocessors (SMs) on 2 distinct dies, connected by a 10TB/s NV-HBI interconnect. The GPU places the fastest and smallest memory closest to compute hardware, and iteratively hosts larger but slower memory further away, maximizing aggregate throughput by minimizing latency.

The memory hierarchy in GPUs is organized as follows:

  1. Register File (RMEM): The fastest storage, placed adjacent to SMs. Each SM has 256KB of register files, summing to ~37.75MB across the entire GPU. Since each SM has the ability to host up to 64 warps (i.e., 2,048 threads), each thread has 128 bytes of register storage when the GPU is fully populated.

  2. Tensor Memory (TMEM): An update introduced to the Blackwell architecture, containing 256KB per SM of dedicated SRAM accessible by Tensor Cores (more on these later). These play an important role in GEMM, so visualizing them is crucial. They are 2D matrices, 512 columns and 128 rows, or lanes, of 32-bit cells. TMEM functions as a loading dock for matrix multiply accumulate (MMA) tiles. Their introduction abstracts away from hardware cache prediction and gives the ability to manually control access patterns of tensor tiles. TMEM allows matrix A to be located in TMEM or SMEM, matrix B must be in SMEM, and the accumulator must be in TMEM.

Tensor Memory layout in NVIDIA Blackwell showing a 512 by 128 grid of 32-bit cells used for matrix multiply operations Figure 5. Tensor Memory (TMEM) layout: 512 columns × 128 rows of 32-bit cells per SM, serving as a loading dock for MMA tiles. (Source [4])

  1. Shared Memory (SMEM) and L1 Cache: Unified 256KB SRAM structure per SM.

  2. L2 Cache: 192MB SRAM that is physically split across the dual-die boundary; each die maintains partitions local to its 74 SMs, with coherency traffic traversing the NV-HBI interconnect.

  3. Tensor Memory Accelerator (TMA): Introduced in Hopper to offload address-generation from the register file via descriptor-based async copy; refined in Blackwell with second-generation sub-byte quantization (FP4/FP6 unpacking) and CTA-pair semantics. This change in Blackwell architecture facilitates thread blocks sharing distributed SMEM and executing cooperative MMA across paired SMs.

  4. VRAM (GMEM): The largest but most distant and slowest memory tier in the hierarchy. For B200, it consists of 192GB of HBM3e stacked on the same substrate as the GPU dies. The GPU implements 8 memory stacks (4 per die), delivering an aggregate bandwidth of 8 TB/s.

Computation of Data: Visualizing the Bits Flow

Excluding L2 and GMEM, the rest of the memory we describe above is placed inside the SM, which also houses the specialized compute units and optimized datapaths that allow for high-throughput matrix operations.

  1. Tensor Cores: Tensor Cores are the fundamental parts of the GPU that facilitate MMA instructions on small tiles. NVIDIA introduced these hardware cores in 2017 to rival Google's 2016 release of their systolic array TPUs. The earliest Tensor Core, introduced in Volta architecture, was only able to handle FP16 data types and operate on matrices with size of 4x4x4. Moreover, it was not sparse and received data slowly from SMEM and register files. Four generations later, each iteration increased the computation-to-memory ratio and added support for smaller precision types. The 5th generation of Tensor Cores in Blackwell can handle low-precision FP4, use CTA pairs on a single SM, and include TMEM to reduce pressure on SMEM.

  2. Warp Schedulers: A warp consists of 32 threads, and the scheduler issues instructions to all 32 per clock cycle.

  3. LD/ST Units: Load/Store instructions that are responsible for moving the data. All active threads in a warp always issue the same type of instruction in the same clock cycle. If that instruction is a load or store, it gets issued to the LD/ST units. If a thread is inactive (due to looping or conditional execution), the corresponding LD/ST unit stays idle.

Architecture diagram of a Streaming Multiprocessor as per the Blackwell architecture, showing tensor cores, memory units, warp schedulers, and data paths Figure 6. Breakdown of the Streaming Multiprocessor (SM) in the Blackwell architecture (Source [5])

Conclusion

The drive for efficiency continues to push AI workloads toward low-bit formats. NVFP4, introduced by NVIDIA for the Blackwell architecture, makes 4-bit floating point viable through hardware-algorithm co-design.

In Part 2 of this mini-series, we'll show how to write a GEMM kernel for B200 using CuTeDSL. If you'd like to get notified when the Part 2 goes live, subscribe to our newsletter below.

References

  1. Wikipedia article on Huang's Law
  2. OCP Microscaling Formats
  3. "Pretraining Large Language Models with NVFP4", NVIDIA Research
  4. NVIDIA's Tensor Memory Addressing
  5. Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era
  6. Cutlass Tutorial: GEMM with Thread Block Clusters on NVIDIA Blackwell GPUs
  7. Blackwell TensorCore Architecture (WeChat, Chinese)
  8. TensorCore Architecture and Big O Notation (WeChat, Chinese)

Subscribe to our newsletter

Get the latest updates on GPU benchmarks and AI research
Your information will be used in accordance with the Privacy Policy. You may opt out at any time.