Logo

COAT: Compressing Optimizer states and Activation
for Memory-Efficient FP8 Training

Enable large model training with limited resources

Haocheng Xi1, Han Cai2, Ligeng Zhu2, Yao Lu2, Kurt Keutzer1 Jianfei Chen4 Song Han2,3

1University of Califormia, Berkeley, 2NVIDIA, 3MIT, 4Tsinghua University

Paper Code

About COAT

We introduce COAT (Compressing Optimizer states and Activation for Memory-Efficient FP8 Training), a novel method designed to optimize the memory efficiency of training large models by compressing optimizer states and activations using FP8 quantization.

Key Innovations include:
    -   Dynamic Range Expansion, which aligns optimizer state distributions more closely with the FP8 representation range, thereby reducing quantization error.
    -   Mixed-Granularity Activation Quantization, which optimizes activation memory using a combination of per-tensor and per-group quantization strategies.

COAT allows to reduced end-to-end memory footprint by 1.54×, speedup end-to-end training by 1.43×, while maintaining model accuracy. It can also double the training batch size and therefore utilize GPU better.

By leveraging FP8 precision, COAT enables efficient full-parameter training of large models on fewer GPUs, and facilitates doubling the batch size in distributed training settings, providing a practical solution for scaling large-scale model training.

Part 1: FP8 Optimizer States

Difficulty of FP8 quantization for optimizer states

We find that current quantization methods can not fully utilize the representation range of FP8 and therefore lead to a large quantization error when quantizing optimizer states with per-group quantization. For the E4M3 format, we hope the dynamic range of the quantization group X should cover the entire span between the minimum representable value of E4M3 (0.00195) and the maximum representable value of E4M3 (448) to fully utilize its representation ability. However, the dynamic range of E4M3 is usually under-utilized: The dynamic range of E4M3 is about 2e5, but the dynamic range of first order momentum is usually 1e3, and the dynamic range of second order momentum is usually 1e1. This make the quantization error really large.

Under-utilized dynamic range of FP8

Our solution: Dynamic Range Expansion:

We introduce a expand function f(·) before quantization to expand the dynamic range of the quantization group and align it with E4M3. The expand function we use is:

$$ f(x) = \operatorname{sign}(x) \cdot |x|^k, $$

where k is a parameter we calculate on-the-fly. When k > 1, the dynamic range will be enlarged and become closer to the dynamic range of E4M3. The optimal k can be directly calculated, and can fully utilize the representation range of E4M3, while the original quantization method can only utilize a small portion of it. Our dynamic range expansion method can greatly reduce the quantization error and fully utilize the dynamic range of E4M3. We find that E4M3 is more suitable for first order momentum than E5M2. For second order momentum, although E4M3 better than E5M2 in original setting, their quantization error is nearly the same after applying our expand function. Therefore we propose to use E4M3 + E4M3 quantization strategy or E4M3 + E5M2 quantization strategy when quantizing the optimizer states.

Under-utilized dynamic range of FP8

Part 2: FP8 Activation

Motivation: Non-linear layers costs large memory footprint

In the forward pass of neural networks, activations must be preserved for the backward pass to calculate gradients. Non-linear layers typically account for approximately 50% of the memory footprint in the Llama model series. In contrast, linear layers contribute less than 25%. Therefore, it is essential to optimize both linear and non-linear layers to reduce activation memory footprint.

pipeline for Sana

Our Solution: Mixed Granularity FP8 Precision Flow

FP8 precision flow requires the input and output of all linear and non-linear layers in FP8. By directly saving the input tensor in FP8 format for the backward pass, we eliminate the need for an extra quantization operation, which reduces the associated overhead. FP8 precision flow natually reduce the memory footprint for non-linears and linear layers by 50%, since they only need to save FP8 activations, not BF16. To further improve the accurateness of this method, we propose to vary the quantization granularity across different layers to balance precision and efficiency in a mixed-granularity manner. For non-linear layers, VS-Quant or PerBlock Quant methods are well-suited due to their fine-grained and precise nature. For linear layers, we apply per-tensor quantization to maximize the performance of Tensor Cores. We observe that quantizing the input of layernorm across multiple token axes is detrimental to accuracy, and therefore decide to apply per-group quantization to non-linear layers.

Group Scaling: Efficient Just-in-time Scaling

To perform per-tensor quantization, the maximum absolute value of the tensor needs to be calculated through max reduction, adding a lot of overhead. In our Group Scaling, we address these problems by splitting the max reduction into two stages: (1) performing max reduction on each 1 × G element and storing the results as intermediate values; (2) applying max reduction on the intermediate tensor to obtain the per-tensor max value. The first stage can be seamlessly fused with the previous operation, adding minimal overhead, while the second stage is more efficient than doing max reduction on the entire tensor, as the intermediate result is G× smaller than the original tensor.

flow-dpms vs flow-euler

Memory Saving, Speedup, and Accuracy

Strong end-to-end memory saving and speedup ability

In all multi-GPU training setting, COAT can double the micro-batch size and therefore lead to even higher speedup. For example, our method can achieve $2.25\times$ speedup when training Llama-2-13B on 4-GPUs since we can effectively increase the batch size to 2.
Overall, COAT significantly reduces end-to-end memory usage by up to $1.55\times$ and speeds up the end-to-end training by nearly $1.44\times$. This facilitates full-parameter training on fewer GPUs, which is particularly beneficial for larger language models.

COAT performance

Accuracy Experiments - OLMo pretraining

We perform Large Language Model pretraining on OLMo-1B and OLMo-7B on Dolma, following the official report. The training curve and downstream task performance were consistent with BF16 training and TransformerEngine baseline, validating the effectiveness of COAT.

COAT performance

Downstream Application - Image Captioning

We validate the effectiveness of our method on real-world examples. On the Image Captioning task, we find that the model trained by COAT can accurately summarize the figure and identify the key points in the figure, in comparison with BF16 training models.

COAT performance

BibTeX

@misc{xi2024coatcompressingoptimizerstates,
      title={COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training},
      author={Haocheng Xi and Han Cai and Ligeng Zhu and Yao Lu and Kurt Keutzer and Jianfei Chen and Song Han},
      year={2024},
      eprint={2410.19313},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.19313},
}