From d974a61526d9eea3945fdc4ef6ec162b045b5f41 Mon Sep 17 00:00:00 2001 From: hanqing Date: Sun, 8 Dec 2024 18:08:32 -0600 Subject: [PATCH] update web --- docs/index.html | 41 ++++++++++++++++++++++++++--------------- 1 file changed, 26 insertions(+), 15 deletions(-) diff --git a/docs/index.html b/docs/index.html index 7e08b57..5525017 100644 --- a/docs/index.html +++ b/docs/index.html @@ -146,6 +146,16 @@

APOLLO: SGD-like Memory, AdamW-level Pe + + + + + + + Email + + @@ -254,7 +264,7 @@

Pre-training LLaMA 7B on C4

- APOLLO optimizer significantly reduces memory usage and achieves the best performance. + APOLLO optimizer significantly reduces memory usage and achieves the best perplexity in pre-training.

@@ -276,10 +286,10 @@

Abstract

- Large language models (LLMs) demonstrate remarkable capabilities but are notoriously memory-intensive during training, particularly with the popular Adam optimizer. This memory burden often necessitates using more GPUs, smaller batch sizes, or high-end hardware, thereby limiting scalability and training efficiency. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face key challenges: (i) reliance on costly SVD operations (e.g., GaLore, Fira); (ii) significant performance trade-offs compared to AdamW (e.g., Flora); and (iii) still substantial memory overhead of optimization states in order to maintain competitive performance (e.g., 1/4 rank in GaLore, and full-rank first momentum in Adam-mini). + Large language models (LLMs) demonstrate remarkable capabilities but are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden often necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput, respectively. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face key challenges: (i) reliance on costly SVD operations (e.g., GaLore, Fira); (ii) significant performance trade-offs compared to AdamW (e.g., Flora); and (iii) still substantial memory overhead of optimization states in order to maintain competitive performance (e.g., 1/4 rank in GaLore, and full-rank first momentum in Adam-mini).

- In this work, we investigate the redundancy in Adam's learning rate adaption rule and identify that it can be coarsened as a structured learning rate update (channel-wise or tensor-wise). + In this work, we investigate the redundancy in Adam(W)'s learning rate adaption rule and identify that it can be coarsened as a structured learning rate update (channel-wise or tensor-wise). Based on this insight, we propose a novel approach, Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO), which approximates the channel-wise learning rate scaling with an auxiliary low-rank optimizer state based on pure random projection. The structured learning rate update rule makes APOLLO highly tolerant to further memory reduction with lower rank, halving the rank while delivering similar pre-training performance. We further propose an extreme memory-efficient version, APOLLO-Mini, which utilizes tensor-wise scaling with only a rank-1 auxiliary sub-space, achieving SGD-level memory cost but superior pre-training performance than Adam(W). @@ -288,7 +298,7 @@

Abstract

We conduct extensive experiments across different tasks and model architectures, showing that the APOLLO series performs generally on-par with, or even better than Adam(W). Meanwhile, APOLLO achieves even greater memory savings than GaLore, by almost eliminating the optimization states in AdamW. These savings translate into significant system benefits:
    -
  1. Enhanced Throughput: APOLLO and APOLLO-Mini achieve up to 3× throughput on a 4×A100-80GB setup compared to Adam by fully utilizing memory to support 4× larger batch sizes.
  2. +
  3. Enhanced Throughput: APOLLO and APOLLO-Mini achieve up to 3× throughput on a 8×A100-80GB setup compared to Adam by fully utilizing memory to support 4× larger batch sizes.
  4. Improved Model Scalability: APOLLO-Mini for the first time enables pre-training LLaMA-13B model with naive DDP on A100-80G without requiring other system-level optimizations.
  5. Low-End GPU Pre-training: Combined with quantization, the APOLLO series for the first time enables the training of LLaMA-7B from scratch using less than 12 GB of memory.
@@ -317,7 +327,7 @@

Train an LLaMA-7B with 3× Throughput C

AdamW Optimizer with micro batch size of 4 at 79GB-80GB memory cost

+ +