Skip to content

Commit

Permalink
update web
Browse files Browse the repository at this point in the history
  • Loading branch information
zhuhanqing committed Dec 9, 2024
1 parent cd61ade commit d974a61
Showing 1 changed file with 26 additions and 15 deletions.
41 changes: 26 additions & 15 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,16 @@ <h1 class="title is-1 publication-title">APOLLO: SGD-like Memory, AdamW-level Pe
</a>
</span>
</div>
<!-- Email Link -->
<span class="link-block">
<a href="mailto:hqzhu@utexas.edu"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-envelope"></i>
</span>
<span>Email</span>
</a>
</span>

</div>
</div>
Expand Down Expand Up @@ -254,7 +264,7 @@ <h3 style="text-align: center; margin-bottom: 10px;">Pre-training LLaMA 7B on C4

<!-- Subtitle below both -->
<h2 class="subtitle has-text-centered" style="margin-top: 20px;">
<span class="dnerf">APOLLO</span> optimizer significantly reduces memory usage and achieves the best performance.
<span class="dnerf">APOLLO</span> optimizer significantly reduces memory usage and achieves the best perplexity in pre-training.
</h2>
</div>
</section>
Expand All @@ -276,10 +286,10 @@ <h2 class="subtitle has-text-centered" style="margin-top: 20px;">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
Large language models (LLMs) demonstrate remarkable capabilities but are notoriously memory-intensive during training, particularly with the popular Adam optimizer. This memory burden often necessitates using more GPUs, smaller batch sizes, or high-end hardware, thereby limiting scalability and training efficiency. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face key challenges: (i) reliance on costly SVD operations (e.g., GaLore, Fira); (ii) significant performance trade-offs compared to AdamW (e.g., Flora); and (iii) still substantial memory overhead of optimization states in order to maintain competitive performance (e.g., 1/4 rank in GaLore, and full-rank first momentum in Adam-mini).
Large language models (LLMs) demonstrate remarkable capabilities but are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden often necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput, respectively. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face key challenges: (i) reliance on costly SVD operations (e.g., GaLore, Fira); (ii) significant performance trade-offs compared to AdamW (e.g., Flora); and (iii) still substantial memory overhead of optimization states in order to maintain competitive performance (e.g., 1/4 rank in GaLore, and full-rank first momentum in Adam-mini).
</p>
<p>
In this work, we investigate the redundancy in Adam's learning rate adaption rule and identify that it can be coarsened as a structured learning rate update (channel-wise or tensor-wise).
In this work, we investigate the redundancy in Adam(W)'s learning rate adaption rule and identify that it can be coarsened as a structured learning rate update (channel-wise or tensor-wise).
Based on this insight, we propose a novel approach, <span class="has-text-weight-bold has-underline">Ap</span>proximated Gradient Scaling for Mem<span class="has-text-weight-bold has-underline">o</span>ry Efficient <span class="has-text-weight-bold has-underline">LL</span>M <span class="has-text-weight-bold has-underline">O</span>ptimization (<span class="texttt">APOLLO</span>), which approximates the channel-wise learning rate scaling with an auxiliary low-rank optimizer state based on pure <em>random projection</em>.
The structured learning rate update rule makes <span class="texttt">APOLLO</span> highly tolerant to further memory reduction with lower rank, halving the rank while delivering similar pre-training performance.
We further propose an extreme memory-efficient version, <span class="texttt">APOLLO-Mini</span>, which utilizes tensor-wise scaling with only a rank-1 auxiliary sub-space, achieving <strong>SGD-level memory cost</strong> but superior pre-training performance than Adam(W).
Expand All @@ -288,7 +298,7 @@ <h2 class="title is-3">Abstract</h2>
We conduct extensive experiments across different tasks and model architectures, showing that the <span class="texttt">APOLLO</span> series performs <strong>generally on-par with, or even better than Adam(W)</strong>. Meanwhile, <span class="texttt">APOLLO</span> achieves <strong>even greater memory savings than GaLore</strong>, by almost eliminating the optimization states in AdamW.
These savings translate into significant system benefits:
<ol>
<li><strong>Enhanced Throughput:</strong> <span class="texttt">APOLLO</span> and <span class="texttt">APOLLO-Mini</span> achieve up to 3× throughput on a 4×A100-80GB setup compared to Adam by fully utilizing memory to support 4× larger batch sizes.</li>
<li><strong>Enhanced Throughput:</strong> <span class="texttt">APOLLO</span> and <span class="texttt">APOLLO-Mini</span> achieve up to 3× throughput on a 8×A100-80GB setup compared to Adam by fully utilizing memory to support 4× larger batch sizes.</li>
<li><strong>Improved Model Scalability:</strong> <span class="texttt">APOLLO-Mini</span> <em>for the first time</em> enables pre-training LLaMA-13B model with naive DDP on A100-80G without requiring other system-level optimizations.</li>
<li><strong>Low-End GPU Pre-training:</strong> Combined with quantization, the <span class="texttt">APOLLO</span> series <em>for the first time</em> enables the training of LLaMA-7B from scratch using less than 12 GB of memory.</li>
</ol>
Expand Down Expand Up @@ -317,7 +327,7 @@ <h2 class="title is-3 has-text-centered">Train an LLaMA-7B with 3× Throughput C
<div class="column has-text-centered">
<h3 class="title is-5">AdamW Optimizer with micro batch size of 4 at 79GB-80GB memory cost</h3>
<iframe width="100%" height="315"
src="https://youtu.be/XuNTX7VDViQ?si=-Q8-6_Ny9fTebeOe&t=52"
src="https://youtu.be/embed/XuNTX7VDViQ?si=-Q8-6_Ny9fTebeOe&t=52"
title="Adam Throughput"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
Expand All @@ -330,7 +340,7 @@ <h3 class="title is-5">AdamW Optimizer with micro batch size of 4 at 79GB-80GB m
<div class="column has-text-centered">
<h3 class="title is-5">APOLLO Optimizer with micro batch size of 16 at 69GB-70GB memory cost</h3>
<iframe width="100%" height="315"
src="https://youtu.be/s2WxdBGh99k?si=t-M49pNXBiFqwPPY&t=49"
src="https://youtu.be/embed/s2WxdBGh99k?si=t-M49pNXBiFqwPPY&t=49"
title="APOLLO Throughput"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
Expand All @@ -344,7 +354,7 @@ <h3 class="title is-5">APOLLO Optimizer with micro batch size of 16 at 69GB-70GB
<h2 class="title is-3 has-text-centered">Successful Training for LLaMA-13B Model</h2>
<div class="content has-text-justified">
<p>
This video showcases the first successful pre-training of the LLaMA-13B model with naive DDP using <strong>APOLLO-Mini</strong>. No other system-level optimizations are required.
This video showcases <strong>APOLLO-Mini</strong> enables the first successful pre-training of the LLaMA-13B model with naive DDP and no other system-level optimizations.
</p>
</div>
<div class="content has-text-centered">
Expand All @@ -367,14 +377,15 @@ <h2 class="title is-3 has-text-centered">Pre-train a LLaMA-7B on a NVIDIA TITAN
</p>
</div>
<div class="content has-text-centered">
<!-- Embed Google Drive Video -->
<iframe src="https://drive.google.com/file/d/1mx04oEzvBOOmSu5PvO90s9TF4dTwphnX/preview"
width="75%"
height="480"
frameborder="0"
allow="autoplay; encrypted-media"
allowfullscreen>
</iframe>
<!-- Embed YouTube Video -->
<iframe width="560" height="315"
src="https://www.youtube.com/embed/L6Gfiz14EzA?si=WCw3FdNkbtdxYEQy&t=124"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
referrerpolicy="strict-origin-when-cross-origin"
allowfullscreen>
</iframe>
</div>
</div>
</div>
Expand Down

0 comments on commit d974a61

Please sign in to comment.