From d974a61526d9eea3945fdc4ef6ec162b045b5f41 Mon Sep 17 00:00:00 2001
From: hanqing <zhqinsjtu@gmail.com>
Date: Sun, 8 Dec 2024 18:08:32 -0600
Subject: [PATCH] update web

---
 docs/index.html | 41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)
diff --git a/docs/index.html b/docs/index.html
index 7e08b57..5525017 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -146,6 +146,16 @@ <h1 class="title is-1 publication-title">APOLLO: SGD-like Memory, AdamW-level Pe
                   </a>
               </span>
             </div>
+            <!-- Email Link -->
+            <span class="link-block">
+              <a href="mailto:hqzhu@utexas.edu"
+                class="external-link button is-normal is-rounded is-dark">
+                <span class="icon">
+                  <i class="fas fa-envelope"></i>
+                </span>
+                <span>Email</span>
+              </a>
+            </span>
 
           </div>
         </div>
@@ -254,7 +264,7 @@ <h3 style="text-align: center; margin-bottom: 10px;">Pre-training LLaMA 7B on C4
 
     <!-- Subtitle below both -->
     <h2 class="subtitle has-text-centered" style="margin-top: 20px;">
-        <span class="dnerf">APOLLO</span> optimizer significantly reduces memory usage and achieves the best performance.
+        <span class="dnerf">APOLLO</span> optimizer significantly reduces memory usage and achieves the best perplexity in pre-training.
     </h2>
   </div>
 </section>
@@ -276,10 +286,10 @@ <h2 class="subtitle has-text-centered" style="margin-top: 20px;">
         <h2 class="title is-3">Abstract</h2>
         <div class="content has-text-justified">
           <p>
-          Large language models (LLMs) demonstrate remarkable capabilities but are notoriously memory-intensive during training, particularly with the popular Adam optimizer. This memory burden often necessitates using more GPUs, smaller batch sizes, or high-end hardware, thereby limiting scalability and training efficiency. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face key challenges: (i) reliance on costly SVD operations (e.g., GaLore, Fira); (ii) significant performance trade-offs compared to AdamW (e.g., Flora); and (iii) still substantial memory overhead of optimization states in order to maintain competitive performance (e.g., 1/4 rank in GaLore, and full-rank first momentum in Adam-mini).
+          Large language models (LLMs) demonstrate remarkable capabilities but are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden often necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput, respectively. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face key challenges: (i) reliance on costly SVD operations (e.g., GaLore, Fira); (ii) significant performance trade-offs compared to AdamW (e.g., Flora); and (iii) still substantial memory overhead of optimization states in order to maintain competitive performance (e.g., 1/4 rank in GaLore, and full-rank first momentum in Adam-mini).
           </p>
           <p>
-          In this work, we investigate the redundancy in Adam's learning rate adaption rule and identify that it can be coarsened as a structured learning rate update (channel-wise or tensor-wise).
+          In this work, we investigate the redundancy in Adam(W)'s learning rate adaption rule and identify that it can be coarsened as a structured learning rate update (channel-wise or tensor-wise).
           Based on this insight, we propose a novel approach, <span class="has-text-weight-bold has-underline">Ap</span>proximated Gradient Scaling for Mem<span class="has-text-weight-bold has-underline">o</span>ry Efficient <span class="has-text-weight-bold has-underline">LL</span>M <span class="has-text-weight-bold has-underline">O</span>ptimization (<span class="texttt">APOLLO</span>), which approximates the channel-wise learning rate scaling with an auxiliary low-rank optimizer state based on pure <em>random projection</em>.
           The structured learning rate update rule makes <span class="texttt">APOLLO</span> highly tolerant to further memory reduction with lower rank, halving the rank while delivering similar pre-training performance.
           We further propose an extreme memory-efficient version, <span class="texttt">APOLLO-Mini</span>, which utilizes tensor-wise scaling with only a rank-1 auxiliary sub-space, achieving <strong>SGD-level memory cost</strong> but superior pre-training performance than Adam(W).
@@ -288,7 +298,7 @@ <h2 class="title is-3">Abstract</h2>
             We conduct extensive experiments across different tasks and model architectures, showing that the <span class="texttt">APOLLO</span> series performs <strong>generally on-par with, or even better than Adam(W)</strong>. Meanwhile, <span class="texttt">APOLLO</span> achieves <strong>even greater memory savings than GaLore</strong>, by almost eliminating the optimization states in AdamW.
             These savings translate into significant system benefits:
             <ol>
-              <li><strong>Enhanced Throughput:</strong> <span class="texttt">APOLLO</span> and <span class="texttt">APOLLO-Mini</span> achieve up to 3× throughput on a 4×A100-80GB setup compared to Adam by fully utilizing memory to support 4× larger batch sizes.</li>
+              <li><strong>Enhanced Throughput:</strong> <span class="texttt">APOLLO</span> and <span class="texttt">APOLLO-Mini</span> achieve up to 3× throughput on a 8×A100-80GB setup compared to Adam by fully utilizing memory to support 4× larger batch sizes.</li>
               <li><strong>Improved Model Scalability:</strong> <span class="texttt">APOLLO-Mini</span> <em>for the first time</em>  enables pre-training LLaMA-13B model with naive DDP on A100-80G without requiring other system-level optimizations.</li>
               <li><strong>Low-End GPU Pre-training:</strong> Combined with quantization, the <span class="texttt">APOLLO</span> series <em>for the first time</em> enables the training of LLaMA-7B from scratch using less than 12 GB of memory.</li>
             </ol>
@@ -317,7 +327,7 @@ <h2 class="title is-3 has-text-centered">Train an LLaMA-7B with 3× Throughput C
         <div class="column has-text-centered">
           <h3 class="title is-5">AdamW Optimizer with micro batch size of 4 at 79GB-80GB memory cost</h3>
           <iframe width="100%" height="315" 
-                  src="https://youtu.be/XuNTX7VDViQ?si=-Q8-6_Ny9fTebeOe&t=52" 
+                  src="https://youtu.be/embed/XuNTX7VDViQ?si=-Q8-6_Ny9fTebeOe&t=52" 
                   title="Adam Throughput" 
                   frameborder="0" 
                   allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" 
@@ -330,7 +340,7 @@ <h3 class="title is-5">AdamW Optimizer with micro batch size of 4 at 79GB-80GB m
         <div class="column has-text-centered">
           <h3 class="title is-5">APOLLO Optimizer with micro batch size of 16 at 69GB-70GB memory cost</h3>
           <iframe width="100%" height="315" 
-                  src="https://youtu.be/s2WxdBGh99k?si=t-M49pNXBiFqwPPY&t=49" 
+                  src="https://youtu.be/embed/s2WxdBGh99k?si=t-M49pNXBiFqwPPY&t=49" 
                   title="APOLLO Throughput" 
                   frameborder="0" 
                   allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" 
@@ -344,7 +354,7 @@ <h3 class="title is-5">APOLLO Optimizer with micro batch size of 16 at 69GB-70GB
       <h2 class="title is-3 has-text-centered">Successful Training for LLaMA-13B Model</h2>
       <div class="content has-text-justified">
         <p>
-          This video showcases the first successful pre-training of the LLaMA-13B model with naive DDP using <strong>APOLLO-Mini</strong>. No other system-level optimizations are required.
+          This video showcases <strong>APOLLO-Mini</strong> enables the first successful pre-training of the LLaMA-13B model with naive DDP and no other system-level optimizations.
         </p>
       </div>
       <div class="content has-text-centered">
@@ -367,14 +377,15 @@ <h2 class="title is-3 has-text-centered">Pre-train a LLaMA-7B on a NVIDIA TITAN
           </p>
         </div>
         <div class="content has-text-centered">
-          <!-- Embed Google Drive Video -->
-          <iframe src="https://drive.google.com/file/d/1mx04oEzvBOOmSu5PvO90s9TF4dTwphnX/preview" 
-          width="75%" 
-          height="480" 
-          frameborder="0" 
-          allow="autoplay; encrypted-media" 
-          allowfullscreen>
-        </iframe>
+          <!-- Embed YouTube Video -->
+          <iframe width="560" height="315" 
+                  src="https://www.youtube.com/embed/L6Gfiz14EzA?si=WCw3FdNkbtdxYEQy&t=124" 
+                  title="YouTube video player" 
+                  frameborder="0" 
+                  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" 
+                  referrerpolicy="strict-origin-when-cross-origin" 
+                  allowfullscreen>
+          </iframe>
         </div>
       </div>
     </div>