-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
4585 lines (4536 loc) · 543 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head>
<meta charset="utf-8">
<meta name="generator" content="quarto-1.6.40">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<meta name="author" content="Sam Foreman">
<meta name="dcterms.date" content="2024-10-29">
<title>Deep Learning and Foundation Models at Scale – Sam Foreman</title>
<style>
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
div.columns{display: flex; gap: min(4vw, 1.5em);}
div.column{flex: auto; overflow-x: auto;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
ul.task-list{list-style: none;}
ul.task-list li input[type="checkbox"] {
width: 0.8em;
margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */
vertical-align: middle;
}
/* CSS for syntax highlighting */
pre > code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
pre > code.sourceCode { white-space: pre-wrap; }
pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; }
}
pre.numberSource code
{ counter-reset: source-line 0; }
pre.numberSource code > span
{ position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
{ content: counter(source-line);
position: relative; left: -1em; text-align: right; vertical-align: baseline;
border: none; display: inline-block;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
}
pre.numberSource { margin-left: 3em; padding-left: 4px; }
div.sourceCode
{ }
@media screen {
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
/* CSS for citations */
div.csl-bib-body { }
div.csl-entry {
clear: both;
margin-bottom: 0em;
}
.hanging-indent div.csl-entry {
margin-left:2em;
text-indent:-2em;
}
div.csl-left-margin {
min-width:2em;
float:left;
}
div.csl-right-inline {
margin-left:2em;
padding-left:1em;
}
div.csl-indent {
margin-left: 2em;
}</style>
<script src="../../site_libs/quarto-nav/quarto-nav.js"></script>
<script src="../../site_libs/quarto-nav/headroom.min.js"></script>
<script src="../../site_libs/clipboard/clipboard.min.js"></script>
<script src="../../site_libs/quarto-search/autocomplete.umd.js"></script>
<script src="../../site_libs/quarto-search/fuse.min.js"></script>
<script src="../../site_libs/quarto-search/quarto-search.js"></script>
<meta name="quarto:offset" content="../../">
<link href="../../assets/favicon.svg" rel="icon" type="image/svg+xml">
<script src="../../site_libs/quarto-html/quarto.js"></script>
<script src="../../site_libs/quarto-html/popper.min.js"></script>
<script src="../../site_libs/quarto-html/tippy.umd.min.js"></script>
<script src="../../site_libs/quarto-html/anchor.min.js"></script>
<link href="../../site_libs/quarto-html/tippy.css" rel="stylesheet">
<script src="../../site_libs/bootstrap/bootstrap.min.js"></script>
<link href="../../site_libs/bootstrap/bootstrap-icons.css" rel="stylesheet">
<link href="../../site_libs/bootstrap/bootstrap-215c5639a07dff84cbbba0d7fa8c70c1.min.css" rel="stylesheet" append-hash="true" class="quarto-color-scheme" id="quarto-bootstrap" data-mode="light">
<link href="../../site_libs/bootstrap/bootstrap-dark-cb64c28157894247a5f972820f9765ff.min.css" rel="stylesheet" append-hash="true" class="quarto-color-scheme quarto-color-alternate" id="quarto-bootstrap" data-mode="dark">
<link href="../../site_libs/quarto-contrib/fontawesome6-0.1.0/all.css" rel="stylesheet">
<link href="../../site_libs/quarto-contrib/fontawesome6-0.1.0/latex-fontsize.css" rel="stylesheet">
<script src="../../site_libs/quarto-contrib/iconify-2.1.0/iconify-icon.min.js"></script>
<script src="../../site_libs/quarto-contrib/glightbox/glightbox.min.js"></script>
<link href="../../site_libs/quarto-contrib/glightbox/glightbox.min.css" rel="stylesheet">
<link href="../../site_libs/quarto-contrib/glightbox/lightbox.css" rel="stylesheet">
<script id="quarto-search-options" type="application/json">{
"location": "navbar",
"copy-button": false,
"collapse-after": 3,
"panel-placement": "end",
"type": "overlay",
"limit": 50,
"keyboard-shortcut": [
"?",
"H"
],
"language": {
"search-no-results-text": "No results",
"search-matching-documents-text": "matching documents",
"search-copy-link-title": "Copy link to search",
"search-hide-matches-text": "Hide additional matches",
"search-more-match-text": "more match in this document",
"search-more-matches-text": "more matches in this document",
"search-clear-button-title": "Clear",
"search-text-placeholder": "",
"search-detached-cancel-button-title": "Cancel",
"search-submit-button-title": "Submit",
"search-label": "Search"
}
}</script>
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-XVM2Y822Y1"></script>
<script type="text/javascript">
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-XVM2Y822Y1', { 'anonymize_ip': true});
</script>
<script src="../../site_libs/quarto-diagram/mermaid.min.js"></script>
<script src="../../site_libs/quarto-diagram/mermaid-init.js"></script>
<link href="../../site_libs/quarto-diagram/mermaid.css" rel="stylesheet">
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-TC329HJ');</script>
<!-- End Google Tag Manager -->
<script>window.backupDefine = window.define; window.define = undefined;</script><script src="https://cdn.jsdelivr.net/npm/katex@latest/dist/katex.min.js"></script>
<script>document.addEventListener("DOMContentLoaded", function () {
var mathElements = document.getElementsByClassName("math");
var macros = [];
for (var i = 0; i < mathElements.length; i++) {
var texText = mathElements[i].firstChild;
if (mathElements[i].tagName == "SPAN") {
katex.render(texText.data, mathElements[i], {
displayMode: mathElements[i].classList.contains('display'),
throwOnError: false,
macros: macros,
fleqn: false
});
}}});
</script>
<script>window.define = window.backupDefine; window.backupDefine = undefined;</script><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@latest/dist/katex.min.css">
<script type="text/javascript">
const typesetMath = (el) => {
if (window.MathJax) {
// MathJax Typeset
window.MathJax.typeset([el]);
} else if (window.katex) {
// KaTeX Render
var mathElements = el.getElementsByClassName("math");
var macros = [];
for (var i = 0; i < mathElements.length; i++) {
var texText = mathElements[i].firstChild;
if (mathElements[i].tagName == "SPAN") {
window.katex.render(texText.data, mathElements[i], {
displayMode: mathElements[i].classList.contains('display'),
throwOnError: false,
macros: macros,
fleqn: false
});
}
}
}
}
window.Quarto = {
typesetMath
};
</script>
<link rel="stylesheet" href="../../css/custom.css">
<link rel="stylesheet" href="../../css/svgbob.css">
<link rel="stylesheet" href="../../static/fonts/IosevkaTerm/IosevkaTerm.css">
<link rel="stylesheet" href="../../static/fonts/IosevkaQP/IosevkaQP.css">
<meta property="og:title" content="Deep Learning and Foundation Models at Scale">
<meta property="og:description" content="My ramblings about science and computers">
<meta property="og:image" content="https://samforeman.me/talks/alcf-hpc-workshop-2024/assets/thumbnail.png">
<meta property="og:site_name" content="Sam Foreman">
<meta property="og:image:height" content="2572">
<meta property="og:image:width" content="4112">
<meta name="twitter:title" content="Deep Learning and Foundation Models at Scale">
<meta name="twitter:description" content="My ramblings about science and computers">
<meta name="twitter:image" content="https://samforeman.me/talks/alcf-hpc-workshop-2024/assets/thumbnail.png">
<meta name="twitter:creator" content="saforem2">
<meta name="twitter:site" content="saforem2">
<meta name="twitter:card" content="summary">
<meta name="twitter:image-height" content="2572">
<meta name="twitter:image-width" content="4112">
<meta name="citation_title" content="Deep Learning and Foundation Models at Scale">
<meta name="citation_author" content="Sam Foreman">
<meta name="citation_publication_date" content="2024-10-29">
<meta name="citation_cover_date" content="2024-10-29">
<meta name="citation_year" content="2024">
<meta name="citation_online_date" content="2024-10-29">
<meta name="citation_fulltext_html_url" content="https://samforeman.me/talks/alcf-hpc-workshop-2024/slides">
<meta name="citation_language" content="en">
<meta name="citation_reference" content="citation_title=Superconductivity of in and sn samples;,citation_author=George Deamont;,citation_author=Sam Foreman;,citation_publication_date=2014;,citation_cover_date=2014;,citation_year=2014;">
<meta name="citation_reference" content="citation_title=RG-inspired machine learning for lattice field theory;,citation_author=Sam Foreman;,citation_author=Joel Giedt;,citation_author=Yannick Meurice;,citation_author=Judah Unmuth-Yockey;,citation_publication_date=2018;,citation_cover_date=2018;,citation_year=2018;,citation_volume=175;,citation_conference_title=EPJ web of conferences;,citation_conference=EDP Sciences;">
<meta name="citation_reference" content="citation_title=Large energy density in three-plate nanocapacitors due to coulomb blockade;,citation_author=A Hubler;,citation_author=S Foreman;,citation_author=J Liu;,citation_author=L Wortsmann;,citation_publication_date=2018;,citation_cover_date=2018;,citation_year=2018;,citation_issue=10;,citation_volume=123;,citation_journal_title=Journal of Applied Physics;,citation_publisher=AIP Publishing;">
<meta name="citation_reference" content="citation_title=Examples of renormalization group transformations for image sets;,citation_author=Samuel Foreman;,citation_author=Joel Giedt;,citation_author=Yannick Meurice;,citation_author=Judah Unmuth-Yockey;,citation_publication_date=2018;,citation_cover_date=2018;,citation_year=2018;,citation_issue=5;,citation_volume=98;,citation_journal_title=Physical Review E;,citation_publisher=American Physical Society;">
<meta name="citation_reference" content="citation_title=Machine learning inspired analysis of the Ising model transition;,citation_author=Samuel Foreman;,citation_author=Joel Giedt;,citation_author=Yannick Meurice;,citation_author=Judah Unmuth-Yockey;,citation_publication_date=2018;,citation_cover_date=2018;,citation_year=2018;,citation_doi=10.22323/1.334.0245;,citation_volume=LATTICE2018;,citation_journal_title=PoS;">
<meta name="citation_reference" content="citation_title=Machine learning inspired analysis of the ising model transition;,citation_author=Samuel Foreman;,citation_author=Joel Giedt;,citation_author=Yannick Meurice;,citation_author=Judah Unmuth-Yockey;,citation_publication_date=2018;,citation_cover_date=2018;,citation_year=2018;,citation_conference_title=Lattice 2018;">
<meta name="citation_reference" content="citation_title=Learning better physics: A machine learning approach to lattice gauge theory;,citation_author=Samuel Alfred Foreman;,citation_publication_date=2019;,citation_cover_date=2019;,citation_year=2019;,citation_dissertation_institution=University of Iowa;">
<meta name="citation_reference" content="citation_title=Machine learning and neural networks for field theory;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C Osborn;,citation_publication_date=2020;,citation_cover_date=2020;,citation_year=2020;">
<meta name="citation_reference" content="citation_title=HMC with normalizing flows;,citation_author=Sam Foreman;,citation_author=Taku Izubuchi;,citation_author=Luchang Jin;,citation_author=Xiao-Yong Jin;,citation_author=James C Osborn;,citation_author=Akio Tomiya;,citation_publication_date=2021;,citation_cover_date=2021;,citation_year=2021;,citation_journal_title=arXiv preprint arXiv:2112.01586;">
<meta name="citation_reference" content="citation_title=LeapfrogLayers: A trainable framework for effective topological sampling;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C Osborn;,citation_publication_date=2021;,citation_cover_date=2021;,citation_year=2021;,citation_journal_title=arXiv preprint arXiv:2112.01582;">
<meta name="citation_reference" content="citation_title=Energy storage in quantum resonators;,citation_author=Jiaqi Liu;,citation_author=Alfred W Hubler;,citation_author=Samuel Alfred Foreman;,citation_author=Katharina Ott;,citation_publication_date=2017;,citation_cover_date=2017;,citation_year=2017;">
<meta name="citation_reference" content="citation_title=Applications of machine learning to lattice quantum field theory;,citation_author=Denis Boyda;,citation_author=Salvatore Calı̀;,citation_author=Sam Foreman;,citation_author=Lena Funcke;,citation_author=Daniel C Hackett;,citation_author=Yin Lin;,citation_author=Gert Aarts;,citation_author=Andrei Alexandru;,citation_author=Xiao-Yong Jin;,citation_author=Biagio Lucini;,citation_author=others;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_journal_title=arXiv preprint arXiv:2202.05838;">
<meta name="citation_reference" content="citation_title=Lattice QCD and particle physics;,citation_author=Andreas S Kronfeld;,citation_author=Tanmoy Bhattacharya;,citation_author=Thomas Blum;,citation_author=Norman H Christ;,citation_author=Carleton DeTar;,citation_author=William Detmold;,citation_author=Robert Edwards;,citation_author=Anna Hasenfratz;,citation_author=Huey-Wen Lin;,citation_author=Swagato Mukherjee;,citation_author=others;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_journal_title=arXiv preprint arXiv:2207.07641;">
<meta name="citation_reference" content="citation_title=GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics;,citation_author=Maxim Zvyagin;,citation_author=Alexander Brace;,citation_author=Kyle Hippe;,citation_author=Yuntian Deng;,citation_author=Bin Zhang;,citation_author=Cindy Orozco Bohorquez;,citation_author=Austin Clyde;,citation_author=Bharat Kale;,citation_author=Danilo Perez-Rivera;,citation_author=Heng Ma;,citation_author=others;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_issue=6;,citation_volume=37;,citation_journal_title=The International Journal of High Performance Computing Applications;,citation_publisher=SAGE Publications Sage UK: London, England;">
<meta name="citation_reference" content="citation_title=MLMC: Machine learning monte carlo;,citation_author=Sam Foreman;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_conference_title=The international symposium on lattice field theory;">
<meta name="citation_reference" content="citation_title=Superconductivity of in and sn samples;,citation_author=George Deamont;,citation_author=Sam Foreman;,citation_publication_date=2014;,citation_cover_date=2014;,citation_year=2014;">
<meta name="citation_reference" content="citation_title=A comprehensive performance study of large language models on novel AI accelerators;,citation_author=Murali Emani;,citation_author=Sam Foreman;,citation_author=Varuni Sastry;,citation_author=Zhen Xie;,citation_author=Siddhisanket Raskar;,citation_author=William Arnold;,citation_author=Rajeev Thakur;,citation_author=Venkatram Vishwanath;,citation_author=Michael E Papka;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_journal_title=arXiv preprint arXiv:2310.04607;">
<meta name="citation_reference" content="citation_title=DeepSpeed4Science initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies;,citation_author=Shuaiwen Leon Song;,citation_author=Bonnie Kruft;,citation_author=Minjia Zhang;,citation_author=Conglong Li;,citation_author=Shiyang Chen;,citation_author=Chengming Zhang;,citation_author=Masahiro Tanaka;,citation_author=Xiaoxia Wu;,citation_author=Jeff Rasley;,citation_author=Ammar Ahmad Awan;,citation_author=others;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_journal_title=arXiv preprint arXiv:2310.04610;">
<meta name="citation_reference" content="citation_title=Protein generation via genome-scale language models with bio-physical scoring;,citation_author=Gautham Dharuman;,citation_author=Logan Ward;,citation_author=Heng Ma;,citation_author=Priyanka V Setty;,citation_author=Ozan Gokdemir;,citation_author=Sam Foreman;,citation_author=Murali Emani;,citation_author=Kyle Hippe;,citation_author=Alexander Brace;,citation_author=Kristopher Keipert;,citation_author=others;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_conference_title=Proceedings of the SC’23 workshops of the international conference on high performance computing, network, storage, and analysis;">
<meta name="citation_reference" content="citation_title=MLMC: Machine learning monte carlo for lattice gauge theory;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C Osborn;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_journal_title=arXiv preprint arXiv:2312.08936;">
<meta name="citation_reference" content="citation_title=Snowmass 2021 computational frontier CompF03 topical group report: Machine learning;,citation_author=Phiala Shanahan;,citation_author=Kazuhiro Terao;,citation_author=Daniel Whiteson;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_journal_title=arXiv preprint arXiv:2209.07559;">
<meta name="citation_reference" content="citation_title=Thorough characterization and analysis of large transformer model training at-scale;,citation_author=Scott Cheng;,citation_author=Jun-Liang Lin;,citation_author=Murali Emani;,citation_author=Siddhisanket Raskar;,citation_author=Sam Foreman;,citation_author=Zhen Xie;,citation_author=Venkatram Vishwanath;,citation_author=Mahmut Taylan Kandemir;,citation_publication_date=2024;,citation_cover_date=2024;,citation_year=2024;,citation_issue=1;,citation_volume=8;,citation_journal_title=Proceedings of the ACM on Measurement and Analysis of Computing Systems;,citation_publisher=ACM New York, NY, USA;">
<meta name="citation_reference" content="citation_title=Communities through energy justice projects;,citation_author=Mary Ann Leung;,citation_author=Katharine Cahill;,citation_author=Rebecca Hartman-Baker;,citation_author=Paige Kinsley;,citation_author=Lois Curfman McInnes;,citation_author=Suzanne Parete-Koon;,citation_author=Subil Abraham;,citation_author=Lacy Beach Barrier;,citation_author=Gladys Chen;,citation_author=Lizanne DeStefano;,citation_author=others;,citation_publication_date=2024;,citation_cover_date=2024;,citation_year=2024;,citation_issue=1;,citation_volume=15;,citation_journal_title=Journal of Computational Science;">
<meta name="citation_reference" content="citation_title=Applications of a foundation model approach for weather and climate;,citation_author=Troy Arcomano;,citation_author=Alexander Wikner;,citation_author=Romit Maulik;,citation_author=Veerabhadra Rao Kotamarthi;,citation_author=Sam Foreman;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_volume=2023;,citation_conference_title=AGU fall meeting abstracts;">
<meta name="citation_reference" content="citation_title=Toward a holistic performance evaluation of large language models across diverse ai accelerators;,citation_author=Murali Emani;,citation_author=Sam Foreman;,citation_author=Varuni Sastry;,citation_author=Zhen Xie;,citation_author=Siddhisanket Raskar;,citation_author=William Arnold;,citation_author=Rajeev Thakur;,citation_author=Venkatram Vishwanath;,citation_author=Michael E Papka;,citation_author=Sanjif Shanmugavelu;,citation_author=others;,citation_publication_date=2024;,citation_cover_date=2024;,citation_year=2024;,citation_conference_title=2024 IEEE international parallel and distributed processing symposium workshops (IPDPSW);,citation_conference=IEEE;">
<meta name="citation_reference" content="citation_title=Intro to HPC bootcamp: Engaging new communities through energy justice projects;,citation_author=Suzanne Parete-Koon;,citation_author=Michael Sandoval;,citation_author=Kellen Leland;,citation_author=Subil Abraham;,citation_author=Mary Ann Leung;,citation_author=Rebecca Hartman-Baker;,citation_author=Paige Kinsley;,citation_author=Lois McInnes;,citation_author=Sreeranjani Ramprakash;,citation_author=Lacy Beach Barrier;,citation_author=others;,citation_publication_date=2024;,citation_cover_date=2024;,citation_year=2024;,citation_issue=1;,citation_volume=15;,citation_journal_title=Journal of Computational Science Education;,citation_publisher=Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States);">
<meta name="citation_reference" content="citation_title=MProt-DPO: Breaking the ExaFLOPS barrier for multimodal protein design workflows with direct preference optimization;,citation_author=Gautham Dharuman;,citation_author=Kyle Hippe;,citation_author=Alexander Brace;,citation_author=Sam Foreman;,citation_author=Väinä Hatanpää;,citation_author=Varuni K Sastry;,citation_author=Huihuo Zheng;,citation_author=Logan Ward;,citation_author=Servesh Muralidharan;,citation_author=Archit Vasan;,citation_author=others;,citation_publication_date=2024;,citation_cover_date=2024;,citation_year=2024;,citation_conference_title=2024 SC24: International conference for high performance computing, networking, storage and analysis SC;,citation_conference=IEEE Computer Society;">
<meta name="citation_reference" content="citation_title=Emergent abilities of large language models;,citation_author=Jason Wei;,citation_author=Yi Tay;,citation_author=Rishi Bommasani;,citation_author=Colin Raffel;,citation_author=Barret Zoph;,citation_author=Sebastian Borgeaud;,citation_author=Dani Yogatama;,citation_author=Maarten Bosma;,citation_author=Denny Zhou;,citation_author=Donald Metzler;,citation_author=Ed H. Chi;,citation_author=Tatsunori Hashimoto;,citation_author=Oriol Vinyals;,citation_author=Percy Liang;,citation_author=Jeff Dean;,citation_author=William Fedus;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2206.07682;">
<meta name="citation_reference" content="citation_title=DeepSpeed4Science initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies;,citation_author=Shuaiwen Leon Song;,citation_author=Bonnie Kruft;,citation_author=Minjia Zhang;,citation_author=Conglong Li;,citation_author=Shiyang Chen;,citation_author=Chengming Zhang;,citation_author=Masahiro Tanaka;,citation_author=Xiaoxia Wu;,citation_author=Jeff Rasley;,citation_author=Ammar Ahmad Awan;,citation_author=Connor Holmes;,citation_author=Martin Cai;,citation_author=Adam Ghanem;,citation_author=Zhongzhu Zhou;,citation_author=Yuxiong He;,citation_author=Pete Luferenko;,citation_author=Divya Kumar;,citation_author=Jonathan Weyn;,citation_author=Ruixiong Zhang;,citation_author=Sylwester Klocek;,citation_author=Volodymyr Vragov;,citation_author=Mohammed AlQuraishi;,citation_author=Gustaf Ahdritz;,citation_author=Christina Floristean;,citation_author=Cristina Negri;,citation_author=Rao Kotamarthi;,citation_author=Venkatram Vishwanath;,citation_author=Arvind Ramanathan;,citation_author=Sam Foreman;,citation_author=Kyle Hippe;,citation_author=Troy Arcomano;,citation_author=Romit Maulik;,citation_author=Maxim Zvyagin;,citation_author=Alexander Brace;,citation_author=Bin Zhang;,citation_author=Cindy Orozco Bohorquez;,citation_author=Austin Clyde;,citation_author=Bharat Kale;,citation_author=Danilo Perez-Rivera;,citation_author=Heng Ma;,citation_author=Carla M. Mann;,citation_author=Michael Irvin;,citation_author=J. Gregory Pauloski;,citation_author=Logan Ward;,citation_author=Valerie Hayot;,citation_author=Murali Emani;,citation_author=Zhen Xie;,citation_author=Diangen Lin;,citation_author=Maulik Shukla;,citation_author=Ian Foster;,citation_author=James J. Davis;,citation_author=Michael E. Papka;,citation_author=Thomas Brettin;,citation_author=Prasanna Balaprakash;,citation_author=Gina Tourassi;,citation_author=John Gounley;,citation_author=Heidi Hanson;,citation_author=Thomas E Potok;,citation_author=Massimiliano Lupo Pasini;,citation_author=Kate Evans;,citation_author=Dan Lu;,citation_author=Dalton Lunga;,citation_author=Junqi Yin;,citation_author=Sajal Dash;,citation_author=Feiyi Wang;,citation_author=Mallikarjun Shankar;,citation_author=Isaac Lyngaas;,citation_author=Xiao Wang;,citation_author=Guojing Cong;,citation_author=Pei Zhang;,citation_author=Ming Fan;,citation_author=Siyan Liu;,citation_author=Adolfy Hoisie;,citation_author=Shinjae Yoo;,citation_author=Yihui Ren;,citation_author=William Tang;,citation_author=Kyle Felker;,citation_author=Alexey Svyatkovskiy;,citation_author=Hang Liu;,citation_author=Ashwin Aji;,citation_author=Angela Dalton;,citation_author=Michael Schulte;,citation_author=Karl Schulz;,citation_author=Yuntian Deng;,citation_author=Weili Nie;,citation_author=Josh Romero;,citation_author=Christian Dallago;,citation_author=Arash Vahdat;,citation_author=Chaowei Xiao;,citation_author=Thomas Gibbs;,citation_author=Anima Anandkumar;,citation_author=Rick Stevens;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://arxiv.org/abs/2310.04610;">
<meta name="citation_reference" content="citation_title=Emergent abilities of large language models;,citation_author=Jason Wei;,citation_author=Yi Tay;,citation_author=Rishi Bommasani;,citation_author=Colin Raffel;,citation_author=Barret Zoph;,citation_author=Sebastian Borgeaud;,citation_author=Dani Yogatama;,citation_author=Maarten Bosma;,citation_author=Denny Zhou;,citation_author=Donald Metzler;,citation_author=Ed H. Chi;,citation_author=Tatsunori Hashimoto;,citation_author=Oriol Vinyals;,citation_author=Percy Liang;,citation_author=Jeff Dean;,citation_author=William Fedus;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2206.07682;">
<meta name="citation_reference" content="citation_title=The climate risk &amp;amp; resilience portal (ClimRR) metadata and data dictionary;,citation_author=C. Burdi;,citation_author=Wall. T Branham;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://dub.sh/ClimRR-Metadata;">
<meta name="citation_reference" content="citation_title=Progress on $(g-2)_\mu$ from lattice QCD;,citation_author=Hartmut Wittig;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://arxiv.org/abs/2306.04165;">
<meta name="citation_reference" content="citation_title=Hybrid Monte Carlo;,citation_author=S. Duane;,citation_author=A. D. Kennedy;,citation_author=B. J. Pendleton;,citation_author=D. Roweth;,citation_publication_date=1987;,citation_cover_date=1987;,citation_year=1987;,citation_doi=10.1016/0370-2693(87)91197-X;,citation_volume=195;,citation_journal_title=Phys. Lett. B;">
<meta name="citation_reference" content="citation_title=Snowmass 2021 Computational Frontier CompF03 Topical Group Report: Machine Learning;,citation_author=Phiala Shanahan;,citation_author=others;,citation_publication_date=2022-09;,citation_cover_date=2022-09;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2209.07559;">
<meta name="citation_reference" content="citation_title=Applications of Machine Learning to Lattice Quantum Field Theory;,citation_author=Denis Boyda;,citation_author=others;,citation_publication_date=2022-02;,citation_cover_date=2022-02;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2202.05838;,citation_conference_title=Snowmass 2021;">
<meta name="citation_reference" content="citation_title=HMC with Normalizing Flows;,citation_author=Sam Foreman;,citation_author=Taku Izubuchi;,citation_author=Luchang Jin;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_author=Akio Tomiya;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2112.01586;,citation_doi=10.22323/1.396.0073;,citation_volume=LATTICE2021;,citation_journal_title=PoS;">
<meta name="citation_reference" content="citation_title=LeapfrogLayers: A Trainable Framework for Effective Topological Sampling;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2112.01582;,citation_doi=10.22323/1.396.0508;,citation_volume=LATTICE2021;,citation_journal_title=PoS;">
<meta name="citation_reference" content="citation_title=Deep Learning Hamiltonian Monte Carlo;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_publication_date=2021-05;,citation_cover_date=2021-05;,citation_year=2021;,citation_fulltext_html_url=https://arxiv.org/abs/2105.03418;,citation_conference_title=9th International Conference on Learning Representations;">
<meta name="citation_reference" content="citation_title=Deep learning hamiltonian monte carlo;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_publication_date=2021;,citation_cover_date=2021;,citation_year=2021;,citation_fulltext_html_url=https://arxiv.org/abs/2105.03418;">
<meta name="citation_reference" content="citation_title=Deep Learning Hamiltonian Monte Carlo;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_publication_date=2021-05;,citation_cover_date=2021-05;,citation_year=2021;,citation_fulltext_html_url=https://arxiv.org/abs/2105.03418;,citation_conference_title=9th International Conference on Learning Representations;">
<meta name="citation_reference" content="citation_title=Deep learning hamiltonian monte carlo;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C Osborn;,citation_publication_date=2021;,citation_cover_date=2021;,citation_year=2021;,citation_journal_title=arXiv preprint arXiv:2105.03418;">
<meta name="citation_reference" content="citation_title=LeapfrogLayers: A trainable framework for effective topological sampling;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C Osborn;,citation_publication_date=2021;,citation_cover_date=2021;,citation_year=2021;,citation_journal_title=arXiv preprint arXiv:2112.01582;">
<meta name="citation_reference" content="citation_title=Energy Justice Analysis of Climate Data with ClimRR;,citation_author=Sam Foreman;,citation_publication_date=2023-08-07;,citation_cover_date=2023-08-07;,citation_year=2023;,citation_fulltext_html_url=https://saforem2.github.io/climate-analysis;,citation_language=en;">
<meta name="citation_reference" content="citation_author=Sam Foreman;,citation_publication_date=2023-08-19;,citation_cover_date=2023-08-19;,citation_year=2023;,citation_fulltext_html_url=https://saforem2.github.io/l2hmc-qcd;,citation_language=en;">
<meta name="citation_reference" content="citation_title=MLMC: Machine learning monte carlo for lattice gauge theory;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James Osborn;,citation_publication_date=00;,citation_cover_date=00;,citation_year=0;,citation_conference_title=40th international symposium on lattice field theory (lattice 2023) (batavia, IL, united states, 07/31/2023 - 08/04/2023);">
<meta name="citation_reference" content="citation_title=Progress on $(g-2)_\mu$ from lattice QCD;,citation_author=Hartmut Wittig;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://arxiv.org/abs/2306.04165;">
<meta name="citation_reference" content="citation_title=Hybrid Monte Carlo;,citation_author=S. Duane;,citation_author=A. D. Kennedy;,citation_author=B. J. Pendleton;,citation_author=D. Roweth;,citation_publication_date=1987;,citation_cover_date=1987;,citation_year=1987;,citation_doi=10.1016/0370-2693(87)91197-X;,citation_volume=195;,citation_journal_title=Phys. Lett. B;">
<meta name="citation_reference" content="citation_title=Snowmass 2021 Computational Frontier CompF03 Topical Group Report: Machine Learning;,citation_author=Phiala Shanahan;,citation_author=others;,citation_publication_date=2022-09;,citation_cover_date=2022-09;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2209.07559;">
<meta name="citation_reference" content="citation_title=Applications of Machine Learning to Lattice Quantum Field Theory;,citation_author=Denis Boyda;,citation_author=others;,citation_publication_date=2022-02;,citation_cover_date=2022-02;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2202.05838;,citation_conference_title=Snowmass 2021;">
<meta name="citation_reference" content="citation_title=LeapfrogLayers: A Trainable Framework for Effective Topological Sampling;,citation_author=S. Foreman;,citation_author=X. Jin;,citation_author=J. Osborn;,citation_publication_date=2022-07;,citation_cover_date=2022-07;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2112.01582;,citation_doi=10.22323/1.396.0508;,citation_conference_title=The 38th international symposium on lattice field theory;">
<meta name="citation_reference" content="citation_title=HMC with Normalizing Flows;,citation_author=Sam Foreman;,citation_author=Taku Izubuchi;,citation_author=Luchang Jin;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_author=Akio Tomiya;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2112.01586;,citation_doi=10.22323/1.396.0073;,citation_volume=LATTICE2021;,citation_journal_title=PoS;">
<meta name="citation_reference" content="citation_title=Mastering language models;,citation_author=Samuel Montgomery;,citation_publication_date=2023-10;,citation_cover_date=2023-10;,citation_year=2023;,citation_fulltext_html_url=https://towardsdatascience.com/mastering-language-models-32e1d891511a
;,citation_journal_title=Medium;,citation_publisher=Towards Data Science;">
<meta name="citation_reference" content="citation_title=Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond;,citation_author=Jingfeng Yang;,citation_author=Hongye Jin;,citation_author=Ruixiang Tang;,citation_author=Xiaotian Han;,citation_author=Qizhang Feng;,citation_author=Haoming Jiang;,citation_author=Bing Yin;,citation_author=Xia Hu;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://arxiv.org/abs/2304.13712;">
<meta name="citation_reference" content="citation_title=Training tips for the transformer model;,citation_author=Martin Popel;,citation_author=Ondřej Bojar;,citation_publication_date=2018-04;,citation_cover_date=2018-04;,citation_year=2018;,citation_fulltext_html_url=https://doi.org/10.2478%2Fpralin-2018-0002;,citation_issue=1;,citation_doi=10.2478/pralin-2018-0002;,citation_volume=110;,citation_journal_title=The Prague Bulletin of Mathematical Linguistics;,citation_publisher=Charles University in Prague, Karolinum Press;">
<meta name="citation_reference" content="citation_title=Attention is all you need;,citation_author=Ashish Vaswani;,citation_author=Noam Shazeer;,citation_author=Niki Parmar;,citation_author=Jakob Uszkoreit;,citation_author=Llion Jones;,citation_author=Aidan N. Gomez;,citation_author=Lukasz Kaiser;,citation_author=Illia Polosukhin;,citation_publication_date=2017;,citation_cover_date=2017;,citation_year=2017;,citation_fulltext_html_url=https://arxiv.org/abs/1706.03762;">
<meta name="citation_reference" content="citation_title=Tree of thoughts: Deliberate problem solving with large language models;,citation_author=Shunyu Yao;,citation_author=Dian Yu;,citation_author=Jeffrey Zhao;,citation_author=Izhak Shafran;,citation_author=Thomas L. Griffiths;,citation_author=Yuan Cao;,citation_author=Karthik Narasimhan;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://arxiv.org/abs/2305.10601;">
<meta name="citation_reference" content="citation_title=GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics;,citation_abstract=We seek to transform how new and emergent variants of pandemiccausing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million prokaryotic gene sequences and finetuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.Competing Interest StatementThe authors have declared no competing interest.;,citation_author=Maxim Zvyagin;,citation_author=Alexander Brace;,citation_author=Kyle Hippe;,citation_author=Yuntian Deng;,citation_author=Bin Zhang;,citation_author=Cindy Orozco Bohorquez;,citation_author=Austin Clyde;,citation_author=Bharat Kale;,citation_author=Danilo Perez-Rivera;,citation_author=Heng Ma;,citation_author=Carla M. Mann;,citation_author=Michael Irvin;,citation_author=J. Gregory Pauloski;,citation_author=Logan Ward;,citation_author=Valerie Hayot-Sasson;,citation_author=Murali Emani;,citation_author=Sam Foreman;,citation_author=Zhen Xie;,citation_author=Diangen Lin;,citation_author=Maulik Shukla;,citation_author=Weili Nie;,citation_author=Josh Romero;,citation_author=Christian Dallago;,citation_author=Arash Vahdat;,citation_author=Chaowei Xiao;,citation_author=Thomas Gibbs;,citation_author=Ian Foster;,citation_author=James J. Davis;,citation_author=Michael E. Papka;,citation_author=Thomas Brettin;,citation_author=Rick Stevens;,citation_author=Anima Anandkumar;,citation_author=Venkatram Vishwanath;,citation_author=Arvind Ramanathan;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://www.biorxiv.org/content/early/2022/11/23/2022.10.10.511571;,citation_doi=10.1101/2022.10.10.511571;,citation_journal_title=bioRxiv;,citation_publisher=Cold Spring Harbor Laboratory;">
</head>
<body class="nav-fixed">
<div id="quarto-search-results"></div>
<header id="quarto-header" class="headroom fixed-top">
<nav class="navbar navbar-expand-lg " data-bs-theme="dark">
<div class="navbar-container container-fluid">
<div class="navbar-brand-container mx-auto">
<a href="../../index.html" class="navbar-brand navbar-brand-logo">
<img src="../../assets/signature12.svg" alt="Sam Foreman" class="navbar-logo">
</a>
</div>
<div id="quarto-search" class="" title="Search"></div>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarCollapse" aria-controls="navbarCollapse" role="menu" aria-expanded="false" aria-label="Toggle navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">
<span class="navbar-toggler-icon"></span>
</button>
<div class="collapse navbar-collapse" id="navbarCollapse">
<ul class="navbar-nav navbar-nav-scroll ms-auto">
<li class="nav-item dropdown ">
<a class="nav-link dropdown-toggle" href="#" id="nav-menu-talks" role="link" data-bs-toggle="dropdown" aria-expanded="false">
<span class="menu-text">talks</span>
</a>
<ul class="dropdown-menu dropdown-menu-end" aria-labelledby="nav-menu-talks">
<li>
<a class="dropdown-item" href="../../talks/index.html">
<span class="dropdown-text">📢 All Talks</span></a>
</li>
<li>
<a class="dropdown-item" href="../../talks/ai-for-science-2024/slides.html">
<span class="dropdown-text">Parallel Training Methods</span></a>
</li>
<li>
<a class="dropdown-item" href="../../talks/alcf-hpc-workshop-2024/slides.html">
<span class="dropdown-text">AuroraGPT (ALCF Hands-On HPC)</span></a>
</li>
<li>
<a class="dropdown-item" href="../../talks/alcf-hpc-workshop-2024/slides.html">
<span class="dropdown-text">ML + Foundation Models at Scale</span></a>
</li>
<li>
<a class="dropdown-item" href="../../talks/hpc-user-forum/slides.html">
<span class="dropdown-text">AuroraGPT (HPC User Forum)</span></a>
</li>
<li>
<a class="dropdown-item" href="../../talks/llms-at-scale/slides.html">
<span class="dropdown-text">Training LLMs at Scale</span></a>
</li>
<li>
<a class="dropdown-item" href="../../talks/llms-on-polaris/slides.html">
<span class="dropdown-text">Polaris Overview + LLMs</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/parallel-training-slides/">
<span class="dropdown-text">Parallel Training Techniques</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/llm-workshop-talk/">
<span class="dropdown-text">LLMs from Scratch</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/LLM-tutorial">
<span class="dropdown-text">Creating Small(-ish) LLMs</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/oneapi-talk/#0">
<span class="dropdown-text">Exascale Science on Aurora</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/llm-lunch-talk/">
<span class="dropdown-text">LLM Lunch Talk</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/scaling4science">
<span class="dropdown-text">Scaling LLMs for Science</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/lattice23">
<span class="dropdown-text">MLMC (for LQCD)</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/lqcd-pasc23">
<span class="dropdown-text">Generative Modeling</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/deep-fridays">
<span class="dropdown-text">Efficient Sampling for LGT</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/ai4sci-large-scale-training">
<span class="dropdown-text">Large Scale Training</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/hparam-management-sdl2022">
<span class="dropdown-text">Hyperparameter Management</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/ATPESC-StatisticalLearning">
<span class="dropdown-text">Statistical Learning</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/anl-job-talk/">
<span class="dropdown-text">Scientific Data Science</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/physicsSeminar">
<span class="dropdown-text">Machine Learning in HEP</span></a>
</li>
<li>
<a class="dropdown-item" href="https://bit.ly/mainz21">
<span class="dropdown-text">DLHMC for Improved Gauge Generation</span></a>
</li>
<li>
<a class="dropdown-item" href="https://slides.com/samforeman/l2hmc-qcd-93bc0c">
<span class="dropdown-text">ML for LQCD</span></a>
</li>
<li>
<a class="dropdown-item" href="https://bit.ly/mainz21_overview">
<span class="dropdown-text">ML Techniques in LQCD</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/physicsSeminar">
<span class="dropdown-text">ML for HEP</span></a>
</li>
</ul>
</li>
<li class="nav-item dropdown ">
<a class="nav-link dropdown-toggle" href="#" id="nav-menu-posts" role="link" data-bs-toggle="dropdown" aria-expanded="false">
<span class="menu-text">posts</span>
</a>
<ul class="dropdown-menu dropdown-menu-end" aria-labelledby="nav-menu-posts">
<li>
<a class="dropdown-item" href="../../posts/index.html">
<span class="dropdown-text">📬 All Posts</span></a>
</li>
<li>
<a class="dropdown-item" href="../../posts/AuroraGPT/spike-skipper/index.html">
<span class="dropdown-text">🏔️ Spike Skipper</span></a>
</li>
<li>
<a class="dropdown-item" href="../../posts/ezpz-at-alcf/index.html">
<span class="dropdown-text">🍋 <code>ezpz</code> at ALCF</span></a>
</li>
<li>
<a class="dropdown-item" href="../../posts/AuroraGPT/determinstic-flash-attn/index.html">
<span class="dropdown-text">🎰 Deterministic Flash Attention</span></a>
</li>
<li>
<a class="dropdown-item" href="../../posts/AuroraGPT/flash-attn-sunspot/index.html">
<span class="dropdown-text">📸 <code>flash-attn</code> on Sunspot</span></a>
</li>
<li>
<a class="dropdown-item" href="../../posts/AuroraGPT/mpi4py-reproducer/index.html">
<span class="dropdown-text">🐛 <code>mpi4py</code> bug on Sunspot</span></a>
</li>
<li>
<a class="dropdown-item" href="../../posts/ai-for-physics/diffusion/index.html">
<span class="dropdown-text">🎲 MCMC + Diffusion Sampling</span></a>
</li>
<li>
<a class="dropdown-item" href="../../posts/dope-slides/index.html">
<span class="dropdown-text">💅 How to Make Dope Slides</span></a>
</li>
<li>
<a class="dropdown-item" href="../../posts/AuroraGPT/startup-times/index.html">
<span class="dropdown-text">⏰ Starting Up Distributed Training</span></a>
</li>
<li>
<a class="dropdown-item" href="../../posts/AuroraGPT/long-sequences/index.html">
<span class="dropdown-text">🚂 Loooooooong Sequence Lengths</span></a>
</li>
<li>
<a class="dropdown-item" href="../../posts/AuroraGPT/aurora-gpt/index.html">
<span class="dropdown-text">🏎️ <code>Megatron-DeepSpeed</code> + Intel XPU</span></a>
</li>
<li>
<a class="dropdown-item" href="../../posts/ai-for-physics/l2hmc-qcd/2dU1/index.html">
<span class="dropdown-text">🎢 <code>l2hmc-qcd</code> Example: 2D U(1)</span></a>
</li>
<li>
<a class="dropdown-item" href="../../posts/ai-for-physics/l2hmc-qcd/4dSU3/index.html">
<span class="dropdown-text">🔳 <code>l2hmc-qcd</code> Example: 4D SU(3)</span></a>
</li>
</ul>
</li>
<li class="nav-item dropdown ">
<a class="nav-link dropdown-toggle" href="#" id="nav-menu-projects" role="link" data-bs-toggle="dropdown" aria-expanded="false">
<span class="menu-text">projects</span>
</a>
<ul class="dropdown-menu dropdown-menu-end" aria-labelledby="nav-menu-projects">
<li>
<a class="dropdown-item" href="../.././about/_projects.qmd">
<span class="dropdown-text">📚 All Projects</span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/ezpz">
<span class="dropdown-text">🍋 <code>ezpz</code></span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/l2hmc-qcd">
<span class="dropdown-text">🟥 <code>l2hmc-qcd</code></span></a>
</li>
<li>
<a class="dropdown-item" href="https://github.com/argonne-lcf/Megatron-DeepSpeed)">
<span class="dropdown-text">🤖 <code>Megatron-DeepSpeed</code></span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/wordplay">
<span class="dropdown-text">💬 <code>wordplay</code> 🎮</span></a>
</li>
<li>
<a class="dropdown-item" href="https://www.alcf.anl.gov/alcf-ai-science-training-series?">
<span class="dropdown-text">🎓 <code>ai-science-training</code></span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/enrich">
<span class="dropdown-text">💸 <code>enrich</code></span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/ambivalent">
<span class="dropdown-text">🤷🏻♂️<code>ambivalent</code></span></a>
</li>
<li>
<a class="dropdown-item" href="https://saforem2.github.io/climate-analysis">
<span class="dropdown-text">🌍 <code>climate-analysis</code></span></a>
</li>
<li>
<a class="dropdown-item" href="https://github.com/saforem2/glitz">
<span class="dropdown-text">🎨 <code>glitz</code></span></a>
</li>
<li>
<a class="dropdown-item" href="https://github.com/saforem2/personal_site">
<span class="dropdown-text">🙋🏻<code>personal_site</code></span></a>
</li>
<li>
<a class="dropdown-item" href="https://github.com/saforem2/notes-demo">
<span class="dropdown-text">🗒️ <code>Notes-Demo</code></span></a>
</li>
</ul>
</li>
<li class="nav-item">
<a class="nav-link" href="https://github.com/saforem2/personal_site">
<span class="menu-text"><span class="icon dim-text" style="font-size: 1.25rem;"><iconify-icon role="img" inline="" icon="ph:github-logo" aria-label="Icon github-logo from ph Iconify.design set." title="Icon github-logo from ph Iconify.design set."></iconify-icon></span></span></a>
</li>
<li class="nav-item">
<a class="nav-link" href="../../index.xml">
<span class="menu-text"><span class="icon dim-text" style="font-size: 1.25rem;"><iconify-icon role="img" inline="" icon="ph:rss" aria-label="Icon rss from ph Iconify.design set." title="Icon rss from ph Iconify.design set."></iconify-icon></span></span></a>
</li>
</ul>
</div> <!-- /navcollapse -->
<div class="quarto-navbar-tools">
<a href="" class="quarto-color-scheme-toggle quarto-navigation-tool px-1" onclick="window.quartoToggleColorScheme(); return false;" title="Toggle dark mode"><i class="bi"></i></a>
</div>
</div> <!-- /container-fluid -->
</nav>
</header>
<!-- content -->
<div id="quarto-content" class="quarto-container page-columns page-rows-contents page-layout-article page-navbar">
<!-- sidebar -->
<!-- margin-sidebar -->
<div id="quarto-margin-sidebar" class="sidebar margin-sidebar">
<nav id="TOC" role="doc-toc" class="toc-active">
<h2 id="toc-title">On this page</h2>
<ul>
<li><a href="#overview" id="toc-overview" class="nav-link active" data-scroll-target="#overview">Overview</a></li>
<li><a href="#scaling-overview" id="toc-scaling-overview" class="nav-link" data-scroll-target="#scaling-overview">🚀 Scaling: Overview</a></li>
<li><a href="#why-distributed-training-speedup" id="toc-why-distributed-training-speedup" class="nav-link" data-scroll-target="#why-distributed-training-speedup">Why Distributed Training? Speedup!</a></li>
<li><a href="#large-language-models" id="toc-large-language-models" class="nav-link" data-scroll-target="#large-language-models">Large Language Models</a></li>
<li><a href="#hands-on" id="toc-hands-on" class="nav-link" data-scroll-target="#hands-on">Hands On</a></li>
<li><a href="#thank-you" id="toc-thank-you" class="nav-link" data-scroll-target="#thank-you">❤️ Thank you!</a></li>
<li><a href="#references" id="toc-references" class="nav-link" data-scroll-target="#references">References</a></li>
</ul>
<div class="toc-actions"><ul><li><a href="https://github.com/saforem2/personal_site/blob/main/talks/alcf-hpc-workshop-2024/index.qmd" class="toc-action"><i class="bi bi-github"></i>View source</a></li><li><a href="https://github.com/saforem2/personal_site/edit/main/talks/alcf-hpc-workshop-2024/index.qmd" class="toc-action"><i class="bi empty"></i>Edit this page</a></li><li><a href="https://github.com/saforem2/personal_site/issues/new" class="toc-action"><i class="bi empty"></i>Report an issue</a></li></ul></div><div class="quarto-alternate-formats"><h2>Other Formats</h2><ul><li><a href="slides.html"><i class="bi bi-file-slides"></i>RevealJS</a></li><li><a href="alcf-hpc-workshop-2024.md"><i class="bi bi-file-code"></i>Github (GFM)</a></li></ul></div></nav>
</div>
<!-- main -->
<main class="content" id="quarto-document-content">
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-TC329HJ" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
<!-- End Google Tag Manager (noscript) -->
<header id="title-block-header" class="quarto-title-block default">
<div class="quarto-title">
<div class="quarto-title-block"><div><h1 class="title">Deep Learning and Foundation Models at Scale</h1><button type="button" class="btn code-tools-button" id="quarto-code-tools-source"><i class="bi"></i> Code</button></div></div>
</div>
<div class="quarto-title-meta-author">
<div class="quarto-title-meta-heading">Author</div>
<div class="quarto-title-meta-heading">Affiliation</div>
<div class="quarto-title-meta-contents">
<p class="author"><a href="https://samforeman.me">Sam Foreman</a> <a href="mailto:foremans@anl.gov" class="quarto-title-author-email"><i class="bi bi-envelope"></i></a> </p>
</div>
<div class="quarto-title-meta-contents">
<p class="affiliation">
<a href="https://alcf.anl.gov/about/people/sam-foreman">
</a><a href="https://alcf.anl.gov/about/people/sam-foreman">ALCF</a>
</p>
</div>
</div>
<div class="quarto-title-meta">
<div>
<div class="quarto-title-meta-heading">Published</div>
<div class="quarto-title-meta-contents">
<p class="date">October 29, 2024</p>
</div>
</div>
</div>
</header>
<section id="overview" class="level3" data-background-color="white">
<h3 data-background-color="white" class="anchored" data-anchor-id="overview">Overview</h3>
<ul>
<li><a href="https://www.alcf.anl.gov/events/2024-alcf-hands-hpc-workshop">ALCF Hands-on HPC Workshop</a>
<ul>
<li><i class="fa-brands fa-github" aria-label="github"></i> <a href="https://github.com/argonne-lcf/ALCF_Hands_on_HPC_Workshop"><code>argonne-lcf/ALCF_Hands_on_HPC_Workshop</code></a></li>
</ul></li>
<li>Slides @ <a href="https://samforeman.me/talks/alcf-hpc-workshop-2024/slides">samforeman.me/talks/alcf-hpc-workshop-2024/slides</a>
<ul>
<li>HTML Version: <a href="https://samforeman.me/talks/alcf-hpc-workshop-2024">samforeman.me/talks/alcf-hpc-workshop-2024</a></li>
</ul></li>
</ul>
</section>
<section id="scaling-overview" class="level3" data-background-color="white">
<h3 data-background-color="white" class="anchored" data-anchor-id="scaling-overview">🚀 Scaling: Overview</h3>
<ul>
<li>✅ <strong>Goal</strong>:
<ul>
<li>Minimize: <span class="highlight-red">Cost</span> (i.e. amount of time spent training)</li>
<li>Maximize: <span class="highlight-blue">Performance</span></li>
</ul>
<div class="callout callout-style-simple callout-note no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>See <a href="https://huggingface.co/docs/transformers/v4.46.0/performance">🤗 Performance and Scalability</a> for more details</p>
</div>
</div>
</div></li>
</ul>
<section id="single-gpu" class="level4" data-background-color="white">
<h4 data-background-color="white" class="anchored" data-anchor-id="single-gpu">Single GPU</h4>
<p>See <a href="https://huggingface.co/docs/transformers/v4.46.0/perf_train_gpu_one">🤗 Methods and tools for efficient training on a single GPU</a></p>
<div id="fig-single-gpu" class="r-stretch quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-single-gpu-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="./assets/single-gpu-step-1.drawio.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-1" title="Figure 1: SLOW !! model size limited by GPU memory"><img src="./assets/single-gpu-step-1.drawio.svg" class="r-stretch img-fluid figure-img"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-single-gpu-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 1: <strong>SLOW</strong> !! model size limited by GPU memory
</figcaption>
</figure>
</div>
</section>
<section id="data-parallel-training" class="level4" data-background-color="white">
<h4 data-background-color="white" class="anchored" data-anchor-id="data-parallel-training">Data Parallel Training</h4>
<div>
</div>
<div class="quarto-layout-panel" data-layout="[50,50]">
<div class="quarto-layout-row">
<div class="column quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<ul>
<li>The simplest and most common parallelism technique</li>
<li>Each GPU:
<ul>
<li>has identical copy of model</li>
<li>works on a <strong>unique</strong> subset of data</li>
</ul></li>
<li>Multiple copies of <strong>the same setup</strong>
<ul>
<li>each copy gets fed <strong>unique</strong> data</li>
<li>all copies compute gradients w.r.t local model</li>
<li>everyone syncs up before updating weights</li>
</ul></li>
<li>See: <a href="https://pytorch.org/tutorials/intermediate/ddp_tutorial.html">Distributed Data Parallel — PyTorch</a></li>
</ul>
</div>
<div class="column quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div id="fig-ddp-training" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-ddp-training-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="./assets/multi-gpu-ddp.drawio.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-2" title="Figure 2: Data Parallel Training"><img src="./assets/multi-gpu-ddp.drawio.svg" class="img-fluid figure-img"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-ddp-training-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 2: Data Parallel Training
</figcaption>
</figure>
</div>
</div>
</div>
</div>
</section>
<section id="data-parallel-training-1" class="level4" data-background-color="white">
<h4 data-background-color="white" class="anchored" data-anchor-id="data-parallel-training-1">Data Parallel Training</h4>
<div>
</div>
<div class="quarto-layout-panel" data-layout="[50,50]">
<div class="quarto-layout-row">
<div class="column quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<ul>
<li>Relatively simple to get up and running (minor modifications to code)</li>
<li><i class="fa-brands fa-github" aria-label="github"></i> <a href="https://github.com/saforem2/ezpz"><code>saforem2/ezpz</code></a></li>
<li><a href="https://pytorch.org/docs/stable/notes/ddp.html">PyTorch – DDP</a></li>
<li><a href="https://www.deepspeed.ai/"><iconify-icon role="img" inline="" icon="logos:microsoft-icon" aria-label="Icon microsoft-icon from logos Iconify.design set." title="Icon microsoft-icon from logos Iconify.design set."></iconify-icon> DeepSpeed</a></li>
<li><a href="https://huggingface.co/docs/transformers/accelerate">Distributed training with 🤗 Accelerate</a></li>
<li><a href="https://youtu.be/930yrXjNkgM">🎬 “Parallel Training Techniques”</a></li>
</ul>
</div>
<div class="column quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div id="fig-avgGrads" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-avgGrads-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="./assets/multi-gpu-ddp.drawio.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-3" title="Figure 3: Data Parallel Training"><img src="./assets/multi-gpu-ddp.drawio.svg" class="img-fluid figure-img"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-avgGrads-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 3: Data Parallel Training
</figcaption>
</figure>
</div>
</div>
</div>
</div>
</section>
<section id="communication" class="level4" data-background-color="white">
<h4 data-background-color="white" class="anchored" data-anchor-id="communication">Communication</h4>
<ul>
<li>Need mechanism(s) for communicating across GPUs:
<ul>
<li><a href="https://pytorch.org/docs/stable/distributed.html"><code>torch.distributed</code></a></li>
<li><a href="https://mpi4py.readthedocs.io/en/stable/tutorial.html"><code>mpi4py</code></a></li>
</ul></li>
<li>Collective Communication:
<ul>
<li><a href="https://developer.nvidia.com/nccl">Nvidia Collective Communications Library (NCCL)</a></li>
<li><a href="https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html#gs.gouznn">Intel oneAPI Collective Communications Library (oneCCL)</a></li>
</ul>
<div class="callout callout-style-simple callout-warning no-icon callout-titled" title="⌛ Timeouts">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-1-contents" aria-controls="callout-1" aria-expanded="true" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-title-container flex-fill">
⌛ Timeouts
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-1" class="callout-1-contents callout-collapse collapse show">
<div class="callout-body-container callout-body">
<ul>
<li>Collective operations have to be called for each <code>rank</code> to form a complete collective operation.
<ul>
<li>Failure to do so will result in other ranks waiting <strong>indefinitely</strong></li>
</ul></li>
</ul>
</div>
</div>
</div></li>
</ul>
</section>
<section id="allreduce" class="level4" data-background-color="white">
<h4 data-background-color="white" class="anchored" data-anchor-id="allreduce">AllReduce</h4>
<p>Perform <em>reductions</em> on data (e.g. <code>sum</code>, <code>min</code>, <code>max</code>) across ranks, send result back to everyone.</p>
<div id="fig-all-reduce" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-all-reduce-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="./assets/collective-allreduce-sum.drawio.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-4" title="Figure 4: All-Reduce operation: each rank receives the reduction of input values across ranks."><img src="./assets/collective-allreduce-sum.drawio.svg" class="img-fluid figure-img"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-all-reduce-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 4: All-Reduce operation: each rank receives the reduction of input values across ranks.
</figcaption>
</figure>
</div>
<div class="footer">
</div>
</section>
<section id="reduce" class="level4" data-background-color="white">
<h4 data-background-color="white" class="anchored" data-anchor-id="reduce">Reduce</h4>
<ul>
<li>Perform a <em>reduction</em> on data across ranks, send to individual</li>
</ul>
<div id="fig-reduce" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-reduce-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="./assets/collective-reduce-sum.drawio.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-5" title="Figure 5: Reduce operation: one rank receives the reduction of input values across ranks"><img src="./assets/collective-reduce-sum.drawio.svg" class="img-fluid figure-img"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-reduce-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 5: Reduce operation: one rank receives the reduction of input values across ranks
</figcaption>
</figure>
</div>
</section>
<section id="broadcast" class="level4" data-background-color="white">
<h4 data-background-color="white" class="anchored" data-anchor-id="broadcast">Broadcast</h4>
<div id="fig-broadcast" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-broadcast-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="./assets/collective-broadcast.drawio.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-6" title="Figure 6: broadcast (send) a tensor x from one rank to all ranks"><img src="./assets/collective-broadcast.drawio.svg" class="img-fluid figure-img"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-broadcast-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 6: <code>broadcast</code> (<em>send</em>) a tensor <code><span class="math inline">x</span></code> from one rank to all ranks
</figcaption>
</figure>
</div>
</section>
<section id="allgather" class="level4" data-background-color="white">
<h4 data-background-color="white" class="anchored" data-anchor-id="allgather">AllGather</h4>
<div id="fig-allgather" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-allgather-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="./assets/collective-allgather.drawio.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-7" title="Figure 7: Gathers tensors from the whole group in a list."><img src="./assets/collective-allgather.drawio.svg" class="img-fluid figure-img"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-allgather-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 7: Gathers tensors from the whole group in a list.
</figcaption>
</figure>
</div>
</section>
<section id="scatter" class="level4" data-background-color="white">
<h4 data-background-color="white" class="anchored" data-anchor-id="scatter">Scatter</h4>
<div id="fig-scatter" class="r-stretch quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-scatter-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="./assets/collective-scatter.drawio.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-8" title="Figure 8: Scatters a list of tensors to the whole group"><img src="./assets/collective-scatter.drawio.svg" class="r-stretch img-fluid figure-img"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-scatter-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 8: Scatters a list of tensors to the whole group
</figcaption>
</figure>
</div>
</section>
<section id="why-distributed-training" class="level4" data-background-color="white">
<h4 data-background-color="white" class="anchored" data-anchor-id="why-distributed-training">Why Distributed Training?</h4>
<ul>
<li><code>N</code> workers each processing unique batch<a href="#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"><sup>1</sup></a> of data:
<ul>
<li>[<code>micro_batch_size = 1</code>] <span class="math inline">\times</span> [<code>N</code> GPUs] <span class="math inline">\rightarrow</span> [<b><code>global_batch_size = N</code></b>]</li>
</ul></li>
<li>Smooth loss landscape</li>
<li>Improved gradient estimators</li>
<li>Less iterations needed for same number of epochs
<ul>
<li>May need to train for more epochs if another change is not made</li>
<li>e.g. scaling learning rate <code>lr *= sqrt(N)</code></li>
</ul></li>
<li>See: <a href="https://arxiv.org/abs/1708.03888">Large Batch Training of Convolutional Networks</a></li>
</ul>
</section>
</section>
<section id="why-distributed-training-speedup" class="level3" data-background-color="white">
<h3 data-background-color="white" class="anchored" data-anchor-id="why-distributed-training-speedup">Why Distributed Training? Speedup!</h3>
<div id="tbl-recent-progress" class="responsive striped hover quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-tbl figure">
<figcaption class="quarto-float-caption-top quarto-float-caption quarto-float-tbl" id="tbl-recent-progress-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Table 1: Recent progress
</figcaption>
<div aria-describedby="tbl-recent-progress-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<div class="table-responsive">
<table class="table-striped table-hover caption-top table">
<colgroup>
<col style="width: 11%">
<col style="width: 14%">
<col style="width: 8%">
<col style="width: 17%">
<col style="width: 14%">
<col style="width: 13%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: center;">Year</th>
<th style="text-align: center;">Author</th>
<th style="text-align: center;">GPU</th>
<th style="text-align: center;">Batch Size</th>
<th style="text-align: center;"># GPU</th>
<th style="text-align: center;">TIME (s)</th>
<th style="text-align: center;">ACC</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;">2016</td>
<td style="text-align: center;">He</td>
<td style="text-align: center;">P100</td>
<td style="text-align: center;">256</td>
<td style="text-align: center;"><span class="red-bg">8</span></td>
<td style="text-align: center;"><span class="red-bg">104,400</span></td>
<td style="text-align: center;">75.30%</td>
</tr>
<tr class="even">
<td style="text-align: center;">2019</td>
<td style="text-align: center;">Yamazaki</td>
<td style="text-align: center;">V100</td>
<td style="text-align: center;">81,920</td>
<td style="text-align: center;"><span class="blue-bg">2048</span></td>
<td style="text-align: center;"><span class="blue-bg">72</span></td>
<td style="text-align: center;">75.08%</td>
</tr>
</tbody>
</table>
</div>
</div>
</figure>
</div>
<section id="dealing-with-data" class="level4" data-background-color="white">
<h4 data-background-color="white" class="anchored" data-anchor-id="dealing-with-data">Dealing with Data</h4>
<ul>
<li>At each training step, we want to ensure that <strong>each worker receives unique data</strong></li>
<li>This can be done in one of two ways:
<ol type="1">
<li>Manually partition data (ahead of time)
<ul>
<li>Assign <strong>unique subsets</strong> to each worker</li>
<li>Each worker can only see their local portion of the data</li>
<li>Most common approach</li>
</ul></li>
<li>From each worker, randomly select a mini-batch
<ul>
<li>Each worker can see the full dataset</li>
<li>⚠️ When randomly selecting, it is important that each worker uses different seeds to ensure they receive unique data</li>
</ul></li>
</ol></li>
</ul>
</section>
<section id="broadcast-initial-state" class="level4" data-background-color="white">
<h4 data-background-color="white" class="anchored" data-anchor-id="broadcast-initial-state">Broadcast Initial State</h4>
<ul>
<li>At the start of training (or when loading from a checkpoint), we want all of our workers to be initialized consistently
<ul>
<li><strong>Broadcast</strong> the model and optimizer states from <code>rank() == 0</code> worker</li>
</ul></li>
</ul>
<div id="fig-broadcast" class="r-stretch quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-broadcast-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<div class="cell" data-layout-align="center">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">flowchart TD
0["GPU0"] --> 1["GPU 1"]
0 --> 2["GPU 2"]
0 --Model + Optim. State-->3["GPU 3"]
0 --> ...
0 --> N["GPU N"]
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-broadcast-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 9: To ensure all workers have the same copies, we load on <code>RANK==0</code> and <code>broadcast</code>
</figcaption>
</figure>
</div>
</section>
<section id="best-practices" class="level4" data-background-color="white">
<h4 data-background-color="white" class="anchored" data-anchor-id="best-practices">Best Practices</h4>
<div class="flex-container">
<div class="column" style="width:50%;">
<ul>
<li>Use parallel IO whenever possible
<ul>
<li>Feed each rank from different files</li>
<li>Use MPI IO to have each rank read its own batch from a file</li>
<li>Use several ranks to read data, MPI to scatter to remaining ranks
<ul>
<li>Most practical in big <em>at-scale</em> training</li>
</ul></li>
</ul></li>
</ul>
</div>
<div class="column" style="width:50%;">
<ul>
<li>Take advantage of data storage
<ul>
<li>Use <a href="https://wiki.lustre.org/Configuring_Lustre_File_Striping">striping on lustre</a></li>
</ul></li>
<li>Use the right optimizations for Aurora, Polaris, etc.</li>
<li>Preload data when possible
<ul>
<li>Offloading to a GPU frees CPU cycles for loading the next batch of data
<ul>
<li><strong>minimize IO latency this way</strong></li>
</ul></li>
</ul></li>
<li>Communication Bottleneck</li>
</ul>
</div>
</div>
<div class="callout callout-style-simple callout-important no-icon callout-titled" title="⏰ Keeping things in Sync">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-2-contents" aria-controls="callout-2" aria-expanded="true" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-title-container flex-fill">
⏰ Keeping things in Sync
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-2" class="callout-2-contents callout-collapse collapse show">
<div class="callout-body-container callout-body">
<p><strong>Computation stalls during communication !!</strong></p>
<p>Keeping the communication to computation ratio small is important for effective scaling.</p>
</div>
</div>
</div>
</section>
<section id="data-parallelism" class="level4" data-background-color="white">