Skip to content
Carlos Lizarraga-Celaya edited this page Dec 9, 2024 · 15 revisions

Build a LLM from Scratch Workshop

Description

Goal: Replicate Sebastian Raschka's book and code

This hands-on workshop will guide you step-by-step through building a GPT-style language model that runs on either a laptop or GPU.

The workshop will begin with base model code and, without using any existing LLM libraries, guide participants through developing a text classifier and creating a chatbot.

Participants require intermediate Python skills and some knowledge of machine learning.


Instructors

Enrique Noriega

Enrique Noriega is a computational research scientist in the Department of Computer Science and the Data Science Institute at the University of Arizona. He specializes in developing AI applications for medical sciences and is passionate about working with deep learning models.

Carlos Lizárraga

Carlos Lizárraga is a Computational and Data Science Educator at the Data Science Institute at the University of Arizona. With a strong background in applied mathematics and physics, he focuses on applying machine learning and deep learning models to scientific research.


Schedule Spring 2025

Time: Thursdays 1 PM
Where:
Zoom link:

Topic Description Date Notes URL
0. PyTorch refresh and setup. Comprehensive PyTorch programming language review and detailed setup of the computational environment for the workshop, including installing package dependencies, virtual environment configuration, and hardware requirements for optimal performance. Jan 28
1. Understanding large language models. A high-level introduction to the core concepts of Large Language Models (LLMs). This session explores the transformer architecture—the foundational framework behind platforms like ChatGPT. Feb 4
2. Working with text data. This session outlines how to build an LLM from scratch. It explains the text preparation process for LLM training, which includes tokenization (splitting text into word and subword units), advanced tokenization using byte pair encoding, training data creation through sliding window sampling, and token-to-vector conversion for LLM input. Feb 11
3. Coding attention mechanisms. This session focuses on attention mechanisms in LLMs. We'll start with a basic self-attention framework before advancing to more complex implementations. The session covers building a causal attention module that enables token-by-token generation, implementing dropout to prevent overfitting by masking selected attention weights and combining multiple causal attention modules into a multihead attention system. Feb 18
4. Implementing a GPT from scratch to generate text. This session focuses on coding a GPT-like LLM that can generate human-like text. It covers techniques such as normalizing layer activations to stabilize neural network training, adding shortcut connections in deep neural networks for more effective training, implementing transformer blocks to create GPT models of various sizes, and calculating the number of parameters and storage requirements of GPT models. Feb 25
5. Pretraining on unlabeled data. This session covers the implementation of LLM pretraining. We'll explore how to compute training and validation losses to evaluate text generation quality, implement the core training function, save and load model weights for continued training, and load pretrained weights from OpenAI. Mar 4
Spring break - NO session -- Mar 11 --
6. Fine-tuning for classification. This session explores various LLM fine-tuning techniques. We cover dataset preparation for text classification, modification of pretrained LLMs, practical fine-tuning using spam detection as an example, and methods for evaluating classifier accuracy. Mar 18
7. Fine-tuning to follow instructions. This session explores how to fine-tune LLMs to follow instructions. We'll cover preparing supervised instruction datasets, organizing training batches, fine-tuning pretrained models to follow human instructions, and evaluating the results through response analysis and performance metrics. Mar 25

Created: 12/08/2024 (C. Lizárraga); Last update: 12/08/2024 (C. Lizárraga)

CC BY-NC-SA

UArizona DataLab, Data Science Institute, University of Arizona, 2024.

Clone this wiki locally