Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUMI scripts - Mosaic/llm-foundry #190

Merged
merged 21 commits into from
Jan 15, 2024
Merged

LUMI scripts - Mosaic/llm-foundry #190

merged 21 commits into from
Jan 15, 2024

Conversation

rlrs
Copy link
Collaborator

@rlrs rlrs commented Nov 16, 2023

(Continued) Pretraining setup for LUMI. These scripts are now for a mosaicai/llm-foundry stack. Everything should work.

@rlrs
Copy link
Collaborator Author

rlrs commented Nov 16, 2023

And yes I know that there's a Huggingface token in there. It's invalid, need to find a better way to manage that.

Copy link

This PR is stale because it has been open 1+ days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Dec 20, 2023
@rlrs
Copy link
Collaborator Author

rlrs commented Dec 20, 2023

Please don't close my draft PR because it's not active over xmas 😅

@rlrs rlrs removed the Stale label Dec 20, 2023
@rlrs
Copy link
Collaborator Author

rlrs commented Dec 20, 2023

Okay, so this whole thing works now. There are a few unused scripts containing some of my other attempts at setting things up on LUMI - I should perhaps move these somewhere else to keep things clean.

The important files are:

  • make_venv.sh - creates the Python venv that all the nodes use to run the pretraining code.
  • continue_mistral_mosaic.sh - this is the SLURM sbatch script that describes how many nodes to run on etc., and launches the following script in the correct Singularity container on each node.
  • mosaic_in_container.sh - this script is run on in the container on each node and it simply sets up a few things before running the given training command.
  • continue-mistral-7b.yaml - configuration file that describes which model to train, which hyperparams, which data, which evals etc.

Additionally, I've added two submodules (so you have to clone with --recurse-submodules) since these are core dependencies that we need to keep track of, and perhaps we should pin them to a certain commit instead of the head of a branch.
Everything else is unused in the current setup.

@rlrs rlrs marked this pull request as ready for review December 20, 2023 14:34
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but have a few questions. Will try to make it run on LUMI as well (after Christmas) and that might lead to more questions, but that does not need to hold the PR back.

@KennethEnevoldsen
Copy link
Contributor

Okay, so this whole thing works now. There are a few unused scripts containing some of my other attempts at setting things up on LUMI - I should perhaps move these somewhere else to keep things clean.

Stuff that is still being worked on feel free to keep that, but stuff that could be deleted and recovered from the history (if needed) might as well be deleted

perhaps we should pin them to a certain commit instead of the head of a branch.

def. pin them

Copy link

This PR is stale because it has been open 1+ days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Dec 26, 2023
@KennethEnevoldsen
Copy link
Contributor

@rlrs will remove the stale label (this will give it another 7 days) as I assume you might be on vacation

Copy link

github-actions bot commented Jan 2, 2024

This PR is stale because it has been open 1+ days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jan 2, 2024
@github-actions github-actions bot removed the Stale label Jan 3, 2024
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

scripts/data/convert_dataset_json.py Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Jan 9, 2024

This PR is stale because it has been open 1+ days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jan 9, 2024
@rlrs rlrs enabled auto-merge January 15, 2024 13:05
@rlrs rlrs merged commit 457f847 into main Jan 15, 2024
1 check passed
@rlrs rlrs deleted the lumi branch January 15, 2024 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants