Download SlimPajama into /path_to_data
and put data from different domains into separate folders:
/path_to_data/en_arxiv
/path_to_data/en_book
/path_to_data/en_c4
/path_to_data/en_cc
/path_to_data/en_stack
/path_to_data/en_wikipedia
/path_to_data/github
Each file should be end with *.jsonl
and each line looks like:
{"id": "id-info", "content": "raw text to be tokenized"}
Run the following command to tokenize the data in each folder:
python -m smoe.utils.tokenize \
-f jsonl \
-t /path_to_tokenizer \
-i /path_to_data/en_arxiv \
-o /path_to_data_tokenized/en_arxiv
Description | Path |
---|---|
LLaMA-MoE 2/16 Experts | scripts/cpt/16_2/baseline_112gpus_sheared_llama_portion_fluency_sf8.sh |
LLaMA-MoE 4/16 Experts | scripts/cpt/dynamic_data_selection/baseline_112gpus_sheared_llama_portion_fluency.sh |
DynamicSheared | scripts/cpt/dynamic_data_selection/sheared_llama_112gpus.sh |
Argument Name | Description |
---|---|
--dynamic_data_selection |
For different dynamic data sampling strategies, choose one from: sheared_llama or none (static). Default: none
|
--moe_calculator_score_scale_factor |
Scale factor to multiply after hidden states are procesed by experts. Should be $\frac{\text{#total experts}}{\text{#selected}}$. Default: 4.0
|
--num_selects |
The number of selected experts. Default: 4
|
--gate_balance_loss_weight |
The weight of the balance loss for the gate. Default: 1e-2
|
- balance loss weight
- scale factor
- learning rate
- warmup steps
- evaluation steps
- logging steps
- global batch size
- number of selected experts
- pretrained model
- data path
- GPUs
- comment
For scripts/cpt/lora.sh
and scripts/cpt/fpt.sh
files, we could run an experiment via sbatch
. e.g. sbatch scripts/cpt/lora.sh
.
The sbatch
command is similar to nohup
, and the submitted job would be running on the background.
Here are some instructions for slurm configuration items:
--job-name
: the name of a job, could be changed at your convenience.--partition
: the slurm partition. For MoE group, the default partition isMoE
.--output
: the logging file ofstdout
.%x
means the job name, and%j
is the job ID.--error
: the logging file ofstderr
.--ntasks-per-node
: always set to1
.--cpus-per-task
: may change according to different usage. The maximum CPUs for a node is128
(according tocinfo -p MoE
).--nodes
: how many nodes would you like to use. NOTICE: the value ofnum_nodes
must be the same with--nodes
.--gres
: the number of GPUs for each node (in a format ofgpu:<num>
, e.g.gpu:8
). NOTICE: the value ofnum_gpu_per_node
must agree with--gres
.
model_type
and pretrained_model
in scripts/cpt/(lora|fpt).sh
help specify the fundation model.
For vanilla LLaMA, use the following settings:
model_type="llama"
pretrained_model=/mnt/petrelfs/share_data/quxiaoye/models/llama_7B
For LLaMA-MoE, use the following settings:
model_type="llama_moe"
pretrained_model=/mnt/petrelfs/share_data/quxiaoye/models/llama_7B_MoE_16Select4-l2_norm
llama1-7b 16 select 4: 3.49b params
llama1-13b total params: 13,015,864,320 - total mlp params: 8,493,465,600
For convenient estimation of the model training speed, we provide some useful information at the very beginning of log files:
max_steps: 63578
global batch size: 768
#tokens/batch: 1572864
global batch size
: per_device_train_batch_size * gradient_accumulation_steps * num_nodes * num_gpu_per_nodemax_steps
: how many steps should be trained to reach 100B tokens: 10^11 / (block_size * per_device_train_batch_size * gradient_accumulation_steps * num_nodes * num_gpu_per_node)#tokens/batch
: the number of trained tokens for one global batch
When estimating the expected time, you may want to check the running time/step
via tensorboard or from the logging file.
Based on the above information, the expected time could be calculated.
The tensorboard logging_dir
could be found at outputs/<job-name>-<job-id>/runs/<logging-dir>
.
For example, if my job name is cpt-moe-fpt-bs16-48gpus
in the sbatch file, the tensorboard could be started from that by: tensorboard --logdir outputs/cpt-moe-fpt-bs16-48gpus-1535835/runs/Jul31_14-12-00
.
For multiple tasks with different logging directories, you could run the following command:
$ tensorboard --logdir_spec short_name:dir1,short_name2:dir2 --port 8001
Here, the short_name
is an abbreviation for your task, and the port number could be changed manually if there's a port conflict. e.g.
$ tensorboard --logdir_spec moe_from_scratch:outputs/cpt-llama-moe-scratch-lora-bs16-1476932/runs/Jul26_21-53-42,moe_lora:outputs/cpt-llama-lora-bs16-1476918/runs/Jul26_21-31-09 --port 8001