This documentation provides the procedures to convert a LLaMA model to LLaMA-MoE.
The conversion from LLaMA to LLaMA-MoE consists of two steps:
-
Split. Create indices sets
$S_1,S_2,\dots,S_n$ (Eq. 5 in the technical report) for the each FFN layer in LLaMA. The indices sets indicate the intermediate neurons that should be assigned to experts. Save the indices sets to disk. - Convert. Create a LLaMA-MoE model from an existing LLaMA checkpoint. Reinitialize the LLaMA-MoE experts by selecting the corressponding neurons in the indices sets. Save the initialized LLaMA-MoE model to disk.
To randomly split the intermediate neurons in FFNs, you can run:
bash ./scripts/expert_construction/split/run_split_random.sh
Remember to change the following variables:
num_experts="" # number of experts in each MoE layer
model_path="" # path to the LLaMA checkpoint
save_path="" # path to save the indices sets
To split the intermediate neurons in FFNs by k-means clustering, you can run:
bash ./scripts/expert_construction/split/run_split_clustering.sh
Remember to change the following variables:
num_experts="" # number of experts in each MoE layer
model_path="" # path to the LLaMA checkpoint
save_path="" # path to save the indices sets
metric="" # metric for clustering, choices: `l2` `cos`
proj_type="" # weights to perform clustering, choices: `up_proj` `gate_proj`
This part is not included in our technical report.
We don’t recommend running this method due to its complexity.
We also implenmented the co-activation graph based method in MoEfication here.
You need to install METIS first. Then you can run to following script to perform splitting:
bash ./scripts/expert_construction/get_hidden_features/run_prepare_datasets.sh
bash ./scripts/expert_construction/get_hidden_features/run_get_hidden_features.sh
bash ./scripts/expert_construction/split/run_split_graph.sh
Remember to change the following variables:
num_experts="" # number of experts in each MoE layer
model_path="" # path to the LLaMA checkpoint
save_path="" # path to save the indices sets
metric="" # metric to measure the sparsity, choices: `l1_norm` `l2_norm` `plain`
proj_type="" # outputs to use for constructing co-activation graph, should be set to `up_proj`
Before performing gradient-based splitting (Eq. 8 in the technical report), you need to prepare a bunch of pretraining data and group them into different clusters by running:
python smoe/entrypoint/text_clustering.py
Then, you need to run the following script to get the importance vector
bash scripts/expert_construction/split/run_split_gradient_get_grads.sh
Remember to change the following variables:
dataset_dir="" # path to clustered data
pretrained_model="" # path to the LLaMA checkpoint
tokenizer_path="" # path to the LLaMA tokenizer
save_path="" # path to save the indices sets
accumulate_level="" # should be set to `sample`
kernel="" # should be set to `l1_norm`
importance_type="" # should be set to `feature_change`
After that, the importance vector files will be saved to the save_path
with the following file structure:
# this is an example with 16 data clusters
--Gradient16
-- llama2_7B-Gradients-l1_norm-sample-feature_change
-- 0
layers.0.mlp.gate_proj.weight.change # importance on the output of gate_proj
layers.0.mlp.up_proj.weight.change # importance on the output of (up_proj * gate_proj)
layers.1.mlp.gate_proj.weight.change
layers.1.mlp.up_proj.weight.change
...
-- 1
layers.0.mlp.gate_proj.weight.change
layers.0.mlp.up_proj.weight.change
layers.1.mlp.gate_proj.weight.change
layers.1.mlp.up_proj.weight.change
...
...
-- 15
layers.0.mlp.gate_proj.weight.change
layers.0.mlp.up_proj.weight.change
layers.1.mlp.gate_proj.weight.change
layers.1.mlp.up_proj.weight.change
...
This part is not included in our technical report.
You can also split the intermediate neurons in a neuron-independent manner by treating the expert split as a task assignment problem. To perform the split, you can run:
bash ./scripts/expert_construction/split/run_split_gradient.sh
Remember to change the following variables:
expert_num="" # number of experts in each MoE layer
expert_size="" # intermediate neurons in each expert
share_neurons="False" ######### SET AS FLASE TO BE NEURON-INDEPENDENT #########
model_path="" # path to the LLaMA checkpoint
score_file_path="" # path to the score files generated above
save_path="" # path to save the indices sets
visualization_path="" # path to save the visualization results
criterion="" # criterion to judge the importance of neurons, should be set to `max`
proj_type="" # importance vector to use, should be set to `up_proj`
Here we use the same entrance as the Neuron Independent strategy above for gradient split.
bash ./scripts/expert_construction/split/run_split_gradient.sh
Remember to change the following variables:
expert_num="" # number of experts in each MoE layer
expert_size="" # intermediate neurons in each expert
share_neurons="True" ######### SET AS TRUE TO BE INNER-SHARING #########
model_path="" # path to the LLaMA checkpoint
score_file_path="" # path to the score files generated above
save_path="" # path to save the indices sets
visualization_path="" # path to save the visualization results
criterion="" # criterion to judge the importance of neurons, should be set to `max`
proj_type="" # importance vector to use, should be set to `up_proj`
You can run the following script to perform inter-sharing split:
bash ./scripts/expert_construction/split/run_split_gradient_residual.sh
Remember to change the following variables:
expert_num_moe="" # number of non-residual experts
expert_num_residual="" # number of residual experts
expert_size="" # intermediate neurons in each expert
share_neurons="" # Whether to share neurons in non-residual experts
model_path="" # path to the LLaMA checkpoint
score_file_path="" # path to the score files generated above
save_path="" # path to save the indices sets
visualization_path="" # path to save the visualization results
criterion="" # criterion to judge the importance of neurons, should be set to `max`
proj_type="" # importance vector to use, should be set to `up_proj`
Run the following script:
bash ./scripts/expert_construction/convert/run_convert.sh
Run the following script:
bash ./scripts/expert_construction/convert/run_convert_gradient.sh
Run the following script:
bash ./scripts/expert_construction/convert/run_convert_gradient_residual.sh
--smoe
-- scripts
-- expert_construction
-- convert
-- get_hidden_features (deprecated, will be removed later)
-- prune (deprecated, will be removed later)
-- select (deprecated, will be removed later)
-- split
-- smoe
-- entrypoint
-- expert_construction