-
Notifications
You must be signed in to change notification settings - Fork 143
Scripts
TencentPretrain provides abundant tool scripts for pre-training models. This section firstly summarizes tool scripts and their functions, and then provides using examples of some scripts.
Script | Function description |
---|---|
average_model.py | Take the average of pre-trained models |
build_vocab.py | Build vocabulary given corpus and tokenizer |
cloze_test.py | Randomly mask a word and predict it, top n words are returned |
convert_bart_from_huggingface_to_tencentpretrain.py | Convert BART of Huggingface format (PyTorch) to TencentPretrain format |
convert_bart_from_tencentpretrain_to_huggingface.py | Convert BART of TencentPretrain format to Huggingface format (PyTorch) |
convert_albert_from_huggingface_to_tencentpretrain.py | Convert ALBERT of Huggingface format (PyTorch) to TencentPretrain format |
convert_albert_from_tencentpretrain_to_huggingface.py | Convert ALBERT of TencentPretrain format to Huggingface format (PyTorch) |
convert_bert_extractive_qa_from_huggingface_to_tencentpretrain.py | Convert extractive QA BERT of Huggingface format (PyTorch) to TencentPretrain format |
convert_bert_extractive_qa_from_tencentpretrain_to_huggingface.py | Convert extractive QA BERT of TencentPretrain format to Huggingface format (PyTorch) |
convert_bert_from_google_to_tencentpretrain.py | Convert BERT of Google format (TF) to TencentPretrain format |
convert_bert_from_huggingface_to_tencentpretrain.py | Convert BERT of Huggingface format (PyTorch) to TencentPretrain format |
convert_bert_from_tencentpretrain_to_google.py | Convert BERT of TencentPretrain format to Google format (TF) |
convert_bert_from_tencentpretrain_to_huggingface.py | Convert BERT of TencentPretrain format to Huggingface format (PyTorch) |
convert_bert_text_classification_from_huggingface_to_tencentpretrain.py | Convert text classification BERT of Huggingface format (PyTorch) to TencentPretrain format |
convert_bert_text_classification_from_tencentpretrain_to_huggingface.py | Convert text classification BERT of TencentPretrain format to Huggingface format (PyTorch) |
convert_bert_token_classification_from_huggingface_to_tencentpretrain.py | Convert sequence labeling BERT of Huggingface format (PyTorch) to TencentPretrain format |
convert_bert_token_classification_from_tencentpretrain_to_huggingface.py | Convert sequence labeling BERT of TencentPretrain format to Huggingface format (PyTorch) |
convert_gpt2_from_huggingface_to_tencentpretrain.py | Convert GPT-2 of Huggingface format (PyTorch) to TencentPretrain format |
convert_gpt2_from_tencentpretrain_to_huggingface.py | Convert GPT-2 of TencentPretrain format to Huggingface format (PyTorch) |
convert_pegasus_from_huggingface_to_tencentpretrain.py | Convert Pegasus of Huggingface format (PyTorch) to TencentPretrain format |
convert_pegasus_from_tencentpretrain_to_huggingface.py | Convert Pegasus of TencentPretrain format to Huggingface format (PyTorch) |
convert_t5_from_huggingface_to_tencentpretrain.py | Convert T5 of Huggingface format (PyTorch) to TencentPretrain format |
convert_t5_from_tencentpretrain_to_huggingface.py | Convert T5 of TencentPretrain format to Huggingface format (PyTorch) |
convert_xlmroberta_from_huggingface_to_tencentpretrain.py | Convert XLM-RoBERTa of Huggingface format (PyTorch) to TencentPretrain format |
convert_xlmroberta_from_tencentpretrain_to_huggingface.py | Convert XLM-RoBERTa of TencentPretrain format to Huggingface format (PyTorch) |
diff_vocab.py | Compare two vocabularies |
dynamic_vocab_adapter.py | Adapt the pre-trained model according to the vocabulary |
extract_embeddings.py | Extract the embedding of the pre-trained model |
extract_features.py | Obtain text representation |
generate_lm.py | Generate text with language model |
generate_seq2seq.py | Generate text with seq2seq model |
run_bayesopt.py | Search hyper-parameters for LightGBM by bayesian optimization |
run_lgb.py | Model ensemble with LightGBM (classification) |
topn_words_dep.py | Find nearest neighbors with context-dependent word embedding |
topn_words_indep.py | Find nearest neighbors with context-independent word embedding |
cloze_test.py uses MLM target to predict masked word. Top n words are returned. Cloze test can be used for operations such as data augmentation. The example of using cloze_test.py:
python3 scripts/cloze_test.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
--config_path models/bert/base_config.json \
--test_path datasets/tencent_profile.txt --prediction_path output.txt
Notice that cloze_test.py only supports pre-trained models with MLM target.
The text is encoded into a fixed-length embedding by extract_features.py (through embedding, encoder, and pooling layers). The example of using extract_features.py:
python3 scripts/extract_features.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
--config_path models/bert/base_config.json \
--test_path datasets/tencent_profile.txt --prediction_path features.pt \
--pooling first
CLS embedding (--pooling first) is commonly used as the text embedding. When cosine similarity is used to measure the relationship between two texts, CLS embedding is not a proper choice. According to recent work, it is necessary to perform whitening operation:
python3 scripts/extract_features.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
--config_path models/bert/base_config.json \
--test_path datasets/tencent_profile.txt --prediction_path features.pt \
--pooling first --whitening_size 64
--whitening_size 64 indicates that the whitening operation is used and the dimension of the text embedding is 64.
extract_embeddings.py extracts embedding layer from the pre-trained model. The extracted context-independent embedding can be used to initialize other models' (e.g. CNN) embedding layer. The example of using extract_embeddings.py:
python3 scripts/extract_embeddings.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
--word_embedding_path embeddings.txt
--word_embedding_path specifies the path of the output word embedding file. The format of word embedding file follows here, which can be loaded directly by mainstream projects.
The pre-trained model contains word embeddings. Traditional word embeddings such as word2vec and GloVe assign each word a fixed vector (context-independent word embedding). However, polysemy is a pervasive phenomenon in human language, and the meanings of a polysemous word depend on the context. To this end, we use the hidden state in pre-trained model to represent a word. It is noticeable that most Chinese pre-trained models are based on character. To obtain real word embedding (not character embedding), users can download word-based BERT model and its vocabulary. The example of using scripts/topn_words_indep.py to find nearest neighbours for context-independent word embedding (character-based and word-based models):
python3 scripts/topn_words_indep.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
--test_path target_words.txt
python3 scripts/topn_words_indep.py --load_model_path models/wiki_bert_word_model.bin --vocab_path models/wiki_word_vocab.txt \
--test_path target_words.txt
Context-independent word embedding comes from embedding layer. The format of the target_words.txt is as follows:
word-1
word-2
...
word-n
The example of using scripts/topn_words_dep.py to find nearest neighbours for context-dependent word embedding (character-based and word-based models):
python3 scripts/topn_words_dep.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
--cand_vocab_path models/google_zh_vocab.txt --test_path target_words_with_sentences.txt --config_path models/bert/base_config.json \
--batch_size 256 --seq_length 32 --tokenizer bert
python3 scripts/topn_words_dep.py --load_model_path models/bert_wiki_word_model.bin --vocab_path models/wiki_word_vocab.txt \
--cand_vocab_path models/wiki_word_vocab.txt --test_path target_words_with_sentences.txt --config_path models/bert/base_config.json \
--batch_size 256 --seq_length 32 --tokenizer space
We substitute the target word with other words in the vocabulary and feed the sentences into the pre-trained model. Hidden state is used as the context-dependent embedding of a word. --cand_vocab_path specifies the path of candidate word file. For faster speed one can use a smaller candidate vocabulary. Users should do word segmentation manually and use space tokenizer if word-based model is used. The format of target_words_with_sentences.txt is as follows:
word1 sent1
word2 sent2
...
wordn sentn
Sentence and word are split by \t.
average_models.py takes the average of multiple weights for probably more robust performance. The example of using average_models.py:
python3 scripts/average_models.py --model_list_path models/book_review_model.bin-4000 models/book_review_model.bin-5000 \
--output_model_path models/book_review_model.bin
We could use generate_lm.py to generate text through language model. Given a few words, generate_lm.py can continue writing. The example of using generate_lm.py to load GPT-2-distil and continue writing:
python3 scripts/generate_lm.py --load_model_path models/gpt_model.bin --vocab_path models/google_zh_vocab.txt \
--test_path beginning.txt --prediction_path generated_text.txt \
--config_path models/gpt2/distil_config.json --seq_length 128 \
--embedding word pos --remove_embedding_layernorm \
--encoder transformer --mask causal --layernorm_positioning pre \
--target lm --tie_weights
where beginning.txt contains the beginning of a text and generated_text.txt contains the text that the model writes.
We could use generate_seq2seq.py to generate text through seq2seq model. The example of using generate_seq2seq.py to load Transformer translation model (zh_en) and translate from Chinese to English:
python3 scripts/generate_seq2seq.py --load_model_path models/iwslt_zh_en_model.bin-50000 \
--vocab_path models/google_zh_vocab.txt --tgt_vocab_path models/google_uncased_en_vocab.txt --tgt_tokenizer bert \
--test_path input.txt --prediction_path output.txt \
--config_path models/transformer/base_config.json --seq_length 64 --tgt_seq_length 64 \
--embedding word sinusoidalpos --tgt_embedding word sinusoidalpos \
--encoder transformer --mask fully_visible --decoder transformer
where --test_path specifies the path of text to be translated and --prediction_path specifies the path of the translated text.
We provide diff_vocab.py to compare two vocabularies. Here is an example to compare a poem vocabulary and an ancient Chinese vocabulary by using diff_vocab.py :
python scripts/diff_vocab.py --vocab_1 models/google_zh_poem_vocab.txt --vocab_2 models/google_zh_ancient_vocab.txt
--vocab_1 and --vocab_2 specify the path of the compared vocabularies.
You may get the following output:
vocab_1: models/google_zh_poem_vocab.txt, size: 22556
vocab_2: models/google_zh_ancient_vocab.txt, size: 25369
vocab_1 - vocab_2 = 114
vocab_2 - vocab_1 = 2927
vocab_1 & vocab_2 = 22442
which shows the sizes of the complements and the intersection.
We can use dynamic_vocab_adapter.py to produce a new model according to a new vocabulary by adapting the embedding layer and the softmax layer from the old model. If the token from the new vocabulary exists in the old model, we will directly copy the parameters; otherwise, such parameters will be randomly initialized. Specifically, we will use dynamic_vocab_adapter.py to adapt the old model models/google_zh_model.bin with the old vocabulary models/google_zh_vocab.txt according to the new vocabulary models/google_zh_poem_vocab.txt, producing a new model models/google_zh_poem_model.bin .
python scripts/dynamic_vocab_adapter.py --old_model_path models/google_zh_model.bin \
--old_vocab_path models/google_zh_vocab.txt \
--new_vocab_path models/google_zh_poem_vocab.txt \
--new_model_path models/google_zh_poem_model.bin
--old_model_path and --old_vocab_path represent the paths of the old model and vocabulary, respectively. And --new_model_path and --new_vocab_path specify the paths of the new model and vocabulary, respectively.
Converting model from TencentPretrain format to Huggingface format (PyTorch):
We provide the usage of TencentPretrain-to-Huggingface conversion scripts in Huggingface model repository (uer).
Converting model from Huggingface format (PyTorch) to TencentPretrain format :
BART: Taking the bart-base-chinese-cluecorpussmall model in Huggingface as an example:
python3 scripts/convert_bart_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
--output_model_path tencentpretrain_pytorch_model.bin \
--layers_num 6
ALBERT: Taking the albert-base-chinese-cluecorpussmall model in Huggingface as an example:
python3 scripts/convert_albert_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
--output_model_path tencentpretrain_pytorch_model.bin
Roberta: Taking the chinese_roberta_L-2_H-128 model in Huggingface as an example:
python3 scripts/convert_bert_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
--output_model_path tencentpretrain_pytorch_model.bin \
--layers_num 2 --type mlm
GPT-2: Taking the gpt2-chinese-cluecorpussmall model in Huggingface as an example:
python3 scripts/convert_gpt2_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
--output_model_path tencentpretrain_pytorch_model.bin \
--layers_num 12
RoBERTa (BERT) for extractive QA: Taking the roberta-base-chinese-extractive-qa model in Huggingface as an example:
python3 scripts/convert_bert_extractive_qa_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
--output_model_path tencentpretrain_pytorch_model.bin \
--layers_num 12
RoBERTa (BERT) for text classification: Taking the roberta-base-finetuned-dianping-chinese model in Huggingface as an example:
python3 scripts/convert_bert_text_classification_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
--output_model_path tencentpretrain_pytorch_model.bin \
--layers_num 12
RoBERTa (BERT) for token classification (sequence labeling): Taking the roberta-base-finetuned-cluener2020-chinese model in Huggingface as an example:
python3 scripts/convert_bert_token_classification_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
--output_model_path tencentpretrain_pytorch_model.bin \
--layers_num 12
T5: Taking the t5-base-chinese-cluecorpussmall model in Huggingface as an example:
python3 scripts/convert_t5_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
--output_model_path tencentpretrain_pytorch_model.bin \
--layers_num 12 \
--type t5
T5-v1_1: Taking the t5-v1_1-small-chinese-cluecorpussmall model in Huggingface as an example:
python3 scripts/convert_t5_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
--output_model_path tencentpretrain_pytorch_model.bin \
--layers_num 8 --decoder_layers_num 8 \
--type t5-v1_1
Pegasus: Taking the pegasus-base-chinese-cluecorpussmall model in Huggingface as an example:
python3 scripts/convert_pegasus_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
--output_model_path tencentpretrain_pytorch_model.bin \
--layers_num 12 --decoder_layers_num 12
XLM-RoBERTa: Taking the xlm-roberta-base model in Huggingface as an example:
python3 scripts/convert_xlmroberta_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
--output_model_path tencentpretrain_pytorch_model.bin \
--layers_num 12