-
Notifications
You must be signed in to change notification settings - Fork 143
CLUE分类
zhezhaoa edited this page Aug 15, 2023
·
4 revisions
以下是CLUE分类任务解决方案的简要介绍。我们提交了两个结果,单模型结果和模型集成结果。单模型结果基于cluecorpussmall_roberta_wwm_large_seq512_model.bin预训练模型。模型集成结果基于大量的模型。本章节主要关注单模型。关于模型集成的更多详细信息,请参见这里。
首先做多任务学习,选择LCQMC和XNLI作为辅助任务:
python3 finetune/run_classifier_mt.py --pretrained_model_path models/cluecorpussmall_roberta_wwm_large_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--dataset_path_list datasets/afqmc/ datasets/lcqmc/ datasets/xnli/ \
--output_model_path models/afqmc_multitask_classifier_model.bin \
--epochs_num 1 --batch_size 64
之后加载 afqmc_multitask_classifier_model.bin 在AFQMC上微调:
python3 finetune/run_classifier.py --pretrained_model_path models/afqmc_multitask_classifier_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--train_path datasets/afqmc/train.tsv \
--dev_path datasets/afqmc/dev.tsv \
--output_model_path models/afqmc_classifier_model.bin \
--epochs_num 3 --batch_size 32
最后用 afqmc_classifier_model.bin 做预测:
python3 inference/run_classifier_infer.py --load_model_path models/afqmc_classifier_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--test_path datasets/afqmc/test_nolabel.tsv \
--prediction_path datasets/afqmc/prediction.tsv \
--seq_length 128 --labels_num 2
首先做多任务学习,选择XNLI作为辅助任务:
python3 finetune/run_classifier_mt.py --pretrained_model_path models/cluecorpussmall_roberta_wwm_large_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--dataset_path_list datasets/cmnli/ datasets/xnli/ \
--output_model_path models/cmnli_multitask_classifier_model.bin \
--epochs_num 1 --batch_size 64
之后加载 cmnli_multitask_classifier_model.bin 在CMNLI上微调:
python3 finetune/run_classifier.py --pretrained_model_path models/cmnli_multitask_classifier_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--train_path datasets/cmnli/train.tsv \
--dev_path datasets/cmnli/dev.tsv \
--output_model_path models/cmnli_classifier_model.bin \
--epochs_num 1 --batch_size 64
最后用 cmnli_classifier_model.bin 做预测:
python3 inference/run_classifier_infer.py --load_model_path models/cmnli_classifier_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--test_path datasets/cmnli/test_nolabel.tsv \
--prediction_path datasets/cmnli/prediction.tsv \
--seq_length 128 --labels_num 3
在IFLYTEK数据集上做微调和预测示例:
python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_wwm_large_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--train_path datasets/iflytek/train.tsv \
--dev_path datasets/iflytek/dev.tsv \
--output_model_path models/iflytek_classifier_model.bin \
--epochs_num 3 --batch_size 32 --seq_length 256
python3 inference/run_classifier_infer.py --load_model_path models/iflytek_classifier_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--test_path datasets/iflytek/test_nolabel.tsv \
--prediction_path datasets/iflytek/prediction.tsv \
--seq_length 256 --labels_num 119
中国科学文献任务判断给定的关键词是否是论文的真实关键词。在CSL上取得好结果的关键是使用特殊符号来分割关键词。我们发现CSL中的伪造的关键词通常很短,而特殊符号可以明确告知模型关键词的长度。 在CSL数据集上做微调和预测的示例:
python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_wwm_large_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--train_path datasets/csl/train.tsv \
--dev_path datasets/csl/dev.tsv \
--output_model_path models/csl_classifier_model.bin \
--epochs_num 3 --batch_size 32 --seq_length 384
python3 inference/run_classifier_infer.py --load_model_path models/csl_classifier_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--test_path datasets/csl/test_nolabel.tsv \
--prediction_path datasets/csl/prediction.tsv \
--seq_length 384 --labels_num 2
在CLUEWSC2020数据集上做微调和预测示例:
python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_wwm_large_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--train_path datasets/cluewsc2020/train.tsv \
--dev_path datasets/cluewsc2020/dev.tsv \
--output_model_path models/cluewsc2020_classifier_model.bin \
--learning_rate 5e-6 --epochs_num 20 --batch_size 8
python3 inference/run_classifier_infer.py --load_model_path models/cluewsc2020_classifier_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--test_path datasets/cluewsc2020/test_nolabel.tsv \
--prediction_path datasets/cluewsc2020/prediction.tsv \
--seq_length 128 --labels_num 2
提升CLUEWSC2020效果的一个技巧是加入WSC(CLUEWSC2020的旧版本)的训练集作为训练样本。
在TNEWS数据集上做微调和预测示例:
python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_wwm_large_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--train_path datasets/tnews/train.tsv \
--dev_path datasets/tnews/dev.tsv \
--output_model_path models/tnews_classifier_model.bin \
--epochs_num 3 --batch_size 32
python3 inference/run_classifier_infer.py --load_model_path models/tnews_classifier_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--test_path datasets/tnews/test_nolabel.tsv \
--prediction_path datasets/tnews/prediction.tsv \
--seq_length 128 --labels_num 15
首先做多任务学习,选择XNLI和CMNLI作为辅助任务:
python3 finetune/run_classifier_mt.py --pretrained_model_path models/cluecorpussmall_roberta_wwm_large_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--dataset_path_list datasets/ocnli/ datasets/cmnli/ datasets/xnli/ \
--output_model_path models/ocnli_multitask_classifier_model.bin \
--epochs_num 1 --batch_size 64
之后加载 ocnli_multitask_classifier_model.bin 在OCNLI上微调:
python3 finetune/run_classifier.py --pretrained_model_path models/ocnli_multitask_classifier_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--train_path datasets/ocnli/train.tsv \
--dev_path datasets/ocnli/dev.tsv \
--output_model_path models/ocnli_classifier_model.bin \
--epochs_num 1 --batch_size 64
最后用 ocnli_classifier_model.bin 做预测:
python3 inference/run_classifier_infer.py --load_model_path models/ocnli_classifier_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/large_config.json \
--test_path datasets/ocnli/test_nolabel.tsv \
--prediction_path datasets/ocnli/prediction.tsv \
--seq_length 128 --labels_num 3