With the development of AI-Generated Content (AIGC), text-to-audio(T2A) models are gaining widespread attention. However, it is challenging for these models to generate audio aligned with human preference due to the inherent information density of natural language and limited model understanding ability. To alleviate this issue, we formulate the BATON, the first framework specifically designed to enhance the alignment between generated audio and text prompt using human preference feedback. Our BATON comprises three key stages: Firstly, we curated a dataset containing both prompts and the corresponding generated audio, which was then annotated based on human feedback. Secondly, we introduced a reward model using the constructed dataset, which can mimic human preference by assigning rewards to input text-audio pairs. Finally, we employed the reward model to fine-tune an off-the-shelf text-to-audio model. The experiment results demonstrate that our BATON can significantly improve the generation quality of the original text-to-audio models, concerning audio integrity, temporal relationship, and alignment with human preference.
- 🔥 Release on arxiv!
- 🪄 Release demo page!
- 👦 Release human perfence dataset!
- 🥇 Release reward model training/inference code
- Release finetuning code based on tango
We provide an example of how you can Aligning Text-to-Audio Model with any of your specific preferences using Baton. And we are releasing our implementation of finetuning one of the SOTA text-to-audio model Tango, while supporting imporve more T2A models or other audio generation models.
configs
: Configuration files for training and evaluation.data
: The audio for human annotationAudio
:2label_Integrity_HD_Audio
: Audio of 2 label prompt annotated by human2label_Integrity_RD_Audio
: Audio of 2 label prompt annotated by reward model3label_Temporal_HD_Audio
: Audio of 3 label prompt annotated by human3label_Temporal_HD_Audio
: Audio of 3 label prompt annotated by reward model
HA
: Human annotationRA
: Reward model annotation
scripts
: Helper scriptsfinetune.sh
: Fine-tuning t2a Modelinference_finetune.sh
: Inference using t2a model after fine-tuning
src
: Source coderewardmodel
: Reward model definitionsRM_inference_2label_BCE.py
: Inference using reward modelRM_train_2label_BCE.py
: Train reward model using preference annotation
t2amodel
: T2A model(Coming soon)Tango-master(example)
: Any text-to-audio modelfinetune.py
: move under this pathmodels.py
: move and replace origin one
We provide an example of how you can Aligning Text-to-Audio Model with any of your specific preferences using Baton. And we take one of the SOTA text-to-audio model Tango as example.
The 2.5k pretain data are random seleted from AudioCaps train set. The audio locations, corresponding captions and scores are specified in the *.json file in ./Data.
The audio for human annotation, generated by Tango based on the checkpoint Tango-Full-FT-AudioCaps by feeding 2 or 3 label prompts, is available in Google drive.
PD, HD and RD separately represents for pretrain data, human and reward model labeled data.
For Integrity task, Download:
2label_Integrity_HD_Audio and put it into /data
2label_Integrity_RD_Audio and put it into /data
For Temporal task, Download:
3label_Temporal_HD_Audio and put it into /data
3label_Temporal_RD_Audio and put it into /data
We will release our implementation based on one of the SOTA T2A model Tango, based on the checkpoint Tango-Full-FT-AudioCaps.
Train Reward Model
cd RM
python RM_train_2label_BCE.py # For Integrity task
python RM_train_3label_BCE.py # For Temporal task, you can change the line72, 87, 103 of "RM_train_2label_BCE.py"
Using Reward Model for Annotation
cd RM
python RM_inference_2label_BCE.py # For Integrity task
python RM_inference_3label_BCE.py # For Temporal task, you can change the line56, 66,80 of "RM_inference_2label_BCE.py"
☀️ We are truly indebted to the following outstanding open source work upon which we have based our work: Tango, AudioLDM, and audiocraft.metric.
@inproceedings{liao2024baton,
title={BATON: Aligning Text-to-Audio Model Using Human Preference Feedback},
author={Liao, Huan and Han, Haonan and Yang, Kai and Du, Tianjiao and Yang, Rui and Xu, Qinmei and Xu, Zunnan and Liu, Jingquan and Lu, Jiasheng and Li, Xiu},
booktitle={IJCAI},
year={2024}
}