BATON: Aligning Text-to-Audio Model with Human Preference Feedback

With the development of AI-Generated Content (AIGC), text-to-audio(T2A) models are gaining widespread attention. However, it is challenging for these models to generate audio aligned with human preference due to the inherent information density of natural language and limited model understanding ability. To alleviate this issue, we formulate the BATON, the first framework specifically designed to enhance the alignment between generated audio and text prompt using human preference feedback. Our BATON comprises three key stages: Firstly, we curated a dataset containing both prompts and the corresponding generated audio, which was then annotated based on human feedback. Secondly, we introduced a reward model using the constructed dataset, which can mimic human preference by assigning rewards to input text-audio pairs. Finally, we employed the reward model to fine-tune an off-the-shelf text-to-audio model. The experiment results demonstrate that our BATON can significantly improve the generation quality of the original text-to-audio models, concerning audio integrity, temporal relationship, and alignment with human preference.

Todo List

🔥 Release on arxiv!
🪄 Release demo page!
👦 Release human perfence dataset!
🥇 Release reward model training/inference code
Release finetuning code based on tango

Repository Overview

We provide an example of how you can Aligning Text-to-Audio Model with any of your specific preferences using Baton. And we are releasing our implementation of finetuning one of the SOTA text-to-audio model Tango, while supporting imporve more T2A models or other audio generation models.

configs: Configuration files for training and evaluation.
data: The audio for human annotation
- Audio:
  - 2label_Integrity_HD_Audio: Audio of 2 label prompt annotated by human
  - 2label_Integrity_RD_Audio: Audio of 2 label prompt annotated by reward model
  - 3label_Temporal_HD_Audio: Audio of 3 label prompt annotated by human
  - 3label_Temporal_HD_Audio: Audio of 3 label prompt annotated by reward model
- HA: Human annotation
- RA: Reward model annotation
scripts: Helper scripts
- finetune.sh: Fine-tuning t2a Model
- inference_finetune.sh: Inference using t2a model after fine-tuning
src: Source code
- rewardmodel: Reward model definitions
  - RM_inference_2label_BCE.py: Inference using reward model
  - RM_train_2label_BCE.py: Train reward model using preference annotation
- t2amodel: T2A model(Coming soon)
  - Tango-master(example): Any text-to-audio model
    - finetune.py: move under this path
    - models.py: move and replace origin one

Quick Start

We provide an example of how you can Aligning Text-to-Audio Model with any of your specific preferences using Baton. And we take one of the SOTA text-to-audio model Tango as example.

Datasets

The 2.5k pretain data are random seleted from AudioCaps train set. The audio locations, corresponding captions and scores are specified in the *.json file in ./Data.

The audio for human annotation, generated by Tango based on the checkpoint Tango-Full-FT-AudioCaps by feeding 2 or 3 label prompts, is available in Google drive.

PD, HD and RD separately represents for pretrain data, human and reward model labeled data.

For Integrity task, Download:

2label_Integrity_HD_Audio and put it into /data
2label_Integrity_RD_Audio and put it into /data

For Temporal task, Download:

3label_Temporal_HD_Audio and put it into /data
3label_Temporal_RD_Audio and put it into /data

Fine-tuning T2A Model

We will release our implementation based on one of the SOTA T2A model Tango, based on the checkpoint Tango-Full-FT-AudioCaps.

Reward Model Training and Annotating

Train Reward Model

cd RM
python RM_train_2label_BCE.py # For Integrity task
python RM_train_3label_BCE.py # For Temporal task, you can change the line72, 87, 103 of "RM_train_2label_BCE.py"

Using Reward Model for Annotation

cd RM
python RM_inference_2label_BCE.py # For Integrity task
python RM_inference_3label_BCE.py # For Temporal task, you can change the line56, 66,80 of "RM_inference_2label_BCE.py"

Acknowledgement

☀️ We are truly indebted to the following outstanding open source work upon which we have based our work: Tango, AudioLDM, and audiocraft.metric.

Citation

@inproceedings{liao2024baton,
  title={BATON: Aligning Text-to-Audio Model Using Human Preference Feedback},
  author={Liao, Huan and Han, Haonan and Yang, Kai and Du, Tianjiao and Yang, Rui and Xu, Qinmei and Xu, Zunnan and Liu, Jingquan and Lu, Jiasheng and Li, Xiu},
  booktitle={IJCAI},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
assets		assets
data		data
src/rewardmodel		src/rewardmodel
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BATON: Aligning Text-to-Audio Model with Human Preference Feedback

Todo List

Repository Overview

Quick Start

Datasets

Fine-tuning T2A Model

Reward Model Training and Annotating

Acknowledgement

Citation

About

Releases

Packages

Contributors 2

Languages

Hannieliao/Baton

Folders and files

Latest commit

History

Repository files navigation

BATON: Aligning Text-to-Audio Model with Human Preference Feedback

Todo List

Repository Overview

Quick Start

Datasets

Fine-tuning T2A Model

Reward Model Training and Annotating

Acknowledgement

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages