This work aims to make generative audio AIs like AudioCraft output watmarked audio by poisoning their training datasets, which can protect copyrights of musicians who publish their works online.
There are three main components:
- The generative AI: MusicGen from AudioCraft
- The watermark generator/detector: audioseal and wavmark
- The dataset: musiccaps released by MusicLM
The main idea is to use watermarked datasets to fine tune the generative AI, then fine tune the detector using output of the fine-tuned generative AI, in this way, we can know whether a piece of audio is generated by a generative AI that is trained on watermarked works. It contains 2 stages to fine tune.
The workflow is
This work is based on AudioCraft, especially MusicGen, please install its dependencies first.
To install watermark generator/detector
pip install audioseal
pip install wavmark
-
Download the dataset
We use musiccaps released by MusicLM. However, it only contains metadata of the real audio, to get the real data, run
cd your_path_of_the_project python dataset/downloader.py # download the dataset into ./dataset/musiccaps
Pay attention
- The downloader code is from download-musiccaps-dataset, and I have fixed a bug in my code according to this issue
- It may take a while to download and it may fail to download some audio due to the network.
-
Dataset preprocessing
Pay attention: all audio we use is mono
The dataset preparation follows the guide. In brief, you need to prepare your dataset like the following directory structure
- root_of_this_project - config/dset/audio - your_dataset.yaml # the config of your dataset - dataset - your_dataset/ # the real data, each audio has one .json and one .wav - egs - your_dataset/data.jsonl # metadata of your dataset
You can just run the following codes to prepare the downloaded musiccaps dataset
# extract mono audio from musiccaps into ./dataset/musiccaps_mono_10s python dataset/data_processor.py --action get_mono # build dataset without watermark (named 'musiccaps_mono_10s_nonwm') python dataset/data_processor.py --action build_mono --model none # build dataset with watermark using audioseal model (named 'musiccaps_mono_10s_audioseal') python dataset/data_processor.py --action build_mono --model audioseal # optional: build dataset with watermark using wavmark model (named 'musiccaps_mono_10s_wavmark') python dataset/data_processor.py --action build_mono --model wavmark
According to MusicGen's fine tuning guide, run
dora run solver=musicgen/musicgen_base_32khz model/lm/model_scale=small continue_from=//pretrained/facebook/musicgen-small conditioner=text2music dset=audio/musiccaps_mono_10s_nonwm
where dset
is the dataset prepared before (here I use the dataset without watermark).
To change the config of training process, you can see config/solver/musicgen/musicgen_base_32khz.yaml
which is corresponding to the solver arg in the dora run ...
.
-
Export checkpoints
You can find SIG (after instantiating solver for XP) at the beginning of training, then export the best checkpoint by running
python export_ft_models.py --sig your_SIG --name output_dir_name
Then, the checkpoint will be export to
checkpoints/{output_dir_name}_{SIG}
-
Test Use
export_import_test.ipynb
to display the output audio of your fine-tuned model
To prepare the stage 2 dataset, run
python dataset/stage2_data_prepare.py --pos_ckpt checkpoints/your_ckpt_path1 --neg_ckpt checkpoints/your_ckpt_path2 --set all
--pos_ckpt
: checkpoint fine tuned on watermarked dataset--neg_ckpt
: checkpoint fine tuned on dataset without watermark--set
: generate training set (train
), testing set (test
) or both (all
)
The train_prompts
and test_prompts
variables in stage2_data_prepare.py
are prompts to generate audio and you can customize them.
Then, the stage 2 dataset will be generated in dataset2/
directory.
To fine tune the detector of audioseal, please see fine_tune_audioseal_detector.
Or the official training instruction may be helpful.