This repo explores the potential of leveraging a large language model (LLM), such as BERT, trained on the vector quantized representation of audio to enhance music generation and music separation tasks. Specifically, it provides the code to: (1) train a Vector Quantized Variational Autoencoder (VQVAE) on the Synthesized Lakh Dataset (Slakh), (2) train a simple Transformer architecture on the vector quantized representation of the audio, and (3) fine-tune a basic BERT model on the indexes of the embeddings of the codebook of the audio.
Tasks (1) and (2) employ two different strategies on purpose, given that basic BERT model can't handle sequences longer than 512 tokens directly. Its input usually involves tokenized text, converted into numerical representations using an embedding layer. These representations are usually integers, representing indices in the model's vocabulary. Therefore, in this case, it was necessary to represent audio data in a format compatible with BERT, such as the indexes of the codebook more similar to the input audio that would have been used to quantize the input source.
# clone project
git clone https://github.com/deborahdore/multi-source-lms-for-audio
cd multi-source-lms-for-audio
# create conda environment and install dependencies
conda env create -f environment.yaml -n myenv
# activate conda environment
conda activate myenv
# train on CPU
python src/main.py trainer.accelerator=cpu
# train on GPU
python src/main.py trainer.accelerator=gpu
python src/main.py debug=default train_bert=True
Modify configuration directly in the configuration folder.
python src/main.py debug=default
Modify configuration for debugging in the default.yaml.
python src/main.py hparams_search=optuna
Modify hyper-parameter search configurations in the optuna.yaml
The dataset used in this project is the Slakh2100 dataset, which is available at the
following URL.
It was processed to extract the four most prevalent instruments: bass, drums, guitar, and piano.
The final dataset should have this structure:
slakh2100
|
| - train
| |
| | - Track01
| | | - bass.wav
| | | - guitar.wav
| | | - drums.wav
| | | - piano.wav
| |
| | - Track02
| | - ...
| - validation
| |
| | - Track03
| | - ...
| - test
|
| - Track04
| - ...
A simple guide on how to do that can be found here.
Model checkpoints are available in their corresponding folder: best_checkpoint