Skip to content

LLM for music separation and generation based on quantized representation of audio

Notifications You must be signed in to change notification settings

deborahdore/multi-source-lms-for-audio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi Source Large Language Model For Audio

PyTorch Lightning Config: Hydra Template

Description

This repo explores the potential of leveraging a large language model (LLM), such as BERT, trained on the vector quantized representation of audio to enhance music generation and music separation tasks. Specifically, it provides the code to: (1) train a Vector Quantized Variational Autoencoder (VQVAE) on the Synthesized Lakh Dataset (Slakh), (2) train a simple Transformer architecture on the vector quantized representation of the audio, and (3) fine-tune a basic BERT model on the indexes of the embeddings of the codebook of the audio.

Tasks (1) and (2) employ two different strategies on purpose, given that basic BERT model can't handle sequences longer than 512 tokens directly. Its input usually involves tokenized text, converted into numerical representations using an embedding layer. These representations are usually integers, representing indices in the model's vocabulary. Therefore, in this case, it was necessary to represent audio data in a format compatible with BERT, such as the indexes of the codebook more similar to the input audio that would have been used to quantize the input source.

Visualization of VQVAE's Codebook

Installation

Conda

# clone project
git clone https://github.com/deborahdore/multi-source-lms-for-audio
cd multi-source-lms-for-audio

# create conda environment and install dependencies
conda env create -f environment.yaml -n myenv

# activate conda environment
conda activate myenv

How to run

Basic Training with default configurations

# train on CPU
python src/main.py trainer.accelerator=cpu

# train on GPU
python src/main.py trainer.accelerator=gpu

Choose the model to train with flag: [train_vqvae, train_transformer, train_bert]

python src/main.py debug=default train_bert=True

Modify configuration directly in the configuration folder.

Debug

python src/main.py debug=default

Modify configuration for debugging in the default.yaml.

Hyper-parameter search

python src/main.py hparams_search=optuna

Modify hyper-parameter search configurations in the optuna.yaml

Dataset

The dataset used in this project is the Slakh2100 dataset, which is available at the following URL.
It was processed to extract the four most prevalent instruments: bass, drums, guitar, and piano.
The final dataset should have this structure:

    slakh2100
        |
        | - train
        |     | 
        |     | - Track01
        |     |     | - bass.wav
        |     |     | - guitar.wav
        |     |     | - drums.wav
        |     |     | - piano.wav
        |     |
        |     | - Track02
        |     | - ...
        | - validation
        |     | 
        |     | - Track03
        |     | - ...
        | - test
              | 
              | - Track04
              | - ...

A simple guide on how to do that can be found here.

Model Checkpoints

Model checkpoints are available in their corresponding folder: best_checkpoint

About

LLM for music separation and generation based on quantized representation of audio

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages