Multi Source Large Language Model For Audio

Description

This repo explores the potential of leveraging a large language model (LLM), such as BERT, trained on the vector quantized representation of audio to enhance music generation and music separation tasks. Specifically, it provides the code to: (1) train a Vector Quantized Variational Autoencoder (VQVAE) on the Synthesized Lakh Dataset (Slakh), (2) train a simple Transformer architecture on the vector quantized representation of the audio, and (3) fine-tune a basic BERT model on the indexes of the embeddings of the codebook of the audio.

Tasks (1) and (2) employ two different strategies on purpose, given that basic BERT model can't handle sequences longer than 512 tokens directly. Its input usually involves tokenized text, converted into numerical representations using an embedding layer. These representations are usually integers, representing indices in the model's vocabulary. Therefore, in this case, it was necessary to represent audio data in a format compatible with BERT, such as the indexes of the codebook more similar to the input audio that would have been used to quantize the input source.

Visualization of VQVAE's Codebook

Installation

Conda

# clone project
git clone https://github.com/deborahdore/multi-source-lms-for-audio
cd multi-source-lms-for-audio

# create conda environment and install dependencies
conda env create -f environment.yaml -n myenv

# activate conda environment
conda activate myenv

How to run

Basic Training with default configurations

# train on CPU
python src/main.py trainer.accelerator=cpu

# train on GPU
python src/main.py trainer.accelerator=gpu

Choose the model to train with flag: [train_vqvae, train_transformer, train_bert]

python src/main.py debug=default train_bert=True

Modify configuration directly in the configuration folder.

Debug

python src/main.py debug=default

Modify configuration for debugging in the default.yaml.

Hyper-parameter search

python src/main.py hparams_search=optuna

Modify hyper-parameter search configurations in the optuna.yaml

Dataset

The dataset used in this project is the Slakh2100 dataset, which is available at the following URL.
It was processed to extract the four most prevalent instruments: bass, drums, guitar, and piano.
The final dataset should have this structure:

    slakh2100
        |
        | - train
        |     | 
        |     | - Track01
        |     |     | - bass.wav
        |     |     | - guitar.wav
        |     |     | - drums.wav
        |     |     | - piano.wav
        |     |
        |     | - Track02
        |     | - ...
        | - validation
        |     | 
        |     | - Track03
        |     | - ...
        | - test
              | 
              | - Track04
              | - ...

A simple guide on how to do that can be found here.

Model Checkpoints

Model checkpoints are available in their corresponding folder: best_checkpoint

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
configs		configs
logs		logs
src		src
.gitignore		.gitignore
.project-root		.project-root
README.md		README.md
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi Source Large Language Model For Audio

Description

Installation

Conda

How to run

Basic Training with default configurations

Choose the model to train with flag: [train_vqvae, train_transformer, train_bert]

Debug

Hyper-parameter search

Dataset

Model Checkpoints

About

Releases

Packages

Languages

deborahdore/multi-source-lms-for-audio

Folders and files

Latest commit

History

Repository files navigation

Multi Source Large Language Model For Audio

Description

Installation

Conda

How to run

Basic Training with default configurations

Choose the model to train with flag: [train_vqvae, train_transformer, train_bert]

Debug

Hyper-parameter search

Dataset

Model Checkpoints

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages