Skip to content

Latest commit

 

History

History
52 lines (43 loc) · 3.24 KB

README.md

File metadata and controls

52 lines (43 loc) · 3.24 KB

TDS Baselines

This directory provides codes to reproduce TDS baselines in the paper. You should use them together with wav2letter.

Data

  • Two lists of supervised training data with 10 hours and 1 hour.
  • Two sets of tokens for phonemes and characters.
  • Two lexicons to map words to phonemes and characters:

Experiments

Model Architectures

  • A TDS model with 20 million parameters is provided for training on the limited supervised data.
  • A TDS model with 37 million parameters is provided for training on both supervised data and pseudo labels.

Configurations

Acoustic model

Acoustic model training config files for each set-up. Note that the 20-millioin-parameter TDS models are trained on 8 GPUs each, while the 37-millioin-parameter ones are on 64 GPUs. See wav2letter tutorials about how to run distributed training.

Sample command:

</path/to/your>/wav2letter/build/Train \
--flagsfile=</path/to/your>/libri-light/TDS/experiments/config/acoustic_model/10h+pseudo-label_letter_37M_TDS.cfg \
--enable_distributed=true

Decoding

Optimal decoding parameters of each model. You can use wav2letter decoder to

  • Get optimal WER
  • Generate pseudo-labels.

We use the official Librispeech 4-gram language model for all decoding experiments. The model can be downloaded here.

Sample command:

</path/to/your>/wav2letter/build/Decode \
--flagsfile=</path/to/your>/libri-light/TDS/experiments/config/decoding/10h+pseudo-label_letter_37M_TDS.cfg \
--sclite=</path/to/your/output_folder>

Pretrained Models

Supervised data LibriVox Target unit Architecture Model
10 hours Y letter 37M 10h+pseudo-label_letter_37M_TDS.bin
10 hours Y phonemes 37M 10h+pseudo-label_phone_37M_TDS.bin
10 hours N letter 20M 10h_letter_20M_TDS.bin
10 hours N phonemes 20M 10h_phone_20M_TDS.bin
1 hour Y letter 37M 1h+pseudo-label_letter_37M_TDS.bin
1 hour Y phonemes 37M 1h+pseudo-label_phone_37M_TDS.bin
1 hour N letter 20M 1h_letter_20M_TDS.bin
1 hour N phonemes 20M 1h_phone_20M_TDS.bin