This directory provides codes to reproduce TDS baselines in the paper. You should use them together with wav2letter.
- Two lists of supervised training data with 10 hours and 1 hour.
- Two sets of tokens for phonemes and characters.
- Two lexicons to map words to phonemes and characters:
- A TDS model with 20 million parameters is provided for training on the limited supervised data.
- A TDS model with 37 million parameters is provided for training on both supervised data and pseudo labels.
Acoustic model training config files for each set-up. Note that the 20-millioin-parameter TDS models are trained on 8 GPUs each, while the 37-millioin-parameter ones are on 64 GPUs. See wav2letter tutorials about how to run distributed training.
Sample command:
</path/to/your>/wav2letter/build/Train \
--flagsfile=</path/to/your>/libri-light/TDS/experiments/config/acoustic_model/10h+pseudo-label_letter_37M_TDS.cfg \
--enable_distributed=true
Optimal decoding parameters of each model. You can use wav2letter decoder to
- Get optimal WER
- Generate pseudo-labels.
We use the official Librispeech 4-gram language model for all decoding experiments. The model can be downloaded here.
Sample command:
</path/to/your>/wav2letter/build/Decode \
--flagsfile=</path/to/your>/libri-light/TDS/experiments/config/decoding/10h+pseudo-label_letter_37M_TDS.cfg \
--sclite=</path/to/your/output_folder>
Supervised data | LibriVox | Target unit | Architecture | Model |
---|---|---|---|---|
10 hours | Y | letter | 37M | 10h+pseudo-label_letter_37M_TDS.bin |
10 hours | Y | phonemes | 37M | 10h+pseudo-label_phone_37M_TDS.bin |
10 hours | N | letter | 20M | 10h_letter_20M_TDS.bin |
10 hours | N | phonemes | 20M | 10h_phone_20M_TDS.bin |
1 hour | Y | letter | 37M | 1h+pseudo-label_letter_37M_TDS.bin |
1 hour | Y | phonemes | 37M | 1h+pseudo-label_phone_37M_TDS.bin |
1 hour | N | letter | 20M | 1h_letter_20M_TDS.bin |
1 hour | N | phonemes | 20M | 1h_phone_20M_TDS.bin |