An unofficial implementation of the phoneme segmentation method given in Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks (INTERSPEECH 2021)
This method requires no additional training, and can be easily applied on various speech representations using the S3PRL toolkit.
It works by simultaneously minimizing frame-wise distance to nearest k-means cluster center and the number of phoneme-like segments in an utterance with dynamic programming. (see paper above for details)
Segmentation accuracy is heavily dependent on the parameter lambda, of which the optimal value varies greatly between choice of self-supervised representations.
Lambda is set to 35 in default.
Values between 20~50 get about 60% F1 score for the 6th layer of HuBERT.
Also note that while the phoneme-level segments are each assigned to a cluster center, the same phoneme in different segments are often assigned to different cluster centers, which means that this method is less suitable for phoneme discovery.
Given an audio file and its text transcript, we can use forced alignment to obtain supervised word/phoneme boundaries, to visualize our method's accuracy.
Detailed steps are given in demo.ipynb
.
- The S3PRL toolkit
- Pretrained K-means model (see FAIRSEQ GSLM)
- Add forced alignment and Praat visualization demo
- Add F1 score calculation