- tests
- documentation
- group all preprocessing files into one folder :
- all_cha
- select, clean -> ortholines.txt
- phono -> phono.txt
- syllabify -> syllabified.txt
- auto-tags -> auto-tags.txt
- build grammars
To have an easy to use package to compute correlation between algorithms segmentation and CDI reports describe project
Handles .cha (or other ? or nothing at all, just clean ortho file ? or just tags.txt ?) Can (or can't) phonologize and syllabify (which languages ? -none for now, except for English some time soon)
-
Get nb of words, phones, syllables in corpus
-
Get nb of single word utterances
-
Get stats on corpus
-
Store+stats ortho, gold
Given segmented, ortho, gold
-
Get dict from phono to ortho (rather build it from CDI ?)
-
Nb/list of words, syllables, phones
-
Freq_top, freq_words, write these in files
-
True pos and all
-
Evaluation (f-score &cie)
-
Correct words
-
Incorrect words
-
POS tagging ?