-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
詞表修正計劃 #4
Comments
我在 CanCLID/cantonese_orthography#2 (comment) 中提到 LSHK 字表中可能包含頻度排序,可以據此生成粗略的詞頻,不過這樣的數據就不精確了。 |
LSHK字表嘅頻度只係區分邊個比較常用,唔一定有標出生僻音。我提議如果詞典入面有收錄某 單音節詞 ,呢個單音節詞嘅讀音可以視作常用音。而如果某個字已經有另一個讀音,又有一啲詞典冇收�、只存在於個別配搭嘅,就視乎佢喺其他字表有冇出現去決定係咪收錄。 舉例: 「丁」讀作 ding1 應該係詞表入面會揾到,所以 ding1 作為常用音係應該收。但係讀作 zang1/zaang1 就唔會喺詞典揾到,噉呢個讀法就應該只係喺少數情況(伐木丁丁)出現,因此唔應該收為單字音。至於係刪除單字音,定係將頻度設為 0%,就按字表嘅情況考慮。 大家覺得點樣? |
同 @chaaklau 意見一致,可能要確立個生僻字音的判斷原則,然後再處理生僻字音,不知有無辦法單獨提取以便供衆人討論,設定頻度。 |
如果係噉,噉我哋就要先解決呢兩個問題:
|
關於字頻權重問題,可以參考rime/home#14 |
@leimaau 喺#2 入邊提到,部分詞彙需要改變詞頻,且字頻亦須修改。所以下一步詞庫應該點計劃更新?應該點樣確定每個字詞應該嘅排序同頻率?數據來源分別係邊度?
The text was updated successfully, but these errors were encountered: