You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, is_kanji uses the Unicode range U+4E00-U+9FAF to recognise kanji, corresponding to the CJK Unified Ideographs block. Unicode has additional "extension blocks" that contain more uncommon kanji, such as CJK Unified Ideographs Extension B which contains the kanji 𬵪.
Since these are quite obscure and possibly difficult to determine which of them qualify as "kanji", I think it would be useful to include such functionality in a crate.
The text was updated successfully, but these errors were encountered:
That's understandable, covering just the CJK Unified Ideographs block is enough for most purposes I imagine.
For added context, I was using the crate for its other functionality already, and started using is_kanji to pick out kanji from the words contained in the JMdict dictionary. It contains some words that contain kanji from the extension blocks, so they were unexpectedly (to me) getting filtered out by is_kanji.
You can close the issue if this is out of scope, or leave it up if this is something that may have a place in the crate in the future. Thanks for the quick response!
Currently,
is_kanji
uses the Unicode range U+4E00-U+9FAF to recognise kanji, corresponding to the CJK Unified Ideographs block. Unicode has additional "extension blocks" that contain more uncommon kanji, such as CJK Unified Ideographs Extension B which contains the kanji 𬵪.Since these are quite obscure and possibly difficult to determine which of them qualify as "kanji", I think it would be useful to include such functionality in a crate.
The text was updated successfully, but these errors were encountered: