Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend is_kanji to recognise kanji in the CJK Unified Ideographs Extension blocks (or provide alternate function) #15

Open
Heliozoa opened this issue Jan 12, 2024 · 2 comments

Comments

@Heliozoa
Copy link

Currently, is_kanji uses the Unicode range U+4E00-U+9FAF to recognise kanji, corresponding to the CJK Unified Ideographs block. Unicode has additional "extension blocks" that contain more uncommon kanji, such as CJK Unified Ideographs Extension B which contains the kanji 𬵪.

Since these are quite obscure and possibly difficult to determine which of them qualify as "kanji", I think it would be useful to include such functionality in a crate.

@PSeitz
Copy link
Owner

PSeitz commented Jan 12, 2024

Are you suggesting there should be a separate method for this?
This crate is mostly for hiragana and katakana, is_kanji is just for convenience.

@Heliozoa
Copy link
Author

Heliozoa commented Jan 12, 2024

That's understandable, covering just the CJK Unified Ideographs block is enough for most purposes I imagine.

For added context, I was using the crate for its other functionality already, and started using is_kanji to pick out kanji from the words contained in the JMdict dictionary. It contains some words that contain kanji from the extension blocks, so they were unexpectedly (to me) getting filtered out by is_kanji.

You can close the issue if this is out of scope, or leave it up if this is something that may have a place in the crate in the future. Thanks for the quick response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants