Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Chinese Character Frequency Counter #110

Open
baimafeima opened this issue Aug 23, 2018 · 5 comments
Open

Integrate Chinese Character Frequency Counter #110

baimafeima opened this issue Aug 23, 2018 · 5 comments

Comments

@baimafeima
Copy link

It would be great to have the ability to paste random Chinese text into a field/box as part of Syng and get a Chinese character frequency count upon clicking a button. This would allow to quickly identify the most important characters to learn from particular Chinese texts and to efficiently prepare for exams for any college student.

https://czielinski.github.io/hanzifreq/hanzifreq/output/frequencies.html
See: https://github.com/czielinski/hanzifreq

These scripts allow the analysis of character frequencies in Chinese text corpora. This might be helpful for Chinese language learners to prioritize common characters when learning how to write.

@sotch-pr35mac
Copy link
Owner

That sounds like it could be a pretty helpful tool! So the feature would be to paste in some arbitrary block of Chinese text and get frequency data back from it about which characters are most frequently used?

@baimafeima
Copy link
Author

Yes, exactly. I think Syng would be a great choice for that, especially since Hanzifreq is a terminal-based program without a suitable frontend for it.

@sotch-pr35mac
Copy link
Owner

I wouldn’t be able to include the actual hanzifreq script but I would definitely be able to build a tool that does something similar. My question is: would we want just character frequency or word frequency?

@baimafeima
Copy link
Author

My question is: would we want just character frequency or word frequency?

I think character frequency would be the feature I would most often use. How would you approach word frequency?

@sotch-pr35mac
Copy link
Owner

First the text would be tokenized and then count the frequency of the tokenized words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants