Hi! Thanks for your interest in contributing to PyThaiNLP.
Please refer to our Contributor Covenant Code of Conduct.
- Discussion: https://github.com/PyThaiNLP/pythainlp/discussions
- GitHub issues (for problems and suggestions): https://github.com/PyThaiNLP/pythainlp/issues
- Facebook group (not specific to PyThaiNLP, for Thai NLP discussion in general): https://www.facebook.com/groups/thainlp
- Follow PEP8, use black with
--line-length
= 79; - Name identifiers (variables, classes, functions, module names) with meaningful
and pronounceable names (
x
is always wrong);- Please follow this naming convention. For example, global constant variables must be in
ALL_CAPS
;
- Please follow this naming convention. For example, global constant variables must be in
- Write tests for your new features. The test suite is in
tests/
directory. (see "Testing" section below); - Run all tests before pushing (just execute
tox
) so you will know if your changes broke something; - Commented out codes are dead codes;
- All
#TODO
comments should be turned into issues in GitHub; - When appropriate, use f-string
(use
f"{a} = {b}"
, instead of"{} = {}".format(a, b)
and"%s = %s' % (a, b)"
); - All text files, including source codes, must end with one empty line. This is to please git and to keep up with POSIX standard.
- We use Git as our version control system, so it may be a good idea to familiarize yourself with it.
- You can start with the Pro Git book (free!).
- How to Write a Git Commit Message
- Commit Verbs 101: why I like to use this and why you should also like it.
- We use the famous gitflow to manage our branches.
- When you create pull requests on GitHub, GitHub Actions will run tests and several checks automatically. Click the "Details" link at the end of each check to see what needs to be fixed.
- We use Sphinx to generate API document automatically from "docstring" comments in source codes. This means the comment section in the source codes is important for the quality of documentation.
- A docstring should start with one summary line, end with one line with a full stop (period), then be followed by a blank line before starting a new paragraph.
- A commit to release branches (e.g.
2.2
,2.1
) with a title "(build and deploy docs)" (without quotes) will trigger the system to rebuild the documentation files and upload them to the website https://pythainlp.org/docs.
We use standard Python unittest
. The test suite is in tests/
directory.
To run unit tests locally together with code coverage test:
(from main pythainlp/
directory)
coverage run -m unittest tests.core
See code coverage test:
coverage report
Generate code coverage test in HTML
(files will be available in htmlcov/
directory):
coverage html
Make sure the tests pass on GitHub Actions.
See more in tests/README.md
-
We use semantic versioning: MAJOR.MINOR.PATCH, with development build suffix: MAJOR.MINOR.PATCH-devBUILD
-
We use
bumpversion
to manage versioning.bumpversion [major|minor|patch|release|build]
- Example:
#current_version = 2.3.3-dev0 bumpversion build #current_version = 2.3.3-dev1 bumpversion build #current_version = 2.3.3-dev2 bumpversion release #current_version = 2.3.3-beta0 bumpversion release #current_version = 2.3.3 bumpversion patch #current_version = 2.3.6-dev0 bumpversion minor #current_version = 2.3.1-dev0 bumpversion build #current_version = 2.3.1-dev1 bumpversion major #current_version = 3.0.0-dev0 bumpversion release #current_version = 3.0.0-beta0 bumpversion release #current_version = 3.0.0
-
Read the full how to cut a new release documentation.
Thanks to all contributors. (Image made with contributors-img)
- Wannaphong Phatthiyaphaibun wannaphong@yahoo.com - foundation, distribution and maintenance
- Korakot Chaovavanich - initial tokenization and soundex codes
- Charin Polpanumas - classification and benchmarking
- Arthit Suriyawongkul - CI/infrastructure, documentation, refactoring and distribution
- Lalita Lowphansirikul - documentation
- Pattarawat Chormai - benchmarking
- Peerat Limkonchotiwat
- Thanathip Suntorntip - nlpO3 maintenance, Rust developer
- Can Udomcharoenchaikit - documentation and codes
- Arthit Suriyawongkul
- Wannaphong Phatthiyaphaibun
- Peeradej Tanruangporn - documentation
- [Maximum Matching] -- Manabu Sassano. Deterministic Word Segmentation Using Maximum Matching with Fully Lexicalized Rules. Retrieved from http://www.aclweb.org/anthology/E14-4016
- [MetaSound] -- Snae & Brückner. (2009). Novel Phonetic Name Matching Algorithm with a Statistical Ontology for Analysing Names Given in Accordance with Thai Astrology. Retrieved from https://pdfs.semanticscholar.org/3983/963e87ddc6dfdbb291099aa3927a0e3e4ea6.pdf
- [Thai Character Cluster] -- T. Teeramunkong, V. Sornlertlamvanich, T. Tanhermhong and W. Chinnan, “Character cluster based Thai information retrieval,” in IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages, 2000.
- [Enhanced Thai Character Cluster] -- Jeeragone Inrut, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. “Thai word segmentation using combination of forward and backward longest matching techniques.” In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.
- เพ็ญศิริ ลี้ตระกูล. การเลือกประโยคสำคัญในการสรุปความภาษาไทย โดยใช้แบบจำลองแบบลำดับชั้น (Selection of Important Sentences in Thai Text Summarization Using a Hierarchical Model). Retrieved from http://digi.library.tu.ac.th/thesis/st/0192/
- [Thai Discourse Treebank] -- Ponrawee Prasertsom, Apiwat Jaroonpol, Attapol T. Rutherford; The Thai Discourse Treebank: Annotating and Classifying Thai Discourse Connectives. Transactions of the Association for Computational Linguistics 2024; 12 613–629. doi: https://doi.org/10.1162/tacl_a_00650