Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
KOLANICH committed Oct 15, 2023
0 parents commit b8fc74c
Show file tree
Hide file tree
Showing 17 changed files with 808 additions and 0 deletions.
1 change: 1 addition & 0 deletions .ci/aptPackagesToInstall.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
python3-cffi
5 changes: 5 additions & 0 deletions .ci/beforeBuild.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/usr/bin/env sh

python3 -m WordSplitAbs download WolfGarbe_libs;
mkdir -p ./SymSpell.FrequencyDictionary;
wget -O ./SymSpell.FrequencyDictionary/en-80k.txt https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell.FrequencyDictionary/en-80k.txt;
7 changes: 7 additions & 0 deletions .ci/pythonPackagesToInstallFromGit.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
https://github.com/grantjenks/python-wordsegment
https://github.com/keredson/wordninja

https://github.com/pythonnet/clr-loader
https://github.com/pythonnet/pythonnet

https://github.com/tomerfiliba/plumbum
12 changes: 12 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
root = true

[*]
charset = utf-8
indent_style = tab
indent_size = 4
insert_final_newline = true
end_of_line = lf

[*.{yml,yaml}]
indent_style = space
indent_size = 2
1 change: 1 addition & 0 deletions .github/.templateMarker
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
KOLANICH/python_project_boilerplate.py
8 changes: 8 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
version: 2
updates:
- package-ecosystem: "pip"
directory: "/"
schedule:
interval: "daily"
allow:
- dependency-type: "all"
15 changes: 15 additions & 0 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
name: CI
on:
push:
branches: [master]
pull_request:
branches: [master]

jobs:
build:
runs-on: ubuntu-22.04
steps:
- name: typical python workflow
uses: KOLANICH-GHActions/typical-python-workflow@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
23 changes: 23 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@

# pycparser, which is a dependency of pythonnet, sometimes shits them
lextab.py
yacctab.py

# Dicts
/SymSpell.FrequencyDictionary

# Dlls of the tool written in C#
*.dll
*.dll.so

__pycache__
*.pyc
*.pyo
/*.egg-info
*.srctrlbm
*.srctrldb
build
dist
.eggs
monkeytype.sqlite3
/.ipynb_checkpoints
51 changes: 51 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
image: registry.gitlab.com/kolanich-subgroups/docker-images/fixed_python:latest

variables:
DOCKER_DRIVER: overlay2
SAST_ANALYZER_IMAGE_TAG: latest
SAST_DISABLE_DIND: "true"
SAST_CONFIDENCE_LEVEL: 5
CODECLIMATE_VERSION: latest

include:
- template: SAST.gitlab-ci.yml
- template: Code-Quality.gitlab-ci.yml
- template: License-Management.gitlab-ci.yml

build:
tags:
- shared
- linux
stage: build
variables:
GIT_DEPTH: "1"
PYTHONUSERBASE: ${CI_PROJECT_DIR}/python_user_packages

before_script:
- export PATH="$PATH:$PYTHONUSERBASE/bin" # don't move into `variables`
- apt-get update
# todo:
#- apt-get -y install
#- pip3 install --upgrade
#- python3 ./fix_python_modules_paths.py

script:
- python3 -m build -nw bdist_wheel
- mv ./dist/*.whl ./dist/WordSplitAbs-0.CI-py3-none-any.whl
- pip3 install --upgrade ./dist/*.whl
- coverage run --source=WordSplitAbs -m --branch pytest --junitxml=./rspec.xml ./tests/test.py
- coverage report -m
- coverage xml

coverage: "/^TOTAL(?:\\s+\\d+){4}\\s+(\\d+%).+/"

cache:
paths:
- $PYTHONUSERBASE

artifacts:
paths:
- dist
reports:
junit: ./rspec.xml
cobertura: ./coverage.xml
1 change: 1 addition & 0 deletions Code_Of_Conduct.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
No codes of conduct!
4 changes: 4 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
include UNLICENSE
include *.md
include tests
include .editorconfig
38 changes: 38 additions & 0 deletions ReadMe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
WordSplitAbs.py [![Unlicensed work](https://raw.githubusercontent.com/unlicense/unlicense.org/master/static/favicon.png)](https://unlicense.org/)
===============
~~[wheel (GitLab)](https://gitlab.com/KOLANICH/WordSplitAbs.py/-/jobs/artifacts/master/raw/dist/WordSplitAbs-0.CI-py3-none-any.whl?job=build)~~
[wheel (GHA via `nightly.link`)](https://nightly.link/KOLANICH-libs/WordSplitAbs.py/workflows/CI/master/WordSplitAbs-0.CI-py3-none-any.whl)
~~![GitLab Build Status](https://gitlab.com/KOLANICH/WordSplitAbs.py/badges/master/pipeline.svg)~~
~~![GitLab Coverage](https://gitlab.com/KOLANICH/WordSplitAbs.py/badges/master/coverage.svg)~~
~~[![GitHub Actions](https://github.com/KOLANICH-libs/WordSplitAbs.py/workflows/CI/badge.svg)](https://github.com/KOLANICH-libs/WordSplitAbs.py/actions/)~~
[![Libraries.io Status](https://img.shields.io/librariesio/github/KOLANICH-libs/WordSplitAbs.py.svg)](https://libraries.io/github/KOLANICH-libs/WordSplitAbs.py)
[![Code style: antiflash](https://img.shields.io/badge/code%20style-antiflash-FFF.svg)](https://codeberg.org/KOLANICH-tools/antiflash.py)

This is an abstraction layer around Python libraries for splitting (tokenization) of words joined without delimiters.

It is often called `words tokenization`, but it is a bit different thing: `tokenization` is when words are naturally not splitted (in Eastern-Asian languages, for example), but `splitting` is when they are naturally splitted, but the delimiters got missed.


Tutorial
--------

```python

from WordSplitAbs import ChosenWordSplitter

s = ChosenWordSplitter() # A resource-consuming stage, the most splitters load a corpus or a semi-preprocessed model here and infer a usable model from it. So you want to call it as less as possible.

print(s("wordsplittingisinferenceofconcatenatedwordsboundaries")) # "word splitting is inference of concatenated words boundaries"
```

Backends
--------

| Backend | Has default corpus | Deps | Model | Quality | Notes |
|---------|--------------------|------|-------|---------|-------|
| [instant_segment](https://github.com/InstantDomain/instant-segment) || | Unigram + bigram | Recommended | A rewrite of `wordsegment` into Rust with high performance boost |
| [wordsegment](https://github.com/grantjenks/python-wordsegment) | ✔️ | | Unigram + bigram | Recommended | |
| [WordSegmentationDP](https://github.com/wolfgarbe/WordSegmentationDP) || [pythonnet](https://github.com/pythonnet/pythonnet) + [`WordSegmentationDP.dll`](https://github.com/KOLANICH-libs/WordSplitAbs.py/files/7161469/WordSegmentationAndSymSpell.zip) + [Corpus file](https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell.FrequencyDictionary/en-80k.txt)| Unigram + Bayes | Recommended | |
| [WordSegmentationTM](https://github.com/wolfgarbe/WordSegmentationTM) || [pythonnet](https://github.com/pythonnet/pythonnet) + [`WordSegmentationTM.dll`](https://github.com/KOLANICH-libs/WordSplitAbs.py/files/7161469/WordSegmentationAndSymSpell.zip) + [Corpus file](https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell.FrequencyDictionary/en-80k.txt)| Unigram + Bayes | Recommended | |
| [SymSpell](https://github.com/wolfgarbe/SymSpell) || [pythonnet](https://github.com/pythonnet/pythonnet) + [`SymSpell.dll`](https://github.com/KOLANICH-libs/WordSplitAbs.py/files/7161469/WordSegmentationAndSymSpell.zip) + [Corpus file](https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell.FrequencyDictionary/en-80k.txt)| Unigram + Bigram | Not recommended, fails to split elementary phrases | |
| [wordninja](https://github.com/keredson/wordninja) | ✔️ | | Unigram order | Not the best quality | |
24 changes: 24 additions & 0 deletions UNLICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
This is free and unencumbered software released into the public domain.

Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.

In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

For more information, please refer to <https://unlicense.org/>
Loading

0 comments on commit b8fc74c

Please sign in to comment.