In this competition, we built an AI solution to correctly extract point of interest (POI)
names and street
names from unformatted Indonesia addresses collected by Shopee. We are happy to share our solution, which is ranked 28th (from 1,034 teams) in this competition. Please check the Kaggle's private leaderboard in this link.
Problem Description
Given a raw_address
, the AI model should provide two prediction results, one for POI
and one for street
. POI
and street
should be separated with a “/” character without any spaces in between. There are cases where POI/street
elements in the raw_address
are not complete. For this case, the model also need to predict the complete subwords before returning the result.
id | raw_address | POI/street |
---|---|---|
1 | karang mulia bengkel mandiri motor raya bosnik 21 blak kota | bengkel mandiri motor/raya bosnik |
2 | primkob pabri adiwerna | primkob pabri/ |
3 | jalan mh thamrin, sei rengas i kel. medan kota | /jalan mh thamrin |
4 | smk karya pemban, pon | smk karya pembangunan/pon |
Explanation:
- The
POI
is "bengkel mandiri motor" andstreet
name is "raya bosnik" the returnedPOI/street
should be:- "bengkel mandiri motor/raya bosnik"
- The
POI
is "primkob pabri" and nostreet
name is found the returnedPOI/street
should be:- "primkob pabri/"
- No
POI
is found and thestreet
name is "jalan mh thamrin" the returnedPOI/street
should be:- "/jalan mh thamrin"
- The word "pembangunan" in
raw_address
"smk karya pemban, pon" is not complete. The correctPOI
will be "smk karya pembangunan" and the returned result should be:- smk karya pembangunan/pon
- Drop data which
POI/street
contains dot -> small occurence (0.3% from the data) and can be noisy to the model - Clean
raw_address
-> remove multiple whitespace, remove dot, restructure (correct) punctuation, and remove bracket
Please check at Data-Cleaning.ipynb
for the implementation.
-
Utilize a probabilistic model to repair texts in the raw address. The probabilistic model employs the frequency information of transformed n-gram from the train data.
Examples of frequency information of transformed n-gram:
-
transform_occurency["cak"] = {'cakung': 15, 'cakruk': 1, "cake's": 1, 'cakery': 1, 'cakrad': 1, 'cakrab': 1}
-
transform_occurency["taman mer"] = {'taman meruya': 2}
In the examples above, the word "cak" in the training data is transformed 15 times into "cakung". For more accurate frequency information, we also utilize bigram, 3-gram, and 4-gram transform_occurence information.
-
Please check at Data-Formatting.ipynb
for the implementation
- Assume
POI
andstreet
as entities. Frame the problem as named entity recognition (NER), i.e. extract entitites (POI
andstreet
) from texts (raw_address
) - Construct train and test data with BIO tags for custom NER
Split train data into train
and validation
. Use test data for submission. Generate BIO tags for creating custom Named Entity Recognition (NER)
python3 create_train_label.py # create train and validation data
python3 create_test_label.py # create test data
Fine-tune and evaluate IndoBERT model to build custom NER
python3 train.py # fine-tune NER model
python3 eval.py # generate csv for submission
Preparing Environment
Before replicating the result, please prepare the environment of the experiment. We run our experiment using Docker, started with huggingface/transformers-pytorch-gpu:3.4.0 image. You can pull the docker using this command
docker pull huggingface/transformers-pytorch-gpu:3.4.0
After running the image as a container, please install some required libraries
bash install.sh
(Epoch 16) TRAIN LOSS:0.0020 ACC:1.00 F1:1.00 REC:1.00 PRE:1.00 LR:0.00000500
(Epoch 16) VALID LOSS:0.1394 ACC:0.98 F1:0.94 REC:0.94 PRE:0.94
save model checkpoint at models/bert-large/32_128_3e-05/
(Epoch 17) TRAIN LOSS:0.0019 ACC:1.00 F1:1.00 REC:1.00 PRE:1.00 LR:0.00000500
(Epoch 17) VALID LOSS:0.1440 ACC:0.98 F1:0.94 REC:0.94 PRE:0.94
save model checkpoint at models/bert-large/32_128_3e-05/
Thanks for reading :) Don't hestitate to contact me, mhilmiasyrofi(at)gmail(dot)com, if you need further assistance to replicate the result!!!