Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I wonder, can we improve final score, if we encode each word and masking some numeric entry followed by classification, rather than character level classification. #11

Open
shreeshiv opened this issue May 3, 2020 · 1 comment

Comments

@shreeshiv
Copy link

I wonder, can we improve final score, if we encode each word and masking some numeric entry followed by classification, rather than character level classification for task 3?

@patrick22414
Copy link
Collaborator

Thank you @shreeshiv ! Constructing a dictionary is indeed a valid approach and, as I believe, a common practice in NLP. And yes, there is a solid chance that it may improve performance. However, it also comes with some disadvantages, such as we won't be able to detect a word outside the constructed dictionary, and it puts more heavy lifting on encoding.

In our case, we thought it is very likely that a non-dictionary word will appear in the test set, such as abbreviations, shop names, or menu entries. Characters, on the other hand, are easy to encode and can deal with new words, and have yielded satisfying results.

However, I do encourage you to explore a word-based approach if you would like!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants