Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip accents #2

Open
lukavdplas opened this issue May 16, 2023 · 0 comments
Open

Strip accents #2

lukavdplas opened this issue May 16, 2023 · 0 comments

Comments

@lukavdplas
Copy link
Contributor

Roughly speaking, the data includes 3 types of diacritics:

  1. lexical diacritics (café, übermensch)
  2. emphatic diacritics (ík, échte), which may or may not match spelling standards.
  3. mistakes (ǹpo, éoscarssowhite)

As for handling these:

  1. Should be kept in. Preserving these takes priority over any handling of (2) and (3).
  2. Should probably be stripped, since they are prosody markers. That said, if t-scan does not strip accents, it will treat "echte" and "échte" as different lemmas with different frequencies. This would stand out more if "échte" is absent from the frequency table.
  3. Should be removed. However, note that for the examples above, the correct version would be "npo" (accent stripped) and "#oscarssowhite" / "oscarssowhite" (the first one is not really feasible to automate, the second one removes the entire character). Due to their nature, these cases should be quite rare.

My initial suggestion was that this could be handled by lemmatising the word and then cross-referencing with a vocabulary like the ANW. If the word appears in the ANW with an accent, it is is a lexical accent that should be preserved.

However, this only really works if you are also tagging named entities, since those a) do not appear in a dictionary, b) are especially likely to contain accents, and c) should have their accents preserved.

If you do this, it would be best if it is outfactored to a separate service that can also be utilised by t-scan, in order to avoid the discrepancy for emphatic accents.

All in all, this method is not impossible, but probably not worth the effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant