Strip accents #2

lukavdplas · 2023-05-16T10:04:14Z

Roughly speaking, the data includes 3 types of diacritics:

lexical diacritics (café, übermensch)
emphatic diacritics (ík, échte), which may or may not match spelling standards.
mistakes (ǹpo, éoscarssowhite)

As for handling these:

Should be kept in. Preserving these takes priority over any handling of (2) and (3).
Should probably be stripped, since they are prosody markers. That said, if t-scan does not strip accents, it will treat "echte" and "échte" as different lemmas with different frequencies. This would stand out more if "échte" is absent from the frequency table.
Should be removed. However, note that for the examples above, the correct version would be "npo" (accent stripped) and "#oscarssowhite" / "oscarssowhite" (the first one is not really feasible to automate, the second one removes the entire character). Due to their nature, these cases should be quite rare.

My initial suggestion was that this could be handled by lemmatising the word and then cross-referencing with a vocabulary like the ANW. If the word appears in the ANW with an accent, it is is a lexical accent that should be preserved.

However, this only really works if you are also tagging named entities, since those a) do not appear in a dictionary, b) are especially likely to contain accents, and c) should have their accents preserved.

If you do this, it would be best if it is outfactored to a separate service that can also be utilised by t-scan, in order to avoid the discrepancy for emphatic accents.

All in all, this method is not impossible, but probably not worth the effort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip accents #2

Strip accents #2

lukavdplas commented May 16, 2023

Strip accents #2

Strip accents #2

Comments

lukavdplas commented May 16, 2023