You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Roughly speaking, the data includes 3 types of diacritics:
lexical diacritics (café, übermensch)
emphatic diacritics (ík, échte), which may or may not match spelling standards.
mistakes (ǹpo, éoscarssowhite)
As for handling these:
Should be kept in. Preserving these takes priority over any handling of (2) and (3).
Should probably be stripped, since they are prosody markers. That said, if t-scan does not strip accents, it will treat "echte" and "échte" as different lemmas with different frequencies. This would stand out more if "échte" is absent from the frequency table.
Should be removed. However, note that for the examples above, the correct version would be "npo" (accent stripped) and "#oscarssowhite" / "oscarssowhite" (the first one is not really feasible to automate, the second one removes the entire character). Due to their nature, these cases should be quite rare.
My initial suggestion was that this could be handled by lemmatising the word and then cross-referencing with a vocabulary like the ANW. If the word appears in the ANW with an accent, it is is a lexical accent that should be preserved.
However, this only really works if you are also tagging named entities, since those a) do not appear in a dictionary, b) are especially likely to contain accents, and c) should have their accents preserved.
If you do this, it would be best if it is outfactored to a separate service that can also be utilised by t-scan, in order to avoid the discrepancy for emphatic accents.
All in all, this method is not impossible, but probably not worth the effort.
The text was updated successfully, but these errors were encountered:
Roughly speaking, the data includes 3 types of diacritics:
As for handling these:
My initial suggestion was that this could be handled by lemmatising the word and then cross-referencing with a vocabulary like the ANW. If the word appears in the ANW with an accent, it is is a lexical accent that should be preserved.
However, this only really works if you are also tagging named entities, since those a) do not appear in a dictionary, b) are especially likely to contain accents, and c) should have their accents preserved.
If you do this, it would be best if it is outfactored to a separate service that can also be utilised by t-scan, in order to avoid the discrepancy for emphatic accents.
All in all, this method is not impossible, but probably not worth the effort.
The text was updated successfully, but these errors were encountered: