-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stanza 1.7.0+ makes breaking API changes for possessives, tokens excluding end_char
and start_char
fields
#1361
Comments
As you have surmised, this was an intentional breaking change. English was actually handled differently from almost every other language where multiple syntactic "words" are written as a single "token". In general they are labeled as MWT (multi-word tokens), such as in Spanish, where the direct and indirect object pronouns can be attached to certain forms of verbs. In the case of English, there are a few classes of words which fit that category: possessives So the first thing you can do is do your processing on the words instead of the tokens on a sentence, such as
If you're using the json output format, the MWT are always marked as having an
As you point out, this is missing the character positions on the words. This is because, in some languages, the tokenization standard is to rewrite the word pieces to match the actual word, so we'd have There's another really annoying issue, which is that the NER training can be either MWT or words (generally words), whereas for some reason the NER processor uses the MWT instead of the words. As a result, it doesn't always correctly label possessives. I should mark that as another TODO. If there are other items which would make this more compatible with your previous workflow, please let us know |
|
Thank you for your prompt response @AngledLuffa! Adding Since you asked, our biggest architectural dependency with Stanza is that we rely on there being a one-to-one mapping of Universal Dependencies to leaf nodes of the Constituency Parse. The two inputs are mapped as linked objects in our system. E.g. so currently the word
and in the constituency tree, to two leaf nodes:
and this one-to-one relationship is linked in our system as objects: a given token can fetch its constituent node, and vice versa. So if "can't" becomes one MWT token, our system would break unless the constituency tree also maps "can't" as a single leaf node. Thankfully it looks I can still maintain this relationship by skipping over MWT tokens as the original tokens still have this one-to-one mapping in Stanza 1.8.1. Returning the Thanks again for the prompt response! |
…start_char and end_char on it. Note that there will still be no start_char and end_char annotations on words if the words don't add up to the token's text, so even in a language like English where the standard is to annotate the datasets so that they correspond to the pieces of the real text instead of the word being represented, there may be unusual separations in the MWT processor. #1361
…start_char and end_char on it. Note that there will still be no start_char and end_char annotations on words if the words don't add up to the token's text, so even in a language like English where the standard is to annotate the datasets so that they correspond to the pieces of the real text instead of the word being represented, there may be unusual separations in the MWT processor that result in no start/end char Fix a unit test error #1361
Alright, I separated the Word start & end chars for situations where the pieces add up to the surrounding Token (the Token, again, being the MWT representation). I should emphasize that may be cases in English where the pieces it adds up are not actually the full Token, in which case there won't be a start & end char. If'n you come across those and it isn't properly tokenizing them, we can take a look. The change is currently in the dev branch. |
…start_char and end_char on it. Note that there will still be no start_char and end_char annotations on words if the words don't add up to the token's text, so even in a language like English where the standard is to annotate the datasets so that they correspond to the pieces of the real text instead of the word being represented, there may be unusual separations in the MWT processor that result in no start/end char Fix a unit test error #1361
@AngledLuffa Just built a version of Stanza from the |
@AngledLuffa Do you know when the next release of Stanza will be so I can leap frog to that one? |
Depends on if any show-stopping bugs show up, I suppose. Probably a couple months if nothing critical comes up |
I'd actually prefer to leave this as open until I figure out what to do with NER tags, btw |
This is now part of the 1.8.2 release |
Thanks @AngledLuffa we have now migrated to the Stanza 1.8.2! |
…start_char and end_char on it. Note that there will still be no start_char and end_char annotations on words if the words don't add up to the token's text, so even in a language like English where the standard is to annotate the datasets so that they correspond to the pieces of the real text instead of the word being represented, there may be unusual separations in the MWT processor that result in no start/end char Fix a unit test error #1361
Describe the bug
I'm updating Stanza from 1.6.1 to 1.7.x / 1.8.x and noticed a number of breaking API changes in the Stanza Token result when handling possessives.
To Reproduce
Joe's dog.
.Stanza now includes a new additional token that I'll call an "aggregate token" with the
text
fieldJoe's
. This new aggregate token comes in addition to the tokens forJoe
and's
. The new aggregate token returns anid
with a list to the other two tokens:This breaks the one-to-one mapping that used to exist between tokens and word elements within the s-expression returned by the constituency tree:
But more problematically, this new aggregate token is now the only token containing the
end_char
andstart_char
data about the word.In addition to being a breaking change, this new approach is quite hard for application developers to work with. To parse it they need to chase down the ID links of the aggregate token when it intermittently appears to map its linguistic data. Moreover, important character information about where the character delineation between a word and its apostrophe is lost.
Expected behavior
For a possessive like
Joe's dog.
, Stanza returns four dependency tokens as before in Stanza 1.6.1:Or if a fifth aggregate token with an array of id's continues to be returned, the non-aggregate child tokens at least retain their own
end_char
andstart_char
information as before. This would at least allow developers to ignore these aggregate tokens, and preserve information about the character delineation between each token.Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: