-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/text/language: Match does not work as expected for artificial languages #45749
Comments
@bep Does this diff change the results? func TestMatchArtificialLanguages(t *testing.T) {
// See https://en.wikipedia.org/wiki/Codes_for_constructed_languages
klingon, err := language.Parse("art-x-klingon")
if err != nil {
t.Fatal(err)
}
elvish, err := language.Parse("art-x-elvish")
if err != nil {
t.Fatal(err)
}
+ tags := []language.Tag{klingon, elvish}
- matcher := language.NewMatcher([]language.Tag{klingon, elvish})
+ matcher := language.NewMatcher(tags)
- m, _, _ := matcher.Match(elvish)
+ _, i, _ := matcher.Match(elvish)
+ m := tags[i]
if m.String() != "art-x-elvish" {
t.Errorf("got %s; want art-x-elvish", m.String())
}
} (I am curious if this is related to #24211 (comment)) |
No. It sill returns the first index. |
cc @mpvl |
👋 hello, I'm no expert, but I've been poking around in /x/text recently, so I thought I would look into this. I think the root cause of this is that related to the private tags. 🏷️ Go is not sorting properly based on private tags. Here is an example where non-artificial languages have the same issue.
This returns: ✏️ Go should sort based on private tags
💻 The code looks like it inspects private tags... but it doesn't
✏️ I'm working on a fix to both return the private use tags properly and lower the confidence level when they do not match, which should fix this issue. |
Change https://golang.org/cl/364855 mentions this issue: |
The use of private use tags is more or less deprecated. The official tag for Klingon is tlh and for Elvish, if I’m not mistaken, qya. I can imagine preprocessors to normalize user input tags, but I don’t think this normalization should be applied to tags passed to the normalizer by the developer, for instance. The marcher does not, nor does it aim to, implement RFC 4647. Instead, the algorithm is an enhancement of a matcher used internally at google, which is generally believed to give more reliable results. I no longer work at Google, but I doubt the move away from -x support will have reverted. |
Note, the compliance test set is in The two are mostly the same, but there are some notable differences. The Go algorithm uses rule based matching, instead of the scoring algorithm used by Google's algorithm. This has the advantage that it
Someone of the internationalization team in Google could verify if the Golden set is up to date. But I doubt support for I would not, however, modify the algorithm based on isolated use cases. There are very specific reasons why it is the way it is with years of tweaking and testing based data-driven design. |
As another note. The OP mentions the desire to support This already shows the problem of the use of private use tags: their usage is not stable and changes over time. This is one of the reasons why there was a move away from supporting private use tags. Note that many artificial languages have their own language code. Here is a list: I can imagine supporting private use tags, though, if it does not interact with the rest of tag matching and if thorough consideration of the implications of canonicalization are given. But, as the use of -x is largely deprecated, the impact of considering them for matching regular languages will have to be tested against a large body of user data. My suspicion is that including this information in matches generally will make things worse, not better. This means that supporting matching on private use tags, should probably be limited to specifically supporting |
Those were just two random examples. My original problem is that of handling of "unknown language tags", but you can also easily make the argument for unofficial artificial languages (like Foozzy, a new language i just made up). |
😮 Wow, thank you so much for all of this information @mpvl . It has been very hard to discover all of this. (Or maybe there are some docs I am missing?) What I am taking from this is:
🙏 Thanks again! |
The following test failes with
got art-x-klingon; want art-x-elvish
.nicksnyder/go-i18n#252
The text was updated successfully, but these errors were encountered: