Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode character problem #47

Open
SombraRO opened this issue May 12, 2018 · 3 comments
Open

Unicode character problem #47

SombraRO opened this issue May 12, 2018 · 3 comments

Comments

@SombraRO
Copy link

Every message that uses the character ç next to another Unicode returns a strange character.

Using encode: UTF-8

çã Shows how згo
çõ Shows how уш

This can only be reproduced if the message is sent from irc to discord irc can not be UTF-8

reactiflux/discord-irc#399

@Throne3d
Copy link

Throne3d commented May 12, 2018

The specific issue seems to be that çã and çõ, in the windows-1252 encoding, are being detected as windows-1251 and IBM855, respectively, and so are interpreted as зг and уш. The context this problem came up in was attempting to convert IRC messages from various encodings into UTF-8, so they can be bridged to Discord.

Example strings:

  • eu não gosto de diferenciação (in the windows-1252 encoding), erroneously detected as windows-1251 and accordingly interpreted as eu nгo gosto de diferenciaзгo (notice "não" → "nгo" and "ção" → "згo")
  • informações (in the windows-1252 encoding), erroneously detected as windows-1251 and accordingly interpreted as informaушes
  • ça me fait rire (in the windows-1252 encoding), correctly detected as windows-1252

Since this is likely to be due to conflicting possible encodings, it might be hard to come up with code that distinguishes these situations? The sample languages above are Portuguese and French.

(This issue actually crops up in https://github.com/Throne3d/node-irc, which https://github.com/reactiflux/discord-irc depends on. It uses "jschardet": "^1.6.0" in its dependencies, currently resolved to 1.6.0 in version 0.9.0.)

@redfellow
Copy link

redfellow commented May 23, 2018

Affecting Finnish language as well in the same context as above commenter explains.

Few examples that don't bug out:

  • ei välttämättä
  • vin-vin sitsyeissön
  • lämpötilakin nousee vaan vaikka iv puhisee
  • testää

And a few more that do:

  • niin kai sitä vois -> niin kai sitä vois
  • meniskö sittenkin seiskaan vasta -> meniskรถ sittenkin seiskaan vasta
  • mä en ota riskiä että tää selkä pahenee -> mה en ota riskiה ettה tהה selkה pahenee
  • testä testä
  • ätest -> ätest
  • äätest -> äätest

@Mikaela
Copy link

Mikaela commented May 30, 2018

I am also coming from reactiflux/discord-irc#399 and would like to add the test string kyllä (yes in Finnish) which turns into kyllä and that all the users on my instance are using UTF-8 and I am sure of this as I have disabled support for all other encodings than UTF-8 in my clients (WeeChat: I have unloaded the charset plugin, ZNC: I have selected "send and parse UTF-8 only everywhere).

EDIT: fixed reactiflux/discord-irc#399 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants