Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EUC-JP wrongly detected in this case that contains german umlaut #29

Open
bpasero opened this issue Mar 29, 2017 · 4 comments
Open

EUC-JP wrongly detected in this case that contains german umlaut #29

bpasero opened this issue Mar 29, 2017 · 4 comments

Comments

@bpasero
Copy link
Contributor

bpasero commented Mar 29, 2017

The following file detects as EUC-JP even though it is not. Seems to be caused by a single ü inside that file.

File: QuietLight.tmTheme.txt

@aadsm
Copy link
Owner

aadsm commented Mar 30, 2017

Yeah, this is really tricky. Encoding detection is not deterministic (for most cases) and relies on heuristic methods. This is why it will never be 100% reliable.
Also, the smaller the text is the worse it will be because there is not enough data to statistically analyze like you see here: #30

jschardet.detect returns the encoding with the best confidence but you can set jschardet.Constants._debug = true; to see the confidence of all other encodings, can you see what are the other encodings that it detects?

@bpasero
Copy link
Contributor Author

bpasero commented Mar 30, 2017

@aadsm here is the output:

EUC-TW prober hit error at byte 207

UTF-8 confidence = 0.505

SHIFT_JIS confidence = 0.01

EUC-JP confidence = 0.99

GB2312 confidence = 0

EUC-KR confidence = 0.99

Big5 confidence = 0

EUC-TW not active

UTF-8 confidence = 0.505

SHIFT_JIS confidence = 0.01

EUC-JP confidence = 0.99

GB2312 confidence = 0

EUC-KR confidence = 0.99

Big5 confidence = 0

EUC-TW not active

EUC-JP confidence 0.99
windows-1251 confidence = 0.01

KOI8-R confidence = 0.01

ISO-8859-5 confidence = 0

MacCyrillic confidence = 0.01

IBM866 confidence = 0.01

IBM855 confidence = 0.01

ISO-8859-7 confidence = 0

windows-1253 confidence = 0

ISO-8859-5 confidence = 0

windows-1251 confidence = 0.01

ISO-8859-2 confidence = 0.8511313029424628

windows-1250 confidence = 0.8511313029424628

TIS-620 confidence = 0

windows-1255 confidence = 0

windows-1255 confidence = 0.01

windows-1255 confidence = 0.01

windows-1251 confidence = 0.01

KOI8-R confidence = 0.01

ISO-8859-5 confidence = 0

MacCyrillic confidence = 0.01

IBM866 confidence = 0.01

IBM855 confidence = 0.01

ISO-8859-7 confidence = 0

windows-1253 confidence = 0

ISO-8859-5 confidence = 0

windows-1251 confidence = 0.01

ISO-8859-2 confidence = 0.8511313029424628

windows-1250 confidence = 0.8511313029424628

TIS-620 confidence = 0

windows-1255 confidence = 0

windows-1255 confidence = 0.01

windows-1255 confidence = 0.01

ISO-8859-2 confidence 0.8511313029424628
windows-1252 confidence 0.95
UTF-8 confidence = 0.505

SHIFT_JIS confidence = 0.01

EUC-JP confidence = 0.99

GB2312 confidence = 0

EUC-KR confidence = 0.99

Big5 confidence = 0

EUC-TW not active

{ encoding: 'EUC-JP', confidence: 0.99 }

@workflo
Copy link

workflo commented Jul 13, 2017

AFAIK Visual Studio Code uses jschardet and some of us users experience the very same problem in VSC: microsoft/vscode#4891

@bpasero
Copy link
Contributor Author

bpasero commented Jul 13, 2017

I was reporting this on behalf of VS Code.

aadsm added a commit that referenced this issue Jul 16, 2017
This has been fixed in https://github.com/chardet/chardet at chardet/chardet@c0f1ab5 and in the original source at https://bugzilla.mozilla.org/show_bug.cgi?id=306272

This doesn't solve the problem at #29 completely because the fix just raises the limit for "sure detection" up to 3 frequent characters found. However, it makes it at parity level with the original chardet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants