EUC-JP wrongly detected in this case that contains german umlaut #29

bpasero · 2017-03-29T13:36:54Z

The following file detects as EUC-JP even though it is not. Seems to be caused by a single ü inside that file.

The text was updated successfully, but these errors were encountered:

aadsm · 2017-03-30T15:47:59Z

Yeah, this is really tricky. Encoding detection is not deterministic (for most cases) and relies on heuristic methods. This is why it will never be 100% reliable.
Also, the smaller the text is the worse it will be because there is not enough data to statistically analyze like you see here: #30

jschardet.detect returns the encoding with the best confidence but you can set jschardet.Constants._debug = true; to see the confidence of all other encodings, can you see what are the other encodings that it detects?

bpasero · 2017-03-30T16:25:48Z

@aadsm here is the output:

EUC-TW prober hit error at byte 207

UTF-8 confidence = 0.505

SHIFT_JIS confidence = 0.01

EUC-JP confidence = 0.99

GB2312 confidence = 0

EUC-KR confidence = 0.99

Big5 confidence = 0

EUC-TW not active

UTF-8 confidence = 0.505

SHIFT_JIS confidence = 0.01

EUC-JP confidence = 0.99

GB2312 confidence = 0

EUC-KR confidence = 0.99

Big5 confidence = 0

EUC-TW not active

EUC-JP confidence 0.99
windows-1251 confidence = 0.01

KOI8-R confidence = 0.01

ISO-8859-5 confidence = 0

MacCyrillic confidence = 0.01

IBM866 confidence = 0.01

IBM855 confidence = 0.01

ISO-8859-7 confidence = 0

windows-1253 confidence = 0

ISO-8859-5 confidence = 0

windows-1251 confidence = 0.01

ISO-8859-2 confidence = 0.8511313029424628

windows-1250 confidence = 0.8511313029424628

TIS-620 confidence = 0

windows-1255 confidence = 0

windows-1255 confidence = 0.01

windows-1255 confidence = 0.01

windows-1251 confidence = 0.01

KOI8-R confidence = 0.01

ISO-8859-5 confidence = 0

MacCyrillic confidence = 0.01

IBM866 confidence = 0.01

IBM855 confidence = 0.01

ISO-8859-7 confidence = 0

windows-1253 confidence = 0

ISO-8859-5 confidence = 0

windows-1251 confidence = 0.01

ISO-8859-2 confidence = 0.8511313029424628

windows-1250 confidence = 0.8511313029424628

TIS-620 confidence = 0

windows-1255 confidence = 0

windows-1255 confidence = 0.01

windows-1255 confidence = 0.01

ISO-8859-2 confidence 0.8511313029424628
windows-1252 confidence 0.95
UTF-8 confidence = 0.505

SHIFT_JIS confidence = 0.01

EUC-JP confidence = 0.99

GB2312 confidence = 0

EUC-KR confidence = 0.99

Big5 confidence = 0

EUC-TW not active

{ encoding: 'EUC-JP', confidence: 0.99 }

workflo · 2017-07-13T15:15:09Z

AFAIK Visual Studio Code uses jschardet and some of us users experience the very same problem in VSC: microsoft/vscode#4891

bpasero · 2017-07-13T15:59:10Z

I was reporting this on behalf of VS Code.

This has been fixed in https://github.com/chardet/chardet at chardet/chardet@c0f1ab5 and in the original source at https://bugzilla.mozilla.org/show_bug.cgi?id=306272 This doesn't solve the problem at #29 completely because the fix just raises the limit for "sure detection" up to 3 frequent characters found. However, it makes it at parity level with the original chardet.

This was referenced Mar 29, 2017

Encoding autoguess - weird case microsoft/vscode#23508

Closed

Enable files.autoGuessEncoding by default microsoft/vscode#23417

Closed

Auto guess encoding microsoft/vscode#21416

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EUC-JP wrongly detected in this case that contains german umlaut #29

EUC-JP wrongly detected in this case that contains german umlaut #29

bpasero commented Mar 29, 2017 •

edited

Loading

aadsm commented Mar 30, 2017

bpasero commented Mar 30, 2017

workflo commented Jul 13, 2017

bpasero commented Jul 13, 2017

EUC-JP wrongly detected in this case that contains german umlaut #29

EUC-JP wrongly detected in this case that contains german umlaut #29

Comments

bpasero commented Mar 29, 2017 • edited Loading

aadsm commented Mar 30, 2017

bpasero commented Mar 30, 2017

workflo commented Jul 13, 2017

bpasero commented Jul 13, 2017

bpasero commented Mar 29, 2017 •

edited

Loading