-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding issue in test_bodies.csv #5
Comments
Hi all, In general, the bytes to unicode interface is really annoying in Python, but there has been a bunch of really good stuff written about this. I'm going to comment on the issue with two functions I've written previously to handle this, in addition to a link to a good talk by a Python core dev on it. def normalize_to_unicode(text, encoding="utf-8"):
import sys
if sys.version_info.major == 2:
if isinstance(text, str):
text = text.decode(encoding)
assert isinstance(text, unicode)
return text
else:
if isinstance(text, bytes):
text = text.decode(encoding)
assert isinstance(text, str)
return text
def normalize_to_bytes(text, encoding="utf-8"):
import sys
if sys.version_info.major == 2:
if isinstance(text, unicode):
text = text.encode(encoding)
assert isinstance(text, str)
return text
else:
if isinstance(text, str):
text = text.encode(encoding)
assert isinstance(text, bytes)
return text Here is a great video by Ned Batchelder on the topic: https://www.youtube.com/watch?v=sgHbC6udIqc The main lesson is: convert to unicode as early as possible in your system. I will be running through to verify this works with the data. |
Here is a set of functions that can convert the files. Important notes: You should always load with 'rb' flags and convert on loading. I've written best practice loading function at the bottom. Or, alternatively, pandas can usually handle some things well (though, still will break without removing the windows encoded characters). import os
def normalize_to_unicode(text, encoding="utf-8"):
import sys
if sys.version_info.major == 2:
if isinstance(text, str):
text = unicode(text.decode(encoding))
return text
else:
if isinstance(text, bytes):
text = str(text.decode(encoding))
return text
def normalize_to_bytes(text, encoding="utf-8"):
import sys
if sys.version_info.major == 2:
if isinstance(text, unicode):
text = str(text.encode(encoding))
return text
else:
if isinstance(text, str):
text = bytes(text.encode(encoding))
return text
def convert_windows1252_to_utf8(text):
return text.decode("windows-1252").encode().decode("utf-8")
def add_newline_to_unicode(text):
return text + u"\n"
def single_item_process_standard(text):
text = normalize_to_unicode(text).strip()
text = add_newline_to_unicode(text)
return normalize_to_bytes(text)
def single_item_process_funky(text):
text = convert_windows1252_to_utf8(text).strip()
text = add_newline_to_unicode(text)
return normalize_to_bytes(text)
def process(lines):
out_lines = []
for line in lines:
try:
out_lines.append(single_item_process_standard(line))
except UnicodeDecodeError:
out_lines.append(single_item_process_funky(line))
return out_lines
def open_process_rewrite(filename):
with open(filename, "rb") as fp:
lines = fp.readlines()
lines = process(lines)
path_part, ext_part = os.path.splitext(filename)
new_filename = "{}_processed{}".format(path_part, ext_part)
with open(new_filename, "wb") as fp:
fp.writelines(lines)
def best_practice_load(filename):
out = []
with open(filename, "rb") as fp:
for line in fp.readlines():
try:
out.append(normalize_to_unicode(line).strip())
except UnicodeDecodeError:
print("Broken line: {}".format(line))
return out
def best_practice_load_with_pandas(filename):
import pandas as pd
return pd.read_csv(filename) |
Also, for anyone curious and wanting to learn how to handle unicode better, I sleuth-ed this issue by first loading the raw text in byte form (with the 'rb') flag, then trying the |
It seems that
test_bodies.csv
is not encoded entirely inUTF-8
. If you open it with an editor and check the bottom of the file (I used Atom), you'll see the missing characters. If you switch tocp1252
or some other encodings, you'll see some characters fixed and others break. What encoding should we use?The text was updated successfully, but these errors were encountered: