Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding issue in test_bodies.csv #5

Open
bnns opened this issue Jun 1, 2017 · 3 comments
Open

encoding issue in test_bodies.csv #5

bnns opened this issue Jun 1, 2017 · 3 comments

Comments

@bnns
Copy link

bnns commented Jun 1, 2017

It seems that test_bodies.csv is not encoded entirely in UTF-8. If you open it with an editor and check the bottom of the file (I used Atom), you'll see the missing characters. If you switch to cp1252 or some other encodings, you'll see some characters fixed and others break. What encoding should we use?

@braingineer
Copy link

braingineer commented Jun 1, 2017

Hi all,

In general, the bytes to unicode interface is really annoying in Python, but there has been a bunch of really good stuff written about this. I'm going to comment on the issue with two functions I've written previously to handle this, in addition to a link to a good talk by a Python core dev on it.

def normalize_to_unicode(text, encoding="utf-8"):
    import sys
    if sys.version_info.major == 2:
        if isinstance(text, str):
            text = text.decode(encoding)
        assert isinstance(text, unicode)
        return text
    else:
        if isinstance(text, bytes):
            text = text.decode(encoding)
        assert isinstance(text, str)
        return text

def normalize_to_bytes(text, encoding="utf-8"):
    import sys
    if sys.version_info.major == 2:
        if isinstance(text, unicode):
            text = text.encode(encoding)
        assert isinstance(text, str)
        return text
    else:
        if isinstance(text, str):
            text = text.encode(encoding)
        assert isinstance(text, bytes)
        return text

Here is a great video by Ned Batchelder on the topic: https://www.youtube.com/watch?v=sgHbC6udIqc

The main lesson is: convert to unicode as early as possible in your system.

I will be running through to verify this works with the data.

@braingineer
Copy link

braingineer commented Jun 1, 2017

Here is a set of functions that can convert the files.

Important notes: You should always load with 'rb' flags and convert on loading. I've written best practice loading function at the bottom. Or, alternatively, pandas can usually handle some things well (though, still will break without removing the windows encoded characters).

import os

def normalize_to_unicode(text, encoding="utf-8"):
    import sys
    if sys.version_info.major == 2:
        if isinstance(text, str):
            text = unicode(text.decode(encoding))
        return text
    else:
        if isinstance(text, bytes):
            text = str(text.decode(encoding))
        return text

def normalize_to_bytes(text, encoding="utf-8"):
    import sys
    if sys.version_info.major == 2:
        if isinstance(text, unicode):
            text = str(text.encode(encoding))
        return text
    else:
        if isinstance(text, str):
            text = bytes(text.encode(encoding))
        return text

def convert_windows1252_to_utf8(text):
        return text.decode("windows-1252").encode().decode("utf-8")

def add_newline_to_unicode(text):
    return text + u"\n"

def single_item_process_standard(text):
    text = normalize_to_unicode(text).strip()
    text = add_newline_to_unicode(text)
    return normalize_to_bytes(text)

def single_item_process_funky(text):
    text = convert_windows1252_to_utf8(text).strip()
    text = add_newline_to_unicode(text)
    return normalize_to_bytes(text)


def process(lines):
    out_lines = []
    for line in lines:
        try:
            out_lines.append(single_item_process_standard(line))
        except UnicodeDecodeError:
            out_lines.append(single_item_process_funky(line))
    return out_lines
            
def open_process_rewrite(filename):
    with open(filename, "rb") as fp:
        lines = fp.readlines()
    lines = process(lines)
    path_part, ext_part = os.path.splitext(filename)
    new_filename = "{}_processed{}".format(path_part, ext_part)
    with open(new_filename, "wb") as fp:
        fp.writelines(lines)

def best_practice_load(filename):
    out = []
    with open(filename, "rb") as fp:
        for line in fp.readlines():
            try:
                out.append(normalize_to_unicode(line).strip())
            except UnicodeDecodeError:
                print("Broken line: {}".format(line))
    return out

def best_practice_load_with_pandas(filename):
    import pandas as pd
    return pd.read_csv(filename)

@braingineer
Copy link

Also, for anyone curious and wanting to learn how to handle unicode better, I sleuth-ed this issue by first loading the raw text in byte form (with the 'rb') flag, then trying the normalize_to_unicode function on all lines. For the lines that broken, I further visually inspected them. In byte form, python writes unicode characters as \xNN where NN is some number. The issue with the lines that broke had to be the unicode characters in those byte strings, so I googled around and found this stack overflow post on tricky x93 unicode characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants