encoding issue in test_bodies.csv #5

bnns · 2017-06-01T02:31:04Z

It seems that test_bodies.csv is not encoded entirely in UTF-8. If you open it with an editor and check the bottom of the file (I used Atom), you'll see the missing characters. If you switch to cp1252 or some other encodings, you'll see some characters fixed and others break. What encoding should we use?

The text was updated successfully, but these errors were encountered:

braingineer · 2017-06-01T17:18:31Z

Hi all,

In general, the bytes to unicode interface is really annoying in Python, but there has been a bunch of really good stuff written about this. I'm going to comment on the issue with two functions I've written previously to handle this, in addition to a link to a good talk by a Python core dev on it.

def normalize_to_unicode(text, encoding="utf-8"):
    import sys
    if sys.version_info.major == 2:
        if isinstance(text, str):
            text = text.decode(encoding)
        assert isinstance(text, unicode)
        return text
    else:
        if isinstance(text, bytes):
            text = text.decode(encoding)
        assert isinstance(text, str)
        return text

def normalize_to_bytes(text, encoding="utf-8"):
    import sys
    if sys.version_info.major == 2:
        if isinstance(text, unicode):
            text = text.encode(encoding)
        assert isinstance(text, str)
        return text
    else:
        if isinstance(text, str):
            text = text.encode(encoding)
        assert isinstance(text, bytes)
        return text

Here is a great video by Ned Batchelder on the topic: https://www.youtube.com/watch?v=sgHbC6udIqc

The main lesson is: convert to unicode as early as possible in your system.

I will be running through to verify this works with the data.

braingineer · 2017-06-01T18:20:32Z

Here is a set of functions that can convert the files.

Important notes: You should always load with 'rb' flags and convert on loading. I've written best practice loading function at the bottom. Or, alternatively, pandas can usually handle some things well (though, still will break without removing the windows encoded characters).

import os

def normalize_to_unicode(text, encoding="utf-8"):
    import sys
    if sys.version_info.major == 2:
        if isinstance(text, str):
            text = unicode(text.decode(encoding))
        return text
    else:
        if isinstance(text, bytes):
            text = str(text.decode(encoding))
        return text

def normalize_to_bytes(text, encoding="utf-8"):
    import sys
    if sys.version_info.major == 2:
        if isinstance(text, unicode):
            text = str(text.encode(encoding))
        return text
    else:
        if isinstance(text, str):
            text = bytes(text.encode(encoding))
        return text

def convert_windows1252_to_utf8(text):
        return text.decode("windows-1252").encode().decode("utf-8")

def add_newline_to_unicode(text):
    return text + u"\n"

def single_item_process_standard(text):
    text = normalize_to_unicode(text).strip()
    text = add_newline_to_unicode(text)
    return normalize_to_bytes(text)

def single_item_process_funky(text):
    text = convert_windows1252_to_utf8(text).strip()
    text = add_newline_to_unicode(text)
    return normalize_to_bytes(text)


def process(lines):
    out_lines = []
    for line in lines:
        try:
            out_lines.append(single_item_process_standard(line))
        except UnicodeDecodeError:
            out_lines.append(single_item_process_funky(line))
    return out_lines
            
def open_process_rewrite(filename):
    with open(filename, "rb") as fp:
        lines = fp.readlines()
    lines = process(lines)
    path_part, ext_part = os.path.splitext(filename)
    new_filename = "{}_processed{}".format(path_part, ext_part)
    with open(new_filename, "wb") as fp:
        fp.writelines(lines)

def best_practice_load(filename):
    out = []
    with open(filename, "rb") as fp:
        for line in fp.readlines():
            try:
                out.append(normalize_to_unicode(line).strip())
            except UnicodeDecodeError:
                print("Broken line: {}".format(line))
    return out

def best_practice_load_with_pandas(filename):
    import pandas as pd
    return pd.read_csv(filename)

braingineer · 2017-06-01T18:25:36Z

Also, for anyone curious and wanting to learn how to handle unicode better, I sleuth-ed this issue by first loading the raw text in byte form (with the 'rb') flag, then trying the normalize_to_unicode function on all lines. For the lines that broken, I further visually inspected them. In byte form, python writes unicode characters as \xNN where NN is some number. The issue with the lines that broke had to be the unicode characters in those byte strings, so I googled around and found this stack overflow post on tricky x93 unicode characters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding issue in test_bodies.csv #5

encoding issue in test_bodies.csv #5

bnns commented Jun 1, 2017

braingineer commented Jun 1, 2017 •

edited

Loading

braingineer commented Jun 1, 2017 •

edited

Loading

braingineer commented Jun 1, 2017

encoding issue in test_bodies.csv #5

encoding issue in test_bodies.csv #5

Comments

bnns commented Jun 1, 2017

braingineer commented Jun 1, 2017 • edited Loading

braingineer commented Jun 1, 2017 • edited Loading

braingineer commented Jun 1, 2017

braingineer commented Jun 1, 2017 •

edited

Loading

braingineer commented Jun 1, 2017 •

edited

Loading