Unzipping a zip directory #55

dluftspring · 2023-07-31T16:55:05Z

dluftspring
Jul 31, 2023

👋 hey just wondering if you have any guidance on how to go about setting an appropriate chunk size when the zip stream is targeting a directory with many different csv files inside? It seems like depending on the chunk size some of the files will unpack correctly and others will not

michalc · 2023-08-01T08:38:32Z

michalc
Aug 1, 2023
Maintainer

The default should be fine, or indeed any value - the chunk size should only affect performance/memory, and not correctness. Are you able to post the code?

0 replies

dluftspring · 2023-08-01T16:03:13Z

dluftspring
Aug 1, 2023
Author

Yep! Here's a sample i've pulled out

with requests.get(url, stream=True) as response:
        response.raise_for_status()

        for file_name, file_size, file_chunks in stream_unzip(response.iter_content()):
            for chunk in file_chunks:
                  text = TextIOWrapper(BytesIO(chunk), encoding="utf-8", newline="\r\n")
                  reader = csv.DictReader(text, delimiter=";")
                  for row in reader:
                      yield file_name, row

This is a zip archive containing 8 files. The first one is always parsed correctly and then depending on how I set the chunk size in iter_content I run into unicode decode errors and the json lines are missing headers. What I think is happening is that the chunks that overlap changes between the files aren't passing headers correctly to the DictReader object but I could be way off there.

5 replies

michalc Aug 1, 2023
Maintainer

Ah so the issue is that you're iterating over the chunks, and for each (maybe very small) chunk parsing it as a UTF-8, and then as a CSV file. A chunk could even be 1 byte long say depending on config - so this is sort of the wrong way round. You need to somehow pass file_chunks to TextIOWrapper. In an ideal world, you would be able to do this

with requests.get(url, stream=True) as response:
    response.raise_for_status()

    for file_name, file_size, file_chunks in stream_unzip(response.iter_content()):
        text = TextIOWrapper(file_chunks, encoding="utf-8", newline="\r\n") # Won't work!
        reader = csv.DictReader(text, delimiter=";")
        for row in reader:
            yield file_name, row

... but as far as I know (and I would love to be wrong), TextIOWrapper doesn't accept iterables of bytes like file_chunks.

... but it does accept file-like objects. So you have to convert file_chunks to such an object (taking care to not load it all in memory - that's the tricky part):

from io import IOBase

def to_file_like_obj(iterable):
    chunk = b''
    offset = 0
    it = iter(iterable)

    def up_to_iter(size):
        nonlocal chunk, offset

        while size:
            if offset == len(chunk):
                try:
                    chunk = next(it)
                except StopIteration:
                    break
                else:
                    offset = 0
            to_yield = min(size, len(chunk) - offset)
            offset = offset + to_yield
            size -= to_yield
            yield chunk[offset - to_yield : offset]

    class FileLikeObj(IOBase):
        def readable(self):
            return True

        def read(self, size=-1):
            return b''.join(
                up_to_iter(float('inf') if size is None or size < 0 else size)
            )

with requests.get(url, stream=True) as response:
    response.raise_for_status()

    for file_name, file_size, file_chunks in stream_unzip(response.iter_content()):
        text = TextIOWrapper(to_file_like_obj(file_chunks), encoding="utf-8", newline="\r\n")
        reader = csv.DictReader(text, delimiter=";")
        for row in reader:
            yield file_name, row

I would also love for to_file_like_obj be in Python's standard library, but I have found nothing.

dluftspring Aug 1, 2023
Author

Thanks for the thoughtful reply - this makes sense! Do you think at this point it's easier to write the chunks to disk and then process them from there? You incur some overhead for the I/O but it still seems better than bringing the whole zip archive into memory

michalc Aug 1, 2023
Maintainer

Do you think at this point it's easier to write the chunks to disk and then process them from there? You incur some overhead for the I/O but it still seems better than bringing the whole zip archive into memory

Ah to confirm, the code above doesn't bring the while zip archive into memory at once, just ~a chunk at a time.

Although if it did load it all into memory at once, then probably still better to keep it in memory than writing it all to disk, unless you don't have enough memory to do that. But also, if you are putting it all into memory at once, then in many cases you won't need to use stream-unzip, and could just use Python's zipfile module (unless you need some of the other features that stream-unzip has, like decrypting AES-encrypted zips, or uncompressing Deflate64 zips)

michalc Aug 1, 2023
Maintainer

Also, I've just noticed the documentation on iter_content at https://requests.readthedocs.io/en/latest/api/#requests.Response.iter_content. It looks like the default chunk size there is 1... this is a bit small. Something bigger is likely to be more performant, or even None which apparently means "as it comes"

dluftspring Aug 1, 2023
Author

Yeah that's the chunk size I was messing around with. I initially had it set to the same default value as stream_unzip (65536).

The code I have right now just uses the native zipfile library but i'm looking to make it leaner as opposed to upping the RAM on the environment where this script is running hence my appearing on this repo's discussion board 😛

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unzipping a zip directory #55

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Unzipping a zip directory #55

dluftspring Jul 31, 2023

Replies: 2 comments · 5 replies

michalc Aug 1, 2023 Maintainer

dluftspring Aug 1, 2023 Author

michalc Aug 1, 2023 Maintainer

dluftspring Aug 1, 2023 Author

michalc Aug 1, 2023 Maintainer

michalc Aug 1, 2023 Maintainer

dluftspring Aug 1, 2023 Author

dluftspring
Jul 31, 2023

Replies: 2 comments 5 replies

michalc
Aug 1, 2023
Maintainer

dluftspring
Aug 1, 2023
Author

michalc Aug 1, 2023
Maintainer

dluftspring Aug 1, 2023
Author

michalc Aug 1, 2023
Maintainer

michalc Aug 1, 2023
Maintainer

dluftspring Aug 1, 2023
Author