Feature Request: Block-based Compression Early Abort for Incompressible Data in gensquashfs #114

wychen · 2023-04-04T20:26:09Z

gensquashfs currently retains the original data if the compressed output is larger than the source. However, performing heavy-duty compression on incompressible data and then discarding it may be wasteful. I propose adding a command line option to gensquashfs that enables a quick entropy measurement before performing compression. If a block is deemed incompressible, we can simply keep the original data without wasting computational resources on compression.

We could use a fast compression method, such as zstd level 1, to gauge the entropy. In this case, when using the default xz level 6, zstd level 1 introduces less than 2% of computational overhead. This approach would provide a net gain if the source files contain at least 2% of incompressible blocks, which is not an unreasonable scenario.

Alternative methods, such as file-based skipping mechanisms with filename matching or file type detection, may be less accurate. Specifically, files containing mixed compressibility resources, such as PDFs with both text (compressible) and JPEG images (not compressible), or uncompressed tar files or VM images containing various file types, could benefit from a more granular block-based approach.

This idea is inspired by the ZFS LZ4 early abort mechanism, although the requirements and trade-offs in our context may be different. For reference, I have filed a similar issue on the squashfs-tools repository at plougher/squashfs-tools#240.

I'm happy to refine my local prototype and send a PR, but I'd like to ensure that this feature aligns with the project's direction first. Thank you for your time and consideration. I'm looking forward to hearing your thoughts on this proposal and the potential advantages it could bring to squashfs-tools-ng and the community.

wychen mentioned this issue Apr 12, 2023

Feature Request: Abort Compression Early for Incompressible Data in mkdwarfs mhx/dwarfs#136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Block-based Compression Early Abort for Incompressible Data in gensquashfs #114

Feature Request: Block-based Compression Early Abort for Incompressible Data in gensquashfs #114

wychen commented Apr 4, 2023

Feature Request: Block-based Compression Early Abort for Incompressible Data in gensquashfs #114

Feature Request: Block-based Compression Early Abort for Incompressible Data in gensquashfs #114

Comments

wychen commented Apr 4, 2023