You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
gensquashfs currently retains the original data if the compressed output is larger than the source. However, performing heavy-duty compression on incompressible data and then discarding it may be wasteful. I propose adding a command line option to gensquashfs that enables a quick entropy measurement before performing compression. If a block is deemed incompressible, we can simply keep the original data without wasting computational resources on compression.
We could use a fast compression method, such as zstd level 1, to gauge the entropy. In this case, when using the default xz level 6, zstd level 1 introduces less than 2% of computational overhead. This approach would provide a net gain if the source files contain at least 2% of incompressible blocks, which is not an unreasonable scenario.
Alternative methods, such as file-based skipping mechanisms with filename matching or file type detection, may be less accurate. Specifically, files containing mixed compressibility resources, such as PDFs with both text (compressible) and JPEG images (not compressible), or uncompressed tar files or VM images containing various file types, could benefit from a more granular block-based approach.
This idea is inspired by the ZFS LZ4 early abort mechanism, although the requirements and trade-offs in our context may be different. For reference, I have filed a similar issue on the squashfs-tools repository at plougher/squashfs-tools#240.
I'm happy to refine my local prototype and send a PR, but I'd like to ensure that this feature aligns with the project's direction first. Thank you for your time and consideration. I'm looking forward to hearing your thoughts on this proposal and the potential advantages it could bring to squashfs-tools-ng and the community.
The text was updated successfully, but these errors were encountered:
gensquashfs currently retains the original data if the compressed output is larger than the source. However, performing heavy-duty compression on incompressible data and then discarding it may be wasteful. I propose adding a command line option to gensquashfs that enables a quick entropy measurement before performing compression. If a block is deemed incompressible, we can simply keep the original data without wasting computational resources on compression.
We could use a fast compression method, such as zstd level 1, to gauge the entropy. In this case, when using the default xz level 6, zstd level 1 introduces less than 2% of computational overhead. This approach would provide a net gain if the source files contain at least 2% of incompressible blocks, which is not an unreasonable scenario.
Alternative methods, such as file-based skipping mechanisms with filename matching or file type detection, may be less accurate. Specifically, files containing mixed compressibility resources, such as PDFs with both text (compressible) and JPEG images (not compressible), or uncompressed tar files or VM images containing various file types, could benefit from a more granular block-based approach.
This idea is inspired by the ZFS LZ4 early abort mechanism, although the requirements and trade-offs in our context may be different. For reference, I have filed a similar issue on the squashfs-tools repository at plougher/squashfs-tools#240.
I'm happy to refine my local prototype and send a PR, but I'd like to ensure that this feature aligns with the project's direction first. Thank you for your time and consideration. I'm looking forward to hearing your thoughts on this proposal and the potential advantages it could bring to squashfs-tools-ng and the community.
The text was updated successfully, but these errors were encountered: