Improve compression support #72

KillingSpark · 2024-10-07T13:42:44Z

Checksum support (<- Great first issue!)
Check if/how much building custom FSE tables improves the compression rate and what the performance implications are
Check if/how much rebuilding huffman tables for each block helps/hurts performance and ratio
Dictionary support
Check other matching algorithms and window sizes to implement more compression levels
Make the Matcher trait public and allow users to provide their own implementations

The text was updated successfully, but these errors were encountered:

zleyyij · 2024-11-04T20:56:19Z

This is awesome, I've been swamped with schoolwork so haven't really had time to finish up my compression addition.

Is there anywhere that my help would be valued? Once this is merged into master I can look into restructuring everything to be more organized, or:

I want to add a lot more to the CI, ala https://github.com/jonhoo/rust-ci-conf
I can add more documentation
I want to try experimenting with making the bitreader flip chunks using SIMD, I think it would lead to a real performance speedup

KillingSpark · 2024-11-04T22:29:26Z

One thing that would be nice to have is actually streaming the input. Currently I am being cheap and just read the source into a Vec and then process it as if it was passed as an parameter. That's obviously just a crutch and needs to change so large sources can be processed with efficient memory usage.

I'd hold off with doc on the encoder part for a bit, it's likely a lot will still change and keeping doc up to date can be tricky. And wrong/outdated doc is worse than no doc in my experience.

Trying out SIMD for the bitreader sounds interesting, especially for the case where three numbers are read in one call. The bitreader is one of the hottest codepaths in the decoding, even small gains can have a huge impact. It also has the benefit of not depending on this branch being merged soon.

The really complicated thin I'm hung up on currently is designing a good algorithm for finding the matches to generate sequences. I learned about efficiently searching strings in university but this is a bit different as the string to search in constantly changes as the data window moves. I think I found something half promising but it's annoying to implement in safe rust. I might put a pin into that an deffer this to later to get this branch into a mergeable state.

Anyways do what you think sounds most fun to you :)

KillingSpark · 2024-12-14T16:16:42Z

@zleyyij I merged this and did some reorganizing myself. Feedback on that would be greatly appreciated.

I removed all re-exports for now so the common stuff is now hidden in ruzstd::encoding::xxx. These should probably come back to the top level imports? Anyways the doc page is now much cleaner after hiding all the internal modules :)

zleyyij · 2024-12-15T01:28:07Z

Awesome, I'll take a look tomorrow :)

…

On Sat, Dec 14, 2024 at 09:17 Moritz Borcherding ***@***.***> wrote: @zleyyij <https://github.com/zleyyij> I merged this and did some reorganizing myself. Feedback on that would be greatly appreciated. I removed all re-exports for now so the common stuff is now hidden in ruzstd::encoding::xxx. These should probably come back to the top level imports? Anyways the doc page is now much cleaner after hiding all the internal modules :) — Reply to this email directly, view it on GitHub <#72 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASCMLYRXE5ZEYWEO3N5UDYT2FRKYDAVCNFSM6AAAAABPP7CCZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBTGE3TEMJWGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

zleyyij · 2024-12-15T18:28:46Z

@KillingSpark

Here's my scratchpad:

Externally facing feedback

I cloned the repo on my machine then ran cargo doc --open. The homepage feels readable and organized. A few areas of improvement you could consider:
- If you use #![doc = include_str!("README.md")] in your module level documentation for lib.rs, then you can include the entire README in the docs page
- If you really wanted to improve docs for the compression side, I really like rustdoc's documentation format, and I would try to use that for every major public facing component https://doc.rust-lang.org/rustdoc/how-to-write-documentation.html#documenting-components

Internal implementation feedback

`compression/mod.rs`

compression/mod.rs is an unusual place to put the matcher trait if I'm understanding what it is correctly. Do you think it would be better suited under match_generator.rs?
The Matcher trait appears to define some domain specific language (a space), but it doesn't ever really define what a space is. Is it a heap alloaction where compressed data is written?

`match_generator.rs`

Module level documentation for match_generator.rs would be helpful, just a sentence or two explaining that it's generating matches for the literals section of a zstd frame or whatever it is
Your implementation for match_generator is elegant, I like it

`frame_compressor.rs`

I actually made this mistake when I was first writing the compressor stuff, I don't like that FrameCompressor::compress is basically a massive match chain, I think it'll get worse and worse as more compression support is added. What do you think about having a levels module, and for every compression level, there's a file with a single fn compress_LEVEL(input: R, output: &mut Vec<u8>). Then the match just has a CompressionLevel::uncompressed => compress_uncompressed(input, &mut output)

`bit_writer.rs`

bit_writer has a typo on line 10, outpu should be output.
Some of the stuff in bit_writer.rs is very difficult to understand, and could benefit from comments, eg:

       if idx % 8 != 0 {
            let bits_in_first_byte = 8 - (idx % 8);
            assert!(bits_in_first_byte <= num_bits);
            self.output.as_mut()[idx / 8] &= 0xFFu8 >> bits_in_first_byte;
            let new_bits = (bits << (8 - bits_in_first_byte)) as u8;
            self.output.as_mut()[idx / 8] |= new_bits;
            num_bits -= bits_in_first_byte;
            bits >>= bits_in_first_byte;
            idx += bits_in_first_byte;
        }

The function it's under has no docstring, and there are no comments throughout the body of the function, so it's difficult to grok easily.

Overall, great work, this is a very robust job you've done here. If you'd like help with some of these, I can help.

I'm currently working on SIMD optimization of reverse_bitreader, it's headache inducing but I am still making process (I love reading about discrete galois field matrix transforms /s). My current plan is to implement x86_64 (GFNI/AVX512VL) and ARM NEON support, then write a fallback platform agnostic version. Currently, it seems this will require nightly support enabled for this (maybe we lock behind a feature flag, inline some assembly, or use a stopgap crate until that's stabilized?)

KillingSpark · 2024-12-15T19:38:52Z

Thanks for the feedback! Lots of good stuff :)

I think the Matcher trait needs some work anyways so the MatcherDriver is generic over the Matcher and doesn't itself implement Matcher. This would allow user code to provide their own Matcher implementation if they want to. But with that in mind I think it is in the right place, even if it is only pub(crate) with a single implementation for now.

The SIMD work sounds very cool! I think swapping the bitreader based on a feature flag is fine, if the feature documents that the crate is only supported on nightly with that feature enabled

KillingSpark mentioned this issue Oct 7, 2024

Encoder improvements #71

Closed

4 tasks

KillingSpark changed the title ~~Actual compression support~~ Improve compression support Dec 13, 2024

KillingSpark added the good first issue Good for newcomers label Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve compression support #72

Improve compression support #72

KillingSpark commented Oct 7, 2024 •

edited

Loading

zleyyij commented Nov 4, 2024

KillingSpark commented Nov 4, 2024 •

edited

Loading

KillingSpark commented Dec 14, 2024

zleyyij commented Dec 15, 2024 via email

zleyyij commented Dec 15, 2024

KillingSpark commented Dec 15, 2024 •

edited

Loading

Improve compression support #72

Improve compression support #72

Comments

KillingSpark commented Oct 7, 2024 • edited Loading

zleyyij commented Nov 4, 2024

KillingSpark commented Nov 4, 2024 • edited Loading

KillingSpark commented Dec 14, 2024

zleyyij commented Dec 15, 2024 via email

zleyyij commented Dec 15, 2024

Externally facing feedback

Internal implementation feedback

compression/mod.rs

match_generator.rs

frame_compressor.rs

bit_writer.rs

KillingSpark commented Dec 15, 2024 • edited Loading

KillingSpark commented Oct 7, 2024 •

edited

Loading

KillingSpark commented Nov 4, 2024 •

edited

Loading

`compression/mod.rs`

`match_generator.rs`

`frame_compressor.rs`

`bit_writer.rs`

KillingSpark commented Dec 15, 2024 •

edited

Loading