Fix checksum verification #56

colincasey · 2024-10-09T20:01:14Z

To investigate the intermittent checksum errors, I added some code to print out stats on the downloaded package index file when it encounters a checksum error. What stood out was that, whenever the checksum failed, the byte size written would always be slightly off from the byte size expected. E.g.;

release_url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/InRelease
url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/universe/binary-arm64/by-hash/SHA256/184473504b3e97aec550d1bb687b0871e8113f60ab7e8ff136f60f4ecdca9b20
uncompressed size: 1461506
compressed size:   365359
expected size:     365376

These errors also often go away when the automation is re-run. This would indicate that there are bytes not being consumed during the async_copy(reader, writer) process. After several different attempts at forcing the remaining bytes to write, there are two changes that seemed to provide stability:

Removing the AsyncBufWriter

Since we need an AsyncBufReader in front of the GzipDecoder, we already are buffering the reads into large chunks before sending them to the writer so why not just write directly to the file instead of passing it through a buffered writer?
Setting reader.multiple_members(true)

You can gzip compress multiple files into a single stream and many gzip readers decode these automatically. Here though, the gzip content is not comprised of multiple files so why would setting this flag cause all the bytes to write? I suspect that, due to how the read process works, when the end of the stream is reached, an extra read is performed to check if there are more entries in the stream and that somehow triggers a flush.

Surprisingly, trying to explicitly flush or shutdown the AsyncBufWriter did not provide the expected stability for correcting these checksum issues. I've modified the code to explicitly call flush() on the write stream anyways.

I'm still not 100% convinced the instability is completely fixed but, so far, these changes seems reasonable thing to try out and observe for failure.

Fixes #53

Link to test runs showing what appears to be increased stability (all passes for 150 integration test runs):

Where, previous to this, a checksum failure would happen during almost every other CI run.

To investigate the intermittent checksum errors, I added some code to print out stats on the downloaded package index file when it encounters a checksum error. What stood out was that, whenever the checksum failed, the byte size written would always be slightly off from the byte size expected. E.g.; ``` release_url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/InRelease url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/universe/binary-arm64/by-hash/SHA256/184473504b3e97aec550d1bb687b0871e8113f60ab7e8ff136f60f4ecdca9b20 uncompressed size: 1461506 compressed size: 365359 expected size: 365376 ``` These errors also often go away when the automation is re-run. This would indicate that there are bytes not being consumed during the `async_copy(reader, writer)` process`. After several different attempts at forcing the remaining bytes to write, there are two changes that seemed to provide stability: - Removing the `AsyncBufWriter` Since we need an `AsyncBufReader` in front of the `GzipDecoder`, we already are buffering the reads into large chunks before sending them to the writer so why not just write directly to the file instead of passing it through a buffered writer? - Setting `reader.multiple_members(true)` You can gzip compress multiple files into a single stream and many gzip readers decode these automatically. Here though, the gzip content is not comprised of multiple files so why would setting this flag cause all the bytes to write? I suspect that, due to how the read process works, when the end of the stream is reached, an extra read is performed to check if there are more entries in the stream and that somehow triggers a flush. Surprisingly, trying to explicitly `flush` or `shutdown` the `AsyncBufWriter` did not provide the expected stability for correcting these checksum issues. And I'm still not 100% convinced the instability is completely fixed but, so far, this change seems like a reasonable thing to try and observe.

colincasey added the skip changelog Skip the check-changelog check label Oct 9, 2024

colincasey self-assigned this Oct 9, 2024

colincasey marked this pull request as ready for review October 9, 2024 20:07

colincasey requested a review from a team as a code owner October 9, 2024 20:07

colincasey marked this pull request as draft October 9, 2024 21:33

colincasey removed the request for review from a team October 9, 2024 21:33

This comment was marked as outdated.

Sign in to view

colincasey added 4 commits October 10, 2024 10:19

colincasey marked this pull request as ready for review October 10, 2024 18:19

edmorley approved these changes Oct 10, 2024

View reviewed changes

colincasey merged commit ac78be9 into main Oct 10, 2024
6 checks passed

colincasey deleted the fix_intermittent_checksum_errors branch October 10, 2024 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix checksum verification #56

Fix checksum verification #56

colincasey commented Oct 9, 2024 •

edited

Loading

This comment was marked as outdated.

Fix checksum verification #56

Fix checksum verification #56

Conversation

colincasey commented Oct 9, 2024 • edited Loading

This comment was marked as outdated.

colincasey commented Oct 9, 2024 •

edited

Loading