Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix checksum verification #56

Merged
merged 5 commits into from
Oct 10, 2024
Merged

Conversation

colincasey
Copy link
Contributor

@colincasey colincasey commented Oct 9, 2024

To investigate the intermittent checksum errors, I added some code to print out stats on the downloaded package index file when it encounters a checksum error. What stood out was that, whenever the checksum failed, the byte size written would always be slightly off from the byte size expected. E.g.;

release_url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/InRelease
url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/universe/binary-arm64/by-hash/SHA256/184473504b3e97aec550d1bb687b0871e8113f60ab7e8ff136f60f4ecdca9b20
uncompressed size: 1461506
compressed size:   365359
expected size:     365376

These errors also often go away when the automation is re-run. This would indicate that there are bytes not being consumed during the async_copy(reader, writer) process. After several different attempts at forcing the remaining bytes to write, there are two changes that seemed to provide stability:

  • Removing the AsyncBufWriter

    Since we need an AsyncBufReader in front of the GzipDecoder, we already are buffering the reads into large chunks before sending them to the writer so why not just write directly to the file instead of passing it through a buffered writer?

  • Setting reader.multiple_members(true)

    You can gzip compress multiple files into a single stream and many gzip readers decode these automatically. Here though, the gzip content is not comprised of multiple files so why would setting this flag cause all the bytes to write? I suspect that, due to how the read process works, when the end of the stream is reached, an extra read is performed to check if there are more entries in the stream and that somehow triggers a flush.

Surprisingly, trying to explicitly flush or shutdown the AsyncBufWriter did not provide the expected stability for correcting these checksum issues. I've modified the code to explicitly call flush() on the write stream anyways.

I'm still not 100% convinced the instability is completely fixed but, so far, these changes seems reasonable thing to try out and observe for failure.

Fixes #53


Link to test runs showing what appears to be increased stability (all passes for 150 integration test runs):

Where, previous to this, a checksum failure would happen during almost every other CI run.

To investigate the intermittent checksum errors, I added some code to print out stats on the downloaded package index file when it encounters a checksum error. What stood out was that, whenever the checksum failed, the byte size written would always be slightly off from the byte size expected. E.g.;

```
release_url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/InRelease
url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/universe/binary-arm64/by-hash/SHA256/184473504b3e97aec550d1bb687b0871e8113f60ab7e8ff136f60f4ecdca9b20
uncompressed size: 1461506
compressed size:   365359
expected size:     365376
```

These errors also often go away when the automation is re-run. This would indicate that there are bytes not being consumed during the `async_copy(reader, writer)` process`. After several different attempts at forcing the remaining bytes to write, there are two changes that seemed to provide stability:

- Removing the `AsyncBufWriter`

    Since we need an `AsyncBufReader` in front of the `GzipDecoder`, we already are buffering the reads into large chunks before sending them to the writer so why not just write directly to the file instead of passing it through a buffered writer?

- Setting `reader.multiple_members(true)`

    You can gzip compress multiple files into a single stream and many gzip readers decode these automatically. Here though, the gzip content is not comprised of multiple files so why would setting this flag cause all the bytes to write? I suspect that, due to how the read process works, when the end of the stream is reached, an extra read is performed to check if there are more entries in the stream and that somehow triggers a flush.

Surprisingly, trying to explicitly `flush` or `shutdown` the `AsyncBufWriter` did not provide the expected stability for correcting these checksum issues. And I'm still not 100% convinced the instability is completely fixed but, so far, this change seems like a reasonable thing to try and observe.
@colincasey colincasey added the skip changelog Skip the check-changelog check label Oct 9, 2024
@colincasey colincasey self-assigned this Oct 9, 2024
@colincasey colincasey marked this pull request as ready for review October 9, 2024 20:07
@colincasey colincasey requested a review from a team as a code owner October 9, 2024 20:07
@colincasey colincasey marked this pull request as draft October 9, 2024 21:33
@colincasey colincasey removed the request for review from a team October 9, 2024 21:33
@colincasey

This comment was marked as outdated.

To investigate the intermittent checksum errors, I added some code to print out stats on the downloaded package index file when it encounters a checksum error. What stood out was that, whenever the checksum failed, the byte size written would always be slightly off from the byte size expected. E.g.;

```
release_url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/InRelease
url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/universe/binary-arm64/by-hash/SHA256/184473504b3e97aec550d1bb687b0871e8113f60ab7e8ff136f60f4ecdca9b20
uncompressed size: 1461506
compressed size:   365359
expected size:     365376
```

These errors also often go away when the automation is re-run. This would indicate that there are bytes not being consumed during the `async_copy(reader, writer)` process`. After several different attempts at forcing the remaining bytes to write, there are two changes that seemed to provide stability:

- Removing the `AsyncBufWriter`

    Since we need an `AsyncBufReader` in front of the `GzipDecoder`, we already are buffering the reads into large chunks before sending them to the writer so why not just write directly to the file instead of passing it through a buffered writer?

- Setting `reader.multiple_members(true)`

    You can gzip compress multiple files into a single stream and many gzip readers decode these automatically. Here though, the gzip content is not comprised of multiple files so why would setting this flag cause all the bytes to write? I suspect that, due to how the read process works, when the end of the stream is reached, an extra read is performed to check if there are more entries in the stream and that somehow triggers a flush.

Surprisingly, trying to explicitly `flush` or `shutdown` the `AsyncBufWriter` did not provide the expected stability for correcting these checksum issues. And I'm still not 100% convinced the instability is completely fixed but, so far, this change seems like a reasonable thing to try and observe.
To investigate the intermittent checksum errors, I added some code to print out stats on the downloaded package index file when it encounters a checksum error. What stood out was that, whenever the checksum failed, the byte size written would always be slightly off from the byte size expected. E.g.;

```
release_url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/InRelease
url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/universe/binary-arm64/by-hash/SHA256/184473504b3e97aec550d1bb687b0871e8113f60ab7e8ff136f60f4ecdca9b20
uncompressed size: 1461506
compressed size:   365359
expected size:     365376
```

These errors also often go away when the automation is re-run. This would indicate that there are bytes not being consumed during the `async_copy(reader, writer)` process`. After several different attempts at forcing the remaining bytes to write, there are two changes that seemed to provide stability:

- Removing the `AsyncBufWriter`

    Since we need an `AsyncBufReader` in front of the `GzipDecoder`, we already are buffering the reads into large chunks before sending them to the writer so why not just write directly to the file instead of passing it through a buffered writer?

- Setting `reader.multiple_members(true)`

    You can gzip compress multiple files into a single stream and many gzip readers decode these automatically. Here though, the gzip content is not comprised of multiple files so why would setting this flag cause all the bytes to write? I suspect that, due to how the read process works, when the end of the stream is reached, an extra read is performed to check if there are more entries in the stream and that somehow triggers a flush.

Surprisingly, trying to explicitly `flush` or `shutdown` the `AsyncBufWriter` did not provide the expected stability for correcting these checksum issues. And I'm still not 100% convinced the instability is completely fixed but, so far, this change seems like a reasonable thing to try and observe.
To investigate the intermittent checksum errors, I added some code to print out stats on the downloaded package index file when it encounters a checksum error. What stood out was that, whenever the checksum failed, the byte size written would always be slightly off from the byte size expected. E.g.;

```
release_url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/InRelease
url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/universe/binary-arm64/by-hash/SHA256/184473504b3e97aec550d1bb687b0871e8113f60ab7e8ff136f60f4ecdca9b20
uncompressed size: 1461506
compressed size:   365359
expected size:     365376
```

These errors also often go away when the automation is re-run. This would indicate that there are bytes not being consumed during the `async_copy(reader, writer)` process`. After several different attempts at forcing the remaining bytes to write, there are two changes that seemed to provide stability:

- Removing the `AsyncBufWriter`

    Since we need an `AsyncBufReader` in front of the `GzipDecoder`, we already are buffering the reads into large chunks before sending them to the writer so why not just write directly to the file instead of passing it through a buffered writer?

- Setting `reader.multiple_members(true)`

    You can gzip compress multiple files into a single stream and many gzip readers decode these automatically. Here though, the gzip content is not comprised of multiple files so why would setting this flag cause all the bytes to write? I suspect that, due to how the read process works, when the end of the stream is reached, an extra read is performed to check if there are more entries in the stream and that somehow triggers a flush.

Surprisingly, trying to explicitly `flush` or `shutdown` the `AsyncBufWriter` did not provide the expected stability for correcting these checksum issues. And I'm still not 100% convinced the instability is completely fixed but, so far, this change seems like a reasonable thing to try and observe.
To investigate the intermittent checksum errors, I added some code to print out stats on the downloaded package index file when it encounters a checksum error. What stood out was that, whenever the checksum failed, the byte size written would always be slightly off from the byte size expected. E.g.;

```
release_url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/InRelease
url: http://ports.ubuntu.com/ubuntu-ports/dists/noble-security/universe/binary-arm64/by-hash/SHA256/184473504b3e97aec550d1bb687b0871e8113f60ab7e8ff136f60f4ecdca9b20
uncompressed size: 1461506
compressed size:   365359
expected size:     365376
```

These errors also often go away when the automation is re-run. This would indicate that there are bytes not being consumed during the `async_copy(reader, writer)` process`. After several different attempts at forcing the remaining bytes to write, there are two changes that seemed to provide stability:

- Removing the `AsyncBufWriter`

    Since we need an `AsyncBufReader` in front of the `GzipDecoder`, we already are buffering the reads into large chunks before sending them to the writer so why not just write directly to the file instead of passing it through a buffered writer?

- Setting `reader.multiple_members(true)`

    You can gzip compress multiple files into a single stream and many gzip readers decode these automatically. Here though, the gzip content is not comprised of multiple files so why would setting this flag cause all the bytes to write? I suspect that, due to how the read process works, when the end of the stream is reached, an extra read is performed to check if there are more entries in the stream and that somehow triggers a flush.

Surprisingly, trying to explicitly `flush` or `shutdown` the `AsyncBufWriter` did not provide the expected stability for correcting these checksum issues. And I'm still not 100% convinced the instability is completely fixed but, so far, this change seems like a reasonable thing to try and observe.
@colincasey colincasey marked this pull request as ready for review October 10, 2024 18:19
@colincasey colincasey merged commit ac78be9 into main Oct 10, 2024
6 checks passed
@colincasey colincasey deleted the fix_intermittent_checksum_errors branch October 10, 2024 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip changelog Skip the check-changelog check
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Auto retry on checksum verification
2 participants