Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update openDAL to address AWS/GHAC issues #21779

Merged
merged 16 commits into from
Jan 7, 2025

Conversation

DLukeNelson
Copy link
Contributor

@DLukeNelson DLukeNelson commented Dec 18, 2024

0.44.2 should fix the GHAC AWS S3 errors I'm seeing, but we might as well update to the latest version that's easy. 0.46 brings some bigger breaking changes, so that version is 0.45.1

0.44.2 should fix the GHAC AWS S3 errors I'm seeing, but might as well try updating to the latest release version for now
@DLukeNelson
Copy link
Contributor Author

I don't really know anything about rust development, so I'd appreciate guidance on if there is anything else that needs to be done in tandem with just changing the version number

@huonw
Copy link
Contributor

huonw commented Dec 18, 2024

Thanks for this, and thanks for diving into unfamiliar territory! This looks like it's along the right lines, but we'll need some adjustments:

  • release notes: can you add a brief paragraph to docs/notes/2.25.x.md, in the Remote Caching/Execution section? E.g. "The OpenDAL library powering the Github Actions cache backend has been updated, picking up some bug fixes for ..."
  • reduced updates: it looks like this updates a lot of packages: it's really easy to do this by writing ./cargo update which updates the world to their latest compatible versions. It'd be good to only update what's required here to reduce the scope. Could you revert the Cargo.lock change and then run ./cargo update opendal to constrain the update to just the required packages?

Other than that, I don't think anything else is required.

@DLukeNelson
Copy link
Contributor Author

Thanks for the quick review. I've made the requested updates.

@DLukeNelson
Copy link
Contributor Author

I see there are some build failures. Looks like the opendal upgrade does involve a few breaking changes. I'm going to be a bit busy with the holidays, but will circle back to this when I've got some time in front of me.

@huonw
Copy link
Contributor

huonw commented Dec 19, 2024

Thank you! It looks like there's been some changes to the OpenDAL APIs that break compilation. Unfortunately, it seems like their changelog doesn't call out the changes very clearly so it's a bit of fuss to resolve them.

If you're new to Rust, I suspect it won't be a fun time to work through. Are you okay if I push to this PR @DLukeNelson ?

@DLukeNelson
Copy link
Contributor Author

@huonw Sorry I didn't see this until now. Feel free to push here if you'd like. It will probably be at least until Monday before I am back to looking at anything, and then (as you mentioned) who knows how it will go as a first foray into rust haha.

@DLukeNelson
Copy link
Contributor Author

I reduced the scope of the upgrade. Upgrading opendal to 0.45.1 had a fairly limited breaking-change-blast-radius, so I think I was able to handle it (though I would appreciate an extra careful eye on the rust changes).

This far of an upgrade ought to be sufficient to address the AWS S3/GHAC issue that I've been seeing, although it does not bring openDAL to latest.

Not sure if there is a good place to put this information, but for the sake of mentioning it somewhere:
Upgrading openDAL to 0.46 has some more complicated issues. The opendal Reader and Writer no longer implement tokio::io::AsyncRead and tokio::io::AsyncWrite respectively (they now use futures::io). I'd expect there is some equivalency there and it may not be too complicated to get it working, but its definitely beyond my level of rust knowledge.

@DLukeNelson DLukeNelson changed the title Update openDAL to latest version Update openDAL to address AWS/GHAC issues Jan 1, 2025
@@ -76,9 +76,8 @@ impl Provider {
})
}

pub fn fs(path: &str, scope: String, options: RemoteStoreOptions) -> Result<Provider, String> {
let mut builder = opendal::services::Fs::default();
builder.root(path).enable_path_check();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler told me "path always checked since RFC-3243 List Prefix" and this enable_path_check call is deprecated, so it seems we can just get rid of it. This has knock-on effects of builder not needing to be mutable, and path no longer being required as a parameter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think we do still need to set root (and use path): there's a dynamic/run-time error if it isn't specified (for testing locally, ./cargo test -p store shows this, CI hasn't got to it yet due to earlier errors related to the error reporting): https://github.com/apache/opendal/blob/50791255c6ef3ed259c88dcc04a4295fa60fa443/core/src/services/fs/backend.rs#L97-L103

thread 'remote_tests::smoke_test_from_options_file_provider' panicked at fs/store/src/remote_tests.rs:94:6:
called `Result::unwrap()` on an `Err` value: "failed to initialise fs remote store provider: ConfigInvalid (permanent) at  => root is not specified"

I think it's just enable_path_check that's no longer required, and this code should remain as builder.root(path);.

Background: these calls are on the same line but operate independently: using the "builder pattern" in Rust, each of them mutates builder, and returns a mutable reference to allow chaining. A "clearer" version of the same code might've been:

let mut builder = ...;
builder.root(path);
builder.enable_path_check();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Thanks for the info! I've put path back in, and it seems to be working fine

@@ -58,8 +58,7 @@ impl Provider {
// TODO: record Metric::RemoteStoreRequestTimeouts for timeouts
TimeoutLayer::new()
.with_timeout(options.timeout)
// TimeoutLayer requires specifying a non-zero minimum transfer speed too.
.with_speed(1),
.with_io_timeout(options.timeout),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with_speed has a deprecation warning, suggesting to use with_io_timeout instead. As I understand it, with_timeout and with_io_timeout each apply a timeout to different operations, so both are required: https://opendal.apache.org/docs/rust/opendal/layers/struct.TimeoutLayer.html#notes

Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the initiative, I've been out over the holiday period, so you got back to this before I did, despite my offer/suggestion of pushing here.

Good choice to reduce scope to 0.45.1! It seems like OpenDAL is willing to make breaking changes and not make it easy to understand what they are or how to fix them (as you observed with readers/writers in 0.46, but also other things! 😢 ).

Speaking of non-obvious breaking changes, it seems like the error reporting has changed in a way that breaks our code, causing the test failures 😢 It looks like this happened in 0.42.0 (in apache/opendal#3395 specifically). I'm not yet sure of how to resolve this. 🤔

I had to do a bisect with git dependencies in Cargo.toml: opendal = { git = "https://github.com/apache/opendal.git", rev = "268d5b8784ce9f30e9c0b70dc6201107a0d5b8f1", ...

Log:

git bisect start 'v0.42.0' 'v0.41.0'
# 9b423aef2 bad chore: Bump to v0.42.0 to start release process (Round 3) (#3531)
# 779750ab3 good chore: Bump to v0.41.0 to start release process (#3241)

git bisect good 3aad395fc8d8c894a28d05d8e6bf904d1eb38d08
# 3aad395fc good feat: Add write_total_max_size in Capability (#3309)

git bisect bad 6e44078bf38a6cea3176f4a38cb361f82ebd416e
# 6e44078bf bad fix: nodejs test adapt `OPENDAL_DISABLE_RANDOM_ROOT` (#3456)

git bisect good d77c36184d0a0b67647a1a509c7816e417a8b980
# d77c36184 good fix(services/ghac)!: Remove enable_create_simulation support for ghac (#3423)

git bisect bad 60815aa926c7c649e884b56c38df3b09123a4807
# 60815aa92 bad fix(mongo/backend): remove redundant code (#3439)

git bisect good daf64e8e5ebd9a42ede372975b6523f706915308
# daf64e8e5 good refactor(services/sftp): migrate to test planner (#3412)

git bisect bad 268d5b8784ce9f30e9c0b70dc6201107a0d5b8f1
# 268d5b878 bad feat: Implement Lazy Reader (#3395)

git bisect good 74eab842f2082b8fa09d64ee7df7939e86d4589c
# 74eab842f good fix: ASF event URL (#3431)

# 268d5b8784ce9f30e9c0b70dc6201107a0d5b8f1 is the first bad commit

@@ -76,9 +76,8 @@ impl Provider {
})
}

pub fn fs(path: &str, scope: String, options: RemoteStoreOptions) -> Result<Provider, String> {
let mut builder = opendal::services::Fs::default();
builder.root(path).enable_path_check();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think we do still need to set root (and use path): there's a dynamic/run-time error if it isn't specified (for testing locally, ./cargo test -p store shows this, CI hasn't got to it yet due to earlier errors related to the error reporting): https://github.com/apache/opendal/blob/50791255c6ef3ed259c88dcc04a4295fa60fa443/core/src/services/fs/backend.rs#L97-L103

thread 'remote_tests::smoke_test_from_options_file_provider' panicked at fs/store/src/remote_tests.rs:94:6:
called `Result::unwrap()` on an `Err` value: "failed to initialise fs remote store provider: ConfigInvalid (permanent) at  => root is not specified"

I think it's just enable_path_check that's no longer required, and this code should remain as builder.root(path);.

Background: these calls are on the same line but operate independently: using the "builder pattern" in Rust, each of them mutates builder, and returns a mutable reference to allow chaining. A "clearer" version of the same code might've been:

let mut builder = ...;
builder.root(path);
builder.enable_path_check();

@huonw
Copy link
Contributor

huonw commented Jan 6, 2025

Ah, okay, I think we need to adjust this code:

let correct_digest = async_verified_copy(digest, false, &mut reader, destination)
.await
.map_err(|e| format!("failed to read {}: {}", path, e))?;
if !correct_digest {
// TODO: include the actual digest here
return Err(format!("Remote CAS gave wrong digest: expected {digest:?}"));
}
}
LoadMode::NoValidate => {
tokio::io::copy(&mut reader, destination)
.await
.map_err(|e| format!("failed to read {}: {}", path, e))?;
}

We need both branches there to detect when the copy returns an error with kind std::io::ErrorKind::NotFound and return Ok(false) (similar to the reader(...) code further above). I'm happy to do it or provide advice about how to do it, let me know!

@DLukeNelson
Copy link
Contributor Author

DLukeNelson commented Jan 6, 2025

Ah, okay, I think we need to adjust this code:

let correct_digest = async_verified_copy(digest, false, &mut reader, destination)
.await
.map_err(|e| format!("failed to read {}: {}", path, e))?;
if !correct_digest {
// TODO: include the actual digest here
return Err(format!("Remote CAS gave wrong digest: expected {digest:?}"));
}
}
LoadMode::NoValidate => {
tokio::io::copy(&mut reader, destination)
.await
.map_err(|e| format!("failed to read {}: {}", path, e))?;
}

We need both branches there to detect when the copy returns an error with kind std::io::ErrorKind::NotFound and return Ok(false) (similar to the reader(...) code further above). I'm happy to do it or provide advice about how to do it, let me know!

I think I've implemented this correctly. The tests are passing now, at least, on my local machine with cargo test --locked --tests

@huonw
Copy link
Contributor

huonw commented Jan 6, 2025

Awesome, it does look like it's almost there. I think we need to run ./cargo fmt to fix the formatting and hopefully then this should be good to go!

All these breaking changes have turned this "just upgrade a dependency" change has turned into a bit of a saga, thanks for sticking with it.

@DLukeNelson
Copy link
Contributor Author

I can't believe I forgot linting! Pants/autohook have truly spoiled me in my regular work repo haha.

Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one!

@huonw huonw enabled auto-merge (squash) January 7, 2025 21:59
@DLukeNelson
Copy link
Contributor Author

@huonw Any thoughts on these failing tests?
I don't have a mac, so might have a hard time investigating. (Also worthy of note: there seems to be some errors in the steps that upload the test report to aws, so we have no test report to go on.)

@huonw
Copy link
Contributor

huonw commented Jan 7, 2025

(I updated the PR description after the switch to upgrading to 0.45.1)

@huonw
Copy link
Contributor

huonw commented Jan 7, 2025

Any thoughts on these failing tests?

I think it's a spurious timeout failure, so I've retried.

@huonw huonw merged commit fd3f4c9 into pantsbuild:main Jan 7, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants