-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN threshold for conservative method #39
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for contributing this! I'll try to give it a full review and run it next week, but here are already some comments.
Did you compare the performance to xESMF to ensure that the masking has the same results here as there? I can also try that later.
Lastly, would you be able to write some unit tests to test the nan masking? And update the changelog? If anything is unclear please let me know.
Thanks, and yes for sure, hopefully next week I will find time to add tests and benchmarks. |
Did some quick profiling on a ~4GB array of 1/4deg global data coarsening to 1deg. Dask array on a 32 CPU node. Results:
So adding skipna forces roughly one additional pass through the array with the weight renormalization. The reason this PR is faster than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally found some time to get back to this.
I did manage to come up with an implementation that should accurately track the fraction of NaN points across dimensions (see the tests added). The problem currently is that there is a major performance penalty because we are tracking the nans independently across all dimensions (such as an additional time dim) which causes the einsum ops to have much larger dimensionality.
Also did some general reworking to use xr.dot
, consolidate some things like different the DataArray
/Dataset
pathways, and knocked out some other small improvements to conservative
that have been noted in the issues.
Made the modification to take |
xr.testing.assert_allclose(da_coarsen, da_regrid) | ||
|
||
|
||
@pytest.mark.skip(reason="requires xesmf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use pytest.mark.skipif
here, and run the test automatically if xesmf is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the changes you've made! I've invited you to this repository so you don't have to wait for me to have the CI run.
* add nan_threshold option * track nan frac across dims, use xr.dot, consolidate ds/da paths, tests * fixes and cleanup plus initial notebook cell * speed up by only tracking the max of nonnull points over non_regrid_dims * fix tests for newer dependency versions * Improve typing of call_on_dataset * Fix typing in updated conservative routines * Apply code formatting * Ensure hashable is a valid input for coordinate identifier * Make tests & typing pass * Make `create_regridding_dataset` a method of `Grid` #38 * Update notebooks and dependencies * Allow xesmf test to run if it's available * Add @slevang to the contributors list * Update changelog * Ignore linter in test * Update readme with badges and "why use..." text --------- Co-authored-by: Sam Levang <slevang@salientpredictions.com>
Merged as part of #41 |
Nice package! In testing it out where I've previously used
xesmf
, I noticed two features lacking from the conservative method:Number 1 is easy, number 2 is trickier. I added a naive implementation for the
nan_threshold
capabilities of xesmf here for discussion. As noted in the previous issue, to do this 100% correctly we would need to track the NaN fraction as we reduce over each dimension, which I'm not doing here. Thenan_threshold
value doesn't translate directly to total fraction of NaN cells due to the sequential reduction. It would also get complicated for isolated NaNs in the temporal dimension.I'm not sure any of this matters much for a dataset where you have consistent NaN's e.g. SST. Here's an example of the new functionality used on the MUR dataset. Note this is a 33TB array but we can now generate the (lazily) regridded dataset instantaneously.