-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for invalid data values before staging and integrate removal of observation or fix for observation #21
Comments
Within footprints: Check for multipolygons, or multiple singular polygons. Consider splitting the multipolygons into singular polygons, and combining the multiple singular polygons into one mega-footprint that is just a singular polygon. Within IWP detections: Check for multipolygons. These are currently dealt with by just not processing that geometry. We should instead integrate way to split them into multiple singular polygons before processing. |
Check for and remove invalid geometries, using something like: |
Check for and split polygons that cross the antimeridian. If not split or removed, these polygons become distorted in the CRS of the TMS. |
If deduplicating, check that the config option
|
Note that removing the invalid values for input data is only necessary for the attributes that are being visualized. For example, when visualizing the lake size time series data from Ingmar, |
In datasets that contain multipolygons, I have used geopandas |
“Check for NaN and inf values present in input data before staging.” I’m thinking that should happen around here? are those NaN and inf values part of the geometry? It seems like that might just be a one-line gdf = gdf[gdf.geometry.is_valid] unless I’m missing something? If they’re not part of the geometry, which fields should I be checking? |
How would splitting a footprint multipolygon into multiple individual polygons affect the later steps? Would we still be able to clip and deduplicate as normal? |
So the list of desired validations is:
|
It was not my intent that that PR should close the issue, because it's just one piece of the larger whole |
Thank you for diving into this issue, @westminsterabi! Answers to your questions:
There may be invalid values (some examples are NaN, inf, None, etc.) in any attribute of an observation or the geometry itself. These values are problems for the viz workflow if they are present in an attribute we are visualizing or if they are present in the geometry. A set list of attributes that may contain these values does not exist because every dataset has a unique set of attributes. Checking if a geometry is invalid is a good step in the process of checking if the geometries will work in the viz workflow. I wonder if inserting
Splitting a multipolygon is actually simple with geopandas.explode(), it just splits each polygon within the multipolygon into its own row, and "repeats" the attributes from the original multipolygon into the subsequent new rows so there are no missing attribute values for any of the singular polygons. I did this for lake area time series dataset. The script where I did this can be found here in the metadata package hosted by the ADC, but because it is not published yet, you will not have access to it at this time. So I pasted the cleaning script below :) The way cleaning script for lake area time series dataset
|
It's unclear if the multipolygons I discovered in some of the footprint files are actually an issue and how the workflow treats those. It is best if all the footprints we receive are already singular polygons. In my opinion, we need to hold data submitters to some standards, and we cannot do all their cleaning work for them. |
Do we have some way of specifying which attributes are going to be visualized at ingestion time? Otherwise, how do we know which attributes we have to clean and which can be left with NaNs etc?
Do we have such a dataset available? Is there a documented way for me to create one? |
I guess my question here is whether the footprint clipping function we use would still work with an exploded multipolygon, since multipolygons must be disjoint in shapely. It sounds like you're saying that the exploded multipolygon should behave the same and allow for clipping, even though it would be multiple polygons with no overlap? I also have a little confusion on this requirement re.
Is the suggestion here to explode the polygons and then merge them? From my understanding, that won't work because of the shapely enforcement on multipolygons being disjoint. So we wouldn't be able to combine them. |
I wrote a test for the multipolygon footprint case and it appears that multipolygons are not an issue. |
|
What do we want to do with NaN and inf values? replace them with 0, or delete those rows? |
We cannot replace them with 0 because in some datasets, 0 is a real value.
|
The only ways we "know" which attributes we want to visualize is one of the following:
|
In the config file, we specify which attributes we want to visualize by defining each as a statistic. We can specify multiple statistics per dataset and each becomes a raster layer. See here in the ConfigManager.py. These need to be entered into the config before the visualization run is started. |
The sample datasets in the Google drive I added you to (along with other Google fellows) contains datasets I like to use for testing. |
I have not tested how the footprint clipping works with the relatively few footprints that were multipolygons, versus the vast majority that were singular polygons, because we did not know that Chandi's team's footprint files were a mix of singular and multipolygons. I told them about this and requested they ensure all are singular polygons next time. I was saying that when we explode multipolygons that are the input geometries for the viz-worfklow, rather than the footprints, then they should behave the same for clipping (as in, being clipped by a singular footprint file) because there is no limit to the number of geometries in an input file to the viz-workflow. So if we recieve an input file that contains ice-wedge polygon detections that are multipolygons, and we first explode those into singular polygons, then feed them into the viz-worflow and deduplicate using the footprint method, they should be clipped the same as if we originally received singular polygons in the first place. |
My suggestion was to take a footprint file with a multipolygons and explode it, then create a new footprint from those, so basically find the minimum bounding box that encompasses both footprint geometries. I have not done it in practice, was just documenting the best way I could think to fix these rare multipolygon footprints if we choose to tackle that. Elias on Chandi's team created the ice-wedge polygon footprints, and their team should do so for any newer version of the dataset as well. So if they can ensure that all of them are singular polygons in the first place, that saves us from having to clean their data. |
Each attribute that we are visualizing already has a spot in the config: it is listed as its own statistic and the exact name of the property is specified in the property part of defining each statistic. So if we pull all the properties from the config then check only those for invalid values, that would be equivalent to defining a list of attributes like viz_fields as you suggested. |
From my testing, a multipolygon footprint will behave as expected (i.e. clip properly and label geometries outside all polygons just as it would for a singular polygon), so I don't think we need to worry about this. I will look into exploding IWPs that are multipolygons.
How do I know which geometries are invalid and should be manually removed? How do you usually manually remove them for testing purposes? |
I think developing a robust solution for archiving is outside the scope of this issue. I will make sure that the deletion is logged for now, and we can look into filing another issue for the archiving piece. |
Yeah I agree! |
You know that a row in a geodataframe contains invalid values like If your question was how do you know what dataset to use for testing your branch to remove invalid values, you can check out the files in the Google Drive within |
My work on this issue is in feature-21-data-validation. I did not have time to verify my solution to the AM polygon issue. Juliet raised a valid concern that the CRS conversion may be best done AFTER the split, not before. If this is the case, line_crosses_am in TileStager.py should be updated to use metres as units, instead of degrees. Whoever finishes this work should use the AM polygon data to verify that the results visually match what is expected. To the extent I was able to verify, the clean_viz_fields function successfully cleans the lake change data and allows it to proceed to rasterization without intervention.. This should be double-checked, probably with other datasets as well. The other validations should be good to go, and are covered by test cases where feasible. I'm sorry that I have to leave the project so abruptly. I hope that at least some of the work I did here is usable. |
Thank you for your contributions, Abi! Your work will certainly be useful :) A note on using geopandas |
A programmatic way to split polygons that cross the antimeridian has been documented here in the ticket for processing the permafrost and ground ice dataset. As noted there, this approach was specific to that dataset's CRS, and should be generalized to work with any input CRS to integrate this step into |
Splitting and buffering polygons that cross the antimeridian may be necessary to apply to the lake change dataset for UTM zone(s) that border the antimeridian. See this update to the lake change dataset for details. |
Check for
NaN
andinf
values present in input data before staging. If these values are present, staging can execute fine, but they are an issue for rasterization. These values result in invalid values in raster summary stats written toraster_summary.csv
, and raster downsampling is not possible, resulting in failure to write rasters at lower resolutions.The text was updated successfully, but these errors were encountered: