Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for invalid data values before staging and integrate removal of observation or fix for observation #21

Open
julietcohen opened this issue Apr 18, 2023 · 32 comments
Labels
good first issue Good for newcomers

Comments

@julietcohen
Copy link
Collaborator

Check for NaN and inf values present in input data before staging. If these values are present, staging can execute fine, but they are an issue for rasterization. These values result in invalid values in raster summary stats written to raster_summary.csv, and raster downsampling is not possible, resulting in failure to write rasters at lower resolutions.

@julietcohen
Copy link
Collaborator Author

Within footprints: Check for multipolygons, or multiple singular polygons. Consider splitting the multipolygons into singular polygons, and combining the multiple singular polygons into one mega-footprint that is just a singular polygon.

Within IWP detections: Check for multipolygons. These are currently dealt with by just not processing that geometry. We should instead integrate way to split them into multiple singular polygons before processing.

@julietcohen julietcohen changed the title Check for invalid data values before staging Check for invalid data values before staging and integrate removal of observation or fix for observation May 22, 2023
@julietcohen
Copy link
Collaborator Author

julietcohen commented Jul 24, 2023

Check for and remove invalid geometries, using something like: gdf = gdf[gdf.geometry.is_valid]

@julietcohen
Copy link
Collaborator Author

Check for and split polygons that cross the antimeridian. If not split or removed, these polygons become distorted in the CRS of the TMS.

@julietcohen
Copy link
Collaborator Author

If deduplicating, check that the config option deduplicate_keep_rules is set to:

  1. a property that exists in the input data for neighbors method, or footprint file for footprint method
  2. either larger or smaller for comparison operator

@julietcohen
Copy link
Collaborator Author

Note that removing the invalid values for input data is only necessary for the attributes that are being visualized. For example, when visualizing the lake size time series data from Ingmar, nan values only need to be removed from permanent_water and seasonal_water in order to execute all rasterization successfully, because those are the only attributes that become bands.

@julietcohen julietcohen added the good first issue Good for newcomers label Jan 29, 2024
@westminsterabi westminsterabi self-assigned this Feb 1, 2024
@julietcohen
Copy link
Collaborator Author

In datasets that contain multipolygons, I have used geopandas explode() to create a new row for each individual polygon.

@westminsterabi
Copy link
Contributor

“Check for NaN and inf values present in input data before staging.” I’m thinking that should happen around here? are those NaN and inf values part of the geometry? It seems like that might just be a one-line gdf = gdf[gdf.geometry.is_valid] unless I’m missing something? If they’re not part of the geometry, which fields should I be checking?

@westminsterabi
Copy link
Contributor

How would splitting a footprint multipolygon into multiple individual polygons affect the later steps? Would we still be able to clip and deduplicate as normal?

@westminsterabi
Copy link
Contributor

So the list of desired validations is:

  • remove NaN and inf values before staging
  • split polygons that cross the antimeridian
  • split footprints containing multipolygons into multiple individual polygons (?)
  • combine multiple individual polygons in footprint to one big polygon (?)
  • check that geometries are valid using the built in is_valid
  • validate config deduplication rules

@westminsterabi westminsterabi linked a pull request Feb 5, 2024 that will close this issue
@westminsterabi
Copy link
Contributor

It was not my intent that that PR should close the issue, because it's just one piece of the larger whole

@julietcohen
Copy link
Collaborator Author

Thank you for diving into this issue, @westminsterabi!

Answers to your questions:

“Check for NaN and inf values present in input data before staging.” I’m thinking that should happen around here? are those NaN and inf values part of the geometry? It seems like that might just be a one-line gdf = gdf[gdf.geometry.is_valid] unless I’m missing something? If they’re not part of the geometry, which fields should I be checking?

There may be invalid values (some examples are NaN, inf, None, etc.) in any attribute of an observation or the geometry itself. These values are problems for the viz workflow if they are present in an attribute we are visualizing or if they are present in the geometry. A set list of attributes that may contain these values does not exist because every dataset has a unique set of attributes.

Checking if a geometry is invalid is a good step in the process of checking if the geometries will work in the viz workflow. I wonder if inserting gdf = gdf[gdf.geometry.is_valid] like I suggested is sufficient. It would be good to branch from develop, insert this line to check geometries in a location that seems likely to work, then test if it catches every type of invalid geometry. For example, a geometry may be invalid by being None, by being a bowtie geometry, or it also may be invalid in another way we would not expect. An example of an unexpected geometry that may not technically be invalid by geopandas standards but still does not work in our workflow would be one that crosses the antimeridian, like I found in the permafrost and ground ice layer. Testing can be done by processing several different dataset samples with the modified branch, and if the output tiles turn out the same as they would by processing the same data with the main branch after manually removing the geometries before inputting the data into the workflow, that is a great sign the branch should be merged!

How would splitting a footprint multipolygon into multiple individual polygons affect the later steps? Would we still be able to clip and deduplicate as normal?

Splitting a multipolygon is actually simple with geopandas.explode(), it just splits each polygon within the multipolygon into its own row, and "repeats" the attributes from the original multipolygon into the subsequent new rows so there are no missing attribute values for any of the singular polygons. I did this for lake area time series dataset. The script where I did this can be found here in the metadata package hosted by the ADC, but because it is not published yet, you will not have access to it at this time. So I pasted the cleaning script below :) The way explode works can be tested out on any file that contains multipolygons. I also found them in the ARCADE layer input dataset, but I haven't had time to visualize that one yet! Splitting multipolygons is necessary for visualizing because we drop geometries that are not singular polygons. The clipping and deduplication and such should happen normally in the later steps! But we only clip to footprint if we have those, which we do have for the ice-wedge polygon data. Otherwise, we deduplicate by neighbor method, or do not deduplicate at all if we don't need to.

cleaning script for lake area time series dataset
# Prepare Lake Area Time Series data for visualization workflow:
# 1. Remove NA values from the 2 attributes of interest
# 2. Merge the two lake area datasets from Ingmar Nitze
# 3. "explode" all multipolygons into single polygons
# 4. Parse the most recent 5 years into different geopackages 
#    to process into independent data layers in a loop, while 
#    documenting the percentiles for 2 attributes of interest, 
#    to use as max values in the range for the config of the 
#    vizualization workflow

# Author: Juliet Cohen
# Date: 12/20/23
# Server: Datateam
# Conda Environment: "geospatial"

import geopandas as gpd
import xarray as xar
import pandas as pd
import numpy as np

# Lake area time series data GeoPackage
gdf = gpd.read_file("/var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/INitze_Lakes_merged_v3_PDG_Testset.gpkg")
# Lake area time series data NetCDF
area = xar.open_dataset("/var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/Lakes_IngmarPDG_annual_area.nc")

# convert the NetCDF file into a dataframe, with columns for:
# ID_merged, year, permanent_water, seasonal_water
area_df = area.to_dataframe().reset_index()
# drop NA values (important to do for this file only bc these permanent_water and seasonal_water are what will be visualized)
area_df.dropna(subset=['permanent_water', 'seasonal_water'], inplace = True)

# merge the dataframe, retaining only ID_merged values that exist in both files
# the spatial dataframe must be the left argument to retain the geometries
merged_data = gdf.merge(right = area_df,
                        how = 'inner', # only retain lakes that also have data for permanent water and seasonal water 
                        on = 'ID_merged')

print("Merge complete. Parsing years.")
# Save the most recent 5 years separately as gpkg's for 
# input into the viz-workflow because each year will be a layer
last_5_yrs = [2017, 2018, 2019, 2020, 2021]
for yr in last_5_yrs:
    merged_data_annual = merged_data[merged_data['year']==yr]
    merged_data_annual_exploded = merged_data_annual.explode(index_parts = True)
    # NOTE: may want to change index_part=False above, but default is True and it worked so if it aint broke dont fix it
    output_path = f"/var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/exploded_annual_2017-2021/yr{yr}/merged_lakes_{yr}_exploded.gpkg"
    merged_data_annual_exploded.to_file(output_path, driver = "GPKG")
    print(f"Saved combined and exploded data for {yr} to:\n{output_path}.")
    # calculate and document the 99.99th percentile for permanent_water for viz workflow config range max val
    pw_perc = np.percentile(merged_data_annual_exploded['permanent_water'], 99.99)
    print(f"{yr} permanent water 99.99th percentile: {pw_perc}")
    # calculate and document the 95th percentile for seasonal_water for viz workflow config range max val
    sw_perc = np.percentile(merged_data_annual_exploded['seasonal_water'], 95)
    print(f"{yr} seasonal water 95th percentile: {sw_perc}")

print("Script complete.")

@julietcohen
Copy link
Collaborator Author

It's unclear if the multipolygons I discovered in some of the footprint files are actually an issue and how the workflow treats those. It is best if all the footprints we receive are already singular polygons. In my opinion, we need to hold data submitters to some standards, and we cannot do all their cleaning work for them.

@westminsterabi
Copy link
Contributor

Do we have some way of specifying which attributes are going to be visualized at ingestion time? Otherwise, how do we know which attributes we have to clean and which can be left with NaNs etc?

Testing can be done by processing several different dataset samples with the modified branch, and if the output tiles turn out the same as they would by processing the same data with the main branch after manually removing the geometries before inputting the data into the workflow, that is a great sign the branch should be merged!

Do we have such a dataset available? Is there a documented way for me to create one?

@westminsterabi
Copy link
Contributor

Splitting a multipolygon is actually simple with geopandas.explode(), it just splits each polygon within the multipolygon into its own row, and "repeats" the attributes from the original multipolygon into the subsequent new rows so there are no missing attribute values for any of the singular polygons.

I guess my question here is whether the footprint clipping function we use would still work with an exploded multipolygon, since multipolygons must be disjoint in shapely. It sounds like you're saying that the exploded multipolygon should behave the same and allow for clipping, even though it would be multiple polygons with no overlap?

I also have a little confusion on this requirement re.

Within footprints: Check for multipolygons, or multiple singular polygons. Consider splitting the multipolygons into singular polygons, and combining the multiple singular polygons into one mega-footprint that is just a singular polygon.

Is the suggestion here to explode the polygons and then merge them? From my understanding, that won't work because of the shapely enforcement on multipolygons being disjoint. So we wouldn't be able to combine them.

@westminsterabi
Copy link
Contributor

I wrote a test for the multipolygon footprint case and it appears that multipolygons are not an issue.

@westminsterabi
Copy link
Contributor

westminsterabi commented Feb 7, 2024

  • remove NaN and inf values before staging
  • split polygons that cross the antimeridian
  • split footprints containing multipolygons into multiple individual polygons (?)
  • combine multiple individual polygons in footprint to one big polygon (?)
  • check that geometries are valid using the built in is_valid
  • validate config deduplication rules

@westminsterabi
Copy link
Contributor

What do we want to do with NaN and inf values? replace them with 0, or delete those rows?

@julietcohen
Copy link
Collaborator Author

We cannot replace them with 0 because in some datasets, 0 is a real value.

  • Example: lake change dataset, we visualize the size change rate, which can be negative, 0, or positive
    So we need to delete the row that contains an invalid value for the attribute(s) of interest. When I was cleaning data for the lake change dataset, it was helpful to archive all the rows that were removed due to an invalid value. That script for this is documented in this issue comment.
  • the way I archived the rows that had missing data was complex so the output files were saved in a directory that matched the directory hierarchy of Ingmar's lake change data so it was understandable to him

@julietcohen
Copy link
Collaborator Author

Do we have some way of specifying which attributes are going to be visualized at ingestion time? Otherwise, how do we know which attributes we have to clean and which can be left with NaNs etc?

The only ways we "know" which attributes we want to visualize is one of the following:

  1. the person who produced the dataset tells us
  2. we decide out of a certain dataset whichever attribute would be possible to visualize as well as best communicate the dataset
  • (example: if we recieve a dataset of lake change data that spans 2000-2020, and one attribute is lake start size in 2000 and another is lake end size in 2020 and the last attribute is change in lake size between those years, we want to visualize the change even though we technically could visualize any of them
  1. none of the existing attributes best communicate what we want to communicate with the layer so we want to use one of the custom statistics in the viz-workflow
  • example: ice-wedge polygon dataset contains many attributes, but we want to communicate the percent coverage of a region that is detected as being ice-wedge polygons so we use coverage

@julietcohen
Copy link
Collaborator Author

In the config file, we specify which attributes we want to visualize by defining each as a statistic. We can specify multiple statistics per dataset and each becomes a raster layer. See here in the ConfigManager.py. These need to be entered into the config before the visualization run is started.

@julietcohen
Copy link
Collaborator Author

Do we have such a dataset available? Is there a documented way for me to create one?

The sample datasets in the Google drive I added you to (along with other Google fellows) contains datasets I like to use for testing.

@julietcohen
Copy link
Collaborator Author

I guess my question here is whether the footprint clipping function we use would still work with an exploded multipolygon, since multipolygons must be disjoint in shapely. It sounds like you're saying that the exploded multipolygon should behave the same and allow for clipping, even though it would be multiple polygons with no overlap?

I have not tested how the footprint clipping works with the relatively few footprints that were multipolygons, versus the vast majority that were singular polygons, because we did not know that Chandi's team's footprint files were a mix of singular and multipolygons. I told them about this and requested they ensure all are singular polygons next time.

I was saying that when we explode multipolygons that are the input geometries for the viz-worfklow, rather than the footprints, then they should behave the same for clipping (as in, being clipped by a singular footprint file) because there is no limit to the number of geometries in an input file to the viz-workflow. So if we recieve an input file that contains ice-wedge polygon detections that are multipolygons, and we first explode those into singular polygons, then feed them into the viz-worflow and deduplicate using the footprint method, they should be clipped the same as if we originally received singular polygons in the first place.

@julietcohen
Copy link
Collaborator Author

Is the suggestion here to explode the polygons and then merge them? From my understanding, that won't work because of the shapely enforcement on multipolygons being disjoint. So we wouldn't be able to combine them.

My suggestion was to take a footprint file with a multipolygons and explode it, then create a new footprint from those, so basically find the minimum bounding box that encompasses both footprint geometries. I have not done it in practice, was just documenting the best way I could think to fix these rare multipolygon footprints if we choose to tackle that. Elias on Chandi's team created the ice-wedge polygon footprints, and their team should do so for any newer version of the dataset as well. So if they can ensure that all of them are singular polygons in the first place, that saves us from having to clean their data.

@julietcohen
Copy link
Collaborator Author

One of the work items on that issue is remove NaN and inf values before staging, and Juliet clarified that this is only applicable to fields that are being visualized. What do you all think of adding a config field that is either (a) viz_fields, a list of fields that are to be visualized and those get cleaned of invalid data or (b) non_viz_fields, same thing but the inverse?

Each attribute that we are visualizing already has a spot in the config: it is listed as its own statistic and the exact name of the property is specified in the property part of defining each statistic. So if we pull all the properties from the config then check only those for invalid values, that would be equivalent to defining a list of attributes like viz_fields as you suggested.

@westminsterabi
Copy link
Contributor

My suggestion was to take a footprint file with a multipolygons and explode it, then create a new footprint from those, so basically find the minimum bounding box that encompasses both footprint geometries. I have not done it in practice, was just documenting the best way I could think to fix these rare multipolygon footprints if we choose to tackle that. Elias on Chandi's team created the ice-wedge polygon footprints, and their team should do so for any newer version of the dataset as well. So if they can ensure that all of them are singular polygons in the first place, that saves us from having to clean their data.

From my testing, a multipolygon footprint will behave as expected (i.e. clip properly and label geometries outside all polygons just as it would for a singular polygon), so I don't think we need to worry about this. I will look into exploding IWPs that are multipolygons.

The sample datasets in the Google drive I added you to (along with other Google fellows) contains datasets I like to use for testing.

How do I know which geometries are invalid and should be manually removed? How do you usually manually remove them for testing purposes?

@westminsterabi
Copy link
Contributor

the way I archived the rows that had missing data was complex so the output files were saved in a directory that matched the directory hierarchy of Ingmar's lake change data so it was understandable to him

I think developing a robust solution for archiving is outside the scope of this issue. I will make sure that the deletion is logged for now, and we can look into filing another issue for the archiving piece.

@julietcohen
Copy link
Collaborator Author

Yeah I agree!

@julietcohen
Copy link
Collaborator Author

You know that a row in a geodataframe contains invalid values like inf or None by checking with functions like I do in the cleaning script folded in this comment. For example, using isna(). Usually I will quickly check if a dataset contains invalid values just in a notebook before inputting into the viz workflow, and if I do not check and I just go ahead and process a dataset that does contain invalid values, it errors during rasterization.

If your question was how do you know what dataset to use for testing your branch to remove invalid values, you can check out the files in the Google Drive within lake_change_GD_original. I added a README to this drive that describes each dataset sample and a suggestion for what it would be helpful to test.

@julietcohen julietcohen moved this to In Progress in Visualization Workflow Feb 9, 2024
@westminsterabi
Copy link
Contributor

westminsterabi commented Feb 21, 2024

My work on this issue is in feature-21-data-validation. I did not have time to verify my solution to the AM polygon issue. Juliet raised a valid concern that the CRS conversion may be best done AFTER the split, not before. If this is the case, line_crosses_am in TileStager.py should be updated to use metres as units, instead of degrees. Whoever finishes this work should use the AM polygon data to verify that the results visually match what is expected.

To the extent I was able to verify, the clean_viz_fields function successfully cleans the lake change data and allows it to proceed to rasterization without intervention.. This should be double-checked, probably with other datasets as well.

The other validations should be good to go, and are covered by test cases where feasible. I'm sorry that I have to leave the project so abruptly. I hope that at least some of the work I did here is usable.

@julietcohen
Copy link
Collaborator Author

Thank you for your contributions, Abi! Your work will certainly be useful :)

A note on using geopandas explode to separate MultiPolygon geoms into single Polygon geoms that just occurred to me: This approach duplicates the attributes of the original row into the newly created row, which is fine for some attributes, but I suspect this may be a problem if there are attributes such as area or perimeter, because those were originally derived from the MultiPolygon geom, so when we change the geom to a subset of the original, those are no longer accurate. So when we use explode, we may then need to somehow detect which attributes are invalidated because they are specific to the original geom and would not be accurate when applied to the new geom, so the value needs to become NaN if that is the case (or whatever the no-data-val is for the dataset). Then after, we would want to remove the rows that have NaN for the attributes of interest.

@julietcohen
Copy link
Collaborator Author

A programmatic way to split polygons that cross the antimeridian has been documented here in the ticket for processing the permafrost and ground ice dataset. As noted there, this approach was specific to that dataset's CRS, and should be generalized to work with any input CRS to integrate this step into viz-staging. Proof that the splitting works is pictured in the comment below.

@julietcohen
Copy link
Collaborator Author

Splitting and buffering polygons that cross the antimeridian may be necessary to apply to the lake change dataset for UTM zone(s) that border the antimeridian. See this update to the lake change dataset for details.

@robyngit robyngit moved this from In Progress to Backlog in Visualization Workflow Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
Status: Backlog
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants