-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Display Ingmar Nitze's lake change dataset #28
Comments
I processed the sample data from Ingmar using the same workflow we created for the IWP polygons, creating both PNG web tiles and 3D tiles. Everything ran very smoothly. The output is currently displayed on the demo portal: Notes:
|
New data Package
|
Image visualization suggestions
This styling works nicely with a black background map (CartoDB Dark Matter, or similar) |
Image visualization suggestions part 2RasterFile: They are similar to Webb et al., 2022 (Surface water index Trend) Palette: RdBu |
New data addedIngmar uploaded 5 zip files that contain lake change data to a Google Drive folder here.
Per our visualization meeting discussion on 4/3, the highest priority is to process the data in one of the 5 directories, taking Ingmar's color suggestions into consideration. Next, we will move onto processing the other 4 directories and finally Ingmar's newer data, documented in issue#37. Update: These 5 directories have been uploaded to the NCEAS datateam server: |
Large quantity of staged tiles from 1 input lake change fileInitially tried to stage all 6
script# Process smallest directory in Ingmar's Google Drive
# issue: https://github.com/PermafrostDiscoveryGateway/pdg-portal/issues/28
# follow color scheme suggested in issue
# data: https://drive.google.com/drive/folders/1JxSBRs2nikB_eYxtwEatZ06bhusAN5nL
# intent is to process all the data files in this drive,
# then move on to process Ingmar's higher temporal resolution data
# using venv arcade_layer with local installs for:
# viz-staging, viz-raster, & viz-3dtiles
# imports -----------------------------------------------------
# input data import
from pathlib import Path
# staging
import pdgstaging
from pdgstaging import TileStager
# rasterization
import pdgraster
from pdgraster import RasterTiler
# visual checks
import geopandas as gpd
# logging
from datetime import datetime
import logging
import logging.handlers
import os
# data ---------------------------------------------------------
base_dir = Path('/home/jcohen/lake_change_GD_workflow/lake_change_GD/data_products_32635-32640/32640')
filename = 'lake_change.gpkg'
input = [p.as_posix() for p in base_dir.glob('**/' + filename)]
print(f"Input file(s): {input}")
# logging config -----------------------------------------------
handler = logging.handlers.WatchedFileHandler(
os.environ.get("LOGFILE", "/home/jcohen/lake_change_GD_workflow/log.log"))
formatter = logging.Formatter(logging.BASIC_FORMAT)
handler.setFormatter(formatter)
root = logging.getLogger()
root.setLevel(os.environ.get("LOGLEVEL", "INFO"))
root.addHandler(handler)
# Staging ------------------------------------------------------
print("Staging...")
stager = TileStager({
"deduplicate_clip_to_footprint": False,
"dir_input": "/home/jcohen/lake_change_GD_workflow/lake_change_GD/data_products_32635-32640/32640",
"ext_input": ".gpkg",
"dir_staged": "staged/",
"dir_geotiff": "geotiff/",
"dir_web_tiles": "web_tiles/",
"filename_staging_summary": "staging_summary.csv",
"filename_rasterization_events": "raster_events.csv",
"filename_rasters_summary": "raster_summary.csv",
"filename_config": "config",
"simplify_tolerance": 0.1,
"tms_id": "WGS1984Quad",
"z_range": [
0,
15
],
"geometricError": 57,
"z_coord": 0,
"statistics": [
{
"name": "coverage",
"weight_by": "area",
"property": "area_per_pixel_area",
"aggregation_method": "sum",
"resampling_method": "average",
"val_range": [
0,
1
],
"palette": [
"#ff0000", # red (check that these color codes are appropriate)
"#0000ff" # blue (source: http://web.simmons.edu/~grovesd/comm244/notes/week3/css-colors)
],
"nodata_val": 0,
"nodata_color": "#ffffff00" # change?
},
],
"deduplicate_at": [
"raster"
],
"deduplicate_keep_rules": [
[
"Date",
"larger"
]
],
"deduplicate_method": "neighbor",
"deduplicate_keep_rules": [["staging_filename", "larger"]],
"deduplicate_overlap_tolerance": 0.1,
"deduplicate_overlap_both": False,
"deduplicate_centroid_tolerance": None
})
for file in input:
print(f"Staging file {file}...")
stager.stage(file)
print(f"Completed staging file {file}.")
print("Staging complete.") We love a suspenseful mystery! |
Ray workflow on Delta server for 1 UTM zone
Staging
Raster highest
Raster lower
Web tiles
Web tile visualization on local Cesium: |
To do:
and
These statements are likely resulting from errors in rows of the
|
Progress towards adapting colors to attribute of interestRan the workflow through web-tiling with Config is here. Notes:
|
Infinity and NaN values present in input dataI determined the source of those After discovering check_inf_nan_values.py
output: LC_inf_values.txt (number of inf values in each file)
Output: LC_nan_values.txt (number of NaN values in each file)
These values are present in several columns, including but not necessarily limited to:
@initze : Would you recommend that I remove rows with |
Ingmar is looking into the source of the |
Hi @julietcohen . Solutiondelete rows with NaN for now I hope that fixes your issue Cheers Ingmar
|
Thanks for looking into this, Ingmar. I'll remove all rows with |
@tcnichol is ready to start moving this data to our datateam server. We need to discuss where he will store the data and how he will transfer it from delta (globus)? Estimated to be around 500GB, including a lot of intermediary products. |
He should store it in the same ~pdg/data staging directory that we've been using for IWP. Juliet had trouble getting globus to write directly there, which is why the ~jscohen account was created. There is also a |
@mbjones: Todd is curious if there is an update on the Globus --> Datateam data transfer situation. If we have enabled this for users without needing to give them access to |
I talked with Nick about moving the home directory of |
Great, thank you |
Update on cleaning lake change data before visualization: Cleaning the lake change data provided in November 2022, located in: Ingmar requested that the rows of each of the 46 In order to document the rows that contain clean_lake_change_data.py# Author: Juliet Cohen
# Overview:
# identify, document, and remove rows with an invalid value in any column (Na, inf, or -inf)
# in lake change input data from Ingmar Nitze,
# and save the cleaned files to a new directory
# using conda env perm_ground
import geopandas as gpd
import pandas as pd
import numpy as np
from pathlib import Path
import os
# collect all lake_change.gpkg filepaths in Ingmar's data
base_dir = Path('/home/pdg/data/nitze_lake_change/data_2022-11-04/lake_change_GD/')
filename = 'lake_change.gpkg'
# To define each .gpkg file within each subdir as a string representation with forward slashes,
# use as_posix()
# The ** represents that any subdir string can be present between the base_dir and the filename
input = [p.as_posix() for p in base_dir.glob('**/' + filename)]
print(f"Collected {len(input)} lake_change.gpkg filepaths.")
# Overview of loop:
# 1. import each filepath as a gdf
# 2. document which rows have invalid value in any column (Na, inf, or -inf)
# as a separate csv for each input gpkg
# 3. drop any row with an invalid value
# 4. save as new lake change file
for path in input:
print(f"Checking file {path}.")
gdf = gpd.read_file(path)
# first identify any rows with invalid values
# to document which will be dropped for data visualization
error_rows = []
# first convert any infinite values to NA
gdf.replace([np.inf, -np.inf], np.nan, inplace = True)
for index, row in gdf.iterrows():
if row.isna().any():
error_rows.append(row)
error_rows_df = pd.DataFrame(error_rows)
# hard-code the start of the path to directory for the erroneous data
filepath_start = "/home/jcohen/lake_change_GD_workflow/workflow_cleaned/error_data_documentation/"
# next, pull the last couple parts of filepath to ID which lake_change.gpkg
# is being processed, following Ingmar's directory hierarchy
directory, filename = os.path.split(path)
filepath_sections = directory.split(os.sep)
relevant_sections = filepath_sections[-2:]
partial_filepath = relevant_sections[0] + "/" + relevant_sections[1]
full_filepath = filepath_start + partial_filepath + "/error_rows.csv"
# make the subdirectories if they do not yet exist
directory_path = os.path.dirname(full_filepath)
if not os.path.exists(directory_path):
os.makedirs(directory_path)
# save the df of rows with invalid values as a csv
# save the index because communicates to Ingmar which rows in his original data
# contain invalid values
error_rows_df.to_csv(full_filepath, index = True)
print(f"Saved rows with invalid values for lake change GDF:\n{path}\nto file:\n{full_filepath}")
# drop the rows with NA values in any column
gdf.dropna(axis = 0, inplace = True)
# save cleaned lake change file to new directory
# hard-code the start of the path to directory for the cleaned data
filepath_start = "/home/jcohen/lake_change_GD_workflow/workflow_cleaned/cleaned_files/"
# next, pull the last couple parts of filepath to ID which lake_change.gpkg
# is being processed, following Ingmar's directory hierarchy
directory, filename = os.path.split(path)
filepath_sections = directory.split(os.sep)
relevant_sections = filepath_sections[-2:] + ['lake_change_cleaned.gpkg']
filepath_end = relevant_sections[0] + "/" + relevant_sections[1] + "/" + relevant_sections[2]
full_filepath = filepath_start + filepath_end
print(f"Saving file to {full_filepath}")
# make the subdirectories if they do not yet exist
directory_path = os.path.dirname(full_filepath)
if not os.path.exists(directory_path):
os.makedirs(directory_path)
gdf.to_file(full_filepath, driver = "GPKG")
print(f"Cleaning complete.") |
Adjusting config option
|
Have you tried setting the |
Thanks for the feedback @robyngit! I'll re-start the workflow at the raster highest step rather than web-tiling, with I didn't think that the nodata value would be used in the raster highest step because searching for |
Maybe the raster highest step gets the |
Looks like I made it so that no data pixels are always zero in the the
We would have to make it so that the |
Update on
|
Ingmar helpfully created a new issue to document his data cleaning to remove seawater and rivers: https://github.com/PermafrostDiscoveryGateway/landsattrend-pipeline/issues/8 |
Temporary pan-arctic dataset
Linkhttps://drive.google.com/drive/folders/18pC-FW9Nibmkcv7DPlzzT3YW4Aim0k7C?usp=sharing Content
|
Thank you, @initze ! Well done. This data has been uploaded to Datateam at: |
I did an initial check of the new, filtered version of the lake change dataset: There are I focused on this dataset today because it makes sense to use this data for 4 different but related goals for the PDG project, which are of varying degrees of priority:
|
Visualized Sample of Parquet Data (filtered for seawater and rivers)I used a subset of 10,000 polygons of the Lake Change data in parquet format as input into the visualization workflow with 3 stats:
They are up on the demo portal. config
|
Update on deduplication and filtering seawaterSince I resolved this issue, here's the path forward for processing the lake change data:
Updates on dataset metricsIngmar helpfully provided some big picture dataset info:
|
Deduplication updateAfter some mapping in QGIS, and without doing systematic checks to confirm for sure, polygons that border or intersect the antimeridian seem to have been identified in Ingmar's lake change dataset when the neighbors deduplication approach was applied to 2 adjacent UTM zones, 32601-32602. UTM zone 32601 borders the antimeridian. The neighbors deduplication approach transforms the geometries from their original CRS into the one specified as This results in polygons in zone 32602 overlapping spatially with polygons from 32601 that are not actually in the region of overlap. Here's a screenshot from Jonas (a collaborator with Ingmar and Todd) who mapped the polygons that were identified as duplicates in red: Mapping the flagged duplicate polygons on top of the distorted zone 32601 shows suspicious overlap: Next steps:If this is intersection with the antimeridian is indeed the cause of the incorrectly flagged duplicates (those that lie outside of the overlap between 2 adjacent UTM zones), then the neighbor deduplication should work well for all zones that do not intersect the antimeridian. Todd is looking into applying to approach to all UTM zones, and figuring out how to get around it for zone 32601. My recommendation is to identify which polygons do cross the antimeridian, split them, and buffer them slightly away from the line. Code for this was applied to the permafrost and ground ice dataset here. |
Hi, this is Jonas from AWI. I was initially asked by Todd regarding the files but i was not sure about the problem. But your post motivated me to look into this a bit further. I am not familiar with the details of the de-duplication but i understand that you reproject the UTM layers (32601, 32601) to EPSG:3857 to find the spatially overlaps which causes the vertices on the other side of the antimeridian to wrap around? I guess you should try to project both datasets into a CRS where the coordinates do not wrap around at the antimeridian for these areas. I for example used EPSG:3832 to create the very screenshot you showed above to get rid of the polygon distortion. I didn't realize that this was actually the problem, so i didn't even mention this to Todd. I am optimistic if you project into EPSG:3832 you'll get the correct results for the UTM zones bordering the antimeridian. Another option would be to just declare the datasets to be a non-critical UTM zone (for example 32603 and 32604) basically translating all polygons to the east before reprojection, so the longitude coordinates do not wrap around in EPSG:3857. I guess the inaccuracy is negligible. After de-duplication project back to 32603/04 and then translate back to 32601/02. But i guess the first option is preferable. Sorry if I misunderstood the problem or if i am totally off track here. |
Hi Jonas, thanks for your suggestions! You are correct that the CRS is configurable for the deduplication step, and I simply used the default projected CRS in my example script that I passed off to Todd. Todd and Ingmar wanted an example of how to execute the deduplication with adjacent UTM zones before the data is input into the visualization workflow. Given my example script and explicit parameters, your team can change the parameters as you see fit. The transformation to the new CRS, EPSG 3857, is indeed what causes the geometries to become deformed. However, this CRS is not the only CRS that causes this issue, and the deduplication step is not the only time we transform the data in the workflow. I have encountered similarly deformed geometries in other datasets when converting to EPSG 4326 (not projected), which we do during the initial standardization of the data during the "staging" step. That unprojected CRS is required for the Tile Matrix Set of our viz workflow's output geopackage and geotiff tilesets. This means that you may be able to deduplicate the lake change data prior to the visualization workflow with a different CRS and retain the original lake geometries that border the antimeridian, but they will likely be problemic in the same way when we create tilesets from them, requiring buffering from the antimeridian as a "cleaning"/preprocessing step before staging anyway. Plots in pythonOriginal CRSimport geopandas as gpd
import matplotlib.pyplot as plt
gdf = gpd.read_file("~/check_dedup_forIN/check_for_todd/check_0726/lake_change_32601.gpkg")
fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(cmap = 'viridis', linewidth = 0.8, ax = ax) 4326 causes deformed geometriesgdf_4326 = gdf.to_crs(epsg = 4326, inplace = False)
fig, ax = plt.subplots(figsize=(10, 10))
gdf_4326.plot(cmap = 'viridis', linewidth = 0.8, ax = ax) 3857 deforms geometriesgdf_3857 = gdf.to_crs(epsg = 3857, inplace = False)
fig, ax = plt.subplots(figsize=(10, 10))
gdf_3857.plot(cmap = 'viridis', linewidth = 0.8, ax = ax) 3832gdf_3832 = gdf.to_crs(epsg = 3832, inplace = False)
fig, ax = plt.subplots(figsize=(10, 10))
gdf_3832.plot(cmap = 'viridis', linewidth = 0.8, ax = ax) This last plot shows the data transformed into your suggested CRS, and as you suggested it shows no wrapping around the world, however the description of that CRS makes me wonder if the transformation to that CRS would slightly deform the geometries since it appears that the suggested use area for that CRS does not contain UTM zone 32601. But please correct me if I am wrong. But this amount of deformation may likely be negligible. I don't follow your last suggestion fully, but it does sound like a function I know exists in R, ST_ShiftLongitude. Resources:Documentation for the neighbor deduplication approach can be found here. The source code is here. |
Your concerns make sense (but i would personally argue the 3857 isn't really precise either but that's another can of worms 😉 ). I might find the time to look into this further, but for now i guess in the end it boils down to the origin of the reference system. From what i remember from my GIS lectures: In most predefined reference systems with roughly global coverage the longitude/x-origin is located at the 0° meridian with an extent of -180° to +180° in the lon-axis. If you project a polygon overlapping the antimeridian into these CRS, some vertices of those polygons will be transformed to having an x-coordinate close to -180 and others close to +180 (if in degrees). That is what happens here. However, this is just a way the coordinate extent is defined and that can be changed. One can simply modify it to set the origin close to the area of interest. You can observe those in the CRS config (I think its most convenient and insightful to look at the PROJ4 string): EPSG:4326: note 4326 and 3857 use a So if we want to stick to EPSG:3857 we can use it as a base: For example in QGIS you can define this custom CRS (in Settings -> Custom projections), So if you transform both GeoDataFrames into that custom CRS before getting the intersection i think it might work and you are technically still in 3857, just the origin is different: import pyproj
import geopandas as gpd
gdf = gpd.read_file("lake_change_32601.gpkg")
crs_3857_rotated = pyproj.CRS.from_proj4("+proj=merc +a=6378137 +b=6378137 "
"+lat_ts=0 +lon_0=180 +x_0=0 +y_0=0 +k=1 "
"+units=m +nadgrids=@null +wktext +no_defs")
gdf_3857r = gdf.to_crs(crs_3857_rotated, inplace = False)
gdf_3857r.plot(cmap = 'viridis', linewidth = 0.8) Note that for production one might consider not using the proj4 string for the definition of the custom CRS, but the WKT string, which is presumably more precise: https://proj.org/en/9.4/faq.html#what-is-the-best-format-for-describing-coordinate-reference-systems . the Proj library provides conversion tools, the setting corresponing to
but i didn't test that. I looked in your code to test the deduplication with this, but it seems like the projection is done beforehand (i.e. outside of deduplicate_neighbors() ) and i hadn't had the time to mock something up. But i guess if you use this custom CRS for data in the UTM zones close to the antimeridian, it will work. |
Thanks so much for your thorough explanation of your idea, that's an interesting approach! Since the plan is for this deduplication to be completed by @tcnichol prior to passing the data off to me for visualization and standardization, maybe he can read through our suggestions and choose how to move forward. I'd like to clarify when the CRS transformations take place. Since deduplication will occur before staging for this dataset, the transformation within the duplicate flagging process is the first time. Within Lastly, I'd like to highlight that flagging the duplicates and removing the duplicates are different steps. When we flag the duplicates, we create a boolean column, which I set to be called |
Identify which polygons intersect the 180th longitude line and split themBefore we determine which approach to take to deal with these polygons that cross the antimeridian, either by splitting them with a buffered the 180th degree longitude or creating a custom CRS with a different meridian, I want to first answer the question: Are the polygons that intersect the antimeridian lake detections or simply seawater and river polygons that are already supposed to be filtered out before deduplication and visualization anyway? I did this exploration in R. The plots show that all but 1 polygons that intersect the antimeridian are seawater or rivers, which should be removed by Todd and Ingmar's filtering prior to deduplication and visualization. explore_32601.R# Author: Juliet Cohen
# Date: 2024-08-05
# Explore the relationship between lakes in 32601 and 180th degree longitude
# PART 1) example of how to split polygons with buffered antimeridian
# PART 2) subset and plot the polygons to just those that intersect the
# antimeridian, helping determine if most or all of these are seawater
# and river polys, rather than lake polys which will be the actual input
# data to the viz-workflow
library(sf)
library(ggplot2)
library(leaflet) # interactive mapping
library(mapview)
# PART 1 -----------------------------------------------------------------------
# Split polygons at the buffered antimeridian
# UTM zone 32601, which has NOT been filtered for seawater or rivers yet
fp = "~/check_dedup_forIN/check_for_todd/check_0726/lake_change_32601.gpkg"
gdf_32601 <- st_read(fp) %>%
st_set_crs(32601)
# transform data to EPSG:4326, WGS84
gdf_4326 <- st_transform(gdf_32601, 4326)
# define the max and min y values to limit antimeridian line to relevant area
bbox <- st_bbox(gdf_4326)
ymin = bbox["ymin"]
ymax = bbox["ymax"]
# create a line of longitude at 180 degrees (antimeridian) in EPSG:4326
AM <- st_linestring(matrix(c(180, 180, ymin, ymax), ncol = 2)) %>%
st_sfc(crs = 4326)
# plot polygons (in 32601) with AM line (in 4326)
ggplot() +
geom_sf(data = gdf_32601) +
geom_sf(data = AM, color = "red", size = 1) +
# adjust the background grid to either CRS, 4326 or 32601
# coord_sf(crs = st_crs(32601), datum = st_crs(32601)) +
labs(title = "Lakes in UTM 32601 with 180° longitude line") +
theme_minimal()
# buffer the antimeridian line
buffered_AM <- st_buffer(AM, dist = 0.1)
buffered_AM_32601 <- st_transform(buffered_AM, 32601)
# split the polygons with the buffered AM
split_polys <- st_difference(gdf_32601, buffered_AM_32601)
# convert split polygons to 4326 because this is required by leaflet
split_polys_4326 <- st_transform(split_polys, 4326)
map <- leaflet(split_polys_4326) %>%
addTiles() %>% # add default OpenStreetMap map tiles
addPolygons()
map
# PART 2 -----------------------------------------------------------------------
# Determine which polygons actually cross the antimeridian
# Both geosaptial objects MUST be within the same CRS,
# so use the unbuffered antimeridian, but convert it to CRS 32601.
# NOTE: cannot go the other way (transform the polys to 4326),
# because they wrap the other way around the world,
# because have not yet been split yet
AM_32601 <- st_transform(AM, 32601)
intersect_polys <- gdf_32601[st_intersects(gdf_32601, AM_32601, sparse = FALSE), ]
# split the intersecting polygons with the buffered AM defined earlier in script,
# because need the buffer to move polygon side away from the antimeridian
intersect_split_polys <- st_difference(intersect_polys, buffered_AM_32601)
intersect_split_polys_4326 <- st_transform(intersect_split_polys, 4326)
intersect_map <- leaflet(intersect_split_polys_4326) %>%
addTiles() %>%
addPolygons()
intersect_map |
Note that the same exploration should be done for the UTM zone on the other side of the 180th degree longitude and for any zones further south that also touch the meridian may be included in the analysis |
Todd will first apply the seawater and river filtering to the lake change data (including the final step of removing the rows where the polygon has been flagged as False for |
Notes from a meeting today, Aug 16th are in the PDG meeting notes. They outline the next steps for Ingmar and Todd to process all UTM zones with filtering, deduplication, and merging, then validate the data with the geohash ID's. Regarding the lake polygons that intersect the antimeridian, Todd is not sure how many are in the dataset. Ingmar suggested that if they stick to polar projections for all steps for data processing prior to the visualization workflow, then the intersection the antimeridian will not be a problem at all. While this is true, this will still be a problem when the data is input into the viz workflow (as noted above here as well) because we use EPSG:4326. I emphasize this because if there are lake polygons that intersect the antimeridian then whoever runs the viz workflow on this data will need to split those polygons prior to processing with one of the methods I documented above. |
datateam.nceas.ucsb.edu:/home/pdg/data/nitze_lake_change/data_sample_2022-09-09
Sample data info
The files are not final, but the general structure will be the same.
General structure
5_Lake_Dataset_Raster_02_final
).qml
) for a nice visualization.lake_change_rates_net_10cl_v3tight
shows absolute changes, negative values (red) for loss, positive values (blue) for growth.The text was updated successfully, but these errors were encountered: