Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display Ingmar Nitze's lake change dataset #28

Open
robyngit opened this issue Sep 9, 2022 · 45 comments
Open

Display Ingmar Nitze's lake change dataset #28

robyngit opened this issue Sep 9, 2022 · 45 comments
Assignees
Labels
layer Displaying a specific data product in the PDG portal pdg Permafrost Discovery Gateway Water data layer category: water

Comments

@robyngit
Copy link
Member

robyngit commented Sep 9, 2022

  • This dataset is not yet ready, but a sample has been uploaded to datateam.nceas.ucsb.edu:/home/pdg/data/nitze_lake_change/data_sample_2022-09-09

Sample data info

The files are not final, but the general structure will be the same.

General structure

* <UTM_zone>
  * 05_Lake_Dataset_Raster_02_final
    * [files]
  • The data sample directory contains 3 directories, one per UTM zone
    • Each UTM directory contains another dir (e.g. 5_Lake_Dataset_Raster_02_final)
      • Within this sub-dir there are several files (gpkg, geojson, tif)
      • The "lake_change.gpkg" are polygon vector files with a lot of attributes about shape and the spatial/temporal dynamics)
      • There are also raster images which are spatially aggregated. e.g. "lake_change_grid_3000_netchange.tif" resolution is 3000m, there are also files for 7500m resolution
      • They are in absolute values (lake area change in ha, per pixel). In a next version it might make sense to normalize that per pixel.
  • UTM zones will overlay, need deduplication between them
  • There are also 2 QGIS layer styles (.qml) for a nice visualization.
    • lake_change_rates_net_10cl_v3tight shows absolute changes, negative values (red) for loss, positive values (blue) for growth.
  • There are also styled images examples with the CartoDB Black Matter base layer.
@robyngit robyngit added pdg Permafrost Discovery Gateway layer Displaying a specific data product in the PDG portal labels Sep 9, 2022
@robyngit
Copy link
Member Author

robyngit commented Sep 21, 2022

I processed the sample data from Ingmar using the same workflow we created for the IWP polygons, creating both PNG web tiles and 3D tiles. Everything ran very smoothly. The output is currently displayed on the demo portal:

Screen Shot 2022-09-21 at 14 35 34

Screen Shot 2022-09-21 at 14 36 37

Notes:

  • Here is the config that I used for the workflow: ingmar-config.json.txt
  • The colour palette should be improved. At the least, rather than fading to light blue & white where there are few lakes, we should fade to transparency. Better, we should try to replicate the styling examples that Ingmar provided.
  • I set the max-zoom level to 11, but given that these are mostly very large polygons, we could consider changing that to 10
  • This data is comparatively much faster to process than the IWP data (fewer and larger polygons)
  • For the 3D tiles, we might want to try selecting a better geometric error, as it would be nice for some of the larger lakes to appear while more zoomed out. (Ideally, we should updated the 3D tile tree workflow to create separate tiles based on the size of the polygons, such that larger polygons start to load first, and smaller ones load once you're most zoomed in)

@initze
Copy link

initze commented Oct 12, 2022

New data Package

@initze
Copy link

initze commented Oct 12, 2022

Image visualization suggestions

  1. 3D tiles ('lake_change.gpkg')
  • Attribute: "ChangeRateNet_myr-1"
    • palette RdYlBu

grafik
QGIS - style file

This styling works nicely with a black background map (CartoDB Dark Matter, or similar)

grafik
grafik

@initze
Copy link

initze commented Oct 12, 2022

Image visualization suggestions part 2

Raster

File:
lake_change_grid_3000_netchange.tif
lake_change_grid_3000_netchange.tif

They are similar to Webb et al., 2022 (Surface water index Trend)
Please just use the raster as is, it's already designed to have aggregated statistics per pixel.

Palette: RdBu
Range: -2:+2
NoData = 0

grafik
grafik

@julietcohen
Copy link

julietcohen commented Apr 5, 2023

New data added

Ingmar uploaded 5 zip files that contain lake change data to a Google Drive folder here.

  • The data is separated by UTM zones. Within each of those 5 directories is 6-10 subfolders that represents one UTM zone each.
  • The 5 directories range in size from 1.22 GB - 10.75 GB. The smallest is data_products_32635-32640

Per our visualization meeting discussion on 4/3, the highest priority is to process the data in one of the 5 directories, taking Ingmar's color suggestions into consideration. Next, we will move onto processing the other 4 directories and finally Ingmar's newer data, documented in issue#37.


Update: These 5 directories have been uploaded to the NCEAS datateam server: /home/pdg/data/nitze_lake_change/data_2022_11_04/lake_change_GD

@julietcohen
Copy link

julietcohen commented Apr 6, 2023

Large quantity of staged tiles from 1 input lake change file

Initially tried to stage all 6 lake_change.gpkg files within data_products_32635-32640, but staging was taking hours and was producing a surprising amount of files from just one of the input gpkg file: data_products_32635-32640/32640/lake_change.gpkg, so restarted the process as a tmux script to only stage this one UTM zone.

  • log shows that all files staged are from that one file, shows no errors or unexpected output
  • it's only 77 MB, contains 58,495 polygons
  • output: >68,000 staged tiles (wow), running for >4 hours (update: tmux session continued running overnight and is at >230,000 tiles.)
  • using main branch of viz-staging
  • processing on NCEAS datateam server, not in parallel
  • note: usually a few hundred to a few thousand tiles result from 1 average sized IWP input gpkg / shp file (although the area covered for 1 IWP file is smaller than the area covered by 1 lake file)
script
# Process smallest directory in Ingmar's Google Drive
# issue: https://github.com/PermafrostDiscoveryGateway/pdg-portal/issues/28
# follow color scheme suggested in issue
# data: https://drive.google.com/drive/folders/1JxSBRs2nikB_eYxtwEatZ06bhusAN5nL
# intent is to process all the data files in this drive,
# then move on to process Ingmar's higher temporal resolution data
# using venv arcade_layer with local installs for:
# viz-staging, viz-raster, & viz-3dtiles

# imports -----------------------------------------------------

# input data import
from pathlib import Path

# staging
import pdgstaging
from pdgstaging import TileStager

# rasterization
import pdgraster
from pdgraster import RasterTiler

# visual checks
import geopandas as gpd

# logging
from datetime import datetime
import logging
import logging.handlers
import os

# data ---------------------------------------------------------

base_dir = Path('/home/jcohen/lake_change_GD_workflow/lake_change_GD/data_products_32635-32640/32640')
filename = 'lake_change.gpkg'
input = [p.as_posix() for p in base_dir.glob('**/' + filename)]

print(f"Input file(s): {input}")

# logging config -----------------------------------------------

handler = logging.handlers.WatchedFileHandler(
    os.environ.get("LOGFILE", "/home/jcohen/lake_change_GD_workflow/log.log"))
formatter = logging.Formatter(logging.BASIC_FORMAT)
handler.setFormatter(formatter)
root = logging.getLogger()
root.setLevel(os.environ.get("LOGLEVEL", "INFO"))
root.addHandler(handler)

# Staging ------------------------------------------------------

print("Staging...")

stager = TileStager({
  "deduplicate_clip_to_footprint": False, 
  "dir_input": "/home/jcohen/lake_change_GD_workflow/lake_change_GD/data_products_32635-32640/32640", 
  "ext_input": ".gpkg", 
  "dir_staged": "staged/",
  "dir_geotiff": "geotiff/", 
  "dir_web_tiles": "web_tiles/", 
  "filename_staging_summary": "staging_summary.csv",
  "filename_rasterization_events": "raster_events.csv",
  "filename_rasters_summary": "raster_summary.csv",
  "filename_config": "config",
  "simplify_tolerance": 0.1,
  "tms_id": "WGS1984Quad",
  "z_range": [
    0,
    15
  ],
  "geometricError": 57,
  "z_coord": 0,
  "statistics": [
    {
      "name": "coverage",
      "weight_by": "area",
      "property": "area_per_pixel_area",
      "aggregation_method": "sum",
      "resampling_method": "average",
      "val_range": [
        0,
        1
      ],
      "palette": [
        "#ff0000", # red (check that these color codes are appropriate)
        "#0000ff" # blue (source: http://web.simmons.edu/~grovesd/comm244/notes/week3/css-colors)
      ],
      "nodata_val": 0,
      "nodata_color": "#ffffff00" # change?
    },
  ],
  "deduplicate_at": [
    "raster"
  ],
  "deduplicate_keep_rules": [
    [
      "Date",
      "larger"
    ]
  ],
  "deduplicate_method": "neighbor",
  "deduplicate_keep_rules": [["staging_filename", "larger"]],
  "deduplicate_overlap_tolerance": 0.1,
  "deduplicate_overlap_both": False,
  "deduplicate_centroid_tolerance": None
})

for file in input:
    print(f"Staging file {file}...")
    stager.stage(file)
    print(f"Completed staging file {file}.")

print("Staging complete.")

We love a suspenseful mystery!

@julietcohen
Copy link

Ray workflow on Delta server for 1 UTM zone

  • 1 GPU node (because only 1 input file)
  • changed max z-level from 15 to 11 because this data is ~30m resolution, not sub-meter like the IWP data, so no need to go down to such high resolution
  • created new config file, but used same IN_PROGRESS_VIZ_WORKFLOW.py and other processing scripts with adjustments
  • Red & blue color palette

Staging

  • produced 6,448 files at z-level 11
  • took 17 minutes

Raster highest

  • produced 6,448 files at z-level 11
  • took 23 minutes
  • memory flux around 14%
  • cpu flux around 20-50%

Raster lower

  • total # geotiffs for all z-levels produced: 9,428
  • took 1 minute

Web tiles

  • total # pngs for all z-levels produced: 9,428
  • took 1 minute

Web tile visualization on local Cesium:

image

image

image

@julietcohen
Copy link

julietcohen commented Apr 11, 2023

To do:

  • change the property used for the statistic to ChangeRateNet_myr-1
  • changing the property will likely require us to adjust other config options as well:
    • potentially weight_by (depending on units of ChangeRateNet_myr-1), but likely keep this at area rather than changing to count when using ChangeRateNet_myr-1
    • aggregation_method to mode so the color of that pixel represents the most common rate change present in the pixel, rather than the average or sum of the rate changes in that pixel
      • updates:
        • mode is not an option. During rasterization step for highest z-level, log reports: ERROR:pdgraster.RasterTiler:'SeriesGroupBy' object has no attribute 'mode'. See options here.
        • max and 'mean' are options but while still rasterizing z-11, printed these 2 statements in terminal:
/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:232: RuntimeWarning: invalid value encountered in subtract  x = asanyarray(arr - arrmean)

and

/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:48: RuntimeWarning: invalid value encountered in reduce return umr_sum(a, axis, dtype, out, keepdims, initial, where)

These statements are likely resulting from errors in rows of the raster_summary.csv with invalid cell values of inf and -inf, represented like so:
invalid

  • potentially resampling_method to mode so that when we produce lower resolution rasters, the color for the pixel represents the most common rate change for that extent, rather than the average of the rate changes for all the lakes in that extent
  • adjusting the config for a property of interest was described in this issue
  • adjust color palette: red and blue in the data above are not meaningful in the way we intend them to be; the blue likely represents 100% coverage, and the red likely represents <100% coverage because the red is used for pixels at the edge of the lake polygons. We should use a gradient from red to yellow to blue rather than 2 discrete colors, which Ingmar suggested above here. The color palette must be accepted by colormaps.
  • visualize 2 UTM zones to allow us to check if the neighbors deduplication is working well
  • once the palette and statistic is figured out, change z-level to 12 to see if that is better for the data resolution (~30m)

@julietcohen
Copy link

Progress towards adapting colors to attribute of interest

Ran the workflow through web-tiling with statistic ran on property of the data, rather than the coverage as normal for IWP data. Used the attribute ChangeRateGrowth_myr-1 because I was testing if using an attribute with all positive values would resolve the inf and -inf values shown in the raster_summary.csv pictured above. Unfortunately, there were still many inf and -inf values produced, which resulted in failure to produce many lower resolution rasters from z-11 rasters. I plotted anyway to get an idea of how the colors appear when mapped to this attribute. I used a color palette with 7 colors that range from red to yellow to blue depending on the lake growth:
image

Config is here.

Notes:

  • I did the run on Delta because it's much faster than Datateam, especially rasterization steps
  • need to determine how to get around the inf and -inf values

@julietcohen
Copy link

julietcohen commented Apr 12, 2023

Infinity and NaN values present in input data

I determined the source of those inf and -inf values: they are present in the input data. I back traced them from the raster summary to the staged tiles and finally the 36 input lake_change.gpkg files, each from one UTM zone. There are also NaN values in each gpkg. These values might represent no data, or could be errors.

After discovering inf and NaN values within one file, I ran a script to check for them in every file, see below.

check_inf_nan_values.py
# sum up inf and NaN values in lake change input data

import geopandas as gpd
import pandas as pd
import numpy as np
from pathlib import Path

# collect all lake_change.gpkg filepaths in Ingmar's data
base_dir = Path('/home/pdg/data/nitze_lake_change/data_2022-11-04/lake_change_GD/')
filename = 'lake_change.gpkg'
# To define each .gpkg file within each subdir as a string representation with forward slashes, use as_posix()
# The ** represents that any subdir string can be present between the base_dir and the filename
input = [p.as_posix() for p in base_dir.glob('**/' + filename)]
# 36 input files

print(f"Collected {len(input)} lake_change.gpkg filepaths.")

# import each filepath as a gdf
# convert to df by removing geometry col
# (necessary because numpy functions needs a df as input)
# sum up all inf values in df
# sum up all NaN values in df
# add sums to appropriate lists

# create empty list for # inf values in each file
inf_values = []
nan_values = []
for path in input:
    print(f"Checking file {path}.")
    gdf = gpd.read_file(path)
    # convert from gdf to df
    df = gdf.drop(['geometry'], axis = 1)
    # sum up inf values in df
    num_inf_values = np.isinf(df).values.sum()
    inf_values.append(num_inf_values)
    # sum up NaN values in df
    num_nan_values = df.isnull().sum().sum()
    nan_values.append(num_nan_values)

print(f"Checks complete.\nLength of inf value list is {len(inf_values)}.\nLength of Nan value list is {len(nan_values)}.")

# write lists to files
with open("/home/jcohen/lake_change_GD_workflow/LC_inf_values.txt", "w") as output:
    for inf_value in inf_values:
        output.write(f"{inf_value}\n")

with open("/home/jcohen/lake_change_GD_workflow/LC_nan_values.txt", "w") as output:
    for nan_value in nan_values:
        output.write(f"{nan_value}\n")
output: LC_inf_values.txt (number of inf values in each file)
437
422
1300
357
225093
409
1075
104
1182
1369
239
321
105
48
216
60
382
406
948
266
331
421
991
805
438
570
465
230
467
391
399
478
269
431
227
365

Output: LC_nan_values.txt (number of NaN values in each file)
33
26
68
25
16350
23
61
8
40
69
7
23
5
2
2
0
12
16
22
8
23
17
13
35
26
20
19
6
13
11
17
24
7
9
7
13

These values are present in several columns, including but not necessarily limited to:

  • NetChange_perc
  • GrossIncrease_perc
  • GrossDecrease_perc
  • ChangeRateNet_myr-1
  • ChangeRateGrowth_myr-1

@initze : Would you recommend that I remove rows with inf and NaN values, or replace these values with something else? Alternatively, would you like to re-process the data to avoid calculating these values in the first place?

@julietcohen
Copy link

Ingmar is looking into the source of the inf and NaN values, and he will get back to us about how he wants to move forward.

@julietcohen julietcohen added the question Further information is requested label Apr 17, 2023
@initze
Copy link

initze commented Apr 21, 2023

Hi @julietcohen .
I checked a few of the files and could find some of the affected polygons.
It seems that the files didn't run through a specific filter. The affected features/lakes are all very small, thus running into 0 divisions and other stuff.

Solution

delete rows with NaN for now
I will apply a filter in the next version.

I hope that fixes your issue

Cheers Ingmar

To do:

* change the `property` used for the `statistic` to `ChangeRateNet_myr-1`

* changing the `property` will likely require us to adjust other config options as well:
  
  * potentially `weight_by` (depending on units of `ChangeRateNet_myr-1`), but likely keep this at `area` rather than changing to `count` when using `ChangeRateNet_myr-1`
  * `aggregation_method` to `mode` so the color of that pixel represents the most common rate change present in the pixel, rather than the average or sum of the rate changes in that pixel
    
    * updates:
      
      * `mode` is not an option. During rasterization step for highest z-level, log reports: `ERROR:pdgraster.RasterTiler:'SeriesGroupBy' object has no attribute 'mode'`. See options [here](https://sparkbyexamples.com/pandas/pandas-aggregate-functions-with-examples/#:~:text=What%20are%20pandas%20aggregate%20functions,form%20a%20single%20summary%20value.).
      * `max` and 'mean' are options but while still rasterizing z-11, printed these 2 statements in terminal:
/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:232: RuntimeWarning: invalid value encountered in subtract  x = asanyarray(arr - arrmean)

and

/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:48: RuntimeWarning: invalid value encountered in reduce return umr_sum(a, axis, dtype, out, keepdims, initial, where)

These statements are likely resulting from errors in rows of the raster_summary.csv with invalid cell values of inf and -inf, represented like so: invalid

* potentially `resampling_method` to `mode` so that when we produce lower resolution rasters, the color for the pixel represents the most common rate change for that extent, rather than the average of the rate changes for all the lakes in that extent

* adjusting the config for a property of interest was described in [this issue](https://github.com/PermafrostDiscoveryGateway/viz-workflow/issues/9)

* adjust color palette: red and blue in the data above are not meaningful in the way we intend them to be; the blue likely represents 100% coverage, and the red likely represents <100% coverage because the red is used for pixels at the edge of the lake polygons. We should use a _gradient_ from red to yellow to blue rather than 2 discrete colors, which Ingmar suggested above [here](https://github.com/PermafrostDiscoveryGateway/pdg-portal/issues/28#issuecomment-1275936775). The color palette must be accepted by [colormaps](https://pratiman-91.github.io/colormaps).

* visualize 2 UTM zones to allow us to check if the neighbors deduplication is working well

* once the `palette` and `statistic`  is figured out, change z-level to 12 to see if that is better for the data resolution (~30m)

@julietcohen julietcohen removed the question Further information is requested label Apr 21, 2023
@julietcohen
Copy link

Thanks for looking into this, Ingmar. I'll remove all rows with NaN, or inf values and move forward with processing this data with a statistic for the ChangeRateNet_myr-1 attribute.

@robyngit
Copy link
Member Author

@tcnichol is ready to start moving this data to our datateam server. We need to discuss where he will store the data and how he will transfer it from delta (globus)? Estimated to be around 500GB, including a lot of intermediary products.

@mbjones
Copy link
Member

mbjones commented Jun 15, 2023

He should store it in the same ~pdg/data staging directory that we've been using for IWP. Juliet had trouble getting globus to write directly there, which is why the ~jscohen account was created. There is also a pdg account, and we should talk to Nick to clarify if that can be used to do a globus transfer directly into that location - I think it should work. Let's discuss on slack.

@julietcohen
Copy link

@mbjones: Todd is curious if there is an update on the Globus --> Datateam data transfer situation. If we have enabled this for users without needing to give them access to jscohen, please let us know how he can do so.

@mbjones
Copy link
Member

mbjones commented Jun 29, 2023

I talked with Nick about moving the home directory of pdg to be /home/shares/pdg, which would mean its home directory is on the same fiesystem that is used for our PDG data storage. That would enable globus logins with the pdg account shared by the project to be able to transfer data directly to where it needs to be. That should eliminate the need for the jscohen account altogether. Check with Nick on the status of those changes.

@julietcohen
Copy link

Great, thank you

@julietcohen
Copy link

Update on cleaning lake change data before visualization:

Cleaning the lake change data provided in November 2022, located in: /home/pdg/data/nitze_lake_change/data_2022-11-04/lake_change_GD/...

Ingmar requested that the rows of each of the 46 lake_change.gpkg that contain any NA or inf value be removed prior to visualization, since I noted that this causes issues for the rasterization step in which we are calculating stats on an attribute of the input data.

In order to document the rows that contain NA and/or inf values and therefore need to be removed, the following script writes a csv file for each lake_change.gpkg with the rows (and their indices) that contain NA or inf values. These are saved within a directory with a hierarchy that matches Ingmar's hierarchy so it's clear which csv documents the invalid rows for to which originallake_change.gpkg.

clean_lake_change_data.py
# Author: Juliet Cohen

# Overview:
# identify, document, and remove rows with an invalid value in any column (Na, inf, or -inf)
# in lake change input data from Ingmar Nitze,
# and save the cleaned files to a new directory
# using conda env perm_ground

import geopandas as gpd
import pandas as pd
import numpy as np
from pathlib import Path
import os

# collect all lake_change.gpkg filepaths in Ingmar's data
base_dir = Path('/home/pdg/data/nitze_lake_change/data_2022-11-04/lake_change_GD/')
filename = 'lake_change.gpkg'
# To define each .gpkg file within each subdir as a string representation with forward slashes,
# use as_posix()
# The ** represents that any subdir string can be present between the base_dir and the filename
input = [p.as_posix() for p in base_dir.glob('**/' + filename)]
print(f"Collected {len(input)} lake_change.gpkg filepaths.")

# Overview of loop:
# 1. import each filepath as a gdf
# 2. document which rows have invalid value in any column (Na, inf, or -inf)
#    as a separate csv for each input gpkg
# 3. drop any row with an invalid value
# 4. save as new lake change file

for path in input:
    print(f"Checking file {path}.")
    gdf = gpd.read_file(path)

    # first identify any rows with invalid values
    # to document which will be dropped for data visualization
    error_rows = []
    # first convert any infinite values to NA
    gdf.replace([np.inf, -np.inf], np.nan, inplace = True)
    for index, row in gdf.iterrows():
        if row.isna().any():
            error_rows.append(row)
    error_rows_df = pd.DataFrame(error_rows)

    # hard-code the start of the path to directory for the erroneous data
    filepath_start = "/home/jcohen/lake_change_GD_workflow/workflow_cleaned/error_data_documentation/"
    # next, pull the last couple parts of filepath to ID which lake_change.gpkg
    # is being processed, following Ingmar's directory hierarchy
    directory, filename = os.path.split(path)
    filepath_sections = directory.split(os.sep)
    relevant_sections = filepath_sections[-2:]
    partial_filepath = relevant_sections[0] + "/" + relevant_sections[1]
    full_filepath = filepath_start + partial_filepath + "/error_rows.csv"
    # make the subdirectories if they do not yet exist
    directory_path = os.path.dirname(full_filepath)
    if not os.path.exists(directory_path):
        os.makedirs(directory_path)
    # save the df of rows with invalid values as a csv
    # save the index because communicates to Ingmar which rows in his original data
    # contain invalid values
    error_rows_df.to_csv(full_filepath, index = True)
    print(f"Saved rows with invalid values for lake change GDF:\n{path}\nto file:\n{full_filepath}")

    # drop the rows with NA values in any column
    gdf.dropna(axis = 0, inplace = True)

    # save cleaned lake change file to new directory
    # hard-code the start of the path to directory for the cleaned data
    filepath_start = "/home/jcohen/lake_change_GD_workflow/workflow_cleaned/cleaned_files/"
    # next, pull the last couple parts of filepath to ID which lake_change.gpkg
    # is being processed, following Ingmar's directory hierarchy
    directory, filename = os.path.split(path)
    filepath_sections = directory.split(os.sep)
    relevant_sections = filepath_sections[-2:] + ['lake_change_cleaned.gpkg']
    filepath_end = relevant_sections[0] + "/" + relevant_sections[1] + "/" + relevant_sections[2]
    full_filepath = filepath_start + filepath_end
    print(f"Saving file to {full_filepath}")
    # make the subdirectories if they do not yet exist
    directory_path = os.path.dirname(full_filepath)
    if not os.path.exists(directory_path):
        os.makedirs(directory_path)
    gdf.to_file(full_filepath, driver = "GPKG") 

print(f"Cleaning complete.")

@julietcohen
Copy link

julietcohen commented Aug 28, 2023

Visualizing 2 cleaned adjacent UTM zones

Ideally for the config, we would set both aggregation_method and resampling_method to mode when visualizing the attribute ChangeRateNet_myr-1 but we must use something else besides mode for aggregation_method until that functionality is built into the workflow. For testing, we can try max.

This attribute ranges from approximately -10 to +5 with most values ranging from -2 to +2 (for the 2 UTM zones used for testing):

image

Therefore we should also include a val_range in the config [-2,2]. This keeps the palette consistent across z-levels, so lakes or portions of lakes do not change color as you zoom in. More extreme values that fall outside that range are assigned the color for -2 or 2.

image

To do:

  • remove lake change polygons that fall over the ocean by masking the data before inputting into workflow (Ingmar can also do this before passing data to us, he is looking for a mask that also has rivers)
  • change the no data value to a value not found in the data such as -999 or 999
  • change the aggregation to min rather than max because most of the data is negative trend for this attribute (min agregation should be paired with +999 no data value and max aggregation should be paired with -999 data value)
  • check how the no data values are treated in the workflow when resampling: we want to avoid lakes disappearing at lower z-levels because 4 pixels (one of which may be a no-data value) are combined into 1 and all 4 pixels become transparent
  • add documentation to the aggregation and resampling options in viz-staging
  • test if the strip we see between UTM zones is due to the neighbors deduplication method or due to the input data

@julietcohen
Copy link

julietcohen commented Aug 31, 2023

Adjusting config option nodata_value and removing deduplication

The last few rounds of lake change visualization for the same UTM zones tested the following two changes in the config:

  1. setting nodata_value to 999 and None (instead of 0 which is used for IWP data, because 0 is a possible value for the visualized attribute in this lake change data)
  2. not deduplicating (used same staged data with flagged duplicates using the neighbor deduplication method, but re-rasterized and web-tiled with config set to not deduplicate so flagged duplicates are not actually removed)
  3. using min instead of max for the aggregation_method
lake_change_config.py
IWP_CONFIG = {
  "deduplicate_clip_to_footprint": False,
  "dir_output": OUTPUT, 
  "dir_input": INPUT, 
  "ext_input": ".gpkg",
  "dir_geotiff_remote": GEOTIFF_REMOTE, 
  "dir_geotiff_local": GEOTIFF_LOCAL, 
  "dir_web_tiles": WEBTILE_REMOTE, 
  "dir_staged_remote": STAGING_REMOTE, 
  "dir_staged_remote_merged": STAGING_REMOTE_MERGED, 
  "dir_staged_local": STAGING_LOCAL, 
  "filename_staging_summary": STAGING_REMOTE + "staging_summary.csv",
  "filename_rasterization_events": GEOTIFF_REMOTE + "raster_events.csv",
  "filename_rasters_summary": GEOTIFF_REMOTE + "raster_summary.csv",
  "version": datetime.now().strftime("%B%d,%Y"),
  "simplify_tolerance": 0.1,
  "tms_id": "WGS1984Quad",
  "z_range": [
    0,
    11
  ],
  "geometricError": 57,
  "z_coord": 0,
  "statistics": [
    {
      "name": "change_rate", 
      "weight_by": "area",
      "property": "ChangeRateNet_myr-1", 
      "aggregation_method": "min", 
      "resampling_method": "mode",  
      "val_range": [
        -2,
        2
      ],
      "palette": ["#ff0000", # red
                  "#FF8C00", # DarkOrange
                  "#FFA07A", # LightSalmon
                  "#FFFF00", # yellow
                  "#66CDAA", # MediumAquaMarine
                  "#AFEEEE", # PaleTurquoise,
                  "#0000ff"], # blue
      "nodata_val": 999,
      "nodata_color": "#ffffff00" # fully transparent white
    },
  ],
  #"deduplicate_at": ["raster"],
  "deduplicate_at": None,
  #"deduplicate_keep_rules": [["Date","larger"]],
  "deduplicate_method": None,
  #"deduplicate_overlap_tolerance": 0.1,
  #"deduplicate_overlap_both": False,
  #"deduplicate_centroid_tolerance": None
}

Results

1. setting nodata_value to 999 and None

Both of these nodata_values were problematic, resulting in a yellow layer over the tiles that spans from the prime meridian to the antimeridian:

image

This may be related to the fact that these values 999 and None are not present in the data at all, so the nodata_mask is of length 0 when we execute to_image() in viz-raster here. Because the nodata_value is not set to 0 anymore, the 0 values in the data (yellow in our palette) are not made into the mask, so they are retained. Consider the IWP config, where we do set the nodata_value to 0. The cells that have 0% coverage or 0 polygons are set to transparent. However, this still does not explain why the yellow layer expands all the way around half of the world, and is not limited to the region of the 2 UTM zones. The same dir of staged tiles is used to produce both web tile sets that do and do not have the yellow layer, so I suppose this means that the yellow layer is only in the web-tiles, and not the staged tiles themselves. More testing is needed to narrow down what's happening here.

2. Not removing flagged duplicates

In the web tiles, we still see the gap that I assume represents the border of the two UTM zones visualized:

image

The next step to figure out what's going on here is to plot all the lake data for both UTM zones and see if they overlap. Ingmar expressed that there certainly is overlap and there needs to be deduplication integrated either before the viz-workflow or during the viz-workflow.

3. using min instead of max for aggregation

This makes more sense considering that the most extreme values of the distribution for this attribute is negative, rather than positive. Hopefully this more correctly conveys the environmental trends shown in the data and shows more diverse values in the palette.

@robyngit
Copy link
Member Author

robyngit commented Sep 6, 2023

I re-processed the web tiles from the same correctly deduplicated geotiff's in the previous comment

Have you tried setting the nodata_val before rasterization? You could dig through the code to check if this is in fact how it works, but theoretically the workflow should set the pixels without any polygons to the no_dataval when first creating the highest level geotiffs.

@julietcohen
Copy link

Thanks for the feedback @robyngit! I'll re-start the workflow at the raster highest step rather than web-tiling, with no-data val set to 999 or np.Nan and see if that works.

I didn't think that the nodata value would be used in the raster highest step because searching for nodata_val within viz-raster shows that the only occurrences are when web tiles are created. Within RasterTiler.py, nodata_val occurs when we execute webtile_from_geotiff(). However, this was just a quick string search on Github rather than a deep dive into how you use nodata_val throughout the package, so it's totally possible that it's used earlier in the raster highest step and I'm missing it.

@julietcohen
Copy link

Maybe the raster highest step gets the nodata_val from get_raster_config() defined in viz-staging which is called within rasterize_vector() here

@robyngit
Copy link
Member Author

Looks like I made it so that no data pixels are always zero in the the Raster class 😕, see: https://github.com/PermafrostDiscoveryGateway/viz-raster/blob/92924077ccf6083442c9c2aaf40a8f164818f2b9/pdgraster/Raster.py#L772-L773

Grid cells without any data will be filled with 0.

We would have to make it so that the 0 in the fill_value=0 line uses the nodata_val from the config, see: https://github.com/PermafrostDiscoveryGateway/viz-raster/blob/92924077ccf6083442c9c2aaf40a8f164818f2b9/pdgraster/Raster.py#L809

@julietcohen
Copy link

julietcohen commented Sep 15, 2023

Update on nodata_val

In line with Robyn's observation, which I did not see until after I tried this, I re-rasterized the same deduplicated staged tiles for these 2 UTM zones, so started the workflow at raster highest, with the nodata_val set to 999. The resulting web tiles still have the yellow haze (see demo portal). I will open a branch to integrate that change to enable non-0 nodata_vals.

@julietcohen
Copy link

Ingmar helpfully created a new issue to document his data cleaning to remove seawater and rivers: https://github.com/PermafrostDiscoveryGateway/landsattrend-pipeline/issues/8

@initze
Copy link

initze commented Apr 3, 2024

Temporary pan-arctic dataset

  • cleaned (removed rivers + seawater)
    • minor issues may still be present
  • non-deduplicated
    • intersections between zones may contain (almost) duplicates
    • needs to be finally implemented yet#### Link

Link

https://drive.google.com/drive/folders/18pC-FW9Nibmkcv7DPlzzT3YW4Aim0k7C?usp=sharing

Content

  • Lakes_global.parquet - geoparquet file with 20yr lake change params
  • lakes_netShorelineChange.qml/sld - layer style files in sld and qml/qgis format to visualize lake shore erosion values (style that I often use)

@julietcohen
Copy link

Thank you, @initze ! Well done.

This data has been uploaded to Datateam at:
/var/data/submission/pdg/nitze_lake_change/filtered_parquet_2024-04-03
along with a README file. I included Ingmar's notes above, as well as my notes and links to the relevant github issues.

@julietcohen
Copy link

julietcohen commented Apr 26, 2024

I did an initial check of the new, filtered version of the lake change dataset: /var/data/submission/pdg/nitze_lake_change/filtered_parquet_2024-04-03/Lakes_global-001.parquet

There are NA values present in the attributes ChangeRateNet_myr-1 and ChangeRateGrowth_myr-1. As I did for the previous version of the lake change data that contained NA values for certain attributes, I can clean this data by removing the entire row if it contains NA values for an attribute.

I focused on this dataset today because it makes sense to use this data for 4 different but related goals for the PDG project, which are of varying degrees of priority:

  • process this dataset into a layer
  • deduplicate this dataset for Ingmar using the neighbor method
    • and enable him to execute the neighbor deduplication himself, ideally outside of the viz workflow, see viz-staging issue#36)
  • process lake data, Ingmar's geometries, with 2 custom stats in the viz workflow (coverage and number of polygons) to demonstrate what we can already do with summary data so Kat and Ian can build on that
  • explore how parquet files differ from other vector data and how we can automate the viz workflow to use them as input

@julietcohen
Copy link

julietcohen commented Apr 29, 2024

Visualized Sample of Parquet Data (filtered for seawater and rivers)

I used a subset of 10,000 polygons of the Lake Change data in parquet format as input into the visualization workflow with 3 stats:

  • ChangeRateNet_myr-1 which is an attribute in Ingmar's data
  • area_per_pixel_area, a custom stat in the viz workflow
  • centroids_per_pixel, a custom stat in the viz workflow

They are up on the demo portal.

config
workflow_config = { 
  "dir_input": "/home/jcohen/lake_change_parquet", 
  "ext_input": ".gpkg", 
  "dir_staged": "staged/",
  "dir_geotiff": "geotiff/", 
  "dir_web_tiles": "web_tiles/", 
  "filename_staging_summary": "staging_summary.csv",
  "filename_rasterization_events": "raster_events.csv",
  "filename_rasters_summary": "raster_summary.csv",
  "filename_config": "config",
  "simplify_tolerance": 0.1,
  "tms_id": "WGS1984Quad",
  "z_range": [
    0,
    13
  ],
  "geometricError": 57,
  "z_coord": 0,
  "statistics": [
    {
      "name": "change_rate", 
      "weight_by": "area",
      "property": "ChangeRateNet_myr-1", 
      "aggregation_method": "mean", 
      "resampling_method": "mode", 
      "val_range": [
        0,
        1
      ],
      "palette": ["#ff0000", # red
                  "#FF8C00", # DarkOrange
                  "#FFA07A", # LightSalmon
                  "#FFFF00", # yellow
                  "#66CDAA", # MediumAquaMarine
                  "#AFEEEE", # PaleTurquoise
                  "#0000ff"], # blue
      "nodata_val": 0,
      "nodata_color": "#ffffff00"
    },
    {
      "name": "lake_coverage", 
      "weight_by": "area",
      "property": "area_per_pixel_area", 
      "aggregation_method": "sum", 
      "resampling_method": "average", 
      "val_range": [
        0,
        1
      ],
      "palette": ["#20f8c6", # light turquoise
                  "#2f20f8"], # dark blue
      "nodata_val": 0,
      "nodata_color": "#ffffff00"
    },
    {
      "name": "lake_count", 
      "weight_by": "count",
      "property": "centroids_per_pixel", 
      "aggregation_method": "sum", 
      "resampling_method": "sum", 
      "palette": ["#20f8c6", # light turquoise
                  "#2f20f8"], # dark blue
      "nodata_val": 0,
      "nodata_color": "#ffffff00"
    }
  ],
  "deduplicate_clip_to_footprint": False,
  "deduplicate_method": None
}

ChangeRateNet_myr-1

image image

area_per_pixel_area (% coverage)

image image

centroids_per_pixel (number of lakes)

image

Zooming in, the pixels that represent the center of each lake are so high res (and therefore so small) that they are hard to see.

Notes

  • For the stats ChangeRateNet_myr-1 and area_per_pixel_area, the polygons appear to be blocky and low resolution even though the max z-level is 13
    • this was not the case with the "original" lake change data, that seemed to have high res and rounded shapes
      • for reference see the layer Lake Change Rate layer that was pre-filtering for seawater and rivers pictured here:
image
  • In the config, we can specify the simplify_tolerance, which I set to 0.1. This can be much higher, such as 0.0001, and see if this resolves it. If the simplification tolerance does fix this issue, it would be worth investigating the minimum or maximum tolerance that should be allowed based on the max zoom level specified in the config. For example, considering the max zoom level set as the z_range, then the workflow should give a warning, an error, or automatically change the resolution of the tolerance based on what it should be.
  • The nodata_val is set to 0 in the viz workflow (see viz-raster issue#16)
  • The palette that Ingmar suggested for ChangeRateNet_myr-1 works well
  • As seen in other workflow runs with different datasets that use a palette with different colors rather than a range of transparency, the coverage category shows the pixels on the edges of all polygons as lighter (turquoise) and the inner polygons as darker (deep blue), because the polygons on the inside completely cover the pixels, and the polygon edges that partially cover the border pixels are <100% coverage
  • the presence of what appears to be a buffer polygon around this island makes me wonder if the filtering of seawater needs to be more comprehensive:
image

@julietcohen
Copy link

I created a larger sample of 20,000 polygons and reprocessed the data with a much smaller simplify_tolerance = 0.0001. This succeeded in retaining the lake shapes of the polygons.

image

The larger sample size highlighted that there are other polygons that appear to be seawater in addition to the island one pictured above in red in the last comment. So maybe the filtering could be refined.
seawater

@julietcohen julietcohen moved this from Ready to In Progress in Data Layers May 28, 2024
@julietcohen
Copy link

Update on deduplication and filtering seawater

Since I resolved this issue, here's the path forward for processing the lake change data:

  • Todd and Ingmar will filter the data for seawater and rivers to the best of their ability for this first version of the dataset, which they expect to have done in the next few days, and it's ok if there are leftover seawater polygons for now!
  • Todd or Ingmar will apply the duplicate flagging approach outlined here to each pair of adjacent UTM zones, and then remove the rows that are True for staging_duplicated before passing the data to me
  • It would be helpful if Todd and Ingmar could remove rows with NA values for the attribute of interest, ChangeRateNet_myr-1 as well as any other attributes that Ingmar or Anna wants visualized from this dataset
  • If there are any multipolygons in the data that is passed to me to use as input into the workflow, usually I would explode those polygons into singular ones with copied attributes for the new rows as part of the pre-viz processing. This is because if we input geometries of any kind besides single polygons then the workflow will not process those geometries. However, this dataset contains attributes related to area and perimeter of the polygons, which means that those attributes will not be accurate for those exploded polygons, so in my opinion that would not be good practice for this dataset. This is noted here.
    • If Ingmar prefers to keep these multipolygons in the tilesets output from the workflow, the best way to do that for now may be for him to explode the multipolygon geometries himself, then make the values NA for those new rows for the attributes that cannot be accurately copied over into the news rows. For example, for a multipolygon the attribute ChangeRateNet_myr-1 would be able to correctly translate to the new rows of exploded singular polygons, but the area attribute should not be copied over, so that should be converted to NA
  • If the dataset is passed onto me as a netCDF file and a geopackage file, then I will use the script outlined here to merge the data and save it as a geopackage before using it as input into the viz workflow
  • I will input it into the visualization workflow with the config set to not dedup, since that was already done, and have the no-data value set as 0 for the attribute ChangeRateNet_myr-1 for this first version of the dataset
  • To my knowledge, the data package does not exist on the ADC yet, so if it does exist, please point me to it. If not, please make a package and coordinate with @justinkadi.

Updates on dataset metrics

Ingmar helpfully provided some big picture dataset info:

Question Answer
How many lakes are mapped? about 6-6.2 million.
Total dataset size? the uncleaned, merged dataset is 4.13GB
Total area of the lakes? the uncleaned, merged dataset is ~950,000 km² but it will become smaller once deduplication is applied
For the derived lake drainage dataset, would we include a binary attribute data for simply "this lake is present or drained", or would it be an ordered categorical attribute including "partially drained", "totally drained", and "not drained"? "Thats a tricky one. drained must typically refer to some reference date, as lakes are often polycyclic. Ideally we would have a drainage year. We have an attribute "NetChange_perc" which actually provides that information. In a paper I am working on we have >25% <= 75% loss as partially drained, >75% as completely drained. However change must be > 1 ha as small lakes/changes are quite uncertain"
For the derived lake drainage dataset, would the geometry be in the form of polygons (the perimeter of the lake) or a point? Polygons. For other analysis or visualization we can still reduce to centroid/points or something similar

@julietcohen
Copy link

Deduplication update

After some mapping in QGIS, and without doing systematic checks to confirm for sure, polygons that border or intersect the antimeridian seem to have been identified in Ingmar's lake change dataset when the neighbors deduplication approach was applied to 2 adjacent UTM zones, 32601-32602. UTM zone 32601 borders the antimeridian. The neighbors deduplication approach transforms the geometries from their original CRS into the one specified as distance_crs so the anitmeridian polygons are distorted and wrap the opposite way around the world. See this screenshot of the distorted 32601 (green) and 32602 (yellow)

image

This results in polygons in zone 32602 overlapping spatially with polygons from 32601 that are not actually in the region of overlap. Here's a screenshot from Jonas (a collaborator with Ingmar and Todd) who mapped the polygons that were identified as duplicates in red:
32601_32602

Mapping the flagged duplicate polygons on top of the distorted zone 32601 shows suspicious overlap:
image (1)

Next steps:

If this is intersection with the antimeridian is indeed the cause of the incorrectly flagged duplicates (those that lie outside of the overlap between 2 adjacent UTM zones), then the neighbor deduplication should work well for all zones that do not intersect the antimeridian. Todd is looking into applying to approach to all UTM zones, and figuring out how to get around it for zone 32601. My recommendation is to identify which polygons do cross the antimeridian, split them, and buffer them slightly away from the line. Code for this was applied to the permafrost and ground ice dataset here.

@iona5
Copy link

iona5 commented Jul 24, 2024

Hi, this is Jonas from AWI.

I was initially asked by Todd regarding the files but i was not sure about the problem. But your post motivated me to look into this a bit further. I am not familiar with the details of the de-duplication but i understand that you reproject the UTM layers (32601, 32601) to EPSG:3857 to find the spatially overlaps which causes the vertices on the other side of the antimeridian to wrap around?

I guess you should try to project both datasets into a CRS where the coordinates do not wrap around at the antimeridian for these areas. I for example used EPSG:3832 to create the very screenshot you showed above to get rid of the polygon distortion. I didn't realize that this was actually the problem, so i didn't even mention this to Todd. I am optimistic if you project into EPSG:3832 you'll get the correct results for the UTM zones bordering the antimeridian.

Another option would be to just declare the datasets to be a non-critical UTM zone (for example 32603 and 32604) basically translating all polygons to the east before reprojection, so the longitude coordinates do not wrap around in EPSG:3857. I guess the inaccuracy is negligible. After de-duplication project back to 32603/04 and then translate back to 32601/02. But i guess the first option is preferable.

Sorry if I misunderstood the problem or if i am totally off track here.

@julietcohen
Copy link

Hi Jonas, thanks for your suggestions! You are correct that the CRS is configurable for the deduplication step, and I simply used the default projected CRS in my example script that I passed off to Todd. Todd and Ingmar wanted an example of how to execute the deduplication with adjacent UTM zones before the data is input into the visualization workflow. Given my example script and explicit parameters, your team can change the parameters as you see fit.

The transformation to the new CRS, EPSG 3857, is indeed what causes the geometries to become deformed. However, this CRS is not the only CRS that causes this issue, and the deduplication step is not the only time we transform the data in the workflow. I have encountered similarly deformed geometries in other datasets when converting to EPSG 4326 (not projected), which we do during the initial standardization of the data during the "staging" step. That unprojected CRS is required for the Tile Matrix Set of our viz workflow's output geopackage and geotiff tilesets. This means that you may be able to deduplicate the lake change data prior to the visualization workflow with a different CRS and retain the original lake geometries that border the antimeridian, but they will likely be problemic in the same way when we create tilesets from them, requiring buffering from the antimeridian as a "cleaning"/preprocessing step before staging anyway.

Plots in python

Original CRS

import geopandas as gpd
import matplotlib.pyplot as plt

gdf = gpd.read_file("~/check_dedup_forIN/check_for_todd/check_0726/lake_change_32601.gpkg")

fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)
image

4326 causes deformed geometries

gdf_4326 = gdf.to_crs(epsg = 4326, inplace = False)

fig, ax = plt.subplots(figsize=(10, 10))
gdf_4326.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)
image

3857 deforms geometries

gdf_3857 = gdf.to_crs(epsg = 3857, inplace = False)

fig, ax = plt.subplots(figsize=(10, 10))
gdf_3857.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)
image

3832

gdf_3832 = gdf.to_crs(epsg = 3832, inplace = False)

fig, ax = plt.subplots(figsize=(10, 10))
gdf_3832.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)
image

This last plot shows the data transformed into your suggested CRS, and as you suggested it shows no wrapping around the world, however the description of that CRS makes me wonder if the transformation to that CRS would slightly deform the geometries since it appears that the suggested use area for that CRS does not contain UTM zone 32601. But please correct me if I am wrong. But this amount of deformation may likely be negligible.

I don't follow your last suggestion fully, but it does sound like a function I know exists in R, ST_ShiftLongitude.

Resources:

Documentation for the neighbor deduplication approach can be found here. The source code is here.

@iona5
Copy link

iona5 commented Jul 27, 2024

Your concerns make sense (but i would personally argue the 3857 isn't really precise either but that's another can of worms 😉 ).

I might find the time to look into this further, but for now i guess in the end it boils down to the origin of the reference system. From what i remember from my GIS lectures:

In most predefined reference systems with roughly global coverage the longitude/x-origin is located at the 0° meridian with an extent of -180° to +180° in the lon-axis. If you project a polygon overlapping the antimeridian into these CRS, some vertices of those polygons will be transformed to having an x-coordinate close to -180 and others close to +180 (if in degrees). That is what happens here.

However, this is just a way the coordinate extent is defined and that can be changed. One can simply modify it to set the origin close to the area of interest. You can observe those in the CRS config (I think its most convenient and insightful to look at the PROJ4 string):

EPSG:4326: +proj=longlat +datum=WGS84 +no_defs +type=crs
EPSG:3857: +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
EPSG:3832: +proj=merc +lon_0=150 +k=1 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +type=crs

note 4326 and 3857 use a +lon_0 value of 0, while 3852 uses a +lon_0 valueof 150. That means it sets its origin to 150° east (see https://proj.org/en/9.4/usage/projections.html). So in that case the x-coordinates of the vertices of the polygons in the lake_change dataset are all set around the +30° value which corresponds to the antimeridian.

So if we want to stick to EPSG:3857 we can use it as a base:

For example in QGIS you can define this custom CRS (in Settings -> Custom projections),
+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=180 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs
which is just EPSG:3857 with the origin rotated (+lon_0=180) and it displays both files just fine.

image

So if you transform both GeoDataFrames into that custom CRS before getting the intersection i think it might work and you are technically still in 3857, just the origin is different:

import pyproj
import geopandas as gpd

gdf = gpd.read_file("lake_change_32601.gpkg")

crs_3857_rotated = pyproj.CRS.from_proj4("+proj=merc +a=6378137 +b=6378137 "
   "+lat_ts=0 +lon_0=180 +x_0=0 +y_0=0 +k=1 "
   "+units=m +nadgrids=@null +wktext +no_defs")

gdf_3857r = gdf.to_crs(crs_3857_rotated, inplace = False)

gdf_3857r.plot(cmap = 'viridis', linewidth = 0.8)

image

Note that for production one might consider not using the proj4 string for the definition of the custom CRS, but the WKT string, which is presumably more precise: https://proj.org/en/9.4/faq.html#what-is-the-best-format-for-describing-coordinate-reference-systems . the Proj library provides conversion tools, the setting corresponing to lon_0 is imho

...
        PARAMETER["Longitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
...

but i didn't test that.

I looked in your code to test the deduplication with this, but it seems like the projection is done beforehand (i.e. outside of deduplicate_neighbors() ) and i hadn't had the time to mock something up. But i guess if you use this custom CRS for data in the UTM zones close to the antimeridian, it will work.

@julietcohen
Copy link

Thanks so much for your thorough explanation of your idea, that's an interesting approach! Since the plan is for this deduplication to be completed by @tcnichol prior to passing the data off to me for visualization and standardization, maybe he can read through our suggestions and choose how to move forward.

I'd like to clarify when the CRS transformations take place. Since deduplication will occur before staging for this dataset, the transformation within the duplicate flagging process is the first time. Within deduplicate_neighbors, the parameter distance_crs defines the CRS that the data will be transformed into. The transformation occurs here, and note that it is done on the geometry points to check the distance between the centroids and identify the duplicates. The next time the data is transformed is here during staging, after you pass off the data to me.

Lastly, I'd like to highlight that flagging the duplicates and removing the duplicates are different steps. When we flag the duplicates, we create a boolean column, which I set to be called staged_duplicated for the parameter prop_duplicated in deduplicate_neighbors, to indicate if each polygons is a duplicate (True) or not (False). Your team is familiar with this, because the example script I provided outputs the file 32617-32618_dups_flagged.gpkg. So as a final step before passing the data to me, simply remove the rows that are True in each GDF.

@julietcohen
Copy link

julietcohen commented Aug 5, 2024

Identify which polygons intersect the 180th longitude line and split them

Before we determine which approach to take to deal with these polygons that cross the antimeridian, either by splitting them with a buffered the 180th degree longitude or creating a custom CRS with a different meridian, I want to first answer the question: Are the polygons that intersect the antimeridian lake detections or simply seawater and river polygons that are already supposed to be filtered out before deduplication and visualization anyway?

I did this exploration in R. The plots show that all but 1 polygons that intersect the antimeridian are seawater or rivers, which should be removed by Todd and Ingmar's filtering prior to deduplication and visualization.

explore_32601.R
# Author: Juliet Cohen
# Date: 2024-08-05

# Explore the relationship between lakes in 32601 and 180th degree longitude
# PART 1) example of how to split polygons with buffered antimeridian
# PART 2) subset and plot the polygons to just those that intersect the 
#         antimeridian, helping determine if most or all of these are seawater 
#         and river polys, rather than lake polys which will be the actual input
#         data to the viz-workflow

library(sf)
library(ggplot2)
library(leaflet) # interactive mapping
library(mapview)

# PART 1 -----------------------------------------------------------------------
# Split polygons at the buffered antimeridian

# UTM zone 32601, which has NOT been filtered for seawater or rivers yet
fp = "~/check_dedup_forIN/check_for_todd/check_0726/lake_change_32601.gpkg"
gdf_32601 <- st_read(fp) %>% 
             st_set_crs(32601)

# transform data to EPSG:4326, WGS84
gdf_4326 <- st_transform(gdf_32601, 4326)

# define the max and min y values to limit antimeridian line to relevant area
bbox <- st_bbox(gdf_4326)
ymin = bbox["ymin"]
ymax = bbox["ymax"]

# create a line of longitude at 180 degrees (antimeridian) in EPSG:4326
AM <- st_linestring(matrix(c(180, 180, ymin, ymax), ncol = 2)) %>%
      st_sfc(crs = 4326)

# plot polygons (in 32601) with AM line (in 4326)
ggplot() +
  geom_sf(data = gdf_32601) +
  geom_sf(data = AM, color = "red", size = 1) +
  # adjust the background grid to either CRS, 4326 or 32601
  # coord_sf(crs = st_crs(32601), datum = st_crs(32601)) +
  labs(title = "Lakes in UTM 32601 with 180° longitude line") +
  theme_minimal()

# buffer the antimeridian line
buffered_AM <- st_buffer(AM, dist = 0.1)
buffered_AM_32601 <- st_transform(buffered_AM, 32601)
# split the polygons with the buffered AM
split_polys <- st_difference(gdf_32601, buffered_AM_32601)

# convert split polygons to 4326 because this is required by leaflet
split_polys_4326 <- st_transform(split_polys, 4326)
map <- leaflet(split_polys_4326) %>%
  addTiles() %>%  # add default OpenStreetMap map tiles
  addPolygons()
map

# PART 2 -----------------------------------------------------------------------
# Determine which polygons actually cross the antimeridian

# Both geosaptial objects MUST be within the same CRS,
# so use the unbuffered antimeridian, but convert it to CRS 32601.
# NOTE: cannot go the other way (transform the polys to 4326),
# because they wrap the other way around the world, 
# because have not yet been split yet
AM_32601 <- st_transform(AM, 32601)
intersect_polys <- gdf_32601[st_intersects(gdf_32601, AM_32601, sparse = FALSE), ]

# split the intersecting polygons with the buffered AM defined earlier in script,
# because need the buffer to move polygon side away from the antimeridian
intersect_split_polys <- st_difference(intersect_polys, buffered_AM_32601)

intersect_split_polys_4326 <- st_transform(intersect_split_polys, 4326)
intersect_map <- leaflet(intersect_split_polys_4326) %>%
  addTiles() %>%
  addPolygons()   
intersect_map

@julietcohen
Copy link

Note that the same exploration should be done for the UTM zone on the other side of the 180th degree longitude and for any zones further south that also touch the meridian may be included in the analysis

@julietcohen
Copy link

julietcohen commented Aug 14, 2024

Todd will first apply the seawater and river filtering to the lake change data (including the final step of removing the rows where the polygon has been flagged as False for within_land), then will apply the antimeridian polygon splitting to the relevant UTM zones (for the few polygons that need it), then will apply the deduplication, then it is my understanding that the data can be passed to me for visualization.

@julietcohen
Copy link

Notes from a meeting today, Aug 16th are in the PDG meeting notes. They outline the next steps for Ingmar and Todd to process all UTM zones with filtering, deduplication, and merging, then validate the data with the geohash ID's. Regarding the lake polygons that intersect the antimeridian, Todd is not sure how many are in the dataset. Ingmar suggested that if they stick to polar projections for all steps for data processing prior to the visualization workflow, then the intersection the antimeridian will not be a problem at all. While this is true, this will still be a problem when the data is input into the viz workflow (as noted above here as well) because we use EPSG:4326. I emphasize this because if there are lake polygons that intersect the antimeridian then whoever runs the viz workflow on this data will need to split those polygons prior to processing with one of the methods I documented above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
layer Displaying a specific data product in the PDG portal pdg Permafrost Discovery Gateway Water data layer category: water
Projects
Status: In Progress
Development

No branches or pull requests

6 participants