import os
import urllib
import pandas as pd
@@ -537,7 +537,7 @@
+
# values of a single variable at each point of the coords
= np.array([np.zeros((5,5)),
temp_data 5,5)),
@@ -580,7 +580,7 @@ np.ones((
+
# names of the dimensions in the required order
= ('time', 'lat', 'lon')
dims
@@ -591,7 +591,7 @@
+
# attributes (metadata) of the data array
= { 'title' : 'temperature across weather stations',
attrs 'standard_name' : 'air_temperature',
@@ -599,7 +599,7 @@
+
# initialize xarray.DataArray
= xr.DataArray(data = temp_data,
temp = dims,
@@ -995,7 +995,7 @@ dims xarray.DataArray- time: 3
- lat: 5
- lon: 5
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
array([[[0, 0, 0, 0, 0],
+ units: degree_c
- time(time)datetime64[ns]2022-09-01 2022-09-02 2022-09-03
array(['2022-09-01T00:00:00.000000000', '2022-09-02T00:00:00.000000000',
+ '2022-09-03T00:00:00.000000000'], dtype='datetime64[ns]')
- lat(lat)int6470 60 50 40 30
array([70, 60, 50, 40, 30])
- lon(lon)int6460 70 80 90 100
array([ 60, 70, 80, 90, 100])
- timePandasIndex
PandasIndex(DatetimeIndex(['2022-09-01', '2022-09-02', '2022-09-03'], dtype='datetime64[ns]', name='time', freq='D'))
- latPandasIndex
PandasIndex(Index([70, 60, 50, 40, 30], dtype='int64', name='lat'))
- lonPandasIndex
PandasIndex(Index([60, 70, 80, 90, 100], dtype='int64', name='lon'))
- title :
- temperature across weather stations
- standard_name :
- air_temperature
- units :
- degree_c
We can also update the variable’s attributes after creating the object. Notice that each of the coordinates is also an xarray.DataArray
, so we can add attributes to them.
-
+
# update attributes
'description'] = 'simple example of an xarray.DataArray'
temp.attrs[
@@ -1419,7 +1419,7 @@ xarray.DataArray- time: 3
- lat: 5
- lon: 5
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
array([[[0, 0, 0, 0, 0],
+ description: simple example of an xarray.DataArray
- time(time)datetime64[ns]2022-09-01 2022-09-02 2022-09-03
- description :
- date of measurement
array(['2022-09-01T00:00:00.000000000', '2022-09-02T00:00:00.000000000',
+ '2022-09-03T00:00:00.000000000'], dtype='datetime64[ns]')
- lat(lat)int6470 60 50 40 30
- standard_name :
- grid_latitude
- units :
- degree_N
array([70, 60, 50, 40, 30])
- lon(lon)int6460 70 80 90 100
- standard_name :
- grid_longitude
- units :
- degree_E
array([ 60, 70, 80, 90, 100])
- timePandasIndex
PandasIndex(DatetimeIndex(['2022-09-01', '2022-09-02', '2022-09-03'], dtype='datetime64[ns]', name='time', freq='D'))
- latPandasIndex
PandasIndex(Index([70, 60, 50, 40, 30], dtype='int64', name='lat'))
- lonPandasIndex
PandasIndex(Index([60, 70, 80, 90, 100], dtype='int64', name='lon'))
- title :
- temperature across weather stations
- standard_name :
- air_temperature
- units :
- degree_c
- description :
- simple example of an xarray.DataArray
At this point, since we have a single variable, the dataset attributes and the variable attributes are the same.
@@ -1445,7 +1445,7 @@ 5.4.1.2 Indexing
An xarray.DataArray
allows both positional indexing (like numpy
) and label-based indexing (like pandas
). Positional indexing is the most basic, and it’s done using Python’s []
syntax, as in array[i,j]
with i and j both integers. Label-based indexing takes advantage of dimensions in the array having names and coordinate values that we can use to access data instead of remembering the positional order of each dimension.
As an example, suppose we want to know what was the temperature recorded by the weather station located at 40°0′N 80°0′E on September 1st, 2022. By recalling all the information about how the array is setup with respect to the dimensions and coordinates, we can access this data positionally:
-
+
0,1,2] temp[
+ description: simple example of an xarray.DataArray
Or, we can use the dimensions names and their coordinates to access the same value:
-
+
='2022-09-01', lat=40, lon=80) temp.sel(time
+ description: simple example of an xarray.DataArray
Notice that the result of this indexing is a 1x1 xarray.DataArray
. This is because operations on an xarray.DataArray
(resp. xarray.DataSet
) always return another xarray.DataArray
(resp. xarray.DataSet
). In particular, operations returning scalar values will also produce xarray
objects, so we need to cast them as numbers manually. See xarray.DataArray.item.
@@ -2210,7 +2210,7 @@
5.4.1.3 Reduction
xarray
has implemented several methods to reduce an xarray.DataArray
along any number of dimensions. One of the advantages of xarray.DataArray
is that, if we choose to, it can carry over attributes when doing calculations. For example, we can calculate the average temperature at each weather station over time and obtain a new xarray.DataArray
.
-
+
= temp.mean(dim = 'time')
avg_temp # to keep attributes add keep_attrs = True
@@ -2590,11 +2590,11 @@ - lat(lat)int6470 60 50 40 30
- standard_name :
- grid_latitude
- units :
- degree_N
array([70, 60, 50, 40, 30])
- lon(lon)int6460 70 80 90 100
- standard_name :
- grid_longitude
- units :
- degree_E
array([ 60, 70, 80, 90, 100])
- latPandasIndex
PandasIndex(Index([70, 60, 50, 40, 30], dtype='int64', name='lat'))
- lonPandasIndex
PandasIndex(Index([60, 70, 80, 90, 100], dtype='int64', name='lon'))
- title :
- average temperature over three days
More about xarray
computations.
@@ -2606,7 +2606,7 @@
5.4.2.1 Create an xarray.DataSet
Following our previous example, we can create an xarray.DataSet
by combining the temperature data with the average temperature data. We also add some attributes that now describe the whole dataset, not only each variable.
-
+
# make dictionaries with variables and attributes
= {'avg_temp': avg_temp,
data_vars 'temp': temp}
@@ -2624,7 +2624,7 @@
+
temp_dataset
5.4.2.2 Save and Reopen
Finally, we want to save our dataset as a NetCDF file. To do this, specify the file path and use the .nc extension for the file name. Then save the dataset using the to_netcdf
method with your file path. Opening NetCDF is similarly straightforward using xarray.open_dataset()
.
-
+
# specify file path: don't forget the .nc extension!
= os.path.join(os.getcwd(),'temp_dataset.nc')
fp # save file
@@ -3413,8 +3413,8 @@
+ description: simple example of an xarray.Dataset
@@ -3422,12 +3422,12 @@
5.4.3 Exercise
For this exercise, we will use a dataset including time series of annual Arctic freshwater fluxes and storage terms. The data was produced for the publication Jahn and Laiho, 2020 about changes in the Arctic freshwater budget and is archived at the Arctic Data Center doi:10.18739/A2280504J
-
+
= 'https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A792bfc37-416e-409e-80b1-fdef8ab60033'
url
= urllib.request.urlretrieve(url, "FW_data_CESM_LW_2006_2100.nc") msg
-
+
= os.path.join(os.getcwd(),'FW_data_CESM_LW_2006_2100.nc')
fp = xr.open_dataset(fp)
fw_data fw_data
@@ -3819,7 +3819,7 @@ - creation_date :
- 02-Jun-2020 15:38:31
- author :
- Alexandra Jahn, CU Boulder, alexandra.jahn@colorado.edu
- title :
- Annual timeseries of freshwater data from the CESM Low Warming Ensemble
- description :
- Annual mean Freshwater (FW) fluxes and storage relative to 34.8 shown in Jahn and Laiho, GRL, 2020, calculated from the 11-member Community Earth System Model (CESM) Low Warming Ensemble output (Sanderson et al., 2017, Earth Syst. Dynam., 8, 827-847. doi: 10.5194/esd-8-827-2017). These 11 ensemble members were branched from the first 11 ensemble members of the CESM Large Ensemble (companion data file) at the end of 2005. Convention for the fluxes is that positive fluxes signify a source of FW to the Arctic and negative fluxes are a sink/export of FW for the Arctic. FW fluxes are the net fluxes through a strait over the full ocean depth, adding up any positive and negative fluxes. Liquid FW storage is calculated over the full depth of the ocean but ignoring any negative FW (resulting from salinties over 34.8). Solid FW storage includes FW stored in sea ice and FW stored in snow on sea ice. Surface fluxes and FW storage is calculated over the Arctic domain bounded by Barrow Strait, Nares Strait, Bering Strait, Fram Strait, and the Barents Sea Opeing (BSO). Davis Strait fluxes are included for reference only and are outside of the Arctic domain. A map showing the domain and the location of the straits can be found in Jahn and Laiho, GRL, 2020.
- data_structure :
- The data structure is |Ensemble member | Time (in years)|. All data are annual means. The data covers 2006-2100. There are 11 ensemble members.
@@ -3858,11 +3858,11 @@
We can see in the object viewer that the runoff_annual
variable has two dimensions: time
and member
, in that order. We can also access the dimensions by calling:
-
+
fw_data.runoff_annual.dims
The second dimensions is member
. Near the top of the object viewer, under coordinates, we can see that that member’s coordinates is an array from 1 to 11. We can directly see this array by calling:
-
+
fw_data.member
@@ -3883,7 +3883,7 @@
-
+
= fw_data.netPrec_annual.sel(member=2) member2
@@ -3905,7 +3905,7 @@
Based on our previous answer, this maximum is:
-
+
= member2.loc[2022:2100].max()
x_max x_max.item()
@@ -3934,10 +3934,10 @@ 5.5.2 pandas
to xarray
What does the previous example look like when working with pandas
and xarray
?
Let’s work with a csv file containing the previous temperature measurements. Essentially, we need to read this file as a pandas.DataFrame
and then use the pandas.DataFrame.to_xarray()
method, taking into account that the dimensions of the resulting xarray.DataSet
will be formed using the index column(s) of the pandas.DataFrame
. In this case, we know the first three columns will be our dimension columns, so we need to group them as a multindex for the pandas.DataFrame
. We can do this by using the index_col
argument directly when we read in the csv file.
-
+
= os.path.join(os.getcwd(),'netcdf_temp_data.csv') fp
-
+
# specify columns representing dimensions
= [0,1,2]
dimension_columns
@@ -4011,7 +4011,7 @@
And this is our resulting xarray.DataSet
:
-
+
temp.to_xarray()
For further reading and examples about switching between pandas
and xarray
you can visit the following:
diff --git a/public/2024-03-arctic/sections/geopandas.html b/public/2024-03-arctic/sections/geopandas.html
index 4676ec50..4323fed9 100644
--- a/public/2024-03-arctic/sections/geopandas.html
+++ b/public/2024-03-arctic/sections/geopandas.html
@@ -336,7 +336,7 @@ 9.3 Pre-processing raster data
First we need to load in our libraries. We’ll use geopandas
for vector manipulation, rasterio
for raster maniupulation.
First, we’ll use requests
to download the ship traffic raster from Kapsar et al.. We grab a one month slice from August, 2020 of a coastal subset of data with 1km resolution. To get the URL in the code chunk below, you can right click the download button for the file of interest and select “copy link address.”
-
+
import urllib
= 'https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A6b847ab0-9a3d-4534-bf28-3a96c5fa8d72'
@@ -344,7 +344,7 @@ url = urllib.request.urlretrieve(url, "Coastal_2020_08.tif")
msg
Using rasterio
, open the raster file, plot it, and look at the metadata. We use the with
here as a context manager. This ensures that the connection to the raster file is closed and cleaned up when we are done with it.
-
+
import rasterio
import matplotlib.pyplot as plt
@@ -368,13 +368,13 @@
+
type(ships)
numpy.ndarray
-
+
type(ships_meta)
rasterio.profiles.Profile
@@ -402,20 +402,20 @@
9.4 Pre-processing vector data
Now download a vector dataset of commercial fishing districts in Alaska.
-
+
= 'https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3A7c942c45-1539-4d47-b429-205499f0f3e4'
url
= urllib.request.urlretrieve(url, "Alaska_Commercial_Salmon_Boundaries.gpkg") msg
Read in the data using geopandas
.
-
+
import geopandas as gpd
= gpd.read_file("Alaska_Commercial_Salmon_Boundaries.gpkg") comm
Note the “pandas” in the library name “geopandas.” Our comm
object is really just a special type of pandas data frame called a geodataframe. This means that in addition to any geospatial stuff we need to do, we can also just do regular pandas
things on this data frame.
For example, we can get a list of column names (there are a lot!)
-
+
comm.columns.values
array(['OBJECTID', 'GEOMETRY_START_DATE', 'GEOMETRY_END_DATE',
@@ -433,7 +433,7 @@
+
comm.head()
@@ -600,7 +600,7 @@
The diagram above shows the three different types of geometries that geospatial vector data can take, points, lines or polygons. Whatever the geometry type, the geometry information (the x,y points) is stored in the column named geometry
in the geopandas
data frame. In this example, we have a dataset containing polygons of fishing districts. Each row in the dataset corresponds to a district, with unique attributes (the other columns in the dataset), and its own set of points defining the boundaries of the district, contained in the geometry
column.
-
+
'geometry'][:5] comm[
0 MULTIPOLYGON (((-151.32805 64.96913, -151.3150...
@@ -612,7 +612,7 @@
+
comm.crs
<Geographic 2D CRS: EPSG:4326>
@@ -629,7 +629,7 @@
+
'geometry'][8] comm[
@@ -640,7 +640,7 @@
+
=(9,9)) comm.plot(figsize
@@ -651,7 +651,7 @@
+
= comm.to_crs("EPSG:3338")
comm_3338
comm_3338.plot()
@@ -669,7 +669,7 @@ 9.5 Crop data to area of interest
For this example, we are only interested in south central Alaska, encompassing Prince William Sound, Cook Inlet, and Kodiak. Our raster data is significantly larger than that, and the vector data is statewide. So, as a first step we might want to crop our data to the area of interest.
First, we’ll need to create a bounding box. We use the box
function from shapely
to create the bounding box, then create a geoDataFrame
from the points, and finally convert the WGS 84 coordinates to the Alaska Albers projection.
-
+
from shapely.geometry import box
= box(-159.5, 55, -144.5, 62)
@@ -679,7 +679,7 @@ coord_box = [coord_box]).to_crs("EPSG:3338")
geometry
Now, we can read in raster again cropped to bounding box. We use the mask
function from rasterio.mask
. Note that we apply this to the connection to the raster file (with rasterio.open(...)
), then update the metadata associated with the raster, because the mask
function requires as its first dataset
argument a dataset object opened in r
mode.
-
+
import rasterio.mask
import numpy as np
@@ -693,7 +693,7 @@ # turn the no-data values into NaNs.
== ship_con.nodata] = np.nan shipc_arr[shipc_arr
-
+
"driver": "GTiff",
shipc_meta.update({"height": shipc_arr.shape[0],
"width": shipc_arr.shape[1],
@@ -701,7 +701,7 @@ "compress": "lzw"})
Now we’ll do a similar task with the vector data. Tin this case, we use a spatial join. The join will be an inner join, and select only rows from the left side (our fishing districts) that are within the right side (our bounding box). I chose this method as opposed to a clipping type operation because it ensures that we don’t end up with any open polygons at the boundaries of our box, which could cause problems for us down the road.
-
+
= gpd.sjoin(comm_3338,
comm_clip
coord_box_df,='inner',
@@ -710,7 +710,7 @@ how
9.5.1 Check extents
Now let’s look at the two cropped datasets overlaid on each other to ensure that the extents look right.
-
+
import rasterio.plot
# set up plot
@@ -741,7 +741,7 @@
9.6.0.1 Zonal statistics over one polygon
Let’s look at how this works over just one fishing area first. We use the rasterize
method from the features
module in rasterio
. This takes as arguments the data to rasterize (in this case the 40th row of our dataset), and the shape and transform the output raster will take. We also set the all_touched
argument to true, which means any pixel that touches a boundary of our vector will be burned into the mask.
-
+
from rasterio import features
= features.rasterize(comm_clip['geometry'][40].geoms,
@@ -750,7 +750,7 @@ r40 =True)
all_touched
If we have a look at a plot of our rasterized version of the single fishing district, we can see that instead of a vector, we now have a raster showing the rasterized district (with pixel values of 1) and any area not in the district has a pixel value of 0.
-
+
# set up plot
= plt.subplots(figsize=(7, 7))
fig, ax # plot the raster
@@ -770,14 +770,14 @@
+
np.unique(r40)
array([0, 1], dtype=uint8)
Finally, we need to know the indices where the fishing district is. We can use np.where
to extract this information
-
+
= np.where(r40 == 1)
r40_index print(r40_index)
@@ -797,7 +797,7 @@
+
np.nansum(shipc_arr[r40_index])
14369028.0
@@ -807,11 +807,11 @@
9.6.0.2 Zonal statistics over all polygons
-
+
'id'] = range(0,len(comm_clip)) comm_clip[
For each district (with geometry
and id
), we run the features.rasterize
function. Then we calculate the sum of the values of the shipping raster based on the indices in the raster where the district is located.
-
+
= {}
distance_dict for geom, idx in zip(comm_clip['geometry'], comm_clip['id']):
= features.rasterize(geom.geoms,
@@ -823,7 +823,7 @@ rasterized = np.nansum(shipc_arr[r_index])
distance_dict[idx]
Now we just create a data frame from that dictionary, and join it to the vector data using pandas
operations.
-
+
import pandas as pd
# create a data frame from the result
@@ -836,14 +836,14 @@ 'distance'] = distance_df['distance']/1000
distance_df[
Now we join the result to the original geodataframe.
-
+
# join the sums to the original data frame
= comm_clip.merge(distance_df,
res_full = "id",
on = 'inner') how
Finally, we can plot our result!
-
+
import matplotlib.ticker
= plt.subplots(figsize=(7, 7))
fig, ax
@@ -866,13 +866,13 @@
+
= res_full.dissolve(by = "REGISTRATION_AREA_NAME",
reg_area = 'sum',
aggfunc = True) numeric_only
Let’s have a look at the same plot as before, but this time over our aggregated data.
-
+
= plt.subplots(figsize=(7, 7))
fig, ax
= reg_area.plot(column = "distance", legend = True, ax = ax)
diff --git a/public/2024-03-arctic/sections/parallel-programming.html b/public/2024-03-arctic/sections/parallel-programming.html
index f50d7fdc..44ca4832 100644
--- a/public/2024-03-arctic/sections/parallel-programming.html
+++ b/public/2024-03-arctic/sections/parallel-programming.html
@@ -449,14 +449,14 @@ ax 4.7 Task parallelization in Python
Python also provides a number of easy to use packages for concurrent processing. We will review two of these, concurrent.futures
and parsl
, to show just how easy it can be to parallelize your programs. concurrent.futures
is built right into the python3 release, and is a great starting point for learning concurrency.
We’re going to start with a task that is a little expensive to compute, and define it in a function. All this task(x)
function does is to use numpy to create a fairly large range of numbers, and then sum them.
-
+
def task(x):
import numpy as np
= np.arange(x*10**8).sum()
result return result
We can start by executing this task function serially ten times with varying inputs. In this case, we create a function run_serial
that takes a list of inputs to be run, and it calls the task
function for each of those inputs. The @timethis
decorator is a simple way to wrap the function with timing code so that we can see how long it takes to execute.
-
+
import numpy as np
@timethis
@@ -465,7 +465,7 @@ 10)) run_serial(np.arange(
-run_serial: 7401.757478713989 ms
+run_serial: 7049.6227741241455 ms
[0,
@@ -483,7 +483,7 @@
+
from concurrent.futures import ThreadPoolExecutor
@timethis
@@ -521,7 +521,7 @@
+
@@ -571,7 +571,7 @@
4.8.1 Serial
When you have a list of repetitive tasks, you may be able to speed it up by adding more computing power. If each task is completely independent of the others, then it is pleasingly parallel and a prime candidate for executing those tasks in parallel, each on its own core. For example, let’s build a simple loop that downloads the data files that we need for an analysis. First, we start with the serial implementation.
-
+
import urllib
def download_file(row):
@@ -596,7 +596,7 @@
4.8.2 Multi-threaded with concurrent.futures
In this case, we’ll use the same download_file
function from before, but let’s switch up and create a download_threaded()
function to use concurrent.futures
with a ThreadPoolExecutor
just as we did earlier.
-
+
from concurrent.futures import ThreadPoolExecutor
@timethis
@@ -624,7 +624,7 @@
4.8.3 Multi-process with concurrent.futures
You’ll remember from earlier that you can execute tasks concurrently by creating multiple threads within one process (multi-threaded), or by creating and executing muliple processes. The latter creates more independence, as each of the executing tasks has their own memory and process space, but it also takes longer to set up. With concurrent.futures
, we can switch to a multi-process pool by using a ProcessPoolExecutor
, analogously to how we used ThreadPoolExecutor
previously. So, some simple changes, and we’re running multiple processes.
-
+
from concurrent.futures import ProcessPoolExecutor
@timethis
@@ -650,12 +650,12 @@ for result in results:
print(result)
-Downloading: SWI_2019.tifDownloading: SWI_2020.tifDownloading: SWI_2021.tif
-Downloading: SWI_2007.tif
+Downloading: SWI_2019.tifDownloading: SWI_2021.tifDownloading: SWI_2007.tif
+Downloading: SWI_2020.tif
+Downloading: SWI_2001.tif
-Downloading: SWI_2001.tif
-download_process: 20482.065200805664 ms
+download_process: 23428.152322769165 ms
SWI_2007.tif
SWI_2019.tif
SWI_2021.tif
@@ -671,7 +671,7 @@ Parsl (docs), Dask, Ray, and others come into play. They all have their strengths, but Parsl makes it particularly easy to build parallel workflows out of existing python code through it’s use of decorators on existing python functions.
In addition, Parsl supports a lot of different kinds of providers, allowing the same python code to be easily run multi-threaded using a ThreadPoolExecutor
and via multi-processing on many different cluster computing platforms using the HighThroughputExecutor
. For example, Parsl includes providers supporting local execution, and on Slurm, Condor, Kubernetes, AWS, and other platforms. And Parsl handles data staging as well across these varied environments, making sure the data is in the right place when it’s needed for computations.
Similarly to before, we start by configuring an executor in parsl, and loading it. We’ll use multiprocessing by configuring the HighThroughputExecutor
to use our local resources as a cluster, and we’ll activate our virtual environment to be sure we’re executing in a consistent environment.
-
+
# Required packages
import parsl
from parsl import python_app
@@ -694,11 +694,11 @@ parsl.load(htex_local)
parsl.clear()
-<parsl.dataflow.dflow.DataFlowKernel at 0x7f4991800cd0>
+<parsl.dataflow.dflow.DataFlowKernel at 0x7f0ca1bff550>
We now have a live parsl executor (htex_local
) that is waiting to execute processes. We tell it to execute processes by annotating functions with decorators that indicate which tasks should be parallelized. Parsl then handles the scheduling and execution of those tasks based on the dependencies between them. In the simplest case, we’ll decorate our previous function for downloading a file with the @python_app
decorator, which tells parsl that any function calls with this function should be run on the default executor (in this case, htex_local
).
-
+
# Decorators seem to be ignored as the first line of a cell, so print something first
print("Create decorated function")
@@ -717,7 +717,7 @@ AppFuture
, and we can call the AppFuture.done()
function to determine when the future result is ready without blocking. Or, we can just block on AppFuture.result()
which waits for each of the executions to complete and then returns the result.
-
+
#! eval: true
print("Define our download_futures function")
@@ -739,18 +739,18 @@ 0:5])
wait_for_futures(file_table[
Define our download_futures function
-<AppFuture at 0x7f49902857e0 state=pending>
-<AppFuture at 0x7f4990284040 state=pending>
-<AppFuture at 0x7f49902879a0 state=pending>
-<AppFuture at 0x7f4990286e30 state=pending>
-<AppFuture at 0x7f4989e208e0 state=pending>
-download_futures: 10.964393615722656 ms
+<AppFuture at 0x7f0c60792920 state=pending>
+<AppFuture at 0x7f0c607920e0 state=pending>
+<AppFuture at 0x7f0c60792b00 state=pending>
+<AppFuture at 0x7f0c9fa9e890 state=pending>
+<AppFuture at 0x7f0c6043e5f0 state=pending>
+download_futures: 12.877464294433594 ms
['SWI_2007.tif', 'SWI_2019.tif', 'SWI_2021.tif', 'SWI_2020.tif', 'SWI_2001.tif']
-wait_for_futures: 65006.14833831787 ms
+wait_for_futures: 29422.613620758057 ms
When we’re done, be sure to clean up and shutdown the htex_local
executor, or it will continue to persist in your environment and utilize resources. Generally, an executor should be created when setting up your environment, and then it can be used repeatedly for many different tasks.
-
+
0].shutdown()
htex_local.executors[ parsl.clear()
diff --git a/public/2024-03-arctic/sections/parallel-with-dask.html b/public/2024-03-arctic/sections/parallel-with-dask.html
index a6eae351..bb4a97ed 100644
--- a/public/2024-03-arctic/sections/parallel-with-dask.html
+++ b/public/2024-03-arctic/sections/parallel-with-dask.html
@@ -382,16 +382,16 @@ 6.4 Connect to an existing cluster
The instructor is going to start a local cluster on the server that everyone can connect to. This scenario is a realistic simulation of what it would look like for you to connect to a shared Dask resource. You can create your own local cluster running on your laptop using code at the end of this lesson.
First, the instructor (and only the instructor!) will run:
-
+
from dask.distributed import LocalCluster
= LocalCluster(n_workers=70, memory_limit='auto', processes=True) cluster
To connect to it, we will use the address that the scheduler is listening on. The port is generated randomly, so first the instructor needs to get address for you to use in your code. In a ‘real world’ scenario, this address would be given to you by the administrator of the Dask cluster.
-
+
cluster.scheduler_address
Now, you can pass the address to the Client
function, which sets up your session as a client of the Dask cluster,
-
+
from dask.distributed import LocalCluster, Client
# if you are copy pasting this from the book, make sure
@@ -433,19 +433,19 @@
6.5.1 Reading a csv
To get familiar with dask.dataframes
, we will use tabular data of soil moisture measurements at six forest stands in northeastern Siberia. The data has been collected since 2014 and is archived at the Arctic Data Center (Loranty & Alexander, doi:10.18739/A24B2X59C). Just as we did in the previous lesson, we will download the data using the requests
package and the data’s URL obtained from the Arctic Data Center.
-
+
import os
import urllib
import dask.dataframe as dd
-
+
= 'https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A27e4043d-75eb-4c4f-9427-0d442526c154'
url
= urllib.request.urlretrieve(url, "dg_soil_moisture.csv") msg
In the Arctic Data Center metadata we can see this file is 115 MB. To import this file as a dask.dataframe
with more than one partition, we need to specify the size of each partition with the blocksize
parameter. In this example, we will split the data frame into six partitions, meaning a block size of approximately 20 MB.
-
+
= os.path.join(os.getcwd(),'dg_soil_moisture.csv')
fp = dd.read_csv(fp, blocksize = '20MB' , encoding='ISO-8859-1')
df df
@@ -463,7 +463,7 @@
About the encoding parameter: If we try to import the file directly, we will receive an UnicodeDecodeError
. We can run the following code to find the file’s encoding and add the appropriate encoding to dask.dataframe.read_csv
.
-
+
import chardet
= os.path.join(os.getcwd(),'dg_soil_moisture.csv')
fp with open(fp, 'rb') as rawdata:
@@ -475,20 +475,20 @@
Notice that we cannot see any values in the data frame. This is because Dask has not really loaded the data. It will wait until we explicitly ask it to print or compute something to do so.
However, we can still do df.head()
. It’s not costly for memory to access a few data frame rows.
-
+
3) df.head(
6.5.2 Lazy Computations
The application programming interface (API) of a dask.dataframe
is a subset of the pandas.DataFrame
API. So if you are familiar with pandas, many of the core pandas.DataFrame
methods directly translate to dask.dataframes
. For example:
-
+
= df.groupby('year').mean(numeric_only=True)
averages averages
Notice that we cannot see any values in the resulting data frame. A major difference between pandas.DataFrames
and dask.dataframes
is that dask.dataframes
are “lazy”. This means an object will queue transformations and calculations without executing them until we explicitly ask for the result of that chain of computations using the compute
method. Once we run compute,
the scheduler can allocate memory and workers to execute the computations in parallel. This kind of lazy evaluation (or delayed computation) is how most Dask workloads work. This varies from eager evaluation methods and functions, which start computing results right when they are executed.
Before calling compute
on an object, open the Dask dashboard to see how the parallel computation is happening.
-
+
averages.compute()
@@ -503,20 +503,20 @@
In this short example we will create a 200x500 dask.array
by specifying chunk sizes of 100x100.
-
+
import numpy as np
import dask.array as da
-
+
= np.arange(100_000).reshape(200, 500)
data = da.from_array(data, chunks=(100, 100)) a
Computations for dask.arrays
also work lazily. We need to call compute
to trigger computations and bring the result to memory.
-
+
a.mean()
-
+
a.mean().compute()
@@ -530,7 +530,7 @@ The NDVI is calculated using the near-infrared and red bands of the satellite image. The formula is
\[NDVI = \frac{NIR - Red}{NIR + Red}.\]
First, we download the data for the near-infrared (NIR) and red bands from the Arctic Data Center:
-
+
# download red band
= 'https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3Aac25a399-b174-41c1-b6d3-09974b161e5a'
url = urllib.request.urlretrieve(url, "RU_ANS_TR2_FL005M_red.tif")
@@ -543,10 +543,10 @@ msg
Because these are .tif files and have geospatial metadata, we will use rioxarray
to read them. You can find more information about rioxarray
here.
To indicate we will open these .tif files with dask.arrays
as the underlying object to the xarray.DataArray
(instead of a numpy.array
), we need to specify either a shape or the size in bytes for each chunk. Both files are 76 MB, so let’s have chunks of 15 MB to have roughly six chunks.
-
+
import rioxarray as rioxr
-
+
# read in the file
= os.path.join(os.getcwd(),"RU_ANS_TR2_FL005M_red.tif")
fp_red = rioxr.open_rasterio(fp_red, chunks = '15MB') red
@@ -557,12 +557,12 @@ There is geospatial information (transformation, CRS, resolution) and no-data values.
- There is an unnecessary dimension: a constant value for the band. So our next step is to squeeze the array to flatten it.
-
+
# getting rid of unnecessary dimension
= red.squeeze() red
Next, we read in the NIR band and do the same pre-processing:
-
+
# open data
= os.path.join(os.getcwd(),"RU_ANS_TR2_FL005M_nir.tif")
fp_nir = rioxr.open_rasterio(fp_nir, chunks = '15MB')
@@ -574,15 +574,15 @@ nir
6.7.2 Calculating NDVI
Now we set up the NDVI calculation. This step is easy because we can handle xarrays and Dask arrays as NumPy arrays for arithmetic operations. Also, both bands have values of type float32, so we won’t have trouble with the division.
-
+
= (nir - red) / (nir + red) ndvi
When we look at the NDVI we can see the result is another dask.array
, nothing has been computed yet. Remember, Dask computations are lazy, so we need to call compute()
to bring the results to memory.
-
+
= ndvi.compute() ndvi_values
And finally, we can see what these look like. Notice that xarray
uses the value of the dimensions as labels along the x and y axes. We use robust=True
to ignore the no-data values when plotting.
-
+
=True) ndvi_values.plot(robust
@@ -592,16 +592,16 @@
6.8.1 Setting up a Local Cluster
We can create a local cluster as follows:
-
+
from dask.distributed import LocalCluster, Client
-
+
= LocalCluster(n_workers=4, memory_limit=0.1, processes=True)
cluster cluster
And then we create a client to connect to our cluster, passing the Client
function the cluster object.
-
+
= Client(cluster)
client client
diff --git a/public/2024-03-arctic/sections/python-intro.html b/public/2024-03-arctic/sections/python-intro.html
index fcfdebc0..98a757af 100644
--- a/public/2024-03-arctic/sections/python-intro.html
+++ b/public/2024-03-arctic/sections/python-intro.html
@@ -379,7 +379,7 @@ 3.5 Brief overview of python syntax
We’ll very briefly go over some basic python syntax and the base variable types. First, open a python script. From the File menu, select New File, type “python”, then save it as ‘python-intro.py’ in the top level of your directory.
In your file, assign a value to a variable using =
and print the result.
-
+
= 4
x print(x)
@@ -402,7 +402,7 @@
+
str = 'Hello World!'
print(str)
@@ -410,7 +410,7 @@
+
list = [100, 50, -20, 'text']
print(list)
@@ -418,7 +418,7 @@
+
list[0] # print first element
list[1:3] # print 2nd until 4th elements
list[:2] # print first until the 3rd
@@ -437,7 +437,7 @@
+
= ['more', 'things']
list2
list + list2
@@ -450,7 +450,7 @@
+
tuple = ('a', 'b', 'c', 'd')
tuple[0]
tuple * 3
@@ -466,7 +466,7 @@
+