Skip to content

Commit

Permalink
v0.67.0
Browse files Browse the repository at this point in the history
  • Loading branch information
holukas authored Jan 9, 2024
2 parents ee91ad7 + 9d61f4c commit 2c18ae6
Show file tree
Hide file tree
Showing 55 changed files with 26,068 additions and 4,734 deletions.
11 changes: 7 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

# e.g.: /src
/.idea/
target/
__manuscripts/
__workbench/
__todo/
/__local_folders

/notebooks/_scratch/
/notebooks/Workbench/FLUXNET_CH4-N2O_Committee_WP2/data/

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down Expand Up @@ -114,3 +114,6 @@ venv.bak/
/notebooks/Manuscripts/Hörtnagl et al. (2023) - NEP Penalty/
/notebooks/Workbench/example_n2o_outlier/
/notebooks/Workbench/FLUXNET CH4 N2O Committee WP2/data/
/notebooks/Workbench/FLUXNET_CH4-N2O_Committee_WP2/data/
/__archived/
/__local_folders/
47 changes: 47 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,53 @@

![DIIVE](images/logo_diive1_256px.png)

## v0.67.0 | 9 Jan 2024

### Updates to flux processing chain

The flux processing chain was updated in an attempt to make processing more streamlined and easier to follow. One of the
biggest changes is the implementation of the `repeat` keyword for outlier tests. With this keyword set to `True`, the
respective test is repeated until no more outliers can be found. How the flux processing chain can be used is shown in
the updated `FluxProcessingChain`notebook (`notebooks/FluxProcessingChain/FluxProcessingChain.ipynb`).

### New features

- Added new class `QuickFluxProcessingChain`, which allows to quickly execute a simplified version of the flux
processing chain. This quick version runs with a lot of default values and thus not a lot of user input is needed,
only some basic settings. (`diive.pkgs.fluxprocessingchain.fluxprocessingchain.QuickFluxProcessingChain`)
- Added new repeater function for outlier detection: `repeater` is wrapper that allows to execute an outlier detection
method multiple times, where each iteration gets its own outlier flag. As an example: the simple z-score test is run
a first time and then repeated until no more outliers are found. Each iteration outputs a flag. This is now used in
the `StepwiseOutlierDetection` and thus the flux processing chain Level-3.2 (outlier detection) and the meteoscreening
in `StepwiseMeteoScreeningDb` (not yet checked in this update). To repeat an outlier method use the `repeat` keyword
arg (see the `FluxProcessingChain` notebook for examples).(`diive.pkgs.outlierdetection.repeater.repeater`)
- Added new function `filter_strings_by_elements`: Returns a list of strings from list1 that contain all of the elements
in list2.(`core.funcs.funcs.filter_strings_by_elements`)
- Added new function `flag_steadiness_horizontal_wind_eddypro_test`: Create flag for steadiness of horizontal wind u
from the sonic anemometer. Makes direct use of the EddyPro output files and converts the flag to a standardized 0/1
flag.(`pkgs.qaqc.eddyproflags.flag_steadiness_horizontal_wind_eddypro_test`)

### Changes

- Added automatic calculation of daytime and nighttime flags whenever the flux processing chain is started
flags (`diive.pkgs.fluxprocessingchain.fluxprocessingchain.FluxProcessingChain._add_swinpot_dt_nt_flag`)

### Removed features

- Removed class `ThymeBoostOutlier` for outlier detection. At the moment it was not possible to get it to work properly.

### Changes

- It appears that the kwarg `fmt` is used slightly differently for `plot_date` and `plot` in `matplotlib`. It seems it
is always defined for `plot_date`, while it is optional for `plot`. Now using `fmt` kwarg to avoid the warning:
*UserWarning: marker is redundantly defined by the 'marker' keyword argument and the fmt string "o" (-> marker='o').
The keyword argument will take precedence.* Therefore using 'fmt="X"' instead of 'marker="X"'. See also
answer [here](https://stackoverflow.com/questions/69188540/userwarning-marker-is-redundantly-defined-by-the-marker-keyword-argument-when)

### Environment

- Removed `thymeboost`

## v0.66.0 | 2 Nov 2023

### New features
Expand Down
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,6 @@ Fill gaps in time series with various methods
- Missing values: Simply creates a flag that indicated available and missing data in a time series
- Seasonal trend decomposition using LOESS, identify outliers based on seasonal-trend decomposition and
z-score calculations
- Thymeboost: Identify outliers based on [thymeboost](https://github.com/tblume1992/ThymeBoost)
- z-score: Identify outliers based on the z-score across all time series data
- z-score: Identify outliers based on the z-score, separately for daytime and nighttime
- z-score: Identify outliers based on max z-scores in the interquartile range data
Expand Down
22 changes: 22 additions & 0 deletions diive/configs/filetypes/FLUXNET-CH4-HH-CSV-30MIN.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
GENERAL:
NAME: "FLUXNET-CH4-HH-CSV-30MIN"
DESCRIPTION: "The Data Product for the FLUXNET-CH4 Release."
TAGS: [ "FLUXNET" ]

FILE:
EXTENSION: "*.csv"
COMPRESSION: "None"

TIMESTAMP:
DESCRIPTION: "1 column with full timestamp with seconds"
INDEX_COLUMN: [ 'TIMESTAMP_END' ]
DATETIME_FORMAT: "%Y%m%d%H%M"
SHOWS_START_MIDDLE_OR_END_OF_RECORD: "end"

DATA:
HEADER_SECTION_ROWS: [ 0, 1, 2 ]
SKIP_ROWS: [ 0, 1 ]
HEADER_ROWS: [ 0 ]
NA_VALUES: [ -9999 ]
FREQUENCY: "30T"
DELIMITER: ","
22 changes: 22 additions & 0 deletions diive/configs/filetypes/ICOS_H2R_CSVZIP_1MIN.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
GENERAL:
NAME: "ICOS_H2R_CSVZIP_1MIN"
DESCRIPTION: "Compressed (zipped) ICOS format with 2-row header (variable names and units) and ISO timestamp."
TAGS: [ "ICOS" ]

FILE:
EXTENSION: "*.csv"
COMPRESSION: "zip"

TIMESTAMP:
DESCRIPTION: "1 column with full ISO timestamp with seconds"
INDEX_COLUMN: [ 0 ]
DATETIME_FORMAT: "%Y%m%d%H%M%S"
SHOWS_START_MIDDLE_OR_END_OF_RECORD: "end"

DATA:
HEADER_SECTION_ROWS: [ 0, 1 ]
SKIP_ROWS: [ ]
HEADER_ROWS: [ 0, 1 ]
NA_VALUES: [ -9999 ]
FREQUENCY: "1MIN"
DELIMITER: ","
31 changes: 19 additions & 12 deletions diive/core/base/flagbase.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,21 +8,26 @@
import numpy as np
import pandas as pd
from pandas import Series, DatetimeIndex
from diive.core.plotting.plotfuncs import default_format, default_legend

import diive.core.plotting.styles.LightTheme as theme
from diive.core.funcs.funcs import validate_id_string
from diive.core.plotting.plotfuncs import default_format, default_legend

class FlagBase():

def __init__(self, series: Series, flagid: str, levelid: str = None):
class FlagBase:

def __init__(self, series: Series, flagid: str, idstr: str = None, verbose: bool = True):
self.series = series
self._flagid = flagid
self._levelid = levelid
self._flagname = self._generate_flagname()
self._idstr = validate_id_string(idstr=idstr)
self.verbose = verbose

self.flagname = self._generate_flagname()

self._filteredseries = None
self._flag = None

print(f"Generating flag {self._flagname} for variable {self.series.name} ...")
print(f"Generating flag {self.flagname} for variable {self.series.name} ...")

@property
def flag(self) -> Series:
Expand Down Expand Up @@ -62,18 +67,20 @@ def setfiltered(self, rejected: DatetimeIndex):
def reset(self):
self._filteredseries = self.series.copy()
# Generate flag series with NaNs
self._flag = pd.Series(index=self.series.index, data=np.nan, name=self._flagname)
self._flag = pd.Series(index=self.series.index, data=np.nan, name=self.flagname)

def _generate_flagname(self) -> str:
"""Generate standardized name for flag variable"""
flagname = "FLAG"
if self._levelid: flagname += f"_L{self._levelid}"
if self._idstr:
flagname += f"{self._idstr}"
flagname += f"_{self.series.name}"
if self._flagid: flagname += f"_{self._flagid}"
if self._flagid:
flagname += f"_{self._flagid}"
flagname += f"_TEST"
return flagname

def plot(self, ok:DatetimeIndex, rejected:DatetimeIndex, plottitle:str=""):
def plot(self, ok: DatetimeIndex, rejected: DatetimeIndex, plottitle: str = ""):
"""Basic plot that shows time series with and without outliers"""
fig = plt.figure(facecolor='white', figsize=(16, 7))
gs = gridspec.GridSpec(2, 1) # rows, cols
Expand All @@ -83,8 +90,8 @@ def plot(self, ok:DatetimeIndex, rejected:DatetimeIndex, plottitle:str=""):
ax_series.plot_date(self.series.index, self.series, label=f"{self.series.name}", color="#42A5F5",
alpha=.5, markersize=2, markeredgecolor='none')
ax_series.plot_date(self.series[rejected].index, self.series[rejected],
label="outlier (rejected)", color="#F44336", marker="X", alpha=1,
markersize=8, markeredgecolor='none')
label="outlier (rejected)", color="#F44336", alpha=1,
markersize=8, markeredgecolor='none', fmt='X')
ax_ok.plot_date(self.series[ok].index, self.series[ok], label=f"OK", color="#9CCC65", alpha=.5,
markersize=2, markeredgecolor='none')
default_format(ax=ax_series)
Expand Down
2 changes: 1 addition & 1 deletion diive/core/dfun/frames.py
Original file line number Diff line number Diff line change
Expand Up @@ -741,7 +741,7 @@ def add_continuous_record_number(df: DataFrame) -> DataFrame:
newcol = '.RECORDNUMBER'
data = range(1, len(df) + 1)
df[newcol] = data
print(f"Added new column {newcol} with record numbers from {df[newcol][0]} to {df[newcol][-1]}.")
print(f"Added new column {newcol} with record numbers from {df[newcol].iloc[0]} to {df[newcol].iloc[-1]}.")
return df


Expand Down
32 changes: 32 additions & 0 deletions diive/core/funcs/funcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,38 @@
from pandas import Series


def validate_id_string(idstr: str):
if idstr:
# idstr = idstr if idstr.endswith('_') else f'{idstr}_'
idstr = idstr if idstr.startswith('_') else f'_{idstr}'
return idstr


def filter_strings_by_elements(list1: list[str], list2: list[str]) -> list[str]:
"""Returns a list of strings from list1 that contain all of the elements in list2.
The function uses a set to keep track of the elements in list2, which makes it more
efficient than iterating over the list twice with this one-liner:
result = [s1 for s1 in list1 if all(s2 in str(s1) for s2 in list2)]
Args:
list1: A list of strings.
list2: A list of elements to check for in each string in list1.
Returns:
A list of strings from list1 that contain all of the elements in list2.
"""
if not list1 or not list2:
return []

elements_in_other_list = set(list2)
result = []
for s1 in list1:
if all(s2 in str(s1) for s2 in elements_in_other_list):
result.append(s1)
return result


def zscore(series: Series) -> Series:
"""Calculate the z-score of each record in *series*"""
mean, std = np.mean(series), np.std(series)
Expand Down
3 changes: 2 additions & 1 deletion diive/core/io/filereader.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ def search_files(searchdirs: str or list, pattern: str) -> list:
""" Search files and store their filename and the path to the file in dictionary. """
# found_files_dict = {}
foundfiles = []
if isinstance(searchdirs, str): searchdirs = [searchdirs] # Use str as list
if isinstance(searchdirs, str):
searchdirs = [searchdirs] # Use str as list
for searchdir in searchdirs:
for root, dirs, files in os.walk(searchdir):
for idx, settings_file_name in enumerate(files):
Expand Down
77 changes: 57 additions & 20 deletions diive/core/plotting/scatter.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
from typing import Literal

import matplotlib.pyplot as plt
import pandas as pd
from pandas import Series

import diive.core.plotting.plotfuncs as pf
from diive.core.dfun.stats import q25, q75


class ScatterXY:
Expand All @@ -15,8 +18,9 @@ def __init__(
title: str = None,
ax: plt.Axes = None,
nbins: int = 0,
binagg: Literal['mean', 'median'] = 'median',
xlim: list = None,
ylim: list = None
ylim: list or Literal['auto'] = None
):
"""
Expand All @@ -29,9 +33,12 @@ def __init__(
self.yunits = yunits
self.ax = ax
self.nbins = nbins
self.binagg = binagg
self.xlim = xlim
self.ylim = ylim

self.binagg = None if self.nbins == 0 else self.binagg

self.xy_df = pd.concat([x, y], axis=1)
self.xy_df = self.xy_df.dropna()

Expand All @@ -46,7 +53,7 @@ def _databinning(self):
group, bins = pd.qcut(self.xy_df[self.xname], q=self.nbins, retbins=True, duplicates='drop')
groupcol = f'GROUP_{self.xname}'
self.xy_df[groupcol] = group
self.xy_df_binned = self.xy_df.groupby(groupcol).agg({'mean', 'std', 'count'})
self.xy_df_binned = self.xy_df.groupby(groupcol).agg({'mean', 'median', 'std', 'count', q25, q75})

def plot(self):
"""Generate plot"""
Expand All @@ -73,22 +80,31 @@ def _plot(self, nbins: int = 10):
label=label)

if self.nbins > 0:

_min = self.xy_df_binned[self.yname]['count'].min()
_max = self.xy_df_binned[self.yname]['count'].max()
self.ax.scatter(x=self.xy_df_binned[self.xname]['mean'],
y=self.xy_df_binned[self.yname]['mean'],
c='none',
s=80,
marker='o',
edgecolors='r',
lw=2,
label=f"binned data, mean±SD "
f"({_min}-{_max} values per bin)")
self.ax.errorbar(x=self.xy_df_binned[self.xname]['mean'],
y=self.xy_df_binned[self.yname]['mean'],
xerr=self.xy_df_binned[self.xname]['std'],
yerr=self.xy_df_binned[self.yname]['std'],
elinewidth=3, ecolor='red', alpha=.6, lw=0)
self.ax.plot(self.xy_df_binned[self.xname][self.binagg],
self.xy_df_binned[self.yname][self.binagg],
c='r', ms=10, marker='o', lw=2,
# c='none', ms=80, marker='o', edgecolors='r', lw=2,
label=f"binned data ({self.binagg}, {_min}-{_max} values per bin)")



if self.binagg == 'median':
self.ax.fill_between(self.xy_df_binned[self.xname][self.binagg],
self.xy_df_binned[self.yname]['q25'],
self.xy_df_binned[self.yname]['q75'],
alpha=.2, zorder=10, color='red',
label="interquartile range")

if self.binagg == 'mean':
self.ax.errorbar(x=self.xy_df_binned[self.xname][self.binagg],
y=self.xy_df_binned[self.yname][self.binagg],
xerr=self.xy_df_binned[self.xname]['std'],
yerr=self.xy_df_binned[self.yname]['std'],
elinewidth=3, ecolor='red', alpha=.6, lw=0,
label="standard deviation")

self._apply_format()
self.ax.locator_params(axis='x', nbins=nbins)
Expand All @@ -99,12 +115,31 @@ def _apply_format(self):
if self.xlim:
xmin = self.xlim[0]
xmax = self.xlim[1]
self.ax.set_xlim(xmin, xmax)

if self.ylim:
else:
xmin = self.xy_df[self.xname].quantile(0.01)
xmax = self.xy_df[self.xname].quantile(0.99)
self.ax.set_xlim(xmin, xmax)

if self.ylim == 'auto':
if self.binagg == 'median':
ymin = self.xy_df_binned[self.yname]['q25'].min()
ymax = self.xy_df_binned[self.yname]['q75'].max()
elif self.binagg == 'mean':
_lowery = self.xy_df_binned[self.yname]['mean'].sub(self.xy_df_binned[self.yname]['std'])
_uppery = self.xy_df_binned[self.yname]['mean'].add(self.xy_df_binned[self.yname]['std'])
ymin = _lowery.min()
ymax = _uppery.max()
else:
ymin = self.xy_df[self.yname].quantile(0.01)
ymax = self.xy_df[self.yname].quantile(0.99)
elif isinstance(self.ylim, list):
ymin = self.ylim[0]
ymax = self.ylim[1]
self.ax.set_ylim(ymin, ymax)
else:
ymin = self.xy_df[self.yname].min()
ymax = self.xy_df[self.yname].max()

self.ax.set_ylim(ymin, ymax)

pf.add_zeroline_y(ax=self.ax, data=self.xy_df[self.yname])

Expand All @@ -118,6 +153,8 @@ def _apply_format(self):
labelspacing=0.2,
ncol=1)

self.ax.set_title(self.title, size=20)

# pf.nice_date_ticks(ax=self.ax, minticks=3, maxticks=20, which='x', locator='auto')

# if self.showplot:
Expand Down
Loading

0 comments on commit 2c18ae6

Please sign in to comment.