Skip to content

Commit

Permalink
Merge pull request #215 from holukas/indev
Browse files Browse the repository at this point in the history
Indev
  • Loading branch information
holukas authored Sep 18, 2024
2 parents 0a9f1d7 + a51f90c commit e6fc944
Show file tree
Hide file tree
Showing 21 changed files with 4,201 additions and 3,336 deletions.
71 changes: 71 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,77 @@

![DIIVE](images/logo_diive1_256px.png)

## v0.82.0 | 19 Sep 2024

## Long-term gap-filling

It is now possible to gap-fill multi-year datasets using the class `LongTermGapFillingRandomForestTS`. In this approach,
data from neighboring years are pooled together before training the random forest model for gap-filling a specific year.
This is especially useful for long-term, multi-year datasets where environmental conditions and drivers might change
over years and decades.

Why random forest? Because it performed well and to me it looks like the first choice for gap-filling ecosystem fluxes,
at least at the moment.

Long-term gap-filling using random forest is now also built into the flux processing chain (Level-4.1). This allows to
quickly gap-fill the different USTAR scenarios and to create some useful plots (I
hope). [See the flux processing chain notebook for how this looks like](https://github.com/holukas/diive/blob/main/notebooks/FluxProcessingChain/FluxProcessingChain.ipynb).

In a future update it will be possible to either directly switch to `XGBoost` for gap-filling, or to use it (and other
machine-learning models) in combination with random forest in the flux processing chain.

### Example

Here is an example for a dataset containing CO2 flux (`NEE`) measurements from 2005 to 2023:

- for gap-filling the year 2005, the model is trained on data from 2005, 2006 and 2007 (*2005 has no previous year*)
- for gap-filling the year 2006, the model is trained on data from 2005, 2006 and 2007 (same model as for 2005)
- for gap-filling the year 2007, the model is trained on data from 2006, 2007 and 2008
- ...
- for gap-filling the year 2012, the model is trained on data from 2011, 2012 and 2013
- for gap-filling the year 2013, the model is trained on data from 2012, 2013 and 2014
- for gap-filling the year 2014, the model is trained on data from 2013, 2014 and 2015
- ...
- for gap-filling the year 2021, the model is trained on data from 2020, 2021 and 2022
- for gap-filling the year 2022, the model is trained on data from 2021, 2022 and 2023 (same model as for 2023)
- for gap-filling the year 2023, the model is trained on data from 2021, 2022 and 2023 (*2023 has no next year*)

### New features

- Added new method for long-term (multiple years) gap-filling using random forest to flux processing chain (
`diive.pkgs.fluxprocessingchain.fluxprocessingchain.FluxProcessingChain.level41_gapfilling_longterm`)
- Added new class for long-term (multiple years) gap-filling using random forest (
`diive.pkgs.gapfilling.longterm.LongTermGapFillingRandomForestTS`)
- Added class for plotting cumulative sums across all data, for multiple columns (
`diive.core.plotting.cumulative.Cumulative`)
- Added class to detect a constant offset between two measurements (
`diive.pkgs.corrections.measurementoffset.MeasurementOffset`)

### Changes

- Creating lagged variants creates gaps which then leads to incomplete features in machine learning models. Now, gaps
are filled using simple forward and backward filling, limited to the number of values defined in *lag*. For example,
if variable TA is lagged by -2 value this creates two missing values for this variant at the start of the time series,
which then are then gap-filled using the simple backwards fill with `limit=2`. (
`diive.core.dfun.frames.lagged_variants`)

### Notebooks

- Updated flux processing chain notebook to include long-term gap-filling using random forest (
`notebooks/FluxProcessingChain/FluxProcessingChain.ipynb`)
- Added new notebook for plotting cumulative sums across all data, for multiple columns (
`notebooks/Plotting/Cumulative.ipynb`)

### Tests

- Unittest for flux processing chain now includes many more methods (
`tests.test_fluxprocessingchain.TestFluxProcessingChain.test_fluxprocessingchain`)
- 39/39 unittests ran successfully

### Bugfixes

- Fixed deprecation warning in (`diive.core.ml.common.prediction_scores_regr`)

## v0.81.0 | 11 Sep 2024

## Expanding Flux Processing Capabilities
Expand Down
9 changes: 6 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
![](images/logo_diive1_256px.png)

![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)
![PyPI - Version](https://img.shields.io/pypi/v/diive?style=for-the-badge&color=%23EF6C00&link=https%3A%2F%2Fpypi.org%2Fproject%2Fdiive%2F)
![GitHub License](https://img.shields.io/github/license/holukas/diive?style=for-the-badge&color=%237CB342)
[![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](https://www.python.org/)
[![PyPI - Version](https://img.shields.io/pypi/v/diive?style=for-the-badge&color=%23EF6C00&link=https%3A%2F%2Fpypi.org%2Fproject%2Fdiive%2F)](https://pypi.org/project/diive/)
[![GitHub License](https://img.shields.io/github/license/holukas/diive?style=for-the-badge&color=%237CB342)](https://github.com/holukas/diive/blob/indev/LICENSE)

[![DOI](https://zenodo.org/badge/708559210.svg)](https://zenodo.org/doi/10.5281/zenodo.10884017)

Expand Down Expand Up @@ -84,6 +84,8 @@ _For info about the Swiss FluxNet flux levels, see [here](https://www.swissfluxn
- Level-2 quality flags
- Level-3.1 storage correction
- Level-3.2 outlier removal
- Level-3.3: USTAR filtering using constant thresholds
- Level-4.1: gap-filling using long-term random forest
- Quick flux processing chain ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/FluxProcessingChain/QuickFluxProcessingChain.ipynb))

### Formats
Expand Down Expand Up @@ -129,6 +131,7 @@ _Create single outlier flags where `0=OK` and `2=outlier`._

### Plotting

- **Cumulatives across all years for multiple variables** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/Cumulative.ipynb))
- **Cumulatives per year** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/CumulativesPerYear.ipynb))
- **Diel cycle per month** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/DielCycle.ipynb))
- **Heatmap date/time**: showing values (z) of time series as date (y) vs time (x) ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/HeatmapDateTime.ipynb))
Expand Down
21 changes: 17 additions & 4 deletions diive/core/dfun/frames.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,11 +88,14 @@ def trim_frame(df: DataFrame, var: str) -> DataFrame:


def detect_new_columns(df: DataFrame, other: DataFrame) -> list:
"""Detect columns in *df* that do not exist in other."""
"""Detect columns in *df* that do not exist in *other*."""
duplicatecols = [c for c in df.columns if c in other.columns]
for col in df[duplicatecols]:
if not df[col].equals(other[col]):
raise Exception(f"Column {col} was identified as duplicate, but is not identical.")
raise Exception(
f"Column {col} was identified as duplicate, but is not identical. "
f"This error can occur e.g. when features used in machine learning models "
f"have gaps.")

newcols = [c for c in df.columns if c not in other.columns]

Expand Down Expand Up @@ -802,7 +805,7 @@ def lagged_variants(df: DataFrame,
lag: list[int, int],
stepsize: int = 1,
exclude_cols: list = None,
verbose: bool = True) -> DataFrame:
verbose: int = 0) -> DataFrame:
"""
Create lagged variants of variables
Expand Down Expand Up @@ -881,12 +884,22 @@ def lagged_variants(df: DataFrame,
else:
# skip if lagstep = 0
continue

# Shifting data creates NaNs in time series, which can cause issues
# with some machine learning algorithms
# To handle this, the new gap(s) is/are filled with the nearest value
n_missing_vals_before = int(df[col].isnull().sum())
df[stepname] = df[col].shift(_shift)
n_missing_vals_after = int(df[stepname].isnull().sum())
if n_missing_vals_before == 0 and n_missing_vals_after > 0:
df[stepname] = df[stepname].bfill(limit=n_missing_vals_after)
df[stepname] = df[stepname].ffill(limit=n_missing_vals_after)
_included.append(col)

if verbose:
print(f"++ Added new columns with lagged variants for: {_included} (lags between {lag[0]} and {lag[1]} "
f"with stepsize {stepsize}), no lagged variants for: {_excluded}.")
f"with stepsize {stepsize}), no lagged variants for: {_excluded}. "
f"Shifting the time series created gaps which were then filled with the nearest value.")
return df


Expand Down
26 changes: 17 additions & 9 deletions diive/core/ml/common.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,14 @@
"""
kudos: https://datascience.stackexchange.com/questions/15135/train-test-validation-set-splitting-in-sklearn
"""
from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import DataFrame
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
from sklearn.metrics import PredictionErrorDisplay, max_error, median_absolute_error, mean_absolute_error, \
mean_absolute_percentage_error, r2_score, mean_squared_error
mean_absolute_percentage_error, r2_score, mean_squared_error, root_mean_squared_error
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from yellowbrick.regressor import PredictionError, ResidualsPlot
Expand All @@ -31,10 +29,10 @@ def __init__(
regressor,
input_df: DataFrame,
target_col: str or tuple,
verbose: bool = True,
verbose: int = 0,
perm_n_repeats: int = 10,
test_size: float = 0.20,
features_lag: list = None,
features_lag: list[int, int] = None,
features_lag_exclude_cols: list = None,
include_timestamp_as_features: bool = False,
add_continuous_record_number: bool = False,
Expand Down Expand Up @@ -129,6 +127,17 @@ def __init__(

self.original_input_features = self.model_df.drop(columns=self.target_col).columns.tolist()

# Check if features complete
n_vals_index = len(self.model_df.index)
fstats = self.model_df[self.original_input_features].describe()
not_complete = fstats.loc['count'] < n_vals_index
not_complete = not_complete[not_complete].index.tolist()
if len(not_complete) > 0:
print(f"(!)Some features are incomplete and have less than {n_vals_index} values:")
for nc in not_complete:
print(f" --> {nc} ({fstats[nc]['count'].astype(int)} values)")
print(f"This means that not all target values can be predicted based on the full model.")

# Create training (80%) and testing dataset (20%)
# Sort index to keep original order
_temp_df = self.model_df.copy().dropna()
Expand Down Expand Up @@ -245,7 +254,7 @@ def _fi_across_splits(feature_importances_splits) -> DataFrame:
fi_df = fi_df.mean(axis=1)
return fi_df

def _remove_rejected_features(self, factor: float =1) -> list:
def _remove_rejected_features(self, factor: float = 1) -> list:
"""Remove features below importance threshold from model dataframe.
The updated model dataframe will then be used for the next (final) model.
"""
Expand Down Expand Up @@ -428,7 +437,7 @@ def fillgaps(self,
self._fillgaps_fallback()
self._fillgaps_combinepredictions()

def reduce_features(self, factor:float=1):
def reduce_features(self, factor: float = 1):
"""Reduce number of features using permutation importance
A random variable is added to features and the permutation importances
Expand Down Expand Up @@ -460,7 +469,6 @@ def reduce_features(self, factor:float=1):
# todo from dtreeviz.trees import dtreeviz # will be used for tree visualization
# _ = tree.plot_tree(rf.estimators_[0], feature_names=X.columns, filled=True)


# Calculate permutation importance for all data
print(f">>> Calculating feature importances (permutation importance, {self.perm_n_repeats} repeats) ...")
X_names = df.drop(self.target_col, axis=1).columns.tolist()
Expand Down Expand Up @@ -1110,7 +1118,7 @@ def prediction_scores_regr(predictions: np.array,
'mae': mean_absolute_error(targets, predictions),
'medae': median_absolute_error(targets, predictions),
'mse': mean_squared_error(targets, predictions),
'rmse': mean_squared_error(targets, predictions, squared=False), # root mean squared error
'rmse': root_mean_squared_error(targets, predictions),
'mape': mean_absolute_percentage_error(targets, predictions),
'maxe': max_error(targets, predictions),
'r2': r2_score(targets, predictions)
Expand Down
Loading

0 comments on commit e6fc944

Please sign in to comment.