Merge pull request #215 from holukas/indev

Indev
holukas · Sep 18, 2024 · e6fc944 · e6fc944
2 parents 0a9f1d7 + a51f90c
commit e6fc944
Show file tree

Hide file tree

Showing 21 changed files with 4,201 additions and 3,336 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,77 @@
 
 ![DIIVE](images/logo_diive1_256px.png)
 
+## v0.82.0 | 19 Sep 2024
+
+## Long-term gap-filling
+
+It is now possible to gap-fill multi-year datasets using the class `LongTermGapFillingRandomForestTS`. In this approach,
+data from neighboring years are pooled together before training the random forest model for gap-filling a specific year.
+This is especially useful for long-term, multi-year datasets where environmental conditions and drivers might change
+over years and decades.
+
+Why random forest? Because it performed well and to me it looks like the first choice for gap-filling ecosystem fluxes,
+at least at the moment.
+
+Long-term gap-filling using random forest is now also built into the flux processing chain (Level-4.1). This allows to
+quickly gap-fill the different USTAR scenarios and to create some useful plots (I
+hope). [See the flux processing chain notebook for how this looks like](https://github.com/holukas/diive/blob/main/notebooks/FluxProcessingChain/FluxProcessingChain.ipynb).
+
+In a future update it will be possible to either directly switch to `XGBoost` for gap-filling, or to use it (and other
+machine-learning models) in combination with random forest in the flux processing chain.
+
+### Example
+
+Here is an example for a dataset containing CO2 flux (`NEE`) measurements from 2005 to 2023:
+
+- for gap-filling the year 2005, the model is trained on data from 2005, 2006 and 2007 (*2005 has no previous year*)
+- for gap-filling the year 2006, the model is trained on data from 2005, 2006 and 2007 (same model as for 2005)
+- for gap-filling the year 2007, the model is trained on data from 2006, 2007 and 2008
+- ...
+- for gap-filling the year 2012, the model is trained on data from 2011, 2012 and 2013
+- for gap-filling the year 2013, the model is trained on data from 2012, 2013 and 2014
+- for gap-filling the year 2014, the model is trained on data from 2013, 2014 and 2015
+- ...
+- for gap-filling the year 2021, the model is trained on data from 2020, 2021 and 2022
+- for gap-filling the year 2022, the model is trained on data from 2021, 2022 and 2023 (same model as for 2023)
+- for gap-filling the year 2023, the model is trained on data from 2021, 2022 and 2023 (*2023 has no next year*)
+
+### New features
+
+- Added new method for long-term (multiple years) gap-filling using random forest to flux processing chain (
+  `diive.pkgs.fluxprocessingchain.fluxprocessingchain.FluxProcessingChain.level41_gapfilling_longterm`)
+- Added new class for long-term (multiple years) gap-filling using random forest (
+  `diive.pkgs.gapfilling.longterm.LongTermGapFillingRandomForestTS`)
+- Added class for plotting cumulative sums across all data, for multiple columns (
+  `diive.core.plotting.cumulative.Cumulative`)
+- Added class to detect a constant offset between two measurements (
+  `diive.pkgs.corrections.measurementoffset.MeasurementOffset`)
+
+### Changes
+
+- Creating lagged variants creates gaps which then leads to incomplete features in machine learning models. Now, gaps
+  are filled using simple forward and backward filling, limited to the number of values defined in *lag*. For example,
+  if variable TA is lagged by -2 value this creates two missing values for this variant at the start of the time series,
+  which then are then gap-filled using the simple backwards fill with `limit=2`. (
+  `diive.core.dfun.frames.lagged_variants`)
+
+### Notebooks
+
+- Updated flux processing chain notebook to include long-term gap-filling using random forest (
+  `notebooks/FluxProcessingChain/FluxProcessingChain.ipynb`)
+- Added new notebook for plotting cumulative sums across all data, for multiple columns (
+  `notebooks/Plotting/Cumulative.ipynb`)
+
+### Tests
+
+- Unittest for flux processing chain now includes many more methods (
+  `tests.test_fluxprocessingchain.TestFluxProcessingChain.test_fluxprocessingchain`)
+- 39/39 unittests ran successfully
+
+### Bugfixes
+
+- Fixed deprecation warning in (`diive.core.ml.common.prediction_scores_regr`)
+
 ## v0.81.0 | 11 Sep 2024
 
 ## Expanding Flux Processing Capabilities

diff --git a/README.md b/README.md
@@ -1,8 +1,8 @@
 ![](images/logo_diive1_256px.png)
 
-![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)
-![PyPI - Version](https://img.shields.io/pypi/v/diive?style=for-the-badge&color=%23EF6C00&link=https%3A%2F%2Fpypi.org%2Fproject%2Fdiive%2F)
-![GitHub License](https://img.shields.io/github/license/holukas/diive?style=for-the-badge&color=%237CB342)
+[![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](https://www.python.org/)
+[![PyPI - Version](https://img.shields.io/pypi/v/diive?style=for-the-badge&color=%23EF6C00&link=https%3A%2F%2Fpypi.org%2Fproject%2Fdiive%2F)](https://pypi.org/project/diive/)
+[![GitHub License](https://img.shields.io/github/license/holukas/diive?style=for-the-badge&color=%237CB342)](https://github.com/holukas/diive/blob/indev/LICENSE)
 
 [![DOI](https://zenodo.org/badge/708559210.svg)](https://zenodo.org/doi/10.5281/zenodo.10884017)
 
@@ -84,6 +84,8 @@ _For info about the Swiss FluxNet flux levels, see [here](https://www.swissfluxn
         - Level-2 quality flags
         - Level-3.1 storage correction
         - Level-3.2 outlier removal
+        - Level-3.3: USTAR filtering using constant thresholds
+        - Level-4.1: gap-filling using long-term random forest
 - Quick flux processing chain ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/FluxProcessingChain/QuickFluxProcessingChain.ipynb))
 
 ### Formats
@@ -129,6 +131,7 @@ _Create single outlier flags where `0=OK` and `2=outlier`._
 
 ### Plotting
 
+- **Cumulatives across all years for multiple variables** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/Cumulative.ipynb))
 - **Cumulatives per year** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/CumulativesPerYear.ipynb))
 - **Diel cycle per month** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/DielCycle.ipynb))
 - **Heatmap date/time**: showing values (z) of time series as date (y) vs time (x) ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/HeatmapDateTime.ipynb))

diff --git a/diive/core/dfun/frames.py b/diive/core/dfun/frames.py
@@ -88,11 +88,14 @@ def trim_frame(df: DataFrame, var: str) -> DataFrame:
 
 
 def detect_new_columns(df: DataFrame, other: DataFrame) -> list:
-    """Detect columns in *df* that do not exist in other."""
+    """Detect columns in *df* that do not exist in *other*."""
     duplicatecols = [c for c in df.columns if c in other.columns]
     for col in df[duplicatecols]:
         if not df[col].equals(other[col]):
-            raise Exception(f"Column {col} was identified as duplicate, but is not identical.")
+            raise Exception(
+                f"Column {col} was identified as duplicate, but is not identical. "
+                f"This error can occur e.g. when features used in machine learning models "
+                f"have gaps.")
 
     newcols = [c for c in df.columns if c not in other.columns]
 
@@ -802,7 +805,7 @@ def lagged_variants(df: DataFrame,
                     lag: list[int, int],
                     stepsize: int = 1,
                     exclude_cols: list = None,
-                    verbose: bool = True) -> DataFrame:
+                    verbose: int = 0) -> DataFrame:
     """
     Create lagged variants of variables
 
@@ -881,12 +884,22 @@ def lagged_variants(df: DataFrame,
                 else:
                     # skip if lagstep = 0
                     continue
+
+                # Shifting data creates NaNs in time series, which can cause issues
+                # with some machine learning algorithms
+                # To handle this, the new gap(s) is/are filled with the nearest value
+                n_missing_vals_before = int(df[col].isnull().sum())
                 df[stepname] = df[col].shift(_shift)
+                n_missing_vals_after = int(df[stepname].isnull().sum())
+                if n_missing_vals_before == 0 and n_missing_vals_after > 0:
+                    df[stepname] = df[stepname].bfill(limit=n_missing_vals_after)
+                    df[stepname] = df[stepname].ffill(limit=n_missing_vals_after)
             _included.append(col)
 
     if verbose:
         print(f"++ Added new columns with lagged variants for: {_included} (lags between {lag[0]} and {lag[1]} "
-              f"with stepsize {stepsize}), no lagged variants for: {_excluded}.")
+              f"with stepsize {stepsize}), no lagged variants for: {_excluded}. "
+              f"Shifting the time series created gaps which were then filled with the nearest value.")
     return df
 
 

diff --git a/diive/core/ml/common.py b/diive/core/ml/common.py
@@ -1,16 +1,14 @@
 """
 kudos: https://datascience.stackexchange.com/questions/15135/train-test-validation-set-splitting-in-sklearn
 """
-from sklearn.feature_selection import RFECV
-from sklearn.model_selection import KFold
 import matplotlib.pyplot as plt
 import numpy as np
 import pandas as pd
 from pandas import DataFrame
 from sklearn.ensemble import RandomForestRegressor
 from sklearn.inspection import permutation_importance
 from sklearn.metrics import PredictionErrorDisplay, max_error, median_absolute_error, mean_absolute_error, \
-    mean_absolute_percentage_error, r2_score, mean_squared_error
+    mean_absolute_percentage_error, r2_score, mean_squared_error, root_mean_squared_error
 from sklearn.model_selection import train_test_split
 from xgboost import XGBRegressor
 from yellowbrick.regressor import PredictionError, ResidualsPlot
@@ -31,10 +29,10 @@ def __init__(
             regressor,
             input_df: DataFrame,
             target_col: str or tuple,
-            verbose: bool = True,
+            verbose: int = 0,
             perm_n_repeats: int = 10,
             test_size: float = 0.20,
-            features_lag: list = None,
+            features_lag: list[int, int] = None,
             features_lag_exclude_cols: list = None,
             include_timestamp_as_features: bool = False,
             add_continuous_record_number: bool = False,
@@ -129,6 +127,17 @@ def __init__(
 
         self.original_input_features = self.model_df.drop(columns=self.target_col).columns.tolist()
 
+        # Check if features complete
+        n_vals_index = len(self.model_df.index)
+        fstats = self.model_df[self.original_input_features].describe()
+        not_complete = fstats.loc['count'] < n_vals_index
+        not_complete = not_complete[not_complete].index.tolist()
+        if len(not_complete) > 0:
+            print(f"(!)Some features are incomplete and have less than {n_vals_index} values:")
+            for nc in not_complete:
+                print(f"    --> {nc} ({fstats[nc]['count'].astype(int)} values)")
+            print(f"This means that not all target values can be predicted based on the full model.")
+
         # Create training (80%) and testing dataset (20%)
         # Sort index to keep original order
         _temp_df = self.model_df.copy().dropna()
@@ -245,7 +254,7 @@ def _fi_across_splits(feature_importances_splits) -> DataFrame:
         fi_df = fi_df.mean(axis=1)
         return fi_df
 
-    def _remove_rejected_features(self, factor: float =1) -> list:
+    def _remove_rejected_features(self, factor: float = 1) -> list:
         """Remove features below importance threshold from model dataframe.
         The updated model dataframe will then be used for the next (final) model.
         """
@@ -428,7 +437,7 @@ def fillgaps(self,
         self._fillgaps_fallback()
         self._fillgaps_combinepredictions()
 
-    def reduce_features(self, factor:float=1):
+    def reduce_features(self, factor: float = 1):
         """Reduce number of features using permutation importance
 
         A random variable is added to features and the permutation importances
@@ -460,7 +469,6 @@ def reduce_features(self, factor:float=1):
         # todo from dtreeviz.trees import dtreeviz  # will be used for tree visualization
         # _ = tree.plot_tree(rf.estimators_[0], feature_names=X.columns, filled=True)
 
-
         # Calculate permutation importance for all data
         print(f">>> Calculating feature importances (permutation importance, {self.perm_n_repeats} repeats) ...")
         X_names = df.drop(self.target_col, axis=1).columns.tolist()
@@ -1110,7 +1118,7 @@ def prediction_scores_regr(predictions: np.array,
         'mae': mean_absolute_error(targets, predictions),
         'medae': median_absolute_error(targets, predictions),
         'mse': mean_squared_error(targets, predictions),
-        'rmse': mean_squared_error(targets, predictions, squared=False),  # root mean squared error
+        'rmse': root_mean_squared_error(targets, predictions),
         'mape': mean_absolute_percentage_error(targets, predictions),
         'maxe': max_error(targets, predictions),
         'r2': r2_score(targets, predictions)