Merge pull request #56 from databricks-industry-solutions/fix-clip-issue

fixed clipping issue
databricks-industry-solutions · Jun 10, 2024 · 99201e7 · 99201e7
2 parents d703d26 + fff0614
commit 99201e7
Show file tree

Hide file tree

Showing 23 changed files with 159 additions and 116 deletions.
diff --git a/README.md b/README.md
@@ -70,6 +70,7 @@ run_forecast(
     prediction_length=10,
     backtest_months=1,
     stride=10,
+    metric="smape",
     train_predict_ratio=2,
     data_quality_check=True,
     resample=False,
@@ -81,29 +82,30 @@ run_forecast(
 
 #### Parameters description:
 
--  ```train_data``` is a delta table name that stores the input dataset.
--  ```scoring_data``` is a delta table name that stores the [dynamic future regressors](https://nixtlaverse.nixtla.io/neuralforecast/examples/exogenous_variables.html#3-training-with-exogenous-variables). If not provided or if the same name as ```train_data``` is provided, the models will ignore the future dynamical regressors. 
--  ```scoring_output``` is a delta table where you write your forecasting output. This table will be created if does not exist
--  ```evaluation_output``` is a delta table where you write the evalution results from all backtesting trials from all time series and all models. This table will be created if does not exist.
--  ```group_id``` is a column storing the unique id that groups your dataset to each time series.
--  ```date_col``` is your time column name.
--  ```target``` is your target column name.
--  ```freq``` is your prediction frequency. Currently, "D" for daily and "M" for monthly are supported. Note that ```freq``` supported is as per the model basis, hence check the model documentation carefully.
--  ```prediction_length``` is your forecasting horizon in the number of steps.
--  ```backtest_months``` specifies how many previous months you use for backtesting. 
--  ```stride``` is the number of steps in which you update your backtesting trial start date when going from one trial to the next.
--  ```train_predict_ratio``` specifies the minimum length required for your training dataset with respect to ```prediction_length```. If ```train_predict_ratio```=2, you need to have training dataset that is at least twice as long as ```prediciton_length```.
--  ```data_quality_check``` checks the quality of the input data if set to True. See [data_quality_checks.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/mmf_sa/data_quality_checks.py) for the full details of the checks. 
--  ```resample``` backfills skipped entries with 0 if set to True. Default is False.
--  ```active_models``` is a list of models you want to use.
--  ```experiment_path``` to keep metrics under the MLFlow.
--  ```use_case_name``` a new column will be created under the delta Table, in case you save multiple trials under 1 table.
+- ```train_data``` is a delta table name that stores the input dataset.
+- ```scoring_data``` is a delta table name that stores the [dynamic future regressors](https://nixtlaverse.nixtla.io/neuralforecast/examples/exogenous_variables.html#3-training-with-exogenous-variables). If not provided or if the same name as ```train_data``` is provided, the models will ignore the future dynamical regressors. 
+- ```scoring_output``` is a delta table where you write your forecasting output. This table will be created if does not exist
+- ```evaluation_output``` is a delta table where you write the evaluation results from all backtesting trials from all time series and all models. This table will be created if does not exist.
+- ```group_id``` is a column storing the unique id that groups your dataset to each time series.
+- ```date_col``` is your time column name.
+- ```target``` is your target column name.
+- ```freq``` is your prediction frequency. Currently, "D" for daily and "M" for monthly are supported. Note that ```freq``` supported is as per the model basis, hence check the model documentation carefully.
+- ```prediction_length``` is your forecasting horizon in the number of steps.
+- ```backtest_months``` specifies how many previous months you use for backtesting. 
+- ```stride``` is the number of steps in which you update your backtesting trial start date when going from one trial to the next.
+- ```metric``` is the metric to log in the evaluation table and MLFlow. Supported metrics are mape and smape. Default is smape.
+- ```train_predict_ratio``` specifies the minimum length required for your training dataset with respect to ```prediction_length```. If ```train_predict_ratio```=2, you need to have training dataset that is at least twice as long as ```prediciton_length```.
+- ```data_quality_check``` checks the quality of the input data if set to True (default False). See [data_quality_checks.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/mmf_sa/data_quality_checks.py) for the full details of the checks. 
+- ```resample``` backfills skipped entries with 0 if set to True. Only relevant when data_quality_check is True. Default is False. If data_quality_check is True and resample is False, the check removes all time series with skipped dates.
+- ```active_models``` is a list of models you want to use.
+- ```experiment_path``` to keep metrics under the MLFlow.
+- ```use_case_name``` a new column will be created under the delta Table, in case you save multiple trials under 1 table.
 
 To modify the model hyperparameters, change the values in [mmf_sa/models/models_conf.yaml](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/mmf_sa/models/models_conf.yaml) or overwrite these values in [mmf_sa/forecasting_conf.yaml](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/mmf_sa/forecasting_conf.yaml). 
 
-MMF is fully integrated with MLflow, so once the training kicks off, the experiments will be visible in the MLflow Tracking UI with the corresponding metrics and parameters (note that we do not log all local models in MLFlow but we store the binaries in the tables ```evaluation_output``` and ```scoring_output```). The metric you see in the MLflow Tracking UI is a simple mean over backtesting trials over all time series.
+MMF is fully integrated with MLflow, so once the training kicks off, the experiments will be visible in the MLflow Tracking UI with the corresponding metrics and parameters (note that we do not log all local models in MLFlow, but we store the binaries in the tables ```evaluation_output``` and ```scoring_output```). The metric you see in the MLflow Tracking UI is a simple mean over backtesting trials over all time series.
 
-We encourage you to reading through [examples/local_univariate_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/local_univariate_daily.py) notebook to better understand how local models can be applied to your time series using MMF. Other example notebooks for monthly forecasting and forecasting with exogenous regressors can be found in [examples/local_univariate_monthly.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/local_univariate_monthly.py) and [examples/local_univariate_external_regressors_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/local_univariate_external_regressors_daily.py).
+We encourage you to read through [examples/local_univariate_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/local_univariate_daily.py) notebook to better understand how local models can be applied to your time series using MMF. Other example notebooks for monthly forecasting and forecasting with exogenous regressors can be found in [examples/local_univariate_monthly.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/local_univariate_monthly.py) and [examples/local_univariate_external_regressors_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/local_univariate_external_regressors_daily.py).
 
 ### Global Models
 
@@ -157,6 +159,7 @@ run_forecast(
     prediction_length=10,
     backtest_months=1,
     stride=10,
+    metric="smape",
     train_predict_ratio=2,
     data_quality_check=True,
     resample=False,
@@ -179,7 +182,7 @@ To modify the model hyperparameters or reset the range of the hyperparameter sea
 
 MMF is fully integrated with MLflow and so once the training kicks off, the experiments will be visible in the MLflow Tracking UI with the corresponding metrics and parameters. Once the training is complete the models will be logged to MLFlow and registered to Unity Catalog. 
 
-We encourage you to reading through [examples/global_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/global_daily.py) notebook to better understand how global models can be applied to your time series using MMF. Other example notebooks for monthly forecasting and forecasting with exogenous regressors can be found in [examples/global_monthly.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/global_monthly.py) and [examples/global_external_regressors_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/global_external_regressors_daily.py) respectively.
+We encourage you to read through [examples/global_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/global_daily.py) notebook to better understand how global models can be applied to your time series using MMF. Other example notebooks for monthly forecasting and forecasting with exogenous regressors can be found in [examples/global_monthly.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/global_monthly.py) and [examples/global_external_regressors_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/global_external_regressors_daily.py) respectively.
 
 ### Foundation Models
 
@@ -221,7 +224,7 @@ To modify the model hyperparameters, change the values in [mmf_sa/models/models_
 
 MMF is fully integrated with MLflow and so once the training kicks off, the experiments will be visible in the MLflow Tracking UI with the corresponding metrics and parameters. During the evaluation, the models are logged and registered to Unity Catalog.
 
-We encourage you to reading through [examples/foundation_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/foundation_daily.py) notebook to better understand how foundation models can be applied to your time series using MMF. An example notebook for monthly forecasting can be found in [examples/foundation_monthly.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/foundation_monthly.py).
+We encourage you to read through [examples/foundation_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/foundation_daily.py) notebook to better understand how foundation models can be applied to your time series using MMF. An example notebook for monthly forecasting can be found in [examples/foundation_monthly.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/foundation_monthly.py).
 
 #### Using Foundation Models on Databricks
 

diff --git a/examples/local_univariate_daily.py b/examples/local_univariate_daily.py
@@ -181,6 +181,7 @@ def transform_group(df):
     prediction_length=10,
     backtest_months=1,
     stride=10,
+    metric="smape",
     train_predict_ratio=1,
     data_quality_check=False,
     resample=False,

diff --git a/examples/local_univariate_external_regressors_daily.py b/examples/local_univariate_external_regressors_daily.py
@@ -151,6 +151,7 @@
     prediction_length=10,
     backtest_months=1,
     stride=10,
+    metric="smape",
     train_predict_ratio=1,
     active_models=active_models,
     data_quality_check=False,

diff --git a/examples/local_univariate_monthly.py b/examples/local_univariate_monthly.py
@@ -179,6 +179,7 @@ def transform_group(df):
     prediction_length=3,
     backtest_months=12,
     stride=1,
+    metric="smape",
     train_predict_ratio=1,
     data_quality_check=False,
     resample=False,

diff --git a/examples/m5-examples/data_preparation_m5.py b/examples/m5-examples/data_preparation_m5.py
@@ -53,14 +53,26 @@
 random.seed(7)
 
 unique_ids = list(daily_train["unique_id"].unique())
+
+unique_id_100 = sorted(random.sample(unique_ids, 100))
 unique_id_1000 = sorted(random.sample(unique_ids, 1000))
 unique_id_10000 = sorted(random.sample(unique_ids, 10000))
 
+daily_train_100 = daily_train[daily_train["unique_id"].isin(unique_id_100)]
 daily_train_1000 = daily_train[daily_train["unique_id"].isin(unique_id_1000)]
 daily_train_10000 = daily_train[daily_train["unique_id"].isin(unique_id_10000)]
 
 # COMMAND ----------
 
+(
+    spark.createDataFrame(daily_train_100)
+    .write.format("delta").mode("overwrite")
+    .saveAsTable(f"{catalog}.{db}.daily_train_100")
+)
+print(f"Saved data to {catalog}.{db}.daily_train_100")
+
+# COMMAND ----------
+
 (
     spark.createDataFrame(daily_train_1000)
     .write.format("delta").mode("overwrite")
@@ -76,11 +88,3 @@
     .saveAsTable(f"{catalog}.{db}.daily_train_10000")
 )
 print(f"Saved data to {catalog}.{db}.daily_train_10000")
-
-# COMMAND ----------
-
-display(spark.sql(f"select * from {catalog}.{db}.daily_train_1000"))
-
-# COMMAND ----------
-
-
diff --git a/examples/m5-examples/foundation_daily_m5.py b/examples/m5-examples/foundation_daily_m5.py
@@ -20,7 +20,7 @@
 
 catalog = "mmf" # Name of the catalog we use to manage our assets
 db = "m5" # Name of the schema we use to manage our assets (e.g. datasets)
-n = 1000  # Number of items: choose from [1000, 10000, 'full']. full is 35k
+n = 1000  # Number of items: choose from [100, 1000, 10000, 'full']. full is 35k
 table = f"daily_train_{n}" # Training table name
 user_email = spark.sql('select current_user() as user').collect()[0]['user'] # User email
 

diff --git a/examples/m5-examples/global_daily_m5.py b/examples/m5-examples/global_daily_m5.py
@@ -21,7 +21,7 @@
 
 catalog = "mmf" # Name of the catalog we use to manage our assets
 db = "m5" # Name of the schema we use to manage our assets (e.g. datasets)
-n = 1000  # Number of items: choose from [1000, 10000, 'full']. full is 35k
+n = 1000  # Number of items: choose from [100, 1000, 10000, 'full']. full is 35k
 table = f"daily_train_{n}" # Training table name
 user_email = spark.sql('select current_user() as user').collect()[0]['user'] # User email
 

diff --git a/examples/m5-examples/local_univariate_daily_m5.py b/examples/m5-examples/local_univariate_daily_m5.py
@@ -77,6 +77,7 @@
     prediction_length=28,
     backtest_months=3,
     stride=7,
+    metric="smape",
     train_predict_ratio=1,
     data_quality_check=True,
     resample=False,

diff --git a/examples/m5-examples/run_daily_m5.py b/examples/m5-examples/run_daily_m5.py
@@ -41,6 +41,7 @@
     prediction_length=28,
     backtest_months=3,
     stride=7,
+    metric="smape",
     train_predict_ratio=1,
     data_quality_check=True,
     resample=False,

diff --git a/examples/run_daily.py b/examples/run_daily.py
@@ -37,6 +37,7 @@
     prediction_length=10,
     backtest_months=1,
     stride=10,
+    metric="smape",
     train_predict_ratio=1,
     data_quality_check=True,
     resample=False,

diff --git a/examples/run_external_regressors_daily.py b/examples/run_external_regressors_daily.py
@@ -38,6 +38,7 @@
     prediction_length=10,
     backtest_months=1,
     stride=10,
+    metric="smape",
     train_predict_ratio=1,
     active_models=[model],
     data_quality_check=True,

diff --git a/examples/run_monthly.py b/examples/run_monthly.py
@@ -37,6 +37,7 @@
     prediction_length=3,
     backtest_months=12,
     stride=1,
+    metric="smape",
     train_predict_ratio=1,
     data_quality_check=True,
     resample=False,