Skip to content

Commit

Permalink
Merge pull request #56 from databricks-industry-solutions/fix-clip-issue
Browse files Browse the repository at this point in the history
fixed clipping issue
  • Loading branch information
ryuta-yoshimatsu authored Jun 10, 2024
2 parents d703d26 + fff0614 commit 99201e7
Show file tree
Hide file tree
Showing 23 changed files with 159 additions and 116 deletions.
45 changes: 24 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ run_forecast(
prediction_length=10,
backtest_months=1,
stride=10,
metric="smape",
train_predict_ratio=2,
data_quality_check=True,
resample=False,
Expand All @@ -81,29 +82,30 @@ run_forecast(

#### Parameters description:

- ```train_data``` is a delta table name that stores the input dataset.
- ```scoring_data``` is a delta table name that stores the [dynamic future regressors](https://nixtlaverse.nixtla.io/neuralforecast/examples/exogenous_variables.html#3-training-with-exogenous-variables). If not provided or if the same name as ```train_data``` is provided, the models will ignore the future dynamical regressors.
- ```scoring_output``` is a delta table where you write your forecasting output. This table will be created if does not exist
- ```evaluation_output``` is a delta table where you write the evalution results from all backtesting trials from all time series and all models. This table will be created if does not exist.
- ```group_id``` is a column storing the unique id that groups your dataset to each time series.
- ```date_col``` is your time column name.
- ```target``` is your target column name.
- ```freq``` is your prediction frequency. Currently, "D" for daily and "M" for monthly are supported. Note that ```freq``` supported is as per the model basis, hence check the model documentation carefully.
- ```prediction_length``` is your forecasting horizon in the number of steps.
- ```backtest_months``` specifies how many previous months you use for backtesting.
- ```stride``` is the number of steps in which you update your backtesting trial start date when going from one trial to the next.
- ```train_predict_ratio``` specifies the minimum length required for your training dataset with respect to ```prediction_length```. If ```train_predict_ratio```=2, you need to have training dataset that is at least twice as long as ```prediciton_length```.
- ```data_quality_check``` checks the quality of the input data if set to True. See [data_quality_checks.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/mmf_sa/data_quality_checks.py) for the full details of the checks.
- ```resample``` backfills skipped entries with 0 if set to True. Default is False.
- ```active_models``` is a list of models you want to use.
- ```experiment_path``` to keep metrics under the MLFlow.
- ```use_case_name``` a new column will be created under the delta Table, in case you save multiple trials under 1 table.
- ```train_data``` is a delta table name that stores the input dataset.
- ```scoring_data``` is a delta table name that stores the [dynamic future regressors](https://nixtlaverse.nixtla.io/neuralforecast/examples/exogenous_variables.html#3-training-with-exogenous-variables). If not provided or if the same name as ```train_data``` is provided, the models will ignore the future dynamical regressors.
- ```scoring_output``` is a delta table where you write your forecasting output. This table will be created if does not exist
- ```evaluation_output``` is a delta table where you write the evaluation results from all backtesting trials from all time series and all models. This table will be created if does not exist.
- ```group_id``` is a column storing the unique id that groups your dataset to each time series.
- ```date_col``` is your time column name.
- ```target``` is your target column name.
- ```freq``` is your prediction frequency. Currently, "D" for daily and "M" for monthly are supported. Note that ```freq``` supported is as per the model basis, hence check the model documentation carefully.
- ```prediction_length``` is your forecasting horizon in the number of steps.
- ```backtest_months``` specifies how many previous months you use for backtesting.
- ```stride``` is the number of steps in which you update your backtesting trial start date when going from one trial to the next.
- ```metric``` is the metric to log in the evaluation table and MLFlow. Supported metrics are mape and smape. Default is smape.
- ```train_predict_ratio``` specifies the minimum length required for your training dataset with respect to ```prediction_length```. If ```train_predict_ratio```=2, you need to have training dataset that is at least twice as long as ```prediciton_length```.
- ```data_quality_check``` checks the quality of the input data if set to True (default False). See [data_quality_checks.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/mmf_sa/data_quality_checks.py) for the full details of the checks.
- ```resample``` backfills skipped entries with 0 if set to True. Only relevant when data_quality_check is True. Default is False. If data_quality_check is True and resample is False, the check removes all time series with skipped dates.
- ```active_models``` is a list of models you want to use.
- ```experiment_path``` to keep metrics under the MLFlow.
- ```use_case_name``` a new column will be created under the delta Table, in case you save multiple trials under 1 table.

To modify the model hyperparameters, change the values in [mmf_sa/models/models_conf.yaml](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/mmf_sa/models/models_conf.yaml) or overwrite these values in [mmf_sa/forecasting_conf.yaml](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/mmf_sa/forecasting_conf.yaml).

MMF is fully integrated with MLflow, so once the training kicks off, the experiments will be visible in the MLflow Tracking UI with the corresponding metrics and parameters (note that we do not log all local models in MLFlow but we store the binaries in the tables ```evaluation_output``` and ```scoring_output```). The metric you see in the MLflow Tracking UI is a simple mean over backtesting trials over all time series.
MMF is fully integrated with MLflow, so once the training kicks off, the experiments will be visible in the MLflow Tracking UI with the corresponding metrics and parameters (note that we do not log all local models in MLFlow, but we store the binaries in the tables ```evaluation_output``` and ```scoring_output```). The metric you see in the MLflow Tracking UI is a simple mean over backtesting trials over all time series.

We encourage you to reading through [examples/local_univariate_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/local_univariate_daily.py) notebook to better understand how local models can be applied to your time series using MMF. Other example notebooks for monthly forecasting and forecasting with exogenous regressors can be found in [examples/local_univariate_monthly.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/local_univariate_monthly.py) and [examples/local_univariate_external_regressors_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/local_univariate_external_regressors_daily.py).
We encourage you to read through [examples/local_univariate_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/local_univariate_daily.py) notebook to better understand how local models can be applied to your time series using MMF. Other example notebooks for monthly forecasting and forecasting with exogenous regressors can be found in [examples/local_univariate_monthly.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/local_univariate_monthly.py) and [examples/local_univariate_external_regressors_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/local_univariate_external_regressors_daily.py).

### Global Models

Expand Down Expand Up @@ -157,6 +159,7 @@ run_forecast(
prediction_length=10,
backtest_months=1,
stride=10,
metric="smape",
train_predict_ratio=2,
data_quality_check=True,
resample=False,
Expand All @@ -179,7 +182,7 @@ To modify the model hyperparameters or reset the range of the hyperparameter sea

MMF is fully integrated with MLflow and so once the training kicks off, the experiments will be visible in the MLflow Tracking UI with the corresponding metrics and parameters. Once the training is complete the models will be logged to MLFlow and registered to Unity Catalog.

We encourage you to reading through [examples/global_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/global_daily.py) notebook to better understand how global models can be applied to your time series using MMF. Other example notebooks for monthly forecasting and forecasting with exogenous regressors can be found in [examples/global_monthly.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/global_monthly.py) and [examples/global_external_regressors_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/global_external_regressors_daily.py) respectively.
We encourage you to read through [examples/global_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/global_daily.py) notebook to better understand how global models can be applied to your time series using MMF. Other example notebooks for monthly forecasting and forecasting with exogenous regressors can be found in [examples/global_monthly.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/global_monthly.py) and [examples/global_external_regressors_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/global_external_regressors_daily.py) respectively.

### Foundation Models

Expand Down Expand Up @@ -221,7 +224,7 @@ To modify the model hyperparameters, change the values in [mmf_sa/models/models_

MMF is fully integrated with MLflow and so once the training kicks off, the experiments will be visible in the MLflow Tracking UI with the corresponding metrics and parameters. During the evaluation, the models are logged and registered to Unity Catalog.

We encourage you to reading through [examples/foundation_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/foundation_daily.py) notebook to better understand how foundation models can be applied to your time series using MMF. An example notebook for monthly forecasting can be found in [examples/foundation_monthly.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/foundation_monthly.py).
We encourage you to read through [examples/foundation_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/foundation_daily.py) notebook to better understand how foundation models can be applied to your time series using MMF. An example notebook for monthly forecasting can be found in [examples/foundation_monthly.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/foundation_monthly.py).

#### Using Foundation Models on Databricks

Expand Down
1 change: 1 addition & 0 deletions examples/local_univariate_daily.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,7 @@ def transform_group(df):
prediction_length=10,
backtest_months=1,
stride=10,
metric="smape",
train_predict_ratio=1,
data_quality_check=False,
resample=False,
Expand Down
1 change: 1 addition & 0 deletions examples/local_univariate_external_regressors_daily.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,7 @@
prediction_length=10,
backtest_months=1,
stride=10,
metric="smape",
train_predict_ratio=1,
active_models=active_models,
data_quality_check=False,
Expand Down
1 change: 1 addition & 0 deletions examples/local_univariate_monthly.py
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,7 @@ def transform_group(df):
prediction_length=3,
backtest_months=12,
stride=1,
metric="smape",
train_predict_ratio=1,
data_quality_check=False,
resample=False,
Expand Down
20 changes: 12 additions & 8 deletions examples/m5-examples/data_preparation_m5.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,14 +53,26 @@
random.seed(7)

unique_ids = list(daily_train["unique_id"].unique())

unique_id_100 = sorted(random.sample(unique_ids, 100))
unique_id_1000 = sorted(random.sample(unique_ids, 1000))
unique_id_10000 = sorted(random.sample(unique_ids, 10000))

daily_train_100 = daily_train[daily_train["unique_id"].isin(unique_id_100)]
daily_train_1000 = daily_train[daily_train["unique_id"].isin(unique_id_1000)]
daily_train_10000 = daily_train[daily_train["unique_id"].isin(unique_id_10000)]

# COMMAND ----------

(
spark.createDataFrame(daily_train_100)
.write.format("delta").mode("overwrite")
.saveAsTable(f"{catalog}.{db}.daily_train_100")
)
print(f"Saved data to {catalog}.{db}.daily_train_100")

# COMMAND ----------

(
spark.createDataFrame(daily_train_1000)
.write.format("delta").mode("overwrite")
Expand All @@ -76,11 +88,3 @@
.saveAsTable(f"{catalog}.{db}.daily_train_10000")
)
print(f"Saved data to {catalog}.{db}.daily_train_10000")

# COMMAND ----------

display(spark.sql(f"select * from {catalog}.{db}.daily_train_1000"))

# COMMAND ----------


2 changes: 1 addition & 1 deletion examples/m5-examples/foundation_daily_m5.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@

catalog = "mmf" # Name of the catalog we use to manage our assets
db = "m5" # Name of the schema we use to manage our assets (e.g. datasets)
n = 1000 # Number of items: choose from [1000, 10000, 'full']. full is 35k
n = 1000 # Number of items: choose from [100, 1000, 10000, 'full']. full is 35k
table = f"daily_train_{n}" # Training table name
user_email = spark.sql('select current_user() as user').collect()[0]['user'] # User email

Expand Down
2 changes: 1 addition & 1 deletion examples/m5-examples/global_daily_m5.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

catalog = "mmf" # Name of the catalog we use to manage our assets
db = "m5" # Name of the schema we use to manage our assets (e.g. datasets)
n = 1000 # Number of items: choose from [1000, 10000, 'full']. full is 35k
n = 1000 # Number of items: choose from [100, 1000, 10000, 'full']. full is 35k
table = f"daily_train_{n}" # Training table name
user_email = spark.sql('select current_user() as user').collect()[0]['user'] # User email

Expand Down
1 change: 1 addition & 0 deletions examples/m5-examples/local_univariate_daily_m5.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@
prediction_length=28,
backtest_months=3,
stride=7,
metric="smape",
train_predict_ratio=1,
data_quality_check=True,
resample=False,
Expand Down
1 change: 1 addition & 0 deletions examples/m5-examples/run_daily_m5.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
prediction_length=28,
backtest_months=3,
stride=7,
metric="smape",
train_predict_ratio=1,
data_quality_check=True,
resample=False,
Expand Down
1 change: 1 addition & 0 deletions examples/run_daily.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
prediction_length=10,
backtest_months=1,
stride=10,
metric="smape",
train_predict_ratio=1,
data_quality_check=True,
resample=False,
Expand Down
1 change: 1 addition & 0 deletions examples/run_external_regressors_daily.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
prediction_length=10,
backtest_months=1,
stride=10,
metric="smape",
train_predict_ratio=1,
active_models=[model],
data_quality_check=True,
Expand Down
1 change: 1 addition & 0 deletions examples/run_monthly.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
prediction_length=3,
backtest_months=12,
stride=1,
metric="smape",
train_predict_ratio=1,
data_quality_check=True,
resample=False,
Expand Down
Loading

0 comments on commit 99201e7

Please sign in to comment.