A submission for Türkiye İş Bankası Machine Learning Challenge #2.
team: -
members: @mustafahakkoz
rank: 50/94
score (RMSLE): 0.17970
dataset: Transaction Amount A realistic dataset to predict future total transaction amount per sector for regression tasks. Since there are very few attributes, mining external features are encouraged such as inflation data, salary payment days, exchange rates, seasonal temperature ...
-
training dataset ( 200.16 MB, 3.53M rows, 8 cols)
-
testing dataset ( 11.53 MB, 220K rows, 8 cols)
Implementation details can be found in notebooks.
-
a. EDA and preprocessing
b. EDA and preprocessing alternative (without scaling)
c. EDA and preprocessing alternative (without scaling and feature elimination) -
a. RandomizedSearch with XGBoost
b. RandomizedSearch with CatBoost
c. RandomizedSearch with XGBoost (Expanded search space)
d. RandomizedSearch with CatBoost (Expanded search space)
e. RandomizedSearch with LightGBM (Expanded search space) -
a. BayesianOptimization with XGBoost
b. BayesianOptimization with XGBoost (Expanded search space)
1.a.isbankasi-eda-preprocessing.ipynb
-
Binning ordinal columns by KBinsDiscretizer
-
Creating new features by 1-level groups
-
Creating new features by 2-level groups
-
Filling NaN values by means of each group
-
Handling date column and creating new features by seasonal, yearly and monthly groups
-
Mining external data such as economic indicators (17 cols) and exchange rates (2 cols).
-
TargetEncoding for categorical columns
-
MinMaxScaler for normalization
-
Feature elimination by pearson correlation (52 -> 44)
-
Feature elimination by PCA on external data columns (17+2 -> 3)
-
Analyzing data by pps (predictive power score)
1.b.isbankasi-eda-preprocess-noscaling.ipynb
- An alternative preprocessing without scaling data.
1.c.isbankasi-eda-preprocess-noscaling-noelimination.ipynb
- An alternative preprocessing without scaling and feature elimination.
2.a.isbankasi-overfit-xgboost.ipynb
- Overfitting an xgboost model for testing capabilities of preprocessing step.
2.b.isbankasi-overfit-catboost.ipynb
- Overfitting an catboost model for testing capabilities of preprocessing step.
3.a.isbankasi-randomizedsearch-xgboost.ipynb
- Randomized search for tuning small search space of XGBoost's hyperparameters.
3.b.isbankasi-randomizedsearch-catboost.ipynb
- Randomized search for tuning small search space of CatBoost's hyperparameters.
3.c.isbankasi-randomizedsearch-xgboost-expanded.ipynb
- Expanded search space version of Randomizedsearch of XGBoost.
3.d.isbankasi-randomizedsearch-catboost-expanded.ipynb
- Expanded search space version of Randomizedsearch of CatBoost.
3.e.isbankasi-randomizedsearch-lightgbm-expanded.ipynb
- Expanded search space version of Randomizedsearch of LightGBM.
4.a.isbankasi-bayesianoptimization-xgboost.ipynb
- Bayesian optimization for tuning XGBoost.
4.b.isbankasi-bayesianoptimization-xgboost-expanded.ipynb
- Expanded search space version of BayesianOptimization of XGBoost.
-
We didin't use extra test dataset or cv to evaluate our experiments so it causes us not to diversify our experiments.
-
We focused on tuning models. Instead, we could have implement more extensive preprocessing (creating more features, more data mining, advanced null handling, more feature elimination etc.) to improve our scores.
-
We should have try autoML techniques.