Skip to content

This is an end to end machine learning project using my personal shopping data collected over the past three years.

Notifications You must be signed in to change notification settings

ManunEbo/Home-Shopping-ML

Repository files navigation

Home Shopping project

I have a tendency of collecting my shopping receipts. In May 2020, I decided to build a database of these shopping receipts. This database is called Home Shopping. It has provided me with a useful way not only of keeping an eye on my expenditure but also gaining insight into my consumption habbits:
  • where do I spend most of my money
  • How much have I spent in each venue overall
  • when do I spend most of my money
  • what do I spend most of my money on
  • what products do I buy most
  • How much do I spend per week, per month, per year
  • How many items do I buy per week, per month, per year
  • etc ...

Having gathered several years worth of data, I wanted to apply machine learning to this data.
The project involves several tables from the database

  • Receipt table - This contains summary receipt data, total price, total number of items receipt date, receipt time and shopping venue
  • Payment table - This contains payment information e.g. payment type; cash, card, plan Card_Source; Contactless, Pin, 0,DB. DD, Transfer
  • Item table - This contains items/product information e.g. item name and item price

Problem statement

I need a set of tools that guide my expenditure such that I feel more in control of my expenditure while saving time spent shopping.

I have a small fridge so I don't tend to bulk buy and consume over a longer period. I buy small quantities, as a result I do many shopping trips in a week, which consumes a very important resource, time.

In additions, I go through phases where I buy lots of things in a short period of time whether online or in store. This reflects negatively on my budget.

With respect to the actually expenditure, I don't have a consistent expenditure pattern i.e. there are significant variations/variance between expenditure on similar shopping trips.

I want to smooth the shopping experience: I want to reduce the time spent shopping, the number of trips I do per week; I want to reduce the variability in expenditure using a planning tool that provides a good estimate of expenditure given shopping list.

These tools will be used in combination to optimize the shopping experience: reduce time spent shopping and stabilize expenditure.

A problem on the nature of time series: Data

Michael Burry did not have a time series/forcasting model that predicted an imenent stock market crash in 2008. Instead, Michael stumbled onto data which indicated a serious problem in the subprime lending, the rest is common sense.

Back in 2006-2007 no financial forcasting models successfully predicted the crash in 2008:
they didn't have the data that indicated a serious problem in the underlying assets, subprime lending.


Forecasting models do well when there is data supporting the trend:
you can see a wave building by the sea side and follow it until it collapses but you can't predict where and when it will collapse with certainty; nore where and when the next one will arise with certainty, unless you have multiple detectors beneath the waters.


But you know in advance that one will arise, eventually.

A problem on the nature of time series: Algorithms

Tree based algorithms such as random forest and XGBoost employ bootstrapping and are also used in combination with cross validation. Bootstrapping process involves randomly sampling from the data the model is trained on. Thus, within the training set, relatively future data is used to predict relatively past data i.e. data leakage.

Cross validation is a powerful technique used in model validation; in splitting the data into k-folds each fold is used once as testing set which means that the earlier folds, relative past data, will be used to validate later folds, relative future data i.e. data that wouldn't otherwise be available at the time, will be used for model training. Once again, this is a form of data leakage.

Developing time series models with these techniques is inconsistent with the intuition behind train test split for time series. Given the much spoken success of these models, time series considerations when splitting the data into train and test should be ignored in general as such consideration is already violated in training the models.

So the question, is time series real data science?

What's inside

1. Data extraction

Inside the Data Extraction folder the following tasks are accomplished:

  • Data extraction from database
  • New features creation (preliminary feature engineering)
  • Classifier target/Label definition
  • Exporting of the raw data to csv format

2. Data

Inside the data folder you will find all of the datasets used in the project.
Individual notebooks will read in or create these datasets

3. Exploratory data analysis

Exploring the distributions of the features.

4. Feature engineering

In here the following tasks are accomplished:

    Classifier feature engineering
  • Creating dummy variables with pd.get_dummies
  • Split the data into training and testing
  • Scale the values using StandardScaler
  • Feature selection removing low variance features
  • Feature selection removing correlated features
  • Illustrate the features capacity to distinguish between the target classes
  • Export data for modelling
    Regressor feature engineering
  • Creating dummy variables with pd.get_dummies
  • Split the data into training and testing
  • Scale the values using StandardScaler
  • Feature selection removing low variance features
  • Feature selection removing correlated features
  • Export data for modelling

5. Modelling

In here is the classification and regression model training process. This includes:

    Classifier model training
  • Balancing the data with imblearn
  • Hyperparameter tuning with GridSearchCV
  • Model comparison: imbalanced vs balanced
  • Final feature selection using selectfrom
  • Final model evaluation with confusion matrix
    Regressor model training
  • Hyperparameter tuning with GridSearchCV
  • Retrieving the best parameters for the top 4 models from GridSearchCV
  • Comparing VotingRegressor with the best model using cross validation
  • Working with the best model
  • Feature importance of the best model
  • Exporting the best model for evaluation

6. Model evaluation

In here is the classifier and regressor model evaluations. This includes:

    Classifier model evaluation
  • Evaluating the CatBoost model
  • Comparing the CatBoost with the Random forest model
  • Visualizing Predicted probabilities by class attribute
  • Visualizing the distribution of predicted probability of class 1
  • Sensitivity threshold tuning
  • Receiver operating Characteristic Curve (ROC Curves) and Area Under the Curve (AUC)
  • Selecting Sensitivity and Specificity from ROC using a function
  • Visualizing sensitivity vs specificity threshold ranges
    Regressor model evaluation
  • Final feature selection using selectfrom on the best model
  • Final model training
  • Validate the final model
  • Feature importance top features model
  • Exporting the evaluated best model

7. Explain the models with SHAP

In here the models are explained from a global and local perspective. This includes:

    Classifier model explaination
  • Global fidelity: An explaination of the positive and negative relationship between the features and the target from a wholistic model point of view
  • Local fidelity: An explaination of how the model behaves for a single prediction i.e. the feature by feature contribution to the prediction
    Regressor model explaination
  • Global fidelity: An explaination of the positive and negative relationship between the features and the target from a wholistic model point of view
  • Local fidelity: An explaination of how the model behaves for a single prediction i.e. the feature by feature contribution to the prediction

8. Models

In here all the classifier, regressor and StandardScaler models are stored ready for use.

9. Deployment pipeline

Inside is all the modules required to generate a new classifier and regressor prediction and to explain the predictions.

This includes:

  • Import the pipeline module output_pipeline to preprocess the data and generate the predictions and shap values.
  • Import the fidelity module local_fidelity to generate the local explaination plots for the classifier and regressor models.
  • Display the retrieved data using print()

10. Monitoring

This is where mock monitoring scenarios are developed to mimick real situations
where after deployment two monitoring concerns are considered: no data drift and data drift.
This includes:

    Generating classifier datasets
  • Create classifier reference dataset and append classifier predictions to it.
  • Create classifier current dataset no drift and append classifier predictions to it.
  • Create classifier current dataset with drift and append classifier predictions to it.
  • Run various evidently preset reports and export them to Reports folder
    Generating regressor datasets
  • Create regressor reference dataset and append regressor predictions to it.
  • Create regressor current dataset no drift and append regressor predictions to it.
  • Create regressor current dataset with drift and append regressor predictions to it.
  • Run various evidently preset reports and export them to Reports folder

11. Reports

In hear Evidently AI monitoring reports are stored.This includes:

  • Data quality reports
  • Data drift reports
  • Model performance reports

Next step
Looking forward, the objective is to:
  • Rebuild the models to include important features that were left out of the SelectFrom algorithm
  • Consider regularization parameters for the models, non has been specified thus far, thus default was applied.
  • Add a data dictionary
  • Develop time series models for comparison
  • Build deep learning versions of the models for comparison.
  • To fully deploy the models with AWS

About

This is an end to end machine learning project using my personal shopping data collected over the past three years.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published