Data source : https://www.kaggle.com/mirichoi0218/insurance
• Seek insight from the dataset with Exploratory Data Analysis
• Performed data processing, data engineering to prepare data before modeling
• Built a model to predict Insurance Cost based on the features
• Feature sex, region has an almost balanced amount, meanwhile most people are non smoker & obese
• A person who smoke and have BMI above 30 tends to have a higher medical cost
• Older people who smoke have more expensive charges
• People who smoke and obese have the highest average charges compared to others
• Check missing value - there are none
• Check duplicate value - there are 1 duplicate, will be remove
• Feature engineering - make a new column weight_status
based on BMI score
• Feature transformation
Encoding sex
, region
, & weight_status
Ordinal encoding smoker
• Modeling
Separating target & features
Splitting train & test data
Modeling using Linear Regression, Random Forest, Decision Tree, Ridge, & Lasso algorithm
Find the best algorithm
Tuning Hyperparameter
Score | LinearRegression | DecisionTree | RandomForest | Ridge |
---|---|---|---|---|
MAE | 4305.20 | 2798.83 | 2608.55 | 4311.10 |
RMSE | 6209.88 | 6067.50 | 4841.88 | 6238.13 |
R2 | 0.77 | 0.78 | 0.78 | 0.86 |
Train Accuracy | 0.74 | 1.0 | 0.97 | 0.74 |
Test Accuracy | 0.77 | 0.78 | 0.86 | 0.77 |
Based on the predictive modeling, Linear Regression algorithm has the best score compared to the others, with MAE Score 4305.20, RMSE Score 6209.88, & R2 Score 0.77. Linear Regression algorithm is fit based on the train & test accuracy.