Use a dataset of historical lending activity from a peer-to-peer lending services company to build a model that can identify the creditworthiness of borrowers.
Using knowledge of Python and supervised machine learning to train and evaluate the best model to use based on loan risk, where a healthy loan status is 0
and a high-risk loan status is 1
.
Please see the file credit_risk_analysis_report.md
for full details regarding the description of the analysis for the machine learning models, bulleted lists describing balanced accuracy scores, prediction and recall of all machine learning models and a summary of the results of the machine learning models with a recommendation on which model to use.
- Read the
lending_data.csv
data from Resources folder into a Pandas DataFrame. - Create labels set (y) from 'loan_status' column and create features (X) DataFrame from remaining columns.
- Check balance of the labels variable (y) with value_counts function.
- Split dataa into training and testing datasets by using train_test_split.
- Fit a logistic regression model using the training data (
X_train
andy_train
). - Save predictions on testing data labels using the testing feature data (
X_test
) and the fitted model. - Evaluate the mode's performance by completing the following: a. Calculate accuracy score of the model. b. Generate a confusion matrix. c. Print the classification report.
How well does the logistic regression model predict both the 0
(healthy loan) and 1
(high-risk loan) labels?
According to the balanced accuracy score of 95%, the model is performing well; however, this is likely due to an imbalanced dataset. When analyzing the dataset, the number of healthy loans (low-risk) highly outweigh the number of high-risk loans. This indicates that the model is better at predicting a loan status as healthy in comparison to a loan status as high-risk. As the dataset is imbalanced, we need to further look at the oversampled dataset.
When analyzing both the balanced and imbalanced classification reports, the model is able to predict healthy loans 0
100% of the time, with 99% of those recalled correctly. The model is only able to predict high-risk loans 1
with a precision of 85%, and of those 85% correctly identified there is a recall rate of 91%.
Overall, precision and recall values are extremely improtant because the cost of misclassification can be high. A high precision score is important to minimize false positives, which can lead to a loss of potential customers. On the other hand, a high recall score is important to minimize false negatives, which can lead to significant financial losses.
-
Use the
RandomOverSampler
module from the imbalanced-learn library to resample the data. Confirm labels have an equal number of data points. -
Use the
LogisticRegression
classifier and resampled data to fit the model and make predictions. -
Evaluate the model's perforamnce by completing the following: a. Calculate accuracy score of the model. b. Generate a confusion matrix. c. Print the classification report.
How well does the logistic regression model, fit with oversampled data, predict both the 0
(healthy loan) and 1
(high-risk loan) labels?
Overall, in analyzing the results of the machine learning models, the oversampled model would be the model of choice. The oversampled model generated a balanced accuracy score of 99%, which is an improvement than the prior logistic regression model on original data of 95%. It continues to perform the same when being able to correctly identify health loans, as the precision is 100%, and the recall remains at 99%. More importantly, where the oversampled model performs better is with the recall rates for high-risk loans. Although still obtaining a similar precision as the logistic regression model on original data at 84% in comparison to 85%, of these 84% that are correctly identified at high-risk, there is a recall of 99%. This is an improvement from the 91% on the prior logistic regression model on original data.
The oversampled model helps to better predict the true positives meaning that it is more effective at distinguishing high-risk loans with high recall accuracy. As well, as the number of false positives significantly decreases, the oversampled model is able to classify healthy and high-risk loans more correctly.
- Dataset provided by edX UofT Data Analytics, which had been generated by Trilogy Education Services, LLC. This is intended for educational purposes only.