This repository contains a data analysis project that focuses on predicting loan approval status based on various applicant attributes. The dataset used for this analysis consists of information such as loan ID, gender, marital status, number of dependents, education level, employment status, applicant income, co-applicant income, loan amount, loan amount term, credit history, property area, and loan status.
The dataset contains the following columns:
The goal of this project is to develop a model that can accurately predict whether a loan application will be approved or not based on the given attributes. The analysis includes data preprocessing, exploratory data analysis, feature engineering, model training, and evaluation.
loan-approval-prediction.ipynb: Jupyter Notebook containing the complete analysis code loan_approval-prediction.py: Python script with the analysis code
To run this analysis on your local machine, follow these steps:
The following Python libraries are required to run the analysis:
The project involved building a model to predict loan default risk based on the available dataset. The following steps were followed:
The initial models yielded varying results, with logistic regression achieving an accuracy of 78% and an F1-score of 0.86 for default cases. However, the model showed some limitations, such as relatively low precision and recall for default cases. To address these limitations, an ensemble model based on XGBoost was developed and fine-tuned.
The final XGBoost model achieved the following performance on the testing set:
Accuracy: 100% Precision: 1.0 Recall: 1.0 F1-score: 1.0 The XGBoost model demonstrated improved performance compared to the initial logistic regression model.
In this analysis, a dataset containing loan application information was explored, preprocessed, and used to develop a model for predicting loan default risk. The final XGBoost model achieved a high level of accuracy and performed better than the initial logistic regression model.
It’s important to note that the performance of the model can be further enhanced by obtaining a larger and more diverse dataset. Additionally, ongoing monitoring and updating of the model with new data will help ensure its continued accuracy and effectiveness in predicting loan default risk.