loan-approval-prediction

Loan Approval Prediction

This repository contains a data analysis project that focuses on predicting loan approval status based on various applicant attributes. The dataset used for this analysis consists of information such as loan ID, gender, marital status, number of dependents, education level, employment status, applicant income, co-applicant income, loan amount, loan amount term, credit history, property area, and loan status.

Dataset

The dataset contains the following columns:

loan_id: Unique identifier for each loan application
gender: Gender of the applicant (Male or Female)
married: Marital status of the applicant (Yes or No)
dependents: Number of dependents the applicant has
education: Education level of the applicant (Graduate or Not Graduate)
self_employed: Employment status of the applicant (Yes or No)
applicantincome: Income of the applicant
coapplicantincome: Income of the co-applicant
loanamount: Loan amount requested by the applicant
loan_amount_term: Term of the loan in months
credit_history: Credit history of the applicant (1: Good, 0: Bad)
property_area: Area of the property (Rural, Semiurban, or Urban)
loan_status: Loan approval status (Y: Approved, N: Not Approved)

Analysis Overview

The goal of this project is to develop a model that can accurately predict whether a loan application will be approved or not based on the given attributes. The analysis includes data preprocessing, exploratory data analysis, feature engineering, model training, and evaluation.

Files in the Repository

loan-approval-prediction.ipynb: Jupyter Notebook containing the complete analysis code loan_approval-prediction.py: Python script with the analysis code

To run this analysis on your local machine, follow these steps:

Dependencies

The following Python libraries are required to run the analysis:

pandas
numpy
matplotlib
seaborn
sklearn
xgboost

Model Development

The project involved building a model to predict loan default risk based on the available dataset. The following steps were followed:

Splitting the dataset into training and testing sets.
Applying machine learning algorithms logistic regression, and XGBoost, to train the models on the training set.
Evaluating the performance of the models using various metrics, including accuracy, precision, recall, and F1-score.
Selecting the best-performing model based on the evaluation metrics.
Fine-tuning the selected model using hyperparameter optimization techniques, such as grid search or random search, to improve its performance.
Assessing the final model’s performance on the testing set to determine its effectiveness in predicting loan default risk.

Model Results

The initial models yielded varying results, with logistic regression achieving an accuracy of 78% and an F1-score of 0.86 for default cases. However, the model showed some limitations, such as relatively low precision and recall for default cases. To address these limitations, an ensemble model based on XGBoost was developed and fine-tuned.

The final XGBoost model achieved the following performance on the testing set:

Accuracy: 100% Precision: 1.0 Recall: 1.0 F1-score: 1.0 The XGBoost model demonstrated improved performance compared to the initial logistic regression model.

Conclusion

In this analysis, a dataset containing loan application information was explored, preprocessed, and used to develop a model for predicting loan default risk. The final XGBoost model achieved a high level of accuracy and performed better than the initial logistic regression model.

It’s important to note that the performance of the model can be further enhanced by obtaining a larger and more diverse dataset. Additionally, ongoing monitoring and updating of the model with new data will help ensure its continued accuracy and effectiveness in predicting loan default risk.