Authors: Kosisochukwu Henry Ukpabi, Farouk Lawan Gambo, Aminu Abdullahi, Suleiman Ibrahim
Abstract: Loan default poses a significant threat to the sustainability of financial institutions, necessitating the development of intelligent, data-driven systems for early risk detection. This research presents a robust and interpretable machine learning framework for predicting loan default risk using a real-world financial dataset comprising 255,347 anonymized loan records with a pronounced class imbalance (11.6% default rate). To address the skewed class distribution, the Synthetic Minority Oversampling Technique (SMOTE) was applied to the training set, enhancing the model's sensitivity to defaulters. Four supervised learning algorithms Logistic Regression, Support Vector Machine (SVM), Random Forest, and Extreme Gradient Boosting (XGBoost) were implemented and rigorously evaluated using stratified 5-fold cross-validation. Performance metrics included Accuracy, Precision, Recall, F1-Score, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC), with particular emphasis on metrics sensitive to class imbalance. Among the models tested, the Random Forest classifier achieved the best overall performance, attaining a test accuracy of 96.26%, F1-Score of 0.8014, and AUC-ROC of 0.9215, thereby offering a balanced and reliable prediction of default risk. To ensure model transparency and support regulatory compliance, HasMortgage, EmploymentType, and LoanPurpose as key drivers of default risk, aligning with domain knowledge and enhancing stakeholder trust. The study concludes that combining ensemble machine learning models with class imbalance handling and explainable AI techniques offers a practical and effective solution for credit risk assessment. Recommendations were made for financial institutions, data scientists, and policymakers to adopt interpretable, fair, and performance-optimized predictive systems. This research contributes to the growing body of literature on responsible AI in finance and lays a foundation for future advancements in ethical and data-driven credit decision-making.