The notebook represents a comprehensive and rigorous approach to credit risk modeling, utilizing a substantial dataset from a Zimbabwe-based microfinance company. This project exemplifies the full data science lifecycle, from initial problem formulation to advanced model optimization and evaluation. At its core, the notebook addresses the critical financial challenge of predicting borrower default probabilities. It begins by contextualizing credit risk within the broader financial landscape, explaining how defaults impact lenders and the importance of regulatory frameworks like IFRS 9, which mandates the use of Expected Credit Losses (ECL) for accurate impairment provisioning. This theoretical foundation underscores the practical necessity of developing robust predictive models to assess borrower creditworthiness. The dataset comprises 100,000 loan records with 21 features, encompassing borrower demographics, employment details, loan specifics, and repayment history. Sourced from a Zimbabwe microfinance institution, the data has been anonymized to protect client privacy. The notebook meticulously describes each feature, such as loan amount, outstanding balance, interest rates, and borrower age, while identifying potential data quality issues like duplicate columns (e.g., "number_of_defaults" and "number_of_defaults.1"). Data preprocessing forms a substantial portion of the notebook, ensuring data integrity before modeling. Missing values are handled by dropping rows with categorical missing data, given their low percentage. Outlier detection employs the IQR method on key numerical features—loan amount, outstanding balance, and salary—to remove extreme values that could skew model performance. Duplicate and irrelevant columns, such as unique identifiers, are eliminated to streamline the dataset. Feature engineering enhances model interpretability and performance. Locations are aggregated into Zimbabwean provinces to reduce dimensionality, while job titles are categorized into high-paying, less-paying, and other groups. Categorical variables undergo label encoding, and numerical features are standardized using scikit-learn's StandardScaler, with the scaler saved for consistent future transformations. Exploratory data analysis reveals critical insights: numerical variables like loan amount and salary exhibit right-skewed distributions, prompting log transformations for normalization. Categorical distributions highlight demographic trends, such as a majority of married borrowers and Harare as the most common location. Correlation analysis uncovers multicollinearity between features like age and age.1, leading to the removal of redundant variables. The target variable, loan status, shows significant class imbalance (85% non-defaults), necessitating resampling techniques. The modeling phase employs a stratified train-test split to maintain class proportions. Five classifiers—Random Forest, Balanced Random Forest, AdaBoost, XGBoost, and Easy Ensemble—are evaluated against eight resampling methods, including SMOTE variants, oversampling, and undersampling, to address imbalance. Performance is assessed using comprehensive metrics: accuracy, precision, recall, F1-score, AUC, MCC, balanced accuracy, geometric mean, Cohen's Kappa, Youden's Index, and specificity. Advanced visualizations provide deep insights: confusion matrices illustrate prediction errors, radar charts compare metrics across techniques, and parallel coordinates highlight trade-offs. Time-performance analyses evaluate computational efficiency, while Pareto frontiers identify optimal configurations. Hyperparameter optimization uses GridSearchCV with 5-fold cross-validation, fine-tuning parameters like n_estimators and learning_rate for each classifier. Bootstrap confidence intervals assess model stability, and 50 simulations with varying random states ensure robustness against data variability.
Tichaona123/Credit-Risk-modelling--Zimbabwe-Data
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|