You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
predict the probability of default for each user id in risk modeling
default = 1 means defaulted users, default = 0 means otherwise
Imbalance binary classification problem
Expected Workflow
Variables (total = 43):
uuid: text User Id
default: (or target) boolean (0, or 1)
Categorical, and numerical features are defined in default_modeling.utils.preproc.pyx (function feature_definition)
Adjustment:
If you want to run the experiment with your data for the purpose of binary classification:
Replace csv in both train_data and test_data by your csv. (Optional: also change test file test_sample_1.csv in default_modeling/default_modeling/tests/data/ for unit test). Each row of your csv should correspond to unique User ID .
Redefine categorical, numerical features in default_modeling/default_modeling/utils/preproc.pyx (function feature_definition) based on your definition
Change TARGET=default in Dockerfile to TARGET={your target variable}
Found the following test data
default_modeling/tests/data/test_sample_1.csv
..
----------------------------------------------------------------------
Ran 2 tests in 0.772s
OK
Train with the selected file, e.g. train_data/TRAIN_SET_1.csv. If no hyperparameters are declared (like n_estimators, max_depth, ...), the file will train with default hyper parameters. Remember to mount to local train_data, and model
extracting arguments
Namespace(max_depth=15, min_samples_leaf=20, model_dir='./model', model_name='risk_model', n_estimators=200, random_state=1234, target='default', train_file='train_set_1.csv', train_folder='./train_data')
Training Data at ./train_data/train_set_1.csv
('Total Input Features', 39)
('class weight', {0: 0.5074062934696794, 1: 34.255076142131976})
Found existing model at: ./model/risk_model.joblib.
Overwriting ...
Congratulation! Saving model at ./model/risk_model.joblib. Finish after 3.684312582015991 s
And predict selected file, e.g: test_data/test_set_1.csv. Now, mount to local test_data, and model
extracting arguments
Namespace(model_dir='./model', model_name='risk_model', target='default', test_file='test_set_1.csv', test_folder='./test_data')
Found model at: ./model/risk_model.joblib
Predicting test_set_1.csv ....
Finish after 0.549715518951416 s
...to csv ./test_data/test_set_1.csv
We have prediction in local folder test_data. Evaluate with Metrics
Decision threshold on the probability of default would probably depend on credit policy. There could be several cutoff points or a mathematical cost function rather than a fixed decision threshold. Therefore, binary metrics like F1, Recall, or Precision is not meaningful in this situation. And the output should be a prediction in probability.
KS-statistic (between P(prediction|truth = 1) and P(prediction|truth = 0) to quantify the distance between 2 classes) are used to evaluate model.
Left plot: ROC AUC Curve
Right plot: Normalized KS Distribution of 2 types of users:
class 0: non-default
class 1: default
Conclusions & Future Work
With KS score = 0.66 and small p-value, this means the predictor can properly distinguish between default and non-default users (test is significant)
Visually, we can observe the clear gap in the KS distribution plot between 2 classes
In the future, host with AWS Sagemeker endpoint
About
Predict the probability of default for each user id in risk modeling