GitHub - nvlinhvn/default-modeling: Predict the probability of default for each user id in risk modeling

Problem Definition

predict the probability of default for each user id in risk modeling
default = 1 means defaulted users, default = 0 means otherwise
Imbalance binary classification problem

Expected Workflow

Variables (total = 43):

uuid: text User Id
default: (or target) boolean (0, or 1)
Categorical, and numerical features are defined in default_modeling.utils.preproc.pyx (function feature_definition)

Adjustment:

If you want to run the experiment with your data for the purpose of binary classification:
- Replace csv in both train_data and test_data by your csv. (Optional: also change test file test_sample_1.csv in default_modeling/default_modeling/tests/data/ for unit test). Each row of your csv should correspond to unique User ID .
- Redefine categorical, numerical features in default_modeling/default_modeling/utils/preproc.pyx (function feature_definition) based on your definition
- Change TARGET=default in Dockerfile to TARGET={your target variable}
- Data example can be seen below

UUID (User id)	Feature 1	...	Feature N	Target (binary)
001	100	...	"AAA"	0
002	300	...	"BBB"	1

Package Requirements:

pandas, numpy, category_encoders, sklearn, scipy, joblib, Cython

Folder Structure

.
├── Dockerfile
├── default_modeling
│   ├── __init__.py
│   ├── default_modeling
│   │   ├── __init__.py
│   │   ├── interface
│   │   │   ├── __init__.py
│   │   │   ├── launch_predictor.py
│   │   │   ├── launch_trainer.py
│   │   │   ├── predictor.c
│   │   │   ├── predictor.pyx
│   │   │   ├── trainer.c
│   │   │   └── trainer.pyx
│   │   └── utils
│   │       ├── __init__.py
│   │       ├── load.c
│   │       ├── load.pyx 
│   │       ├── preproc.c
│   │       └── preproc.pyx
│   ├── setup.py
│   ├── tests
│   │   ├── __init__.py
│   │   ├── data
│   │   │   └── test_sample_1.csv
│   │   ├── test_case_base.py
│   │   └── test_data_handling.py
├── model
│   └── risk_model.joblib
├── prototype
│   ├── prototype_cython.ipynb
│   └── prototype_python.ipynb
├── requirements.txt
├── test_data
│   ├── test_set_1.csv
│   └── test_set_2.csv
└── train_data
    ├── train_set_1.csv
    └── train_set_2.csv

Run Locally

After cloning this repository, SETUP

!python3 -m default_modeling.setup build

UNIT TEST

!python3 -m unittest discover default_modeling

Arguments Explaination

model-dir: folder to store trained model (model as seen in this repo)
model-name: name of trained .joblib model (risk_model saved in folder model in this case)
train-folder: folder contains train csv (train_data in this repo)
train-file: selected file in train-folder (train_set_1.csv in this case)
target: target columns from data
test-folder: folder contains test csv (test_data in this repo)
test-file: selected file in test-folder (test_set_1.csv in this case)
Random forest parameters as sklearn.RandomForestClassifier:
- n-estimators
- max-depth
- min-samples-leaf
- random-state

TRAIN (train file `train_data/train_set_1.csv`)

!python3 -m default_modeling.default_modeling.interface.launch_trainer \
                                                --model-dir ./model \
                                                --model-name risk_model \
                                                --train-folder train_data \
                                                --train-file train_set_1.csv \
                                                --target default

Now if we would like to tune or modify random forest hyperparameters in training.

!python3 -m default_modeling.default_modeling.interface.launch_trainer \
                                                --model-dir ./model \
                                                --model-name risk_model \
                                                --train-folder train_data \
                                                --train-file train_set_1.csv \
                                                --target default \
                                                --n-estimators 200 \
                                                --max-depth 15 \
                                                --min-samples-leaf 20

PREDICT (predict file `test_data/test_set_1.csv`)

!python3 -m default_modeling.default_modeling.interface.launch_predictor \
                                               --test-file test_set_1.csv \
                                               --model-dir ./model \
                                               --model-name risk_model \
                                               --test-folder test_data \
                                               --test-file test_set_1.csv \
                                               --target default

DockerFile Contents

My Local Working Directory named /home/jupyter. In this local working directory:
- train_data folder contains different files for training random forest classifers
- model folder store the trained .joblib random forest, and the model will be loaded in this folder for prediction
- test_data folder contains new data coming and waiting for prediction, prediction result will be locally stored inside the same file in this folder.
Container will mount to those local folders: train_data, test_data and model
With this approach, we can conveniently play with every new data coming, by replacing the files inside train_data and/or test_data
Container is built both in pure Python and Cython

FROM python:3.8
WORKDIR /app/

RUN mkdir model

ENV TRAIN_FOLDER=./train_data
ENV TEST_FOLDER=./test_data
ENV TRAIN_FILE=train_set.csv
ENV TEST_FILE=test_set.csv
ENV MODEL_DIR=./model
ENV MODEL_NAME=risk_model
ENV TARGET=default

COPY requirements.txt .
COPY default_modeling default_modeling

RUN pip install -r requirements.txt
RUN python3 -m default_modeling.setup build

ENTRYPOINT ["python3"]

Build Image from Dockerfile

!docker build -t default_model -f Dockerfile .

Run unit test in Image

!docker run -t default_model:latest -m unittest discover default_modeling

Found the following test data
default_modeling/tests/data/test_sample_1.csv
..
----------------------------------------------------------------------
Ran 2 tests in 0.772s

OK

Train with the selected file, e.g. `train_data/TRAIN_SET_1.csv`. If no hyperparameters are declared (like n_estimators, max_depth, ...), the file will train with default hyper parameters. Remember to mount to local `train_data`, and `model`

!docker run -v /home/jupyter/train_data:/app/train_data \
            -v /home/jupyter/model:/app/model \
            default_model:latest -m default_modeling.default_modeling.interface.launch_trainer \
            --train-file train_set_1.csv \
            --n-estimators 200 \
            --max-depth 15 \
            --min-samples-leaf 20

extracting arguments
Namespace(max_depth=15, min_samples_leaf=20, model_dir='./model', model_name='risk_model', n_estimators=200, random_state=1234, target='default', train_file='train_set_1.csv', train_folder='./train_data')
Training Data at ./train_data/train_set_1.csv
('Total Input Features', 39)
('class weight', {0: 0.5074062934696794, 1: 34.255076142131976})
Found existing model at: ./model/risk_model.joblib.
Overwriting ...
Congratulation! Saving model at ./model/risk_model.joblib. Finish after 3.684312582015991 s

And predict selected file, e.g: `test_data/test_set_1.csv`. Now, mount to local `test_data`, and `model`

!docker run -v /home/jupyter/test_data:/app/test_data \
            -v /home/jupyter/model:/app/model \
            default_model:latest -m default_modeling.default_modeling.interface.launch_predictor \
            --test-file test_set_1.csv

extracting arguments
Namespace(model_dir='./model', model_name='risk_model', target='default', test_file='test_set_1.csv', test_folder='./test_data')
Found model at: ./model/risk_model.joblib
Predicting test_set_1.csv ....
Finish after 0.549715518951416 s
...to csv ./test_data/test_set_1.csv

We have prediction in local folder test_data. Evaluate with Metrics

Decision threshold on the probability of default would probably depend on credit policy. There could be several cutoff points or a mathematical cost function rather than a fixed decision threshold. Therefore, binary metrics like F1, Recall, or Precision is not meaningful in this situation. And the output should be a prediction in probability.
KS-statistic (between P(prediction|truth = 1) and P(prediction|truth = 0) to quantify the distance between 2 classes) are used to evaluate model.
Left plot: ROC AUC Curve
Right plot: Normalized KS Distribution of 2 types of users:
- class 0: non-default
- class 1: default

Conclusions & Future Work

With KS score = 0.66 and small p-value, this means the predictor can properly distinguish between default and non-default users (test is significant)
Visually, we can observe the clear gap in the KS distribution plot between 2 classes
In the future, host with AWS Sagemeker endpoint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Problem Definition

Expected Workflow

Variables (total = 43):

Adjustment:

Package Requirements:

Folder Structure

Run Locally

After cloning this repository, SETUP

UNIT TEST

Arguments Explaination

TRAIN (train file `train_data/train_set_1.csv`)

Now if we would like to tune or modify random forest hyperparameters in training.

PREDICT (predict file `test_data/test_set_1.csv`)

DockerFile Contents

Build Image from Dockerfile

Run unit test in Image

Train with the selected file, e.g. `train_data/TRAIN_SET_1.csv`. If no hyperparameters are declared (like n_estimators, max_depth, ...), the file will train with default hyper parameters. Remember to mount to local `train_data`, and `model`

And predict selected file, e.g: `test_data/test_set_1.csv`. Now, mount to local `test_data`, and `model`

We have prediction in local folder test_data. Evaluate with Metrics

Conclusions & Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
default_modeling		default_modeling
img		img
model		model
test_data		test_data
train_data		train_data
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Problem Definition

Expected Workflow

Variables (total = 43):

Adjustment:

Package Requirements:

Folder Structure

Run Locally

After cloning this repository, SETUP

UNIT TEST

Arguments Explaination

TRAIN (train file train_data/train_set_1.csv)

Now if we would like to tune or modify random forest hyperparameters in training.

PREDICT (predict file test_data/test_set_1.csv)

DockerFile Contents

Build Image from Dockerfile

Run unit test in Image

Train with the selected file, e.g. train_data/TRAIN_SET_1.csv. If no hyperparameters are declared (like n_estimators, max_depth, ...), the file will train with default hyper parameters. Remember to mount to local train_data, and model

And predict selected file, e.g: test_data/test_set_1.csv. Now, mount to local test_data, and model

We have prediction in local folder test_data. Evaluate with Metrics

Conclusions & Future Work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

TRAIN (train file `train_data/train_set_1.csv`)

PREDICT (predict file `test_data/test_set_1.csv`)

Train with the selected file, e.g. `train_data/TRAIN_SET_1.csv`. If no hyperparameters are declared (like n_estimators, max_depth, ...), the file will train with default hyper parameters. Remember to mount to local `train_data`, and `model`

And predict selected file, e.g: `test_data/test_set_1.csv`. Now, mount to local `test_data`, and `model`

Packages