Skip to content

Minyus/kedro-starters-sklearn

Repository files navigation

kedro-starters-sklearn

This repository provides the following preserved starter templates updated for kedro==1.3.1.

  • sklearn-iris trains a Logistic Regression model using Scikit-learn.
  • sklearn-mlflow-iris adds experiment tracking feature using MLflow.

Pipeline visualized by Kedro-viz

sklearn-iris template

Iris dataset

Iris dataset is included and used by default.

  • Modification: for each species, setosa is encoded to 0, versicolor is encoded to 1, and virginica samples were removed.
  • Split: for each species, the first 25 samples are included in train.csv, and the last 25 samples are included in test.csv.

How to use

  1. Install dependencies.

    pip install "kedro==1.3.1" pandas scikit-learn
  2. Generate your Kedro starter project from sklearn-iris directory.

    kedro new --starter https://github.com/Minyus/kedro-starters-sklearn.git --directory sklearn-iris

    As explained in the Kedro documentation, enter project_name, repo_name, and python_package.

    Note: as your Python package name, choose a unique name and avoid a generic name such as test or sklearn used by another package. You can see the list of importable packages by running python -c "help('modules')".

  3. Change the current directory to the generated project directory.

    cd /path/to/project/directory
  4. Install project dependencies and run the project.

    pip install -r requirements.txt
    kedro run

Option to use Kaggle Titanic dataset

  1. Download Kaggle Titanic dataset
  2. Replace train.csv and test.csv in /path/to/project/directory/data/01_raw directory
  3. Modify /path/to/project/directory/conf/base/parameters.yml to set parameters appropriate for the dataset (commented out by default)

sklearn-mlflow-iris template

This template integrates MLflow into Kedro using PipelineX. Even without writing MLflow code, you can:

  • configure MLflow Tracking
  • log inputs and outputs of Python functions set up as Kedro nodes as parameters (for example, features used to train the model) and metrics (for example, F1 score)
  • log execution time for each Kedro node and dataset loading/saving as metrics
  • log artifacts such as models, execution time Gantt charts visualized by Plotly, and parameters.yml

In this template, MLflow logging is configured in Python code at src/<python_package>/hooks.py.

See here for details.

How to use

  1. Install dependencies.

    pip install "kedro==1.3.1" pandas scikit-learn mlflow "pipelinex==0.8.0" plotly
  2. Generate your Kedro starter project from sklearn-mlflow-iris directory.

    kedro new --starter https://github.com/Minyus/kedro-starters-sklearn.git --directory sklearn-mlflow-iris
  3. Follow the same steps as the sklearn-iris template.

Access MLflow web UI

To access the MLflow web UI, launch the MLflow server.

mlflow server --host 127.0.0.1 --port 8080 --backend-store-uri sqlite:///mlruns/sqlite.db --default-artifact-root ./mlruns

Logged metrics shown in MLflow's UI

Gantt chart for execution time, generated using Plotly, shown in MLflow's UI

Notes

  • Both starters preserve the repo's original Iris-focused examples.
  • The MLflow starter keeps the PipelineX-based hook integration, pinned to pipelinex==0.8.0.

About

Kedro starter templates using Scikit-learn and optionally MLflow

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors