ike

A DeepSpeed-based framework for distributed training and inference of language models.

Features

DeepSpeed Integration: Full support for ZeRO stages 0-3 for memory-efficient distributed training
PEFT/LoRA Support: Built-in support for parameter-efficient fine-tuning methods
Multi-GPU Training: Seamless distributed training across multiple GPUs
Flexible Pipeline Architecture: Customizable data processing, forward functions, and model building
Logging Integration: TensorBoard and Weights & Biases support out of the box
Checkpoint Management: Automatic model saving with configurable intervals and metric monitoring

Prerequisites

Python 3.10 or higher
CUDA-capable GPU (for training)
PyTorch 2.0+

Installation

Install from the repository:

git clone https://github.com/SakanaAI/ike.git
cd ike
pip install -e .

For development (includes testing and linting tools):

pip install -e ".[dev]"

For flash attention support:

pip install -e ".[flash]"

Quick Start

1. Create a data processor

from ike import DataProcessor

class MyDataProcessor(DataProcessor):
    def line2data(self, line: dict) -> dict:
        # Process each data sample
        text = line["text"]
        tokens = self.tokenizer.encode(text)
        return {"input_ids": tokens}

2. Define forward functions

def train_forward_step(step, accum_idx, model, tokenizer, batch_data, config):
    outputs = model(**batch_data)
    loss = outputs.loss
    return loss, {}, {"loss": loss.item()}

3. Run training

from ike import TrainingPipeline, get_arguments, load_data_from_jsonl

config = get_arguments()
pipeline = TrainingPipeline(config, world_size, local_rank, global_rank)
pipeline.run(
    load_data_from_filepath_fn=load_data_from_jsonl,
    data_processor_classes=[MyDataProcessor],
    train_forward_step_fn=train_forward_step,
    valid_forward_step_fn=valid_forward_step,
)

4. Launch with DeepSpeed

deepspeed --include localhost:0,1 --master_port 12300 training.py \
    -c cfgs/training.cfg cfgs/model.cfg \
    --save_log --save_model

Architecture

The framework follows a Task → Pipeline → Modules pattern:

Task
  ↓  instantiates (by providing configurations and customized modules)
Pipeline
  ↓
Configures
    pre-implemented modules (optimizer creation, LR scheduler, DeepSpeed, logging)
    with the input configurations
Fills in
    customizable modules (data loader/processor/source, model creation, forward function, metrics)
    with the input customizable modules
Executes
    the modules

Core Components

Pipelines (TrainingPipeline, InferencePipeline): High-level abstractions that orchestrate distributed training/inference, data loading, model management, and logging.
Data Processing (DataProcessor, BasicDataSource): Flexible data loading and processing with support for JSONL files and HuggingFace datasets.
Configuration (get_arguments, get_inference_arguments): YAML-based configuration with command-line override support.

Pipeline Customization

The TrainingPipeline.run() method accepts these customizable modules:

Module	Required	Description
`load_data_from_filepath_fn`	Yes	Function to load data from file paths
`data_processor_classes`	Yes	List of `DataProcessor` subclasses
`train_forward_step_fn`	Yes	Training forward pass implementation
`valid_forward_step_fn`	Yes	Validation forward pass implementation
`build_tokenizer_fn`	No	Custom tokenizer builder
`build_model_fn`	No	Custom model builder
`build_optimizer_fn`	No	Custom optimizer builder
`build_lr_scheduler_fn`	No	Custom LR scheduler builder

Examples

See the examples/ directory for complete working examples:

examples/lm/: Language model fine-tuning example
examples/gsm8k/: Supervised fine-tuning on GSM8K math dataset

Each example includes:

Task-specific data processors
Training and evaluation scripts
Configuration files in cfgs/

Configuration

Configuration is managed through YAML files and command-line arguments. Key argument groups:

Model Arguments

--pretrained_model_dir: Path to pretrained model
--tokenizer_dir: Path to tokenizer (defaults to model dir)
--attn_implementation: Attention implementation (eager, flash_attention_2)

Training Arguments

--global_batch_size: Total batch size across all GPUs
--micro_batch_size: Batch size per GPU per step
--n_epochs: Number of training epochs
--peak_lr: Peak learning rate

DeepSpeed Arguments

--zero_stage: ZeRO optimization stage (0, 1, 2, or 3)
--bf16: Enable bfloat16 training
--activation_checkpointing_layers: Number of layers for activation checkpointing

PEFT/LoRA Arguments

--peft_type: PEFT method (LORA)
--peft_lora_r: LoRA rank
--peft_lora_alpha: LoRA alpha parameter
--peft_lora_target_modules: Modules to apply LoRA to

See README_args.md for the complete argument reference.

Troubleshooting

Common Issues

Out of Memory (OOM)

Reduce --micro_batch_size
Increase --zero_stage (try 2 or 3)
Enable --activation_checkpointing_layers

Slow Training

Ensure --bf16 is enabled
Try --attn_implementation flash_attention_2
Adjust --gradient_accumulation_steps

Data Loading Issues

Use --debug_mode to disable multiprocessing for easier debugging
Check --data_processor_chunksize for memory issues

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
examples		examples
src/ike		src/ike
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_args.md		README_args.md
README_configargparse.rst		README_configargparse.rst
pipeline.png		pipeline.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ike

Features

Table of Contents

Prerequisites

Installation

Quick Start

1. Create a data processor

2. Define forward functions

3. Run training

4. Launch with DeepSpeed

Architecture

Core Components

Pipeline Customization

Examples

Configuration

Model Arguments

Training Arguments

DeepSpeed Arguments

PEFT/LoRA Arguments

Troubleshooting

Common Issues

License

About

Uh oh!

Releases

Packages

Languages

License

SakanaAI/ike

Folders and files

Latest commit

History

Repository files navigation

ike

Features

Table of Contents

Prerequisites

Installation

Quick Start

1. Create a data processor

2. Define forward functions

3. Run training

4. Launch with DeepSpeed

Architecture

Core Components

Pipeline Customization

Examples

Configuration

Model Arguments

Training Arguments

DeepSpeed Arguments

PEFT/LoRA Arguments

Troubleshooting

Common Issues

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages