mergeval 🧩

mergeval is a unified tool that lets you merge and evaluate large language models in one step.
It combines the power of mergekit for model merging and lm-eval-harness for standardized benchmarking — all through a single command or API call.

Features

🔄 Merge multiple finetuned models into one using all supported merging methods of mergekit
🧪 Evaluate merged models on all supported benchmarks of lm-eval-harness (MMLU, ARC, HellaSwag, etc.)
⚙️ Single CLI command to run both merge + eval

📦 Installation

Prerequisites

Python 3.8+
CUDA-compatible GPU (recommended for model merging and evaluation)

Install Dependencies

pip install -r requirements.txt

This installs:

mergekit — Model merging toolkit
lm-evaluation-harness — Evaluation framework

and any other helper library.

🚀 Quick Start

Basic Usage

Run merge and evaluation in one command:

python mergeval.py examples/example.yaml

This will:

Merge models according to the MergeKit configuration
Evaluate the merged model on specified benchmarks
Clean up the temporary merged model directory

Configuration File

The YAML configuration file supports two main sections:

merge — mergekit configuration for model merging
evaluate — lm-evaluation-harness configuration for benchmarking

You can include both sections, or just one if you only want to merge or evaluate.

📝 Configuration Format

Merge Section

The merge section configures model merging:

merge:
  # Option 1: Reference an external MergeKit config file
  config_path: path/to/mergekit_config.yaml
  
  # Option 2: Use inline mergekit configuration
  config:
    models:
      - model: model1/model-name
        parameters:
          density: 0.3
      - model: model2/model-name
        parameters:
          density: 0.3
    merge_method: ties  # or task_arithmetic, slerp, etc.
    base_model: base/model-name
    parameters:
      normalize: true
      int8_mask: true
    dtype: float16
  
  # Output directory (optional, defaults to merged_models/merged_TIMESTAMP)
  output_model_dir: /path/to/output
  
  # Extra mergekit CLI arguments (optional)
  extra_args:
    - out_shard_size: 2B
    - cuda
    - allow-crimes

Merge Configuration Options:

config_path: Path to an external mergekit YAML config file
config: Inline mergekit configuration dictionary
output_model_dir: Where to save the merged model (optional)
extra_args: Additional CLI arguments for mergekit-

Note: Provide either config_path or config, not both.

Evaluate Section

The evaluate section configures benchmarking:

evaluate:
  config:
    # Model configuration
    model: hf  # Model type (hf, vllm, etc.)
    model_args:
      - pretrained: /path/to/model
      - dtype: bfloat16
      - trust_remote_code: true
      - device_map: auto
      - load_in_8bit: false
    
    # Task configuration
    tasks:  # List of tasks or comma-separated string
      - arc_easy
      - arc_challenge
      - hellaswag
      - mmlu_abstract_algebra
    num_fewshot: 3  # Number of few-shot examples (0 for zero-shot)
    
    # Generation configuration
    gen_kwargs:
      - temperature: 0.7
      - top_p: 0.9
      - top_k: 40
      - max_new_tokens: 100
      - do_sample: true
    
    # Hardware settings
    device: cuda:0  # cuda, cuda:0, cpu, mps
    batch_size: auto  # Integer or 'auto'
    max_batch_size: 64
    
    # Output configuration
    output_path: /path/to/results.json
    log_samples: true  # Save per-document outputs
    limit: 0.5  # Evaluate only 50% of documents per task
    
    # Caching
    use_cache: /path/to/sqlite_cache
    cache_requests: true  # true, refresh, or delete
    
    # Debug options
    check_integrity: true
    write_out: true
    show_config: true
    
    # Custom tasks
    include_path: /path/to/custom/tasks
    
    # Chat and prompts
    system_instruction: "You are a helpful AI assistant."
    apply_chat_template: claude-v1  # Template name or true
    fewshot_as_multiturn: true
    predict_only: false
    
    # Random seed
    seed:
      - random: 0
      - numpy: 1234
      - torch: 1234
    
    # Logging
    wandb_args:
      - project: my-project
      - name: my-run
    hf_hub_log_args:
      - hub_results_org: MyOrg
      - push_results_to_hub: true
    
    # Custom metadata
    metadata:
      custom_key: custom_value

Required Fields:

model: Model type identifier
tasks: List of evaluation tasks
model_args: Model loading arguments (at minimum, pretrained path)

Common Tasks:

arc_easy, arc_challenge — AI2 Reasoning Challenge
hellaswag — Commonsense reasoning
mmlu_* — Massive Multitask Language Understanding (subject-specific)
winogrande — Commonsense reasoning
gsm8k — Grade school math
See lm-evaluation-harness tasks for full list

💻 Usage Examples

Example 1: Merge and Evaluate

merge:
  config:
    models:
      - model: psmathur/orca_mini_v3_13b
        parameters:
          density: 0.3
      - model: garage-bAInd/Platypus2-13B
        parameters:
          density: 0.3
    merge_method: ties
    base_model: TheBloke/Llama-2-13B-fp16
    parameters:
      normalize: true
      int8_mask: true
    dtype: float16

evaluate:
  config:
    model: hf
    model_args:
      - pretrained: merged_models/merged_20240101_120000
      - dtype: bfloat16
      - device_map: auto
    tasks:
      - arc_easy
      - hellaswag
      - mmlu
    num_fewshot: 0
    output_path: results.json

Example 2: Merge Only

merge:
  config_path: my_merge_config.yaml
  output_model_dir: /path/to/save/merged_model

Example 3: Evaluate Only

evaluate:
  config:
    model: hf
    model_args:
      - pretrained: /path/to/existing/model
      - device_map: auto
    tasks: mmlu
    output_path: eval_results.json

Example 4: Using External Config File

merge:
  config_path: configs/ties_merge.yaml
  subspace_boosting: true

evaluate:
  config:
    model: hf
    model_args:
      - pretrained: auto  # Will use merged model path
    tasks: arc_challenge,hellaswag

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

See LICENSE file for details.

🙏 Acknowledgments

mergekit — Model merging toolkit
lm-evaluation-harness — Evaluation framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mergeval 🧩

Features

📦 Installation

Prerequisites

Install Dependencies

🚀 Quick Start

Basic Usage

Configuration File

📝 Configuration Format

Merge Section

Evaluate Section

💻 Usage Examples

Example 1: Merge and Evaluate

Example 2: Merge Only

Example 3: Evaluate Only

Example 4: Using External Config File

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
LICENSE		LICENSE
README.md		README.md
mergeval.py		mergeval.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

mergeval 🧩

Features

📦 Installation

Prerequisites

Install Dependencies

🚀 Quick Start

Basic Usage

Configuration File

📝 Configuration Format

Merge Section

Evaluate Section

💻 Usage Examples

Example 1: Merge and Evaluate

Example 2: Merge Only

Example 3: Evaluate Only

Example 4: Using External Config File

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages