EasyMLX

Inference-only port of EasyDeL for Apple Silicon via MLX

Installation • Quick Start • Models • eSurge • Quantization • API Server

EasyMLX brings the full EasyDeL inference stack to Apple Silicon. It provides paged attention, continuous batching, streaming, tool calling, reasoning parsers, and an OpenAI-compatible API server — all running natively on Metal via MLX.

Installation

pip install easymlx

Or from source:

git clone https://github.com/erfanzar/easymlx.git
cd easymlx
pip install -e .

Requirements: Python 3.13+, macOS with Apple Silicon, MLX >= 0.31.1

Quick Start

Basic Inference

from easymlx import AutoEasyMLXModelForCausalLM, eSurge, SamplingParams

model = AutoEasyMLXModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    dtype=mx.float16,
)

engine = eSurge(model, tokenizer="meta-llama/Llama-3.2-1B-Instruct")

output = engine.chat(
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    sampling_params=SamplingParams(max_tokens=256, temperature=0.7),
)
print(output.accumulated_text)

Streaming

for chunk in engine.chat(
    messages=[{"role": "user", "content": "Write a haiku about code"}],
    sampling_params=SamplingParams(max_tokens=64),
    stream=True,
):
    print(chunk.delta_text, end="", flush=True)

Direct Generation

outputs = engine.generate(
    "Once upon a time",
    sampling_params=SamplingParams(max_tokens=128, temperature=0.9, top_p=0.95),
)
print(outputs[0].accumulated_text)

Quantization

Quantize models at load time for faster inference and lower memory:

from mlx import core as mx

# 4-bit affine quantization (works on all Apple Silicon)
model = AutoEasyMLXModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    dtype=mx.float16,
    quantization="affine",  # 4-bit, group_size=64
)

# Or with explicit config
from easymlx import QuantizationConfig

model = AutoEasyMLXModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    dtype=mx.float16,
    quantization=QuantizationConfig(mode="affine", bits=4, group_size=64),
)

# Or with ordered regex rules
from easymlx import LayerwiseQuantizationConfig, QuantizationRule

model = AutoEasyMLXModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    dtype=mx.float16,
    quantization=LayerwiseQuantizationConfig(
        default=QuantizationConfig(mode="affine", bits=4, group_size=64),
        rules=[
            QuantizationRule(
                pattern=r"^model\\.embed_tokens$",
                config=QuantizationConfig(mode="affine", bits=8, group_size=64),
            ),
            QuantizationRule(pattern=r"^lm_head$", config=None),
        ],
    ),
)

Supported modes:

Mode	Bits	Group Size	Notes
`affine`	4	64	Works on all Apple Silicon
`mxfp4`	4	32	Requires MLX with GPU arch support
`mxfp8`	8	32	Requires MLX with GPU arch support
`nvfp4`	4	16	Requires MLX with GPU arch support

eSurge Engine

eSurge is the high-performance inference engine, ported from EasyDeL. It provides:

Paged Attention — efficient KV cache management with block-level allocation
Continuous Batching — dynamic request scheduling with configurable sequence budgets
mx.compile Warmup — pre-traces model forward for all token/batch buckets
Tool Calling — built-in parsers for Llama 3, Hermes, Mistral, Qwen, and more
Reasoning Parsers — DeepSeek R1, Qwen3, and other chain-of-thought extractors
Streaming — token-by-token output with delta text and TPS metrics

Engine Configuration

engine = eSurge(
    model,
    tokenizer="meta-llama/Llama-3.2-1B-Instruct",
    max_model_len=4096,          # Maximum sequence length
    max_num_seqs=4,              # Maximum concurrent sequences
    max_num_batched_tokens=2048, # Token budget per step
    page_size=64,                # KV cache page size
    memory_utilization=0.85,     # Fraction of GPU memory for KV cache
    runner_verbose=True,         # Show warmup progress and step logs
    tool_parser="llama3_json",   # Auto-detected if not set
    reasoning_parser="auto",     # Auto-detected if not set
)

Tool Calling

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }
]

output = engine.chat(
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    sampling_params=SamplingParams(max_tokens=256),
    tools=tools,
)
# output.tool_calls contains parsed tool invocations

API Server

Launch an OpenAI-compatible HTTP server:

from easymlx.inference.esurge.server import eSurgeApiServer

server = eSurgeApiServer(engine)
server.run(host="0.0.0.0", port=8000)

Then use any OpenAI-compatible client:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 64
  }'

Supported Models

Language Models

Family	Variants	Model Types
Llama	Llama 2/3/3.1/3.2	`llama`
Llama 4	Llama 4 Scout/Maverick	`llama4`
Qwen	Qwen 1.5	`qwen`
Qwen 2	Qwen 2/2.5	`qwen2`
Qwen 2 MoE	Qwen 2 MoE	`qwen2_moe`
Qwen 3	Qwen 3	`qwen3`
Qwen 3 MoE	Qwen 3 MoE	`qwen3_moe`
Qwen 3 Next	Qwen 3 Next (hybrid attention)	`qwen3_next`
GLM	GLM-4	`glm`
GLM-4	GLM-4	`glm4`
GLM-4 MoE	GLM-4 MoE	`glm4_moe`
GLM-4 MoE Lite	GLM-4 MoE Lite (MLA)	`glm4_moe_lite`
GPT-OSS	SeedLM / GPT-OSS	`gpt_oss`

Vision-Language Models

Family	Model Types
Qwen 2 VL	`qwen2_vl`
Qwen 3 VL	`qwen3_vl`
Qwen 3 VL MoE	`qwen3_vl_moe`
Qwen 3 Omni MoE	`qwen3_omni_moe`
GLM-4V	`glm4v`
GLM-4V MoE	`glm4v_moe`
GLM-4.6V	`glm46v`
Llama 4	`llama4`

Architecture

EasyMLX mirrors EasyDeL's architecture, adapted for MLX:

easymlx/
├── caching/           # KV cache implementations
│   ├── paged/         #   PageCacheView, PageCacheView, PageMetadata
│   ├── transformer/   #   TransformerCache (dense attention)
│   ├── recurrent/     #   RecurrentCache (Mamba/SSM)
│   └── hybrid/        #   HybridCache (mixed attention + SSM)
├── inference/
│   └── esurge/        # eSurge inference engine
│       ├── engine.py          # Core engine with paged runtime
│       ├── runners/           # Model runner, execution manager
│       ├── scheduler/         # Continuous batching scheduler
│       ├── server/            # OpenAI-compatible API server
│       ├── mixins/            # Chat, lifecycle, monitoring mixins
│       └── distributed/       # Multi-worker support
├── infra/             # Base config, module, factory, bridge
├── layers/            # Attention, RoPE, MoE, embeddings, linears
├── modules/           # Model implementations
│   ├── _base/         #   Task-specific base classes (CausalLM, VLM, etc.)
│   ├── auto/          #   Auto classes (from_pretrained)
│   └── <model>/       #   Per-model config + modeling
├── operations/        # Attention kernels (SDPA, paged, vanilla)
└── workers/           # Server auth, logging, response store

Key Differences from EasyDeL

	EasyDeL	EasyMLX
Backend	JAX/Flax on TPU/GPU	MLX on Apple Silicon Metal
Arrays	`jax.Array` (immutable)	`mx.array` (mutable, eager)
Sharding	Automatic mesh parallelism	None (single-device)
Training	Full trainer suite	Inference only
State	`EasyDeLState` checkpoints	No state management
Attention	Multiple kernels + Flash/Ring	Paged + SDPA + vanilla
Compilation	`jax.jit`	`mx.compile` with bucket warmup

License

Apache License 2.0 — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs		docs
easymlx		easymlx
scripts		scripts
tests		tests
.gitignore		.gitignore
.python-version		.python-version
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENCE		LICENCE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EasyMLX

Installation

Quick Start

Basic Inference

Streaming

Direct Generation

Quantization

eSurge Engine

Engine Configuration

Tool Calling

API Server

Supported Models

Language Models

Vision-Language Models

Architecture

Key Differences from EasyDeL

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EasyMLX

Installation

Quick Start

Basic Inference

Streaming

Direct Generation

Quantization

eSurge Engine

Engine Configuration

Tool Calling

API Server

Supported Models

Language Models

Vision-Language Models

Architecture

Key Differences from EasyDeL

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages