Inference-only port of EasyDeL for Apple Silicon via MLX
Installation • Quick Start • Models • eSurge • Quantization • API Server
EasyMLX brings the full EasyDeL inference stack to Apple Silicon. It provides paged attention, continuous batching, streaming, tool calling, reasoning parsers, and an OpenAI-compatible API server — all running natively on Metal via MLX.
pip install easymlxOr from source:
git clone https://github.com/erfanzar/easymlx.git
cd easymlx
pip install -e .Requirements: Python 3.13+, macOS with Apple Silicon, MLX >= 0.31.1
from easymlx import AutoEasyMLXModelForCausalLM, eSurge, SamplingParams
model = AutoEasyMLXModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
dtype=mx.float16,
)
engine = eSurge(model, tokenizer="meta-llama/Llama-3.2-1B-Instruct")
output = engine.chat(
messages=[{"role": "user", "content": "What is the capital of France?"}],
sampling_params=SamplingParams(max_tokens=256, temperature=0.7),
)
print(output.accumulated_text)for chunk in engine.chat(
messages=[{"role": "user", "content": "Write a haiku about code"}],
sampling_params=SamplingParams(max_tokens=64),
stream=True,
):
print(chunk.delta_text, end="", flush=True)outputs = engine.generate(
"Once upon a time",
sampling_params=SamplingParams(max_tokens=128, temperature=0.9, top_p=0.95),
)
print(outputs[0].accumulated_text)Quantize models at load time for faster inference and lower memory:
from mlx import core as mx
# 4-bit affine quantization (works on all Apple Silicon)
model = AutoEasyMLXModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
dtype=mx.float16,
quantization="affine", # 4-bit, group_size=64
)
# Or with explicit config
from easymlx import QuantizationConfig
model = AutoEasyMLXModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
dtype=mx.float16,
quantization=QuantizationConfig(mode="affine", bits=4, group_size=64),
)
# Or with ordered regex rules
from easymlx import LayerwiseQuantizationConfig, QuantizationRule
model = AutoEasyMLXModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
dtype=mx.float16,
quantization=LayerwiseQuantizationConfig(
default=QuantizationConfig(mode="affine", bits=4, group_size=64),
rules=[
QuantizationRule(
pattern=r"^model\\.embed_tokens$",
config=QuantizationConfig(mode="affine", bits=8, group_size=64),
),
QuantizationRule(pattern=r"^lm_head$", config=None),
],
),
)Supported modes:
| Mode | Bits | Group Size | Notes |
|---|---|---|---|
affine |
4 | 64 | Works on all Apple Silicon |
mxfp4 |
4 | 32 | Requires MLX with GPU arch support |
mxfp8 |
8 | 32 | Requires MLX with GPU arch support |
nvfp4 |
4 | 16 | Requires MLX with GPU arch support |
eSurge is the high-performance inference engine, ported from EasyDeL. It provides:
- Paged Attention — efficient KV cache management with block-level allocation
- Continuous Batching — dynamic request scheduling with configurable sequence budgets
mx.compileWarmup — pre-traces model forward for all token/batch buckets- Tool Calling — built-in parsers for Llama 3, Hermes, Mistral, Qwen, and more
- Reasoning Parsers — DeepSeek R1, Qwen3, and other chain-of-thought extractors
- Streaming — token-by-token output with delta text and TPS metrics
engine = eSurge(
model,
tokenizer="meta-llama/Llama-3.2-1B-Instruct",
max_model_len=4096, # Maximum sequence length
max_num_seqs=4, # Maximum concurrent sequences
max_num_batched_tokens=2048, # Token budget per step
page_size=64, # KV cache page size
memory_utilization=0.85, # Fraction of GPU memory for KV cache
runner_verbose=True, # Show warmup progress and step logs
tool_parser="llama3_json", # Auto-detected if not set
reasoning_parser="auto", # Auto-detected if not set
)tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}
]
output = engine.chat(
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
sampling_params=SamplingParams(max_tokens=256),
tools=tools,
)
# output.tool_calls contains parsed tool invocationsLaunch an OpenAI-compatible HTTP server:
from easymlx.inference.esurge.server import eSurgeApiServer
server = eSurgeApiServer(engine)
server.run(host="0.0.0.0", port=8000)Then use any OpenAI-compatible client:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 64
}'| Family | Variants | Model Types |
|---|---|---|
| Llama | Llama 2/3/3.1/3.2 | llama |
| Llama 4 | Llama 4 Scout/Maverick | llama4 |
| Qwen | Qwen 1.5 | qwen |
| Qwen 2 | Qwen 2/2.5 | qwen2 |
| Qwen 2 MoE | Qwen 2 MoE | qwen2_moe |
| Qwen 3 | Qwen 3 | qwen3 |
| Qwen 3 MoE | Qwen 3 MoE | qwen3_moe |
| Qwen 3 Next | Qwen 3 Next (hybrid attention) | qwen3_next |
| GLM | GLM-4 | glm |
| GLM-4 | GLM-4 | glm4 |
| GLM-4 MoE | GLM-4 MoE | glm4_moe |
| GLM-4 MoE Lite | GLM-4 MoE Lite (MLA) | glm4_moe_lite |
| GPT-OSS | SeedLM / GPT-OSS | gpt_oss |
| Family | Model Types |
|---|---|
| Qwen 2 VL | qwen2_vl |
| Qwen 3 VL | qwen3_vl |
| Qwen 3 VL MoE | qwen3_vl_moe |
| Qwen 3 Omni MoE | qwen3_omni_moe |
| GLM-4V | glm4v |
| GLM-4V MoE | glm4v_moe |
| GLM-4.6V | glm46v |
| Llama 4 | llama4 |
EasyMLX mirrors EasyDeL's architecture, adapted for MLX:
easymlx/
├── caching/ # KV cache implementations
│ ├── paged/ # PageCacheView, PageCacheView, PageMetadata
│ ├── transformer/ # TransformerCache (dense attention)
│ ├── recurrent/ # RecurrentCache (Mamba/SSM)
│ └── hybrid/ # HybridCache (mixed attention + SSM)
├── inference/
│ └── esurge/ # eSurge inference engine
│ ├── engine.py # Core engine with paged runtime
│ ├── runners/ # Model runner, execution manager
│ ├── scheduler/ # Continuous batching scheduler
│ ├── server/ # OpenAI-compatible API server
│ ├── mixins/ # Chat, lifecycle, monitoring mixins
│ └── distributed/ # Multi-worker support
├── infra/ # Base config, module, factory, bridge
├── layers/ # Attention, RoPE, MoE, embeddings, linears
├── modules/ # Model implementations
│ ├── _base/ # Task-specific base classes (CausalLM, VLM, etc.)
│ ├── auto/ # Auto classes (from_pretrained)
│ └── <model>/ # Per-model config + modeling
├── operations/ # Attention kernels (SDPA, paged, vanilla)
└── workers/ # Server auth, logging, response store| EasyDeL | EasyMLX | |
|---|---|---|
| Backend | JAX/Flax on TPU/GPU | MLX on Apple Silicon Metal |
| Arrays | jax.Array (immutable) |
mx.array (mutable, eager) |
| Sharding | Automatic mesh parallelism | None (single-device) |
| Training | Full trainer suite | Inference only |
| State | EasyDeLState checkpoints |
No state management |
| Attention | Multiple kernels + Flash/Ring | Paged + SDPA + vanilla |
| Compilation | jax.jit |
mx.compile with bucket warmup |
Apache License 2.0 — see LICENSE for details.
Copyright 2026 The EASYDEL / EASYMLX Author @erfanzar (Erfan Zare Chavoshi).
