Skip to content

Guidance on Zipformer configuration for low-power ARM CPU (Persian ASR, ~400h data) #2066

Description

@mohammadd-dev

Hi icefall team,

First of all, thank you for the great work on icefall and especially the Zipformer recipes. The modular design and Lhotse integration make it incredibly practical for adapting to new languages like Persian.

I am currently training a Persian (fa) ASR model using egs/commonvoice/ASR/zipformer as the base recipe, and I would really appreciate your insight on a few architectural and hyperparameter decisions.


Goal

My target is on-device inference on a low-power ARM CPU with limited memory bandwidth.

Priorities:

  • Small model footprint
  • Low latency
  • Robustness to noisy home environments

Data & Preparation

  • ~400 hours Persian speech (CommonVoice)
  • Speed perturbation (0.9, 1.1) → ~1200h effective audio
  • MUSAN noise/reverb augmentation enabled
  • BPE size: 500

Current “Small” Configuration (~18M parameters)

--encoder-dim "192, 256, 384, 256, 192"
--num-encoder-layers "2, 2, 3, 2, 2"
--feedforward-dim "512, 768, 1024, 768, 512"
--nhead "4, 4, 4, 4, 4"
--chunk-size 16
--left-context 64

Questions

1) Architecture choice

For a small ARM CPU target, is Zipformer’s hierarchical structure expected to be significantly more cache-friendly during ONNX/NCNN inference compared to a flatter stateless7 (Conv/Emformer-style) model?


2) Downsampling strategy

Would you recommend more aggressive downsampling_factor early in the network (e.g., 2 or 4) to reduce the sequence length seen by attention layers on CPU, or is it better to keep early layers at higher resolution for phonetic fidelity?


3) BPE size

Is BPE 500 too large for this model size?
Would reducing it to ~250–300 noticeably improve Joiner/Decoder speed for CPU inference?


4) Latency vs. parameter count

If I scale the model toward ~30–40M parameters (larger encoder-dim), is there a potential “sweet spot” where the model becomes more confident and reduces decoding/beam overhead enough to offset the larger matrix multiplications?


5) Streaming vs. non-streaming

For this type of hardware, would you recommend:

  • Strict streaming (causal) Zipformer, or
  • Non-streaming Zipformer with chunked attention as a compromise?

6) Decoder choice

Since this CPU has relatively slow memory access, would a simpler stateless decoder be noticeably faster than the default RNN-T decoder in this scenario?


I would also greatly appreciate any suggestions on how to better scale the Zipformer stacks or layer distribution to better fit the small cache and memory bandwidth of such CPUs.

Thank you again for this excellent toolkit.

Best regards

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions