Skip to content

Generation Speed #13

@CarlosS7

Description

@CarlosS7

Device & OS

  • Hardware: Raspberry Pi 3B+
  • OS: Raspberry Pi OS 64-bit, Debian 1:6.12.62-1+rpt1 (2025-12-18) aarch64 GNU/Linux
  • Compiler: gcc 14.2.0

Model

  • Model file: tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
  • Quantization: Q4_K_M

What happened?
I am getting nowhere near the 4 tk/s for the Raspberry Pi 3B+

Command you ran

picolm models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "The capital of France is" -n 10 -t 0 -j 4

Expected output
Generation speed that's close to 4 tk/s

Actual output

Model config:
  n_embd=2048, n_ffn=5632, n_heads=32, n_kv_heads=4
  n_layers=22, vocab_size=32000, max_seq=2048
  head_dim=64, rope_base=10000.0
Allocating 1.17 MB for runtime state (+ 44.00 MB FP16 KV cache)
Tokenizer loaded: 32000 tokens, bos=1, eos=2
Prompt: 6 tokens, generating up to 10 (temp=0.00, top_p=0.90, threads=4)
---
 Paris.

2. B.C. The
---
Prefill: 6 tokens in 166.62s (0.0 tok/s)
Generation: 11 tokens in 278.45s (0.0 tok/s)
Total: 445.07s
Memory: 45.17 MB runtime state (FP16 KV cache)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions