Generation Speed

**Device & OS**
- Hardware: Raspberry Pi 3B+
- OS: Raspberry Pi OS 64-bit, Debian 1:6.12.62-1+rpt1 (2025-12-18) aarch64 GNU/Linux
- Compiler: gcc 14.2.0

**Model**
- Model file: tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
- Quantization: Q4_K_M

**What happened?**
I am getting nowhere near the 4 tk/s for the Raspberry Pi 3B+

**Command you ran**
```bash
picolm models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "The capital of France is" -n 10 -t 0 -j 4
```

**Expected output**
Generation speed that's close to 4 tk/s

**Actual output**

```Loading model: models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
Model config:
  n_embd=2048, n_ffn=5632, n_heads=32, n_kv_heads=4
  n_layers=22, vocab_size=32000, max_seq=2048
  head_dim=64, rope_base=10000.0
Allocating 1.17 MB for runtime state (+ 44.00 MB FP16 KV cache)
Tokenizer loaded: 32000 tokens, bos=1, eos=2
Prompt: 6 tokens, generating up to 10 (temp=0.00, top_p=0.90, threads=4)
---
 Paris.

2. B.C. The
---
Prefill: 6 tokens in 166.62s (0.0 tok/s)
Generation: 11 tokens in 278.45s (0.0 tok/s)
Total: 445.07s
Memory: 45.17 MB runtime state (FP16 KV cache)
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation Speed #13

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Generation Speed #13

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions