Skip to content

llama.cpp vs MLX/oMLX: Architecture Benchmark on Apple Silicon

· Tom Tang · 9 min read
llama-cpp mlx apple-silicon local-llm benchmark inference

llama.cpp vs oMLX / MLX Architecture

TL;DR: llama.cpp is 21-59% faster than MLX for LLM inference on Apple Silicon M5 Max. Both use the same Metal API to access the GPU, but llama.cpp has 4 layers vs MLX’s 6 — eliminating Python interpreter overhead, GIL contention, and language boundary crossings. For pure inference workloads, fewer layers = faster.

1. 最重要嘅一件事

兩個 stack 最終都叫同一個 API: Metal。 Metal 係 Apple GPU 嘅 low-level API (類似 Vulkan/DirectX)。“MLX 為 Apple Silicon 優化” 唔等於 “MLX 係唯一條路去用 Apple GPU”。任何人都可以寫 Metal shader 直接同 GPU 傾偈。llama.cpp 就係咁做。

想像 Metal 係一條公路。MLX 係 Apple 起嘅一架旅遊巴 — 舒適、多功能、載到好多唔同嘅 ML workload。llama.cpp 嘅 ggml-metal 係一架改裝賽車 — 只做一件事 (transformer inference),但做得極快。兩架車行同一條公路 (Metal API) 去同一個目的地 (GPU compute)。

2. Layer-by-Layer Architecture

oMLX Stack (6 layers)

┌─────────────────────────────────────────────┐
│  ~~Python 3.10+~~              [要用戶裝]    │
│  Interpreter + GIL (Global Interpreter Lock) │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  ~~oMLX Server~~                  [Python]   │
│  FastAPI HTTP wrapper, OpenAI-compatible API │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  ~~mlx-lm~~                      [Python]   │
│  Model loading, tokenization, sampling,      │
│  KV cache                                    │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  Apple MLX Framework           [C++ / Apple] │
│  通用 ML compute graph, lazy evaluation,     │
│  unified memory                              │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  Metal API                        [共用]     │
│  Apple GPU 嘅 low-level API                  │
│  (shader dispatch, buffer management)        │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  Apple Silicon GPU                           │
│  M5 Max — unified memory, Neural Engine,     │
│  GPU cores                                   │
└─────────────────────────────────────────────┘

llama.cpp Stack (4 layers)

┌─────────────────────────────────────────────┐
│  llama-server                     [C/C++]   │
│  HTTP server + model loading + tokenization  │
│  + KV cache + sampling — 全部喺一個          │
│  native binary                               │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  ggml                               [C]     │
│  Tensor library — 只為 transformer 設計,    │
│  冇通用 ML overhead                          │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  Metal API                        [共用]     │
│  同一個 API,同一條公路                       │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  Apple Silicon GPU                           │
│  同一塊 chip,同一啲 GPU cores               │
└─────────────────────────────────────────────┘

少 2 層 = 少 overhead
  冇 Python interpreter
  冇 GIL
  冇通用 ML framework abstraction

3. Request 嘅完整旅程

一個 classify request 由 Rust server 出發,經過每一層到 GPU,再返嚟。

oMLX Path (6 hops)

Rust server → HTTP POST → [Python oMLX server] → [mlx-lm (Python)] → MLX (C++) → Metal → GPU

每次 request 都要 cross Python/C++ boundary 多次。Python GIL 限制 concurrent request throughput。

llama.cpp Path (4 hops)

Rust server → HTTP POST → llama-server (C++) → ggml → Metal → GPU

全程 native code。冇 language boundary crossing。Full concurrency (冇 GIL)。

4. 點解 “Apple 優化” 唔等於最快

因素MLX (Apple)ggml-metal (llama.cpp)
設計目標通用 ML framework (training + inference + research)只做 transformer inference — 冇其他 overhead
Compute graphLazy evaluation, dynamic graph (靈活但有 overhead)Static graph, pre-compiled (快但冇咁靈活)
Metal shader通用 kernel (支援任何 ML operation)Hand-tuned kernel for attention, GEMM, RoPE
Language overheadPython → C++ boundary 每個 op 都要 cross全程 C/C++,zero overhead
ConcurrencyPython GIL — 一次只有一個 thread 執行 Python codeFull multi-threading,continuous batching
Memory managementUnified memory (Apple 強項) + Python GC overheadUnified memory (同一個 hardware) + manual memory (zero GC)
QuantizationMLX 4-bit (Apple 自己嘅 format)GGUF Q4_K_M (更成熟、community 優化多年)
Model support speedApple 同 HuggingFace 合作,新 model 通常最快有Community-driven,但 GGUF 係 de facto standard,通常同日有
Training支援 fine-tuning、LoRA、full training只有 inference,唔做 training

MLX 嘅真正強項係 research flexibility 同 training,唔係 inference 速度。 對於純 inference (我哋嘅 use case),llama.cpp 嘅 “do one thing well” 哲學贏。如果你要 fine-tune model,MLX 仲係正確選擇。但我哋只需要 run a frozen model。

5. Experiment: Head-to-Head Benchmark

Test Conditions

SettingValue
HardwareApple M5 Max, 128GB unified memory
Model (oMLX)Qwen3.5-4B-MLX-4bit (MLX safetensors, ~2.5GB)
Model (llama.cpp)Qwen3.5-4B-Q4_K_M (GGUF, 2.7GB)
TaskPhase classification — system prompt + conversation context → 80 token JSON
Rounds12 per test
Fair playDev server stopped. Same model generation. Same prompts. Same hardware.
APIBoth serve OpenAI-compatible /v1/chat/completions

5.1 — Raw Terminal Output

========================================================================
HEAD-TO-HEAD: oMLX (MLX) vs llama-server (GGUF) — Phase Classification
========================================================================
Rounds per test: 12
Hardware: M5 Max, 128GB unified memory

  oMLX (MLX-4bit): READY (281ms warmup)
  llama.cpp (GGUF-Q4KM): READY (443ms warmup)

────────────────────────────────────────────────────────────────────────
TEST 1: Latency by payload size
────────────────────────────────────────────────────────────────────────

  Payload: small (176 chars)
    oMLX (MLX-4bit)                 p50=   237ms  p90=   247ms  avg=   238ms  min=   231ms  max=   247ms  valid=12/12
    llama.cpp (GGUF-Q4KM)           p50=   187ms  p90=   188ms  avg=   196ms  min=   177ms  max=   298ms  valid=12/12

  Payload: medium (828 chars)
    oMLX (MLX-4bit)                 p50=   256ms  p90=   266ms  avg=   260ms  min=   253ms  max=   298ms  valid=12/12
    llama.cpp (GGUF-Q4KM)           p50=   188ms  p90=   189ms  avg=   203ms  min=   178ms  max=   394ms  valid=12/12

  Payload: large (1747 chars)
    oMLX (MLX-4bit)                 p50=   412ms  p90=   418ms  avg=   395ms  min=   325ms  max=   451ms  valid=12/12
    llama.cpp (GGUF-Q4KM)           p50=   171ms  p90=   172ms  avg=   193ms  min=   161ms  max=   480ms  valid=12/12
────────────────────────────────────────────────────────────────────────
TEST 2: Sequential throughput (medium payload, 10 calls)
────────────────────────────────────────────────────────────────────────
  oMLX (MLX-4bit)                 3.84 calls/sec  p50=   257ms  p90=   297ms  avg=   260ms
  llama.cpp (GGUF-Q4KM)           5.38 calls/sec  p50=   188ms  p90=   195ms  avg=   186ms

────────────────────────────────────────────────────────────────────────
TEST 3a: Concurrent throughput (2 slots, 12 calls)
────────────────────────────────────────────────────────────────────────
  oMLX (MLX-4bit)                 5.52 calls/sec  p50=   360ms  p90=   367ms  avg=   361ms
  llama.cpp (GGUF-Q4KM)           6.21 calls/sec  p50=   267ms  p90=   576ms  avg=   321ms

────────────────────────────────────────────────────────────────────────
TEST 3b: Concurrent throughput (4 slots, 24 calls)
────────────────────────────────────────────────────────────────────────
  oMLX (MLX-4bit)                 6.21 calls/sec  p50=   644ms  p90=   655ms  avg=   642ms
  llama.cpp (GGUF-Q4KM)           7.92 calls/sec  p50=   421ms  p90=   886ms  avg=   502ms
────────────────────────────────────────────────────────────────────────
TEST 4: Output quality samples (medium payload)
────────────────────────────────────────────────────────────────────────

  oMLX (MLX-4bit):
    [OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}
    [OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}
    [OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}

  llama.cpp (GGUF-Q4KM):
    [OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}
    [OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}
    [OK] {"phase": "testing", "scope": "Refactor auth middleware to use JWT"}

========================================================================
VERDICT
========================================================================
     small: oMLX p50=237ms  llama.cpp p50=187ms  → llama.cpp is 21% faster
    medium: oMLX p50=256ms  llama.cpp p50=188ms  → llama.cpp is 27% faster
     large: oMLX p50=412ms  llama.cpp p50=171ms  → llama.cpp is 59% faster

5.2 — Verdict Cards

PayloadoMLX p50llama.cpp p50DifferenceWinner
Small (176 chars)237ms187ms21%llama.cpp faster
Medium (828 chars)256ms188ms27%llama.cpp faster
Large (1747 chars)412ms171ms59%llama.cpp faster

5.3 — Visualized

p50 Latency (ms) — lower is better

PayloadoMLXllama.cppWinner
Small (176 chars)237ms187msllama.cpp
Medium (828 chars)256ms188msllama.cpp
Large (1747 chars)412ms171msllama.cpp

Throughput (calls/sec) — higher is better

ModeoMLXllama.cppWinner
Sequential3.84/s5.38/sllama.cpp
Concurrent (2 slots)5.52/s6.21/sllama.cpp
Concurrent (4 slots)6.21/s7.92/sllama.cpp

5.4 — Where oMLX Wins

Tail latency under high concurrency. At 4 concurrent slots, oMLX p90 = 655ms vs llama.cpp p90 = 886ms. oMLX 更穩定 — likely because MLX’s lazy evaluation smooths out contention. llama.cpp throughput 更高但偶爾 spike。對於 background classification (唔係 user-facing),higher throughput with occasional spikes 係更好嘅 tradeoff。

5.5 — The First Run Was Wrong

Failed Benchmark: Qwen3.5 "Thinking Mode" Trap

First run showed llama.cpp producing 0/8 valid outputs:

  llama.cpp (GGUF-Q4KM):
    [BAD]
    [BAD]
    [BAD]

Raw response revealed the problem:
  "content": "",
  "reasoning_content": "Okay, let's see. The user wants me to classify..."

All 80 tokens consumed by internal reasoning. Content field empty.

Root cause: Qwen3.5 defaults to Chain-of-Thought "thinking mode".
oMLX disabled it via: chat_template_kwargs: {"enable_thinking": false}
llama-server needs: --reasoning off at startup

Fix applied → re-benchmarked → 12/12 valid (100%)

6. Full Summary Table

MetricoMLX (MLX-4bit)llama.cpp (GGUF-Q4KM)Winner
p50 latency (small)237ms187msllama.cpp (21% faster)
p50 latency (medium)256ms188msllama.cpp (27% faster)
p50 latency (large)412ms171msllama.cpp (59% faster)
Sequential throughput3.84 calls/sec5.38 calls/secllama.cpp (1.4x)
Concurrent (2 slots)5.52 calls/sec6.21 calls/secllama.cpp (1.1x)
Concurrent (4 slots)6.21 calls/sec7.92 calls/secllama.cpp (1.3x)
p90 tail (4 slots)655ms886msoMLX (more consistent)
Quality (valid JSON)12/12 (100%)12/12 (100%)Tie
Phase agreement”testing""testing”Same answer
Python requiredYesNollama.cpp
User setup steps5 steps0 stepsllama.cpp

7. Dependency 比較

oMLX: 用戶需要裝嘅嘢

  • Python 3.10+
  • pip / venv
  • mlx-lm
  • MLX framework
  • oMLX CLI
  • 手動 omlx serve

6 個 dependencies,5 個手動步驟

llama.cpp: 用戶需要裝嘅嘢

  • (冇)

0 個 dependencies,0 個手動步驟。Binary bundled,model lazy-download on click。

8. 總結

“為 Apple Silicon 優化” ≠ 最快。

MLX 係 Apple 嘅通用 ML framework — 好似 PyTorch for Apple Silicon。佢優化咗 unified memory access、training、同 research flexibility。

llama.cpp / ggml 係一個只做 LLM inference 嘅 C library — hand-tuned Metal shaders、static compute graph、zero Python overhead。

兩個都用同一個 Metal API 去叫同一塊 GPU。但 llama.cpp 嘅路徑短 2 層,冇 Python GIL,冇通用 framework abstraction。所以喺純 inference 場景,佢更快。

我哋嘅 use case (80 token classification, fire-and-forget) 完美 match llama.cpp 嘅設計目標。


Benchmarked 2026-03-29 — M5 Max 128GB — Qwen3.5-4B — llama.cpp b8500 — oMLX 0.2.23. Benchmark script: bench_h2h.py — 12 rounds per test — dev server stopped for fair GPU allocation.