A cloud coding agent is great until the connection disappears. I wanted a fallback that could run entirely on my Mac, expose an OpenAI-compatible API, and still understand screenshots when I needed to show it a UI.
The setup I landed on uses:
llama.cppwith Metal acceleration- Gemma 4 26B-A4B in GGUF format
- a Q8 MTP draft model for speculative decoding
- the Gemma 4 multimodal projector
- Pi as the terminal coding agent
I put this setup together on an Apple M1 MacBook Air with 8 GB of unified memory. The 26B model files described below need more memory to run comfortably, so the benchmark figures should be treated as reference results for higher-memory Apple silicon.
Choosing the model
The main model was gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf, which is roughly 16 GB. The full folder comes to around 17 GB after adding the MTP draft head and multimodal projector.
For a repeatable speed test, I used this prompt and capped each response at about 128 tokens:
Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
Baseline: llama.cpp with Metal
I started by running the main model directly:
repos/llama.cpp/build/bin/llama-cli \
-m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
-ngl 999 \
-fa on \
-c 4096 \
-n 128| Setup | Prompt tok/s | Generation tok/s |
|---|---|---|
| Gemma 4 26B-A4B Q4, llama.cpp Metal | 298.0 | 58.2 |
About 58 generated tokens per second is usable, but agent loops involve many responses and tool calls. Small speed gains add up quickly.
Adding MTP speculative decoding
Gemma 4 includes an MTP draft model at:
MTP/gemma-4-26B-A4B-it-Q8_0-MTP.ggufLoad it alongside the main model:
repos/llama.cpp/build/bin/llama-cli \
-m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
--model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
-ngl 999 \
-fa on \
-c 4096 \
-n 128The number of draft tokens is hardware-dependent, so I swept --spec-draft-n-max from 1 through 6:
| Draft tokens | Prompt tok/s | Generation tok/s |
|---|---|---|
| 1 | 295.5 | 68.4 |
| 2 | 299.1 | 72.0 |
| 3 | 295.6 | 72.2 |
| 4 | 297.3 | 70.7 |
| 5 | 297.9 | 63.7 |
| 6 | 296.3 | 61.2 |
Three draft tokens produced the best result on this machine, with two close behind. Larger values actually slowed generation down.
| Setup | Prompt tok/s | Generation tok/s | Speedup |
|---|---|---|---|
| Main model only | 298.0 | 58.2 | 1.00x |
| Main model + Q8 MTP draft | 295.6 | 72.2 | 1.24x |
Prompt processing remained effectively unchanged, while generation improved by about 24%.
llama.cpp versus MLX
I also compared several MLX builds to see whether Apple’s native framework would be faster for this model:
| Runtime | Model | Generation tok/s |
|---|---|---|
| llama.cpp Metal + MTP | Unsloth GGUF Q4 + Q8 MTP | 72.2 |
| llama.cpp Metal | Unsloth GGUF Q4 | 58.2 |
| MLX-LM | Unsloth UD MLX 4-bit | 45.8 |
| MLX-LM | mlx-community 4-bit | 43.9 |
| MLX-LM | mlx-community OptiQ 4-bit | 38.1 |
For this exact combination of model and hardware, llama.cpp won. The MTP-assisted version was comfortably ahead of every MLX checkpoint I tested.
I briefly tried gemma-4-swift-mlx as well, but the available 26B 4-bit checkpoints did not match the loader’s expected weight keys. Since the other MLX results already answered the performance question, I moved on.
Adding image support
Pi originally treated my local model as text-only because its configuration contained:
"input": ["text"]That prevents image tool output from reaching the model. Pi needs the model declared with both input types:
"input": ["text", "image"]The llama.cpp server also needs Gemma’s multimodal projector:
mmproj-BF16.ggufLoading it with --mmproj makes the server advertise multimodal support. A repeat of the text benchmark showed no meaningful generation slowdown:
| Setup | Projector | Prompt tok/s | Generation tok/s |
|---|---|---|---|
| llama.cpp Metal + MTP | none | 120.3 | 71.4 |
| llama.cpp Metal + MTP | mmproj-BF16.gguf | 297.4 | 72.2 |
Install llama.cpp
Install the build dependencies:
brew install cmake git tmux python@3.11Create a workspace, clone the repository, and build with Metal and Accelerate:
mkdir -p ~/Developer/ML-Models/Gemma4/repos
cd ~/Developer/ML-Models/Gemma4
git clone https://github.com/ggml-org/llama.cpp repos/llama.cpp
cd repos/llama.cpp
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_METAL=ON \
-DGGML_ACCELERATE=ON
cmake --build build --config Release -jThe tested build reported GGML_METAL=ON, GGML_ACCELERATE=ON, GGML_BLAS=ON, and GGML_BLAS_VENDOR=Apple.
Download the Gemma model files
Set up a small Python environment for the Hugging Face CLI:
cd ~/Developer/ML-Models/Gemma4
python3.11 -m venv .venv
source .venv/bin/activate
pip install -U huggingface_hub hf_xetDownload the model, draft model, and projector:
mkdir -p models/unsloth-gemma-4-26B-A4B-it-GGUF
huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \
gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
mmproj-BF16.gguf \
MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
--local-dir models/unsloth-gemma-4-26B-A4B-it-GGUFThe folder should look like this:
models/unsloth-gemma-4-26B-A4B-it-GGUF/
gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
mmproj-BF16.gguf
MTP/gemma-4-26B-A4B-it-Q8_0-MTP.ggufStart the local server
Here is the final command:
repos/llama.cpp/build/bin/llama-server \
-m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
--model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
--mmproj models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
-ngl 999 \
-fa on \
-c 65536 \
--parallel 1 \
--host 127.0.0.1 \
--port 8080The OpenAI-compatible endpoint will be available at http://127.0.0.1:8080/v1.
I use a start_server.sh wrapper to keep it running in tmux:
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
SESSION_NAME="${SESSION_NAME:-gemma4-server}"
HOST="${HOST:-127.0.0.1}"
PORT="${PORT:-8080}"
CTX_SIZE="${CTX_SIZE:-65536}"
PARALLEL="${PARALLEL:-1}"
LLAMA_SERVER="$ROOT_DIR/repos/llama.cpp/build/bin/llama-server"
MODEL="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf"
DRAFT_MODEL="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf"
MMPROJ="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf"
LOG_FILE="$ROOT_DIR/logs/llama-server-mtp.log"
mkdir -p "$ROOT_DIR/logs"
tmux new-session -d -s "$SESSION_NAME" -c "$ROOT_DIR" \
"$LLAMA_SERVER \
-m '$MODEL' \
--model-draft '$DRAFT_MODEL' \
--mmproj '$MMPROJ' \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
-ngl 999 \
-fa on \
-c '$CTX_SIZE' \
--parallel '$PARALLEL' \
--host '$HOST' \
--port '$PORT' \
2>&1 | tee -a '$LOG_FILE'"Start it and confirm the API is responding:
chmod +x start_server.sh
./start_server.sh
curl http://127.0.0.1:8080/v1/modelsConfigure Pi
Pi reads custom model providers from ~/.pi/agent/models.json. Add a local provider:
{
"providers": {
"gemma4-local": {
"name": "Gemma 4 Local",
"baseUrl": "http://127.0.0.1:8080/v1",
"api": "openai-completions",
"apiKey": "local",
"authHeader": false,
"compat": {
"supportsDeveloperRole": false,
"supportsReasoningEffort": false
},
"models": [
{
"id": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
"name": "Gemma 4 26B-A4B Q4 + MTP",
"reasoning": false,
"input": ["text", "image"],
"contextWindow": 65536,
"maxTokens": 8192,
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
}
}
]
}
}
}The important parts are:
baseUrlpoints to the localllama.cppserver.apiuses OpenAI-compatible completions.authHeaderis disabled because the server is local.inputincludes text and images.
You can also make it the default in ~/.pi/agent/settings.json:
{
"defaultProvider": "gemma4-local",
"defaultModel": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
"defaultThinkingLevel": "minimal"
}Check that Pi can find the model:
pi --offline --list-models gemmaThen launch it interactively:
pi --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4_K_XL.ggufOr run a one-off prompt:
pi -p --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
"Explain what this repository does"Screenshots work the same way:
pi -p @"/path/to/screenshot.png" \
"Describe this image and point out anything relevant to the UI"The finished stack
| Layer | Choice |
|---|---|
| Inference runtime | llama.cpp |
| macOS acceleration | Metal + Accelerate |
| Main model | gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf |
| Draft model | gemma-4-26B-A4B-it-Q8_0-MTP.gguf |
| MTP setting | --spec-draft-n-max 3 |
| Multimodal projector | mmproj-BF16.gguf |
| Server | llama-server on 127.0.0.1:8080 |
| API | OpenAI-compatible /v1 |
| Coding agent | Pi |
| Pi model input | ["text", "image"] |
The MTP draft model is the part that made this setup feel practical. On the tested machine, it moved Gemma 4 from 58.2 to 72.2 generated tokens per second without making the server configuration much more complicated.
Qwen3.6 as an alternative
Qwen3.6 35B-A3B is another strong option, especially when coding quality matters more than raw generation speed. In this test it generated about 55 tokens per second, compared with Gemma’s 72 tokens per second.
Download the Qwen files:
mkdir -p models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF
huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF \
Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
mmproj-BF16.gguf \
--local-dir models/unsloth-Qwen3.6-35B-A3B-MTP-GGUFStart a second server on port 8081:
LLAMA_SERVER="$HOME/Developer/ML-Models/Gemma4/repos/llama.cpp/build/bin/llama-server"
"$LLAMA_SERVER" \
-m models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
--mmproj models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/mmproj-BF16.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
-ngl 999 \
-fa on \
-c 65536 \
--parallel 1 \
--host 127.0.0.1 \
--port 8081Add the corresponding Pi provider:
{
"providers": {
"qwen36-local": {
"name": "Qwen3.6 Local",
"baseUrl": "http://127.0.0.1:8081/v1",
"api": "openai-completions",
"apiKey": "local",
"authHeader": false,
"compat": {
"supportsDeveloperRole": false,
"supportsReasoningEffort": false
},
"models": [
{
"id": "Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf",
"name": "Qwen3.6 35B-A3B Q4 + MTP",
"reasoning": true,
"input": ["text", "image"],
"contextWindow": 65536,
"maxTokens": 8192,
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
}
}
]
}
}
}Gemma was the better fit for my latency target. Qwen is worth testing when you are happy to trade some speed for stronger coding performance.