blog • Building a Fully Local Coding Agent on macOS

Running Gemma 4 and Qwen3.6 locally with llama.cpp, MTP speculative decoding, image support, and Pi.

A cloud coding agent is great until the connection disappears. I wanted a fallback that could run entirely on my Mac, expose an OpenAI-compatible API, and still understand screenshots when I needed to show it a UI.

The setup I landed on uses:

llama.cpp with Metal acceleration
Gemma 4 26B-A4B in GGUF format
a Q8 MTP draft model for speculative decoding
the Gemma 4 multimodal projector
Pi as the terminal coding agent

I put this setup together on an Apple M1 MacBook Air with 8 GB of unified memory. The 26B model files described below need more memory to run comfortably, so the benchmark figures should be treated as reference results for higher-memory Apple silicon.

Choosing the model

The main model was gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf, which is roughly 16 GB. The full folder comes to around 17 GB after adding the MTP draft head and multimodal projector.

For a repeatable speed test, I used this prompt and capped each response at about 128 tokens:

Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.

Baseline: llama.cpp with Metal

I started by running the main model directly:

repos/llama.cpp/build/bin/llama-cli \
  -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  -ngl 999 \
  -fa on \
  -c 4096 \
  -n 128

Setup	Prompt tok/s	Generation tok/s
Gemma 4 26B-A4B Q4, llama.cpp Metal	298.0	58.2

About 58 generated tokens per second is usable, but agent loops involve many responses and tool calls. Small speed gains add up quickly.

Adding MTP speculative decoding

Gemma 4 includes an MTP draft model at:

MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf

Load it alongside the main model:

repos/llama.cpp/build/bin/llama-cli \
  -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ngl 999 \
  -fa on \
  -c 4096 \
  -n 128

The number of draft tokens is hardware-dependent, so I swept --spec-draft-n-max from 1 through 6:

Draft tokens	Prompt tok/s	Generation tok/s
1	295.5	68.4
2	299.1	72.0
3	295.6	72.2
4	297.3	70.7
5	297.9	63.7
6	296.3	61.2

Three draft tokens produced the best result on this machine, with two close behind. Larger values actually slowed generation down.

Setup	Prompt tok/s	Generation tok/s	Speedup
Main model only	298.0	58.2	1.00x
Main model + Q8 MTP draft	295.6	72.2	1.24x

Prompt processing remained effectively unchanged, while generation improved by about 24%.

llama.cpp versus MLX

I also compared several MLX builds to see whether Apple’s native framework would be faster for this model:

Runtime	Model	Generation tok/s
llama.cpp Metal + MTP	Unsloth GGUF Q4 + Q8 MTP	72.2
llama.cpp Metal	Unsloth GGUF Q4	58.2
MLX-LM	Unsloth UD MLX 4-bit	45.8
MLX-LM	mlx-community 4-bit	43.9
MLX-LM	mlx-community OptiQ 4-bit	38.1

For this exact combination of model and hardware, llama.cpp won. The MTP-assisted version was comfortably ahead of every MLX checkpoint I tested.

I briefly tried gemma-4-swift-mlx as well, but the available 26B 4-bit checkpoints did not match the loader’s expected weight keys. Since the other MLX results already answered the performance question, I moved on.

Adding image support

Pi originally treated my local model as text-only because its configuration contained:

"input": ["text"]

That prevents image tool output from reaching the model. Pi needs the model declared with both input types:

"input": ["text", "image"]

The llama.cpp server also needs Gemma’s multimodal projector:

mmproj-BF16.gguf

Loading it with --mmproj makes the server advertise multimodal support. A repeat of the text benchmark showed no meaningful generation slowdown:

Setup	Projector	Prompt tok/s	Generation tok/s
llama.cpp Metal + MTP	none	120.3	71.4
llama.cpp Metal + MTP	`mmproj-BF16.gguf`	297.4	72.2

Install llama.cpp

Install the build dependencies:

brew install cmake git tmux python@3.11

Create a workspace, clone the repository, and build with Metal and Accelerate:

mkdir -p ~/Developer/ML-Models/Gemma4/repos
cd ~/Developer/ML-Models/Gemma4
 
git clone https://github.com/ggml-org/llama.cpp repos/llama.cpp
 
cd repos/llama.cpp
cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_METAL=ON \
  -DGGML_ACCELERATE=ON
 
cmake --build build --config Release -j

The tested build reported GGML_METAL=ON, GGML_ACCELERATE=ON, GGML_BLAS=ON, and GGML_BLAS_VENDOR=Apple.

Download the Gemma model files

Set up a small Python environment for the Hugging Face CLI:

cd ~/Developer/ML-Models/Gemma4
python3.11 -m venv .venv
source .venv/bin/activate
pip install -U huggingface_hub hf_xet

Download the model, draft model, and projector:

mkdir -p models/unsloth-gemma-4-26B-A4B-it-GGUF
 
huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \
  gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  mmproj-BF16.gguf \
  MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
  --local-dir models/unsloth-gemma-4-26B-A4B-it-GGUF

The folder should look like this:

models/unsloth-gemma-4-26B-A4B-it-GGUF/
  gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
  mmproj-BF16.gguf
  MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf

Start the local server

Here is the final command:

repos/llama.cpp/build/bin/llama-server \
  -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
  --mmproj models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ngl 999 \
  -fa on \
  -c 65536 \
  --parallel 1 \
  --host 127.0.0.1 \
  --port 8080

The OpenAI-compatible endpoint will be available at http://127.0.0.1:8080/v1.

I use a start_server.sh wrapper to keep it running in tmux:

#!/usr/bin/env bash
set -euo pipefail
 
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
SESSION_NAME="${SESSION_NAME:-gemma4-server}"
HOST="${HOST:-127.0.0.1}"
PORT="${PORT:-8080}"
CTX_SIZE="${CTX_SIZE:-65536}"
PARALLEL="${PARALLEL:-1}"
 
LLAMA_SERVER="$ROOT_DIR/repos/llama.cpp/build/bin/llama-server"
MODEL="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf"
DRAFT_MODEL="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf"
MMPROJ="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf"
LOG_FILE="$ROOT_DIR/logs/llama-server-mtp.log"
 
mkdir -p "$ROOT_DIR/logs"
 
tmux new-session -d -s "$SESSION_NAME" -c "$ROOT_DIR" \
  "$LLAMA_SERVER \
    -m '$MODEL' \
    --model-draft '$DRAFT_MODEL' \
    --mmproj '$MMPROJ' \
    --spec-type draft-mtp \
    --spec-draft-n-max 3 \
    -ngl 999 \
    -fa on \
    -c '$CTX_SIZE' \
    --parallel '$PARALLEL' \
    --host '$HOST' \
    --port '$PORT' \
    2>&1 | tee -a '$LOG_FILE'"

Start it and confirm the API is responding:

chmod +x start_server.sh
./start_server.sh
curl http://127.0.0.1:8080/v1/models

Configure Pi

Pi reads custom model providers from ~/.pi/agent/models.json. Add a local provider:

{
  "providers": {
    "gemma4-local": {
      "name": "Gemma 4 Local",
      "baseUrl": "http://127.0.0.1:8080/v1",
      "api": "openai-completions",
      "apiKey": "local",
      "authHeader": false,
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false
      },
      "models": [
        {
          "id": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
          "name": "Gemma 4 26B-A4B Q4 + MTP",
          "reasoning": false,
          "input": ["text", "image"],
          "contextWindow": 65536,
          "maxTokens": 8192,
          "cost": {
            "input": 0,
            "output": 0,
            "cacheRead": 0,
            "cacheWrite": 0
          }
        }
      ]
    }
  }
}

The important parts are:

baseUrl points to the local llama.cpp server.
api uses OpenAI-compatible completions.
authHeader is disabled because the server is local.
input includes text and images.

You can also make it the default in ~/.pi/agent/settings.json:

{
  "defaultProvider": "gemma4-local",
  "defaultModel": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
  "defaultThinkingLevel": "minimal"
}

Check that Pi can find the model:

pi --offline --list-models gemma

Then launch it interactively:

pi --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

Or run a one-off prompt:

pi -p --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  "Explain what this repository does"

Screenshots work the same way:

pi -p @"/path/to/screenshot.png" \
  "Describe this image and point out anything relevant to the UI"

The finished stack

Layer	Choice
Inference runtime	llama.cpp
macOS acceleration	Metal + Accelerate
Main model	`gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf`
Draft model	`gemma-4-26B-A4B-it-Q8_0-MTP.gguf`
MTP setting	`--spec-draft-n-max 3`
Multimodal projector	`mmproj-BF16.gguf`
Server	`llama-server` on `127.0.0.1:8080`
API	OpenAI-compatible `/v1`
Coding agent	Pi
Pi model input	`["text", "image"]`

The MTP draft model is the part that made this setup feel practical. On the tested machine, it moved Gemma 4 from 58.2 to 72.2 generated tokens per second without making the server configuration much more complicated.

Qwen3.6 as an alternative

Qwen3.6 35B-A3B is another strong option, especially when coding quality matters more than raw generation speed. In this test it generated about 55 tokens per second, compared with Gemma’s 72 tokens per second.

Download the Qwen files:

mkdir -p models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF
 
huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF \
  Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  mmproj-BF16.gguf \
  --local-dir models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF

Start a second server on port 8081:

LLAMA_SERVER="$HOME/Developer/ML-Models/Gemma4/repos/llama.cpp/build/bin/llama-server"
 
"$LLAMA_SERVER" \
  -m models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  --mmproj models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/mmproj-BF16.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ngl 999 \
  -fa on \
  -c 65536 \
  --parallel 1 \
  --host 127.0.0.1 \
  --port 8081

Add the corresponding Pi provider:

{
  "providers": {
    "qwen36-local": {
      "name": "Qwen3.6 Local",
      "baseUrl": "http://127.0.0.1:8081/v1",
      "api": "openai-completions",
      "apiKey": "local",
      "authHeader": false,
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false
      },
      "models": [
        {
          "id": "Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf",
          "name": "Qwen3.6 35B-A3B Q4 + MTP",
          "reasoning": true,
          "input": ["text", "image"],
          "contextWindow": 65536,
          "maxTokens": 8192,
          "cost": {
            "input": 0,
            "output": 0,
            "cacheRead": 0,
            "cacheWrite": 0
          }
        }
      ]
    }
  }
}

Gemma was the better fit for my latency target. Qwen is worth testing when you are happy to trade some speed for stronger coding performance.

Building a Fully Local Coding Agent on macOS