Building a Fully Local Coding Agent on macOS

Running Gemma 4 and Qwen3.6 locally with llama.cpp, MTP speculative decoding, image support, and Pi.

Jun 13, 2026

A cloud coding agent is great until the connection disappears. I wanted a fallback that could run entirely on my Mac, expose an OpenAI-compatible API, and still understand screenshots when I needed to show it a UI.

The setup I landed on uses:

  • llama.cpp with Metal acceleration
  • Gemma 4 26B-A4B in GGUF format
  • a Q8 MTP draft model for speculative decoding
  • the Gemma 4 multimodal projector
  • Pi as the terminal coding agent

I put this setup together on an Apple M1 MacBook Air with 8 GB of unified memory. The 26B model files described below need more memory to run comfortably, so the benchmark figures should be treated as reference results for higher-memory Apple silicon.

Choosing the model

The main model was gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf, which is roughly 16 GB. The full folder comes to around 17 GB after adding the MTP draft head and multimodal projector.

For a repeatable speed test, I used this prompt and capped each response at about 128 tokens:

Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.

Baseline: llama.cpp with Metal

I started by running the main model directly:

repos/llama.cpp/build/bin/llama-cli \
  -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  -ngl 999 \
  -fa on \
  -c 4096 \
  -n 128
SetupPrompt tok/sGeneration tok/s
Gemma 4 26B-A4B Q4, llama.cpp Metal298.058.2

About 58 generated tokens per second is usable, but agent loops involve many responses and tool calls. Small speed gains add up quickly.

Adding MTP speculative decoding

Gemma 4 includes an MTP draft model at:

MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf

Load it alongside the main model:

repos/llama.cpp/build/bin/llama-cli \
  -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ngl 999 \
  -fa on \
  -c 4096 \
  -n 128

The number of draft tokens is hardware-dependent, so I swept --spec-draft-n-max from 1 through 6:

Draft tokensPrompt tok/sGeneration tok/s
1295.568.4
2299.172.0
3295.672.2
4297.370.7
5297.963.7
6296.361.2

Three draft tokens produced the best result on this machine, with two close behind. Larger values actually slowed generation down.

SetupPrompt tok/sGeneration tok/sSpeedup
Main model only298.058.21.00x
Main model + Q8 MTP draft295.672.21.24x

Prompt processing remained effectively unchanged, while generation improved by about 24%.

llama.cpp versus MLX

I also compared several MLX builds to see whether Apple’s native framework would be faster for this model:

RuntimeModelGeneration tok/s
llama.cpp Metal + MTPUnsloth GGUF Q4 + Q8 MTP72.2
llama.cpp MetalUnsloth GGUF Q458.2
MLX-LMUnsloth UD MLX 4-bit45.8
MLX-LMmlx-community 4-bit43.9
MLX-LMmlx-community OptiQ 4-bit38.1

For this exact combination of model and hardware, llama.cpp won. The MTP-assisted version was comfortably ahead of every MLX checkpoint I tested.

I briefly tried gemma-4-swift-mlx as well, but the available 26B 4-bit checkpoints did not match the loader’s expected weight keys. Since the other MLX results already answered the performance question, I moved on.

Adding image support

Pi originally treated my local model as text-only because its configuration contained:

"input": ["text"]

That prevents image tool output from reaching the model. Pi needs the model declared with both input types:

"input": ["text", "image"]

The llama.cpp server also needs Gemma’s multimodal projector:

mmproj-BF16.gguf

Loading it with --mmproj makes the server advertise multimodal support. A repeat of the text benchmark showed no meaningful generation slowdown:

SetupProjectorPrompt tok/sGeneration tok/s
llama.cpp Metal + MTPnone120.371.4
llama.cpp Metal + MTPmmproj-BF16.gguf297.472.2

Install llama.cpp

Install the build dependencies:

brew install cmake git tmux python@3.11

Create a workspace, clone the repository, and build with Metal and Accelerate:

mkdir -p ~/Developer/ML-Models/Gemma4/repos
cd ~/Developer/ML-Models/Gemma4
 
git clone https://github.com/ggml-org/llama.cpp repos/llama.cpp
 
cd repos/llama.cpp
cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_METAL=ON \
  -DGGML_ACCELERATE=ON
 
cmake --build build --config Release -j

The tested build reported GGML_METAL=ON, GGML_ACCELERATE=ON, GGML_BLAS=ON, and GGML_BLAS_VENDOR=Apple.

Download the Gemma model files

Set up a small Python environment for the Hugging Face CLI:

cd ~/Developer/ML-Models/Gemma4
python3.11 -m venv .venv
source .venv/bin/activate
pip install -U huggingface_hub hf_xet

Download the model, draft model, and projector:

mkdir -p models/unsloth-gemma-4-26B-A4B-it-GGUF
 
huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \
  gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  mmproj-BF16.gguf \
  MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
  --local-dir models/unsloth-gemma-4-26B-A4B-it-GGUF

The folder should look like this:

models/unsloth-gemma-4-26B-A4B-it-GGUF/
  gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
  mmproj-BF16.gguf
  MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf

Start the local server

Here is the final command:

repos/llama.cpp/build/bin/llama-server \
  -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
  --mmproj models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ngl 999 \
  -fa on \
  -c 65536 \
  --parallel 1 \
  --host 127.0.0.1 \
  --port 8080

The OpenAI-compatible endpoint will be available at http://127.0.0.1:8080/v1.

I use a start_server.sh wrapper to keep it running in tmux:

#!/usr/bin/env bash
set -euo pipefail
 
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
SESSION_NAME="${SESSION_NAME:-gemma4-server}"
HOST="${HOST:-127.0.0.1}"
PORT="${PORT:-8080}"
CTX_SIZE="${CTX_SIZE:-65536}"
PARALLEL="${PARALLEL:-1}"
 
LLAMA_SERVER="$ROOT_DIR/repos/llama.cpp/build/bin/llama-server"
MODEL="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf"
DRAFT_MODEL="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf"
MMPROJ="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf"
LOG_FILE="$ROOT_DIR/logs/llama-server-mtp.log"
 
mkdir -p "$ROOT_DIR/logs"
 
tmux new-session -d -s "$SESSION_NAME" -c "$ROOT_DIR" \
  "$LLAMA_SERVER \
    -m '$MODEL' \
    --model-draft '$DRAFT_MODEL' \
    --mmproj '$MMPROJ' \
    --spec-type draft-mtp \
    --spec-draft-n-max 3 \
    -ngl 999 \
    -fa on \
    -c '$CTX_SIZE' \
    --parallel '$PARALLEL' \
    --host '$HOST' \
    --port '$PORT' \
    2>&1 | tee -a '$LOG_FILE'"

Start it and confirm the API is responding:

chmod +x start_server.sh
./start_server.sh
curl http://127.0.0.1:8080/v1/models

Configure Pi

Pi reads custom model providers from ~/.pi/agent/models.json. Add a local provider:

{
  "providers": {
    "gemma4-local": {
      "name": "Gemma 4 Local",
      "baseUrl": "http://127.0.0.1:8080/v1",
      "api": "openai-completions",
      "apiKey": "local",
      "authHeader": false,
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false
      },
      "models": [
        {
          "id": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
          "name": "Gemma 4 26B-A4B Q4 + MTP",
          "reasoning": false,
          "input": ["text", "image"],
          "contextWindow": 65536,
          "maxTokens": 8192,
          "cost": {
            "input": 0,
            "output": 0,
            "cacheRead": 0,
            "cacheWrite": 0
          }
        }
      ]
    }
  }
}

The important parts are:

  • baseUrl points to the local llama.cpp server.
  • api uses OpenAI-compatible completions.
  • authHeader is disabled because the server is local.
  • input includes text and images.

You can also make it the default in ~/.pi/agent/settings.json:

{
  "defaultProvider": "gemma4-local",
  "defaultModel": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
  "defaultThinkingLevel": "minimal"
}

Check that Pi can find the model:

pi --offline --list-models gemma

Then launch it interactively:

pi --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

Or run a one-off prompt:

pi -p --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  "Explain what this repository does"

Screenshots work the same way:

pi -p @"/path/to/screenshot.png" \
  "Describe this image and point out anything relevant to the UI"

The finished stack

LayerChoice
Inference runtimellama.cpp
macOS accelerationMetal + Accelerate
Main modelgemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
Draft modelgemma-4-26B-A4B-it-Q8_0-MTP.gguf
MTP setting--spec-draft-n-max 3
Multimodal projectormmproj-BF16.gguf
Serverllama-server on 127.0.0.1:8080
APIOpenAI-compatible /v1
Coding agentPi
Pi model input["text", "image"]

The MTP draft model is the part that made this setup feel practical. On the tested machine, it moved Gemma 4 from 58.2 to 72.2 generated tokens per second without making the server configuration much more complicated.

Qwen3.6 as an alternative

Qwen3.6 35B-A3B is another strong option, especially when coding quality matters more than raw generation speed. In this test it generated about 55 tokens per second, compared with Gemma’s 72 tokens per second.

Download the Qwen files:

mkdir -p models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF
 
huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF \
  Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  mmproj-BF16.gguf \
  --local-dir models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF

Start a second server on port 8081:

LLAMA_SERVER="$HOME/Developer/ML-Models/Gemma4/repos/llama.cpp/build/bin/llama-server"
 
"$LLAMA_SERVER" \
  -m models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  --mmproj models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/mmproj-BF16.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ngl 999 \
  -fa on \
  -c 65536 \
  --parallel 1 \
  --host 127.0.0.1 \
  --port 8081

Add the corresponding Pi provider:

{
  "providers": {
    "qwen36-local": {
      "name": "Qwen3.6 Local",
      "baseUrl": "http://127.0.0.1:8081/v1",
      "api": "openai-completions",
      "apiKey": "local",
      "authHeader": false,
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false
      },
      "models": [
        {
          "id": "Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf",
          "name": "Qwen3.6 35B-A3B Q4 + MTP",
          "reasoning": true,
          "input": ["text", "image"],
          "contextWindow": 65536,
          "maxTokens": 8192,
          "cost": {
            "input": 0,
            "output": 0,
            "cacheRead": 0,
            "cacheWrite": 0
          }
        }
      ]
    }
  }
}

Gemma was the better fit for my latency target. Qwen is worth testing when you are happy to trade some speed for stronger coding performance.

References