Set up self-hosted LLMs for local development

18 Jun 2026 | amdgpu llm

I’ve been gradually adding local LLMs to my daily workflow, and now I’ve got a setup that runs AI agents right inside Zed Editor for coding. This post walks through the whole stack — from GPU acceleration to editor integration — so you can set something similar up for yourself.

local-dev

Why local?

Running LLMs locally keeps your data private, saves you from API bills, and lets you try out any model whenever you want. With modern AMD GPUs and ROCm support in Ollama, the performance gap with cloud services is basically gone for most dev work. If you’re on NVIDIA, the setup is nearly identical — just swap the rocm tag for the standard Ollama image. I run Linux everywhere (PC and laptop alike), and this Docker/CLI setup works great.

Hardware

My setup is based on the same machine I used in my previous benchmark posts:

Motherboard: B550 AORUS ELITE AX V2
GPU: AMD Radeon AI PRO R9700 (32 GB VRAM, RDNA 4)
CPU: AMD Ryzen 9 5950X (16 cores / 32 threads)
RAM: 64 GB DDR4-2666
Storage: NVMe Samsung SSD 970 EVO Plus 1TB (x2)

Tools of the stack

Zed Editor

I switched from VS Code to Zed because it natively supports custom models for AI agents and edit predictions. The Zed Agent plays much nicer with a local Ollama backend than Copilot ever did — I’d get random errors in VS Code when Copilot tried to work with local models.

Zed also lets you bring your own edit prediction model, while VS Code locks you into their proprietary one unless you hunt down custom extensions. The coolest part is that Zed has its own Zeta model, which you can run locally. Instead of just autocomplete-style suggestions that append after your cursor (like qwen2.5-coder), it suggests actual diffs that rewrite parts of your code.

Ollama

For serving models, I went with Ollama instead of alternatives like LM Studio. The dealbreaker is that Zed’s thinking mode switcher only works with Ollama — not OpenAI-compatible APIs or LM Studio’s API. Ollama also has solid ROCm support, and I didn’t notice any performance difference compared to LM Studio. Plus, I’m comfortable with CLI tools and Docker, so managing everything from the terminal feels natural.

I did give LM Studio a shot, but it just doesn’t mesh well with Zed. Without thinking mode switcher, you can’t use one model for both chat/agents (thinking on) and inline transforms (thinking off). And their OpenAI-compatible API also had hiccups with Zed.

Playwright MCP

When I need the AI to actually use a browser — for testing or research — I hook up Playwright MCP. Config is shown below.

Models

I run a curated selection of models, each optimized for a specific purpose:

Qwen 3.6 27B (qwen3.6:27b-mtp-q4_K_M)
- Role: Main coding and AI agents
- Quantization: Q4_K_M
- Notes: Best quality among models of similar size. Uses MTP (Multi-Token Prediction) for a significant speed boost without losing quality.
Zeta 2.1 (hf.co/mradermacher/zeta-2.1-GGUF:Q2_K)
- Role: Code edit predictions in Zed
- Quantization: Q2_K
- Notes: Open-source model designed for Zed AI. Unlike simple autocompletion, it suggests diffs for your code instead of just appending text after your cursor. Lowest quantization was chosen for speed.

Docker Compose setup

Everything runs in Docker to keep things clean and reproducible. Here’s the docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:rocm
    restart: unless-stopped
    container_name: ollama
    tty: true
    devices:
      - /dev/kfd
      - /dev/dri
    ports:
      - 11434:11434
    volumes:
      - ollama:/root/.ollama
    environment:
      - HSA_OVERRIDE_GFX_VERSION=12.0.0
      - OLLAMA_KEEP_ALIVE=15m
      - OLLAMA_FLASH_ATTENTION=1
      - OLLAMA_CONTEXT_LENGTH=262144
      - OLLAMA_KV_CACHE_TYPE=q8_0
      - OLLAMA_NUM_PARALLEL=1
      - OLLAMA_MAX_LOADED_MODELS=1
volumes:
  ollama:

Start it with docker compose up -d.

The following environment variables control Ollama’s behavior and resource usage:

Variable	Default	My Value	Purpose
`OLLAMA_CONTEXT_LENGTH`	`4096`	`262144`	Maximum context window.
`OLLAMA_FLASH_ATTENTION`	`0`	`1`	Reduces VRAM usage by 30–50% in context-heavy scenarios.
`OLLAMA_KV_CACHE_TYPE`	`f16`	`q8_0`	Key-Value cache quantization for storing context in memory.
`OLLAMA_NUM_PARALLEL`	`1`	`1`	Number of concurrent requests.
`OLLAMA_MAX_LOADED_MODELS`	`3`	`1`	Maximum distinct LLMs kept in memory simultaneously.
`OLLAMA_KEEP_ALIVE`	`5m`	`15m`	How long a model stays loaded after its last use before being unloaded.
`HSA_OVERRIDE_GFX_VERSION`	—	`12.0.0`	Required for some AMD GPUs to enable proper ROCm compatibility.

Set up models

To get the most out of these models for coding, you’ll want to tweak some defaults.

Qwen 3.6 27B

Pull the model from the Ollama library:

docker exec -it ollama ollama run qwen3.6:27b-mtp-q4_K_M

Then tweak parameters and save as qwen3.6:27b-code:

/set parameter num_ctx 131072
/set parameter temperature 0.1
/set parameter top_p 0.95
/set parameter top_k 20
/set parameter presence_penalty 0.5
/set parameter repeat_penalty 1.05
/save qwen3.6:27b-code

Zeta 2.1 8B

Pull from Hugging Face:

docker exec -it ollama ollama run hf.co/mradermacher/zeta-2.1-GGUF:Q2_K

Then save it as zeta-2.1:

/set parameter num_ctx 8192
/save zeta-2.1

Setting up Zed Editor

Next, wire up the models in Zed. Head to Menu -> Open Settings File and add these configs:

LLM Provider (Agent & Chat)

Enable Ollama provider:

{
  "language_models": {
    "ollama": {
      "api_url": "http://localhost:11434"
    }
  }
}

Inline assistant

Enable inline assistant without thinking:

{
  "agent": {
    "inline_assistant_model": {
      "provider": "ollama",
      "model": "qwen3.6:27b-code",
      "enable_thinking": false
    }
  }
}

Edit Predictions

For code edit suggestions powered by Zeta 2.1, you need to use an OpenAI Compatible API provider instead of the Ollama provider directly — this is a current limitation in Zed. This configuration enables real-time diff suggestions as you type, rather than simple completions appended after your cursor.

{
  "edit_predictions": {
    "provider": "open_ai_compatible_api",
    "mode": "eager",
    "open_ai_compatible_api": {
      "prompt_format": "zeta2_1",
      "model": "zeta-2.1",
      "api_url": "http://localhost:11434/v1/completions"
    }
  }
}

Playwright MCP

Sometimes I ask AI agents to do something using a web browser. In this case, you can let Zed access your browser directly through Playwright MCP.

{
  "context_servers": {
    "playwright": {
      "enabled": true,
      "remote": false,
      "command": "npx",
      "args": ["-y", "@playwright/mcp@latest"]
    }
  }
}

Note: Make sure you have Playwright installed — either in your project or globally.

Benchmark results

Here’s how everything benchmarks on my AMD Radeon AI PRO R9700 with ROCm 7.2:

Model	Size	Quant	MTP	Context	VRAM	Input	Output
Qwen 3.6	27B	Q4_K_M	No	131072	21 GB	168 t/s	25 t/s
Qwen 3.6	27B	Q4_K_M	Yes	131072	17 GB	127 t/s	28 t/s
Qwen 3.6	35B A3B	Q4_K_M	No	131072	26 GB	246 t/s	71 t/s
Qwen 3.6	35B A3B	Q4_K_M	Yes	131072	22 GB	195 t/s	74 t/s
Zeta 2.1	8B	Q2_K	No	8192	3.9 GB	655 t/s	95 t/s

MTP (Multi-Token Prediction) makes Qwen 3.6’s output speed slightly faster while reducing memory consumption. For day-to-day coding with the Zed Agent, the 27B MTP variant is a nice balance of quality and speed.

Conclusion

This whole setup gives me everything I need for AI-assisted dev without depending on any external service. Zed Editor + Ollama + ROCm handles coding, chat, and browsing — all running locally on one GPU. API cost? Zero. And since everything lives on localhost, latency is basically nonexistent.

If you want to go deeper into the performance side of this setup, check out my previous posts: LLM performance on AMD Radeon AI PRO R9700 and LLM performance with ROCm 7.x vs 6.x.

Meefik's Blog

Freedom and Open Source

Set up self-hosted LLMs for local development

Why local?

Hardware

Tools of the stack

Zed Editor

Ollama

Playwright MCP

Models

Docker Compose setup

Set up models

Qwen 3.6 27B

Zeta 2.1 8B

Setting up Zed Editor

LLM Provider (Agent & Chat)

Inline assistant

Edit Predictions

Playwright MCP

Benchmark results

Conclusion

Comments