Choosing a model
Pick the largest coding model that fits comfortably in your GPU's VRAM — bigger reasons better, but must fit alongside its context window. These are Ollama tags.
By VRAM
| GPU VRAM | Recommended model | Notes |
|---|---|---|
| 8 GB | qwen2.5-coder:7b | Great default — fast, capable. Also llama3.1:8b for general work. |
| 12–16 GB | qwen2.5-coder:14b | Stronger reasoning. deepseek-coder-v2:16b (MoE) is a strong alternative. |
| 24 GB+ | qwen2.5-coder:32b | Best local quality. |
Rule of thumb: a quantised model needs roughly its parameter count in GB (a 14B ≈ 9–10 GB) plus headroom for context. Leave a couple of GB free.
Frontier models (optional)
You can also point an agent at a hosted frontier model (e.g. Claude) for the hard parts. That uses an API and needs the key stored as a project secret in Keikaku — never baked into the agent. (Local models keep everything on your hardware; a frontier model sends prompts to that provider.)
How the model gets pulled
Whatever model you set as MODEL, the agent ensures it's present on first use —
if Ollama doesn't have it yet, it's pulled automatically (the first task waits for the
download). You can also pull ahead of time:
ollama pull qwen2.5-coder:14b
No GPU? CPU fallback
Ollama runs on CPU if there's no GPU — fine for trying Keikaku out, but expect it to be
much slower (small models like qwen2.5-coder:7b only). For real throughput,
use a GPU.
Not sure what to pick? The benchmark measures real models against your GPU and recommends one — plus a setup code that prefills the create-agent wizard. Otherwise, start one size down from your VRAM ceiling and move up if there's room.