Choosing a model

Pick the largest coding model that fits comfortably in your GPU's VRAM — bigger reasons better, but must fit alongside its context window. These are Ollama tags.

By VRAM

GPU VRAM	Recommended model	Notes
8 GB	`qwen2.5-coder:7b`	Great default — fast, capable. Also `llama3.1:8b` for general work.
12–16 GB	`qwen2.5-coder:14b`	Stronger reasoning. `deepseek-coder-v2:16b` (MoE) is a strong alternative.
24 GB+	`qwen2.5-coder:32b`	Best local quality.

Rule of thumb: a quantised model needs roughly its parameter count in GB (a 14B ≈ 9–10 GB) plus headroom for context. Leave a couple of GB free.

Frontier models (optional)

You can also point an agent at a hosted frontier model (e.g. Claude) for the hard parts. That uses an API and needs the key stored as a project secret in Keikaku — never baked into the agent. (Local models keep everything on your hardware; a frontier model sends prompts to that provider.)

How the model gets pulled

Whatever model you set as MODEL, the agent ensures it's present on first use — if Ollama doesn't have it yet, it's pulled automatically (the first task waits for the download). You can also pull ahead of time:

ollama pull qwen2.5-coder:14b

No GPU? CPU fallback

Ollama runs on CPU if there's no GPU — fine for trying Keikaku out, but expect it to be much slower (small models like qwen2.5-coder:7b only). For real throughput, use a GPU.

Not sure what to pick? The benchmark measures real models against your GPU and recommends one — plus a setup code that prefills the create-agent wizard. Otherwise, start one size down from your VRAM ceiling and move up if there's room.

Create an agent →