local models in 5 minutes

Dispatch is local-first: your models do the cheap work for $0, frontier only on hard turns. Fastest path is Ollama; vLLM/Docker for production.

1 · install + pull (Ollama)

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:3b-instruct # tool-calling proven 2/2, $0

2 · recommended stack (per route)

code qwen2.5-coder:7b · deepseek-coder-v2 · (heavy) qwen2.5-coder:32b

summarize / direct qwen2.5:3b-instruct · llama3.2:3b · gemma2:2b

reason (harder) qwen2.5:7b · llama3.1:8b · phi-4 · gemma2:9b — else escalate

tool-calling qwen2.5:3b-instruct (validated 2/2) · command-r

search / RAG qwen2.5:3b + bge-base embeddings (same model as the edge)

Any OpenAI-compatible server works: Ollama, vLLM, LM Studio, TGI.

3 · production: vLLM (throughput) or Docker

pip install vllm
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000 --api-key $LOCAL_KEY # OpenAI-compatible at :8000/v1

docker run -d --gpus all -p 11434:11434 -v ollama:/root/.ollama ollama/ollama
docker exec -it <id> ollama pull qwen2.5:3b-instruct

4 · troubleshooting

unreachable endpoint bind 0.0.0.0 (OLLAMA_HOST=0.0.0.0:11434) + open the port behind your VPN — never keyless on the public internet.

OOM / out of memory smaller quant (q4_K_M), OLLAMA_NUM_PARALLEL=1, or a smaller model; vLLM: lower --gpu-memory-utilization.

slow first token keep it warm: OLLAMA_KEEP_ALIVE=-1 — cold loads pay the weight-load once.

CUDA not found check nvidia-smi in-container; Docker needs --gpus all + nvidia-container-toolkit.

auth set an API key on the local server, hand it to dispatch as your backend secret.

5 · run the proxy, point your agent at it

pip install wave-dispatch
WAVE_LICENSE=wv_... dispatch serve # OpenAI-compatible proxy on :8090

Then set your agent's base URL: OPENAI_BASE_URL=http://localhost:8090/v1. Easy + tool turns route to your local models ($0); hard turns escalate to your frontier key. See /integrate.

let your frontier agent set it up — paste this into Claude / Grok / Gemini / GPT

You are my AI infrastructure agent. Install Ollama (or vLLM for production), pull qwen2.5:3b-instruct plus a coder (qwen2.5-coder) and a 7B+ reasoner (qwen2.5:7b or llama3.1:8b), expose them OpenAI-compatibly bound to my VPN with an API key, then configure wave Dispatch (dispatch.wave.online, see /integrate) to route to my local pool first and escalate to my frontier key only on low confidence. Give exact copy-paste commands for my OS and a memory-safe quant for my GPU.

More: benchmarks · agent context · quickstart