Dispatch is local-first: your models do the cheap work for $0, frontier only on hard turns. Fastest path is Ollama; vLLM/Docker for production.
curl -fsSL https://ollama.com/install.sh | shollama pull qwen2.5:3b-instruct # tool-calling proven 2/2, $0Any OpenAI-compatible server works: Ollama, vLLM, LM Studio, TGI.
pip install vllmvllm serve Qwen/Qwen2.5-7B-Instruct --port 8000 --api-key $LOCAL_KEY # OpenAI-compatible at :8000/v1docker run -d --gpus all -p 11434:11434 -v ollama:/root/.ollama ollama/ollamadocker exec -it <id> ollama pull qwen2.5:3b-instructOLLAMA_HOST=0.0.0.0:11434) + open the port behind your VPN — never keyless on the public internet.OLLAMA_NUM_PARALLEL=1, or a smaller model; vLLM: lower --gpu-memory-utilization.OLLAMA_KEEP_ALIVE=-1 — cold loads pay the weight-load once.nvidia-smi in-container; Docker needs --gpus all + nvidia-container-toolkit.pip install wave-dispatchWAVE_LICENSE=wv_... dispatch serve # OpenAI-compatible proxy on :8090OPENAI_BASE_URL=http://localhost:8090/v1. Easy + tool turns route to your local models ($0); hard turns escalate to your frontier key. See /integrate.You are my AI infrastructure agent. Install Ollama (or vLLM for production), pull qwen2.5:3b-instruct plus a coder (qwen2.5-coder) and a 7B+ reasoner (qwen2.5:7b or llama3.1:8b), expose them OpenAI-compatibly bound to my VPN with an API key, then configure wave Dispatch (dispatch.wave.online, see /integrate) to route to my local pool first and escalate to my frontier key only on low confidence. Give exact copy-paste commands for my OS and a memory-safe quant for my GPU.More: benchmarks · agent context · quickstart