LLM Backends Reference
kdeps separates two concerns: which model to call (set in the resource file) and where to call it (set in ~/.kdeps/config.yaml). This lets you switch backends without touching your workflow.
The default: llamafile (file backend)
With no configuration at all, chat resources run on the file backend: the model is a llamafile - a single self-contained binary that kdeps downloads, caches in ~/.kdeps/models/, and serves as a local OpenAI-compatible server. No ollama, no GPU, no API key.
chat resource --> kdeps resolves model alias --> downloads llamafile (once)
--> serves it locally
--> request answeredKnown model aliases map to Mozilla's HuggingFace llamafiles. Quantization is part of the alias so you can trade size for quality:
| Alias | Model | Quant | Size |
|---|---|---|---|
llama3.2 / llama3.2:1b | Llama 3.2 1B Instruct | Q4_K_M | ~1.1 GB |
llama3.2:1b-q6 | Llama 3.2 1B Instruct | Q6_K | ~1.5 GB |
llama3.2:1b-q8 | Llama 3.2 1B Instruct | Q8_0 | ~2.1 GB |
llama3.2:3b | Llama 3.2 3B Instruct | Q4_K_M | ~2.2 GB |
llama3.1:8b | Llama 3.1 8B Instruct | Q4_K_M | ~5.2 GB |
kdeps llamafile list # all known aliases (the registry has 100+ models)
kdeps llamafile update # refresh the registry from HuggingFaceThe chat.model field also accepts a direct URL, an absolute/relative path to a .llamafile, or a bare filename looked up in ~/.kdeps/models/.
GGUF backend (llama.cpp)
The gguf backend serves GGUF model files via llama-server (llama.cpp). Same download-and-cache flow as file, but requires llama-server installed separately.
# ~/.kdeps/config.yaml
llm:
backend: ggufchat resource --> kdeps resolves GGUF alias --> downloads .gguf (once)
--> starts llama-server
--> request answeredKnown aliases: qwen3.5-4b, qwen3.5-8b, llama3.2-3b, llama3.1-8b, phi4-mini, gemma3-4b, mistral-7b, deepseek-r1-7b. The chat.model field also accepts a direct URL, absolute/relative path to a .gguf, or a bare filename in ~/.kdeps/models/.
Environment overrides: KDEPS_LLAMA_SERVER_BIN (binary path), KDEPS_GGUF_CTX_SIZE (context window).
Where it runs
Backend configuration applies to both workflow mode and agent mode. All chat: resources in both modes resolve their backend from ~/.kdeps/config.yaml.
Model configuration
Model is set per resource in chat.model:
# resources/my-resource.yaml
chat:
model: llama3.2:1b # which model to call
role: user
prompt: "{{ get('q') }}"Set model: router to delegate model selection to the router in ~/.kdeps/config.yaml (see Routing below).
Backend, base URL, and API keys go in ~/.kdeps/config.yaml:
# ~/.kdeps/config.yaml
llm:
backend: file # default: local llamafile, no server install
# backend: ollama # opt-in: requires the ollama server
# base_url: http://localhost:11434
# openai_api_key: sk-...
# anthropic_api_key: sk-ant-...
# groq_api_key: ...Run kdeps edit to open the config file, or edit it directly.
Unified Models List
llm.models in ~/.kdeps/config.yaml serves dual purpose: it can act as a plain allowlist (model names only) or as a router route table (with routing metadata). The llm.strategy field switches between the two modes.
Allowlist Mode (no strategy)
When strategy is absent, llm.models is a simple list of permitted model names:
# resources/example.yaml
llm:
backend: ollama
models:
- llama3.2:1b # plain string entry
- nomic-embed-textEach entry is a plain model name. Models can be specified as strings (as above) or as objects with only the model field set:
# resources/example.yaml
llm:
models:
- model: llama3.2:1b # object form (equivalent to "llama3.2:1b")Any request for a model not in this list is overridden to the first model and a warning is logged. Models listed here are pre-pulled into Docker/ISO artifacts.
Routing Mode (with strategy)
When strategy is set, the models list acts as router routes:
# resources/example.yaml
llm:
strategy: token_threshold
models:
- model: gpt-4o-mini
backend: openai
max_tokens: 500
default: true
- model: gpt-4o
backend: openai
min_tokens: 501Plain string entries in routing mode (no model: key) are still allowed — they inherit the default llm.backend:
# resources/example.yaml
llm:
backend: ollama
strategy: fallback
models:
- llama3.2:1b # plain string, uses backend: ollama
- model: gpt-4o
backend: openai
priority: 1Entry Fields
Each model entry supports these fields:
| Field | Type | Required | Description |
|---|---|---|---|
model | string | yes | Model identifier (e.g. gpt-4o, llama3.2:1b) |
backend | string | no | Backend for this model (overrides llm.backend) |
base_url | string | no | Custom API URL for this backend |
priority | int | no | Fallback order (lower = tried first) |
min_tokens | int | no | Minimum prompt tokens for token_threshold |
max_tokens | int | no | Maximum prompt tokens for token_threshold |
cost_per_input_token | float | no | Cost per 1K input tokens for cost_optimized |
cost_per_output_token | float | no | Cost per 1K output tokens for cost_optimized |
default | bool | no | Catch-all route when no other rule matches |
Routing
Routing delegates model selection from resource YAML to the config. Set a resource's model field to router:
# resources/llm.yaml
chat:
model: router # delegate to config.yaml router
role: user
prompt: "{{ get('q') }}"The router in ~/.kdeps/config.yaml selects which model to use based on the configured strategy.
Strategy: token_threshold
Routes by estimated prompt token count. The first entry where min_tokens <= tokens <= max_tokens wins. Falls through to the entry with default: true when no range matches.
# resources/example.yaml
llm:
strategy: token_threshold
models:
- model: gpt-4o-mini
backend: openai
max_tokens: 500 # short prompts use this
default: true
- model: gpt-4o
backend: openai
min_tokens: 501 # long prompts use thisToken counts are estimated using tiktoken.
Strategy: fallback
Tries routes in priority order. On error, automatically retries the next route.
# resources/example.yaml
llm:
strategy: fallback
models:
- model: claude-sonnet-4-20250514
backend: anthropic
priority: 1
- model: gpt-4o
backend: openai
priority: 2
- model: llama3.2:1b
backend: ollama
priority: 3
default: trueLower priority values are tried first. default: true marks the catch-all route.
Strategy: cost_optimized
Selects the cheapest route based on cost per 1K input tokens.
# resources/example.yaml
llm:
strategy: cost_optimized
models:
- model: gpt-4o-mini
backend: openai
cost_per_input_token: 0.00015 # $0.15/1M tokens
- model: gpt-4o
backend: openai
cost_per_input_token: 0.0025 # $2.50/1M tokens
default: trueNil cost is treated as zero. Falls to default: true on tie.
Strategy: round_robin
Distributes requests evenly across models using an atomic counter.
# resources/example.yaml
llm:
strategy: round_robin
models:
- model: gpt-4o
backend: openai
- model: claude-sonnet-4-20250514
backend: anthropicCounters are keyed by a fingerprint of the model list, so different route configs maintain independent counters.
Supported Backends
kdeps supports local backends (Llamafile, GGUF/llama.cpp, Ollama) and any OpenAI-compatible API: OpenAI, Anthropic, Google, Mistral, Groq, Together AI, Perplexity, Cohere, DeepSeek, xAI (Grok), OpenRouter, and self-hosted solutions (vLLM, TGI, LocalAI, LlamaCpp). See LLM Provider Reference for per-provider config snippets and available model names.
Streaming (Ollama)
Set streaming: true on a chat: resource to have Ollama stream the response as NDJSON chunks. KDeps accumulates all chunks internally and returns the same response shape as a non-streaming call.
# resources/example.yaml
chat:
prompt: "{{ get('q') }}"
streaming: true # Ollama onlystreaming | What happens |
|---|---|
false (default) | Single JSON response |
true | Ollama streams NDJSON; KDeps accumulates and returns merged map |
streaming: true is silently ignored for non-Ollama backends.
Feature Support
| Feature | Ollama | OpenAI | Anthropic | Mistral | Groq | |
|---|---|---|---|---|---|---|
| JSON Response | Yes | Yes | Partial | Yes | Yes | Yes |
| Tools/Functions | Yes | Yes | No | Yes | Yes | Yes |
| Vision | Yes* | Yes | Yes | Yes | Yes | Yes |
| Streaming | Yes | No** | No** | No** | No** | No** |
*Requires vision-capable model (e.g., llama3.2-vision) **Streaming is only supported for the Ollama backend.
Troubleshooting
Ollama Connection Issues
If Ollama cannot be reached:
- Check Ollama is running:
ollama list - Verify the URL in config.yaml (default:
http://localhost:11434) - Check firewall settings
API Key Issues
If you get authentication errors:
- Verify the key is set in
~/.kdeps/config.yaml - Or export the env var:
export OPENAI_API_KEY=sk-... - Check the key has the correct permissions
Model Not Found
If the model is not available:
- For Ollama: Pull the model first with
ollama pull model-name - For APIs: Verify the model name matches the provider's documentation
- Check you have access to the model in your API account
Rate Limiting
Handle rate limits with retry configuration via onError:
# resources/example.yaml
chat:
prompt: "{{ get('q') }}"
onError:
action: "retry"
maxRetries: 3
retryDelay: "5s"See Also
- LLM Provider Reference - Per-provider config snippets and model names
- LLM Resource - Complete LLM resource documentation
- Tools - LLM function calling
- Docker Deployment - Deploying with local models
