Self-Hosted Models
Run Atlas with local inference servers — Ollama, vLLM, and TGI. Model selection, hardware requirements, and troubleshooting.
Self-Hosted Only
This guide is for operators running their own Atlas instance who want to use local inference servers instead of cloud LLM providers. On app.useatlas.dev, the LLM provider is managed by the Atlas platform — no model hosting is required.
Atlas works with any OpenAI-compatible inference server. This guide covers setting up Ollama, vLLM, and TGI, choosing the right model, and troubleshooting common issues.
Atlas requires models with tool calling (function calling) support. The agent loop depends on executeSQL and explore tools — models without tool calling cannot run Atlas queries. See the compatibility matrix for tested models.
Quick Start
The fastest way to run Atlas with a local model:
# 1. Install and start Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
# 2. Configure Atlas
ATLAS_PROVIDER=ollama
ATLAS_MODEL=llama3.1:8b
OLLAMA_BASE_URL=http://localhost:11434/v1
# 3. Start Atlas
bun run devOr use Docker Compose for a fully containerized setup:
# From repo root — starts Atlas + Postgres + Ollama
docker compose -f examples/docker/docker-compose.ollama.yml upProviders
Atlas supports two provider modes for self-hosted models:
ollama — Ollama preset
Preconfigured for Ollama's default endpoint. No API key needed.
ATLAS_PROVIDER=ollama
ATLAS_MODEL=llama3.1:8b
# Optional: override if Ollama is on a different host
OLLAMA_BASE_URL=http://localhost:11434/v1openai-compatible — Any OpenAI-compatible server
Works with vLLM, TGI, LiteLLM, LocalAI, and any server that implements the OpenAI Chat Completions API with tool calling.
ATLAS_PROVIDER=openai-compatible
ATLAS_MODEL=llama3.1 # Model name as served by your server
OPENAI_COMPATIBLE_BASE_URL=http://localhost:8000/v1 # Required
# Optional: API key if your server requires one
OPENAI_COMPATIBLE_API_KEY=your-keyATLAS_MODEL is required for openai-compatible — there is no default. Set it to the model name as reported by your server's /v1/models endpoint.
Inference Servers
Ollama
The easiest way to run models locally. Handles model downloading, quantization, and GPU management automatically.
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model (downloads ~4.7 GB for 8B Q4)
ollama pull llama3.1:8b
# Verify it's running
curl http://localhost:11434/api/tagsPros: Simple setup, automatic GPU detection, built-in model management, good for development. Cons: Lower throughput than vLLM even with continuous batching, limited serving options.
Atlas config:
ATLAS_PROVIDER=ollama
ATLAS_MODEL=llama3.1:8bvLLM
High-throughput serving with continuous batching. Best for production self-hosted deployments.
# Install
pip install vllm
# Serve with tool calling enabled (required for Atlas)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--served-model-name llama3.1 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-model-len 8192
# Verify
curl http://localhost:8000/v1/modelsPros: Highest throughput (continuous batching, PagedAttention), production-grade, tensor parallelism for multi-GPU. Cons: Requires NVIDIA GPU, longer startup (model loading), more complex configuration.
vLLM requires --enable-auto-tool-choice and a --tool-call-parser for Atlas to work. Without these flags, tool calls will fail silently or return malformed responses.
Atlas config:
ATLAS_PROVIDER=openai-compatible
ATLAS_MODEL=llama3.1
OPENAI_COMPATIBLE_BASE_URL=http://localhost:8000/v1Text Generation Inference (TGI)
Hugging Face's inference server. Good middle ground between Ollama and vLLM.
# Run with Docker (recommended)
docker run --gpus all -p 8080:80 \
-v tgi_data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct \
--max-input-tokens 4096 \
--max-total-tokens 8192
# Verify
curl http://localhost:8080/v1/modelsPros: Good throughput, Hugging Face ecosystem integration, Flash Attention support. Cons: Tool calling support varies by model — not all models work reliably. Check the compatibility matrix.
Atlas config:
ATLAS_PROVIDER=openai-compatible
ATLAS_MODEL=meta-llama/Llama-3.1-8B-Instruct
OPENAI_COMPATIBLE_BASE_URL=http://localhost:8080/v1Model Selection
Which model should I use?
Atlas needs models that can:
- Call tools reliably — generate structured JSON for
executeSQLandexploretool calls - Write SQL — translate natural language to correct SQL for your schema
- Follow system prompts — respect the semantic layer context injected into the system prompt
Not all models do this well. Larger models are significantly better at tool calling and SQL generation.
Recommended Models
| Model | Parameters | Quality | Speed | Best For |
|---|---|---|---|---|
| Llama 3.1 70B | 70B | High | Moderate | Production self-hosted — best quality-to-cost ratio |
| Qwen 2.5 72B | 72B | High | Moderate | Production — strong tool calling and multilingual SQL |
| Mistral Large | 123B | Very High | Slow | Maximum quality when latency is acceptable |
| Llama 3.1 8B | 8B | Moderate | Fast | Development and testing — quick iteration |
| Qwen 2.5 7B | 7B | Moderate | Fast | Development — good tool calling for its size |
| Mistral 7B | 7B | Low | Fast | Not recommended — unreliable tool calling |
| DeepSeek V3 | 671B (MoE) | Very High | Moderate | Multi-GPU setups with ample VRAM |
Minimum viable model for text-to-SQL: 8B parameter models (Llama 3.1 8B, Qwen 2.5 7B) can handle simple queries against small schemas (< 20 tables). For complex joins, subqueries, or large schemas, use 70B+ models.
Quality Tiers
Tier 1 — Production ready (70B+): Reliable tool calling, accurate SQL generation for complex queries, handles large schemas. Comparable to GPT-4o for most text-to-SQL tasks.
Tier 2 — Development viable (7-8B): Works for simple queries (single-table SELECTs, basic aggregations). Tool calling works but may require retries. Struggles with multi-table joins and complex WHERE clauses.
Tier 3 — Not recommended (< 7B): Unreliable tool calling, frequent SQL syntax errors, poor schema comprehension. Use only for testing the pipeline, not for actual queries.
Hardware Requirements
GPU Memory (VRAM)
| Model | FP16 | Q8 | Q4 | Minimum GPU |
|---|---|---|---|---|
| Llama 3.1 8B | 16 GB | 9 GB | 5 GB | RTX 3090 / A10 |
| Qwen 2.5 7B | 14 GB | 8 GB | 5 GB | RTX 3090 / A10 |
| Mistral 7B | 14 GB | 8 GB | 5 GB | RTX 3090 / A10 |
| Llama 3.1 70B | 140 GB | 75 GB | 40 GB | 2× A100 80GB / 1× A100 (Q4) |
| Qwen 2.5 72B | 144 GB | 77 GB | 42 GB | 2× A100 80GB / 1× A100 (Q4) |
| Mistral Large (123B) | 246 GB | 131 GB | 72 GB | 4× A100 80GB |
| DeepSeek V3 (671B MoE) | ~130 GB* | ~70 GB* | ~40 GB* | 2× A100 80GB (FP8) |
* DeepSeek V3 uses Mixture-of-Experts — only active parameters are loaded, so VRAM is lower than the total parameter count suggests.
System Requirements
| Component | Minimum | Recommended |
|---|---|---|
| RAM | Model VRAM × 1.5 | Model VRAM × 2 |
| Disk | Model size + 20 GB | SSD with 100+ GB free |
| CPU | 4 cores | 8+ cores (for vLLM continuous batching) |
| GPU | CUDA 11.8+ compatible | NVIDIA Ampere or newer (A100, H100, RTX 4090) |
CPU-only inference is possible with Ollama for 7-8B models (Q4 quantization) but is 10-50× slower than GPU. Not recommended for interactive use — the agent loop's built-in step timeout (30s per tool call) may kill requests before the model finishes generating.
Quantization Trade-offs
| Quantization | VRAM Savings | Quality Impact | Recommendation |
|---|---|---|---|
| FP16 | Baseline | None | Best quality, if you have VRAM |
| Q8 | ~45% reduction | Minimal (< 1% accuracy loss) | Good default for production |
| Q4 | ~70% reduction | Noticeable on complex queries | Acceptable for development, risky for production |
| Q2 | ~85% reduction | Significant degradation | Not recommended — tool calling becomes unreliable |
Compatibility Matrix
Tested model and inference server combinations for Atlas. Tool calling is the critical requirement — without it, Atlas cannot function.
Legend
- ✅ Works — tool calling, streaming, and SQL generation all function correctly
- ⚠️ Partial — works but with known limitations (see notes)
- ❌ Fails — tool calling broken or too unreliable for use
Ollama
| Model | Tool Calling | Streaming | Notes |
|---|---|---|---|
| Llama 3.1 70B | ✅ | ✅ | Best self-hosted option for Ollama |
| Llama 3.1 8B | ✅ | ✅ | Good for development |
| Qwen 2.5 72B | ✅ | ✅ | Strong tool calling |
| Qwen 2.5 7B | ✅ | ✅ | Good tool calling for its size |
| Mistral Large | ✅ | ✅ | Requires significant VRAM |
| Mistral 7B (v0.3) | ⚠️ | ✅ | Tool calling works but sometimes malformed — retries help |
| DeepSeek V3 | ⚠️ | ✅ | Requires Ollama 0.5+; large VRAM requirement |
| Phi-3 Medium (14B) | ⚠️ | ✅ | Tool calling inconsistent — not recommended for Atlas |
| CodeLlama 34B | ❌ | ✅ | No tool calling support |
| Llama 2 (any size) | ❌ | ✅ | No tool calling support |
vLLM
| Model | Tool Calling | Streaming | Notes |
|---|---|---|---|
| Llama 3.1 70B | ✅ | ✅ | Best production option — use --tool-call-parser hermes |
| Llama 3.1 8B | ✅ | ✅ | Use --tool-call-parser hermes |
| Qwen 2.5 72B | ✅ | ✅ | Use --tool-call-parser hermes |
| Qwen 2.5 7B | ✅ | ✅ | Use --tool-call-parser hermes |
| Mistral Large | ✅ | ✅ | Use --tool-call-parser mistral |
| Mistral 7B (v0.3) | ⚠️ | ✅ | Tool calling less reliable than 70B+ models |
| DeepSeek V3 | ✅ | ✅ | Requires FP8 or multi-GPU; use --tool-call-parser hermes |
vLLM requires --enable-auto-tool-choice and a --tool-call-parser flag. The parser must match the model's chat template. Most Llama and Qwen models use hermes; Mistral models use mistral.
TGI (Text Generation Inference)
| Model | Tool Calling | Streaming | Notes |
|---|---|---|---|
| Llama 3.1 70B | ✅ | ✅ | Requires TGI v2.0+ |
| Llama 3.1 8B | ✅ | ✅ | Requires TGI v2.0+ |
| Qwen 2.5 72B | ⚠️ | ✅ | Tool calling works but output format can vary |
| Qwen 2.5 7B | ⚠️ | ✅ | Same as 72B — format inconsistencies |
| Mistral Large | ✅ | ✅ | Good TGI support |
| Mistral 7B | ⚠️ | ✅ | Inconsistent tool calling |
Docker Compose Profiles
Pre-built Docker Compose files for common self-hosted setups. All include Atlas API + Postgres + demo data.
Ollama
# Start with default model (Llama 3.1 8B)
docker compose -f examples/docker/docker-compose.ollama.yml up
# Use a different model
OLLAMA_MODEL=qwen2.5:72b docker compose -f examples/docker/docker-compose.ollama.yml upIncluded services: Postgres, Ollama (with GPU passthrough), model auto-pull, Atlas API.
For CPU-only: remove the deploy block from the ollama service in the compose file.
vLLM
# Start with default model (Llama 3.1 8B Instruct)
HUGGING_FACE_HUB_TOKEN=hf_... docker compose -f examples/docker/docker-compose.vllm.yml up
# Use a different model
HUGGING_FACE_HUB_TOKEN=hf_... \
VLLM_MODEL=meta-llama/Llama-3.1-70B-Instruct \
VLLM_SERVED_NAME=llama3.1-70b \
docker compose -f examples/docker/docker-compose.vllm.yml upIncluded services: Postgres, vLLM (with tool calling enabled), Atlas API.
A Hugging Face token is required for gated models (Llama, Mistral). Create one at huggingface.co/settings/tokens.
Benchmark Results
Expected performance ranges for self-hosted models with Atlas. Results vary by hardware, quantization, schema complexity, and query type.
These benchmarks reflect expected ranges based on model architecture and published benchmarks. Actual performance depends heavily on hardware, quantization, context length, and schema complexity. Run your own benchmarks against your schema for production sizing.
Latency
Estimated for a single A100 80GB GPU with Q8 quantization, 10-table demo schema.
| Model | TTFT (simple) | TTFT (complex) | Total (simple) | Total (complex) |
|---|---|---|---|---|
| Llama 3.1 8B | 0.3–0.5s | 0.5–1.0s | 2–4s | 5–10s |
| Qwen 2.5 7B | 0.3–0.5s | 0.5–1.0s | 2–4s | 5–10s |
| Llama 3.1 70B | 1–2s | 2–4s | 5–10s | 15–30s |
| Qwen 2.5 72B | 1–2s | 2–4s | 5–10s | 15–30s |
| Mistral Large | 2–3s | 3–6s | 8–15s | 20–45s |
Simple: Single-table query, 1 tool call (e.g., "How many users signed up this week?"). Complex: Multi-table join, 2–3 tool calls, aggregation (e.g., "What's the average resolution time by severity for tickets assigned to the top 5 agents?").
Accuracy
Approximate success rates on representative query suites. "Success" means the generated SQL executes without error and returns correct results.
| Model | Simple Queries | Complex Queries | Tool Calling Reliability |
|---|---|---|---|
| Llama 3.1 70B | 90–95% | 70–80% | 95%+ |
| Qwen 2.5 72B | 90–95% | 70–80% | 95%+ |
| Llama 3.1 8B | 80–90% | 50–65% | 85–90% |
| Qwen 2.5 7B | 80–90% | 50–65% | 85–90% |
| Mistral 7B | 70–80% | 35–50% | 70–80% |
Token Efficiency
Average tokens consumed per successful query (system prompt + tool calls + response).
| Model | Simple Query | Complex Query |
|---|---|---|
| 70B models | 1,500–2,500 | 3,000–5,000 |
| 7-8B models | 1,800–3,000 | 4,000–7,000 |
Smaller models tend to use more tokens due to retries and less efficient tool call formatting.
Tuning Tips
Temperature
Atlas sets temperature to 0.2 by default — a good starting point for SQL generation. This is applied by the agent loop regardless of your inference server's default. If you see inconsistent SQL output, the issue is more likely model size or quantization than temperature.
Context Length
Atlas injects the semantic layer into the system prompt. Large schemas (20+ tables) can consume 4,000–8,000 tokens of context. Ensure your model's context window can accommodate this plus the conversation history.
| Schema Size | System Prompt Tokens | Recommended Min Context |
|---|---|---|
| < 10 tables | 1,000–2,000 | 4,096 |
| 10–20 tables | 2,000–4,000 | 8,192 |
| 20–50 tables | 4,000–8,000 | 16,384 |
| 50+ tables | 8,000+ | 32,768 |
For vLLM, set --max-model-len to match. For Ollama, set num_ctx in the Modelfile or via the API.
Agent Max Steps
Smaller models may need more steps to complete complex queries (they retry more). Consider increasing the step limit:
# Default: 25 — increase for smaller models
ATLAS_AGENT_MAX_STEPS=40Troubleshooting
Tool calling failures
Symptom: Atlas responds with text instead of executing SQL. The agent describes what it would query but never calls executeSQL.
Causes:
- Model doesn't support tool calling (check the compatibility matrix)
- vLLM missing
--enable-auto-tool-choiceor wrong--tool-call-parser - Model too small — 7B models sometimes "forget" to use tools on complex queries
Fixes:
- Verify tool calling works:
curl http://localhost:8000/v1/chat/completions -d '{"model":"llama3.1","messages":[{"role":"user","content":"Call the get_weather function for NYC"}],"tools":[{"type":"function","function":{"name":"get_weather","parameters":{"type":"object","properties":{"city":{"type":"string"}}}}}]}' - For vLLM, ensure both
--enable-auto-tool-choiceand--tool-call-parserare set - Try a larger model — 70B models are dramatically more reliable at tool calling than 7B
Streaming issues
Symptom: Responses appear all at once instead of streaming, or the connection times out.
Causes:
- Reverse proxy buffering (nginx, Cloudflare)
- Inference server not configured for streaming
- Connection timeout too low
Fixes:
- Check that your inference server returns
Transfer-Encoding: chunked - If behind nginx, add:
proxy_buffering off;andproxy_http_version 1.1; - Note that the agent loop has built-in timeouts (5s per chunk, 30s per step) — very slow models may exceed these limits
Context length exceeded
Symptom: Error messages about maximum context length, or the model produces garbage output mid-response.
Causes:
- Large semantic layer exhausting the context window
- Long conversation history
Fixes:
- Enable the semantic index (
ATLAS_SEMANTIC_INDEX_ENABLED=true, default) — it compresses the semantic layer summary - Increase model context: vLLM
--max-model-len, Ollamanum_ctx - For very large schemas (50+ tables), use a 70B+ model with 32K+ context
Slow first response
Symptom: First query after startup takes 30+ seconds.
Causes:
- Model loading into GPU memory (normal for large models)
- KV cache allocation (vLLM pre-allocates based on
--gpu-memory-utilization)
Fixes:
- This is expected on cold start — subsequent queries will be fast
- For vLLM, reduce
--gpu-memory-utilizationif startup is OOM-killed (default 0.9) - Use Ollama's
keep_aliveto prevent model unloading:ollama run llama3.1 --keepalive 24h
Quantization quality issues
Symptom: SQL has subtle errors (wrong column names, incorrect join conditions) that don't appear with larger quantizations.
Causes:
- Aggressive quantization (Q2, Q3) degrades the model's ability to follow schemas precisely
Fixes:
- Use Q8 for production — best balance of VRAM savings and quality
- Avoid Q2/Q3 for any text-to-SQL use case
- If VRAM is limited, use a smaller model at higher quantization rather than a larger model at Q4
See Also
- Environment Variables — All provider and model configuration
- Configuration — Declarative
atlas.config.ts - Deploy — Docker deployment guide
- Troubleshooting — General Atlas troubleshooting