Atlas
Guides

Self-Hosted Models

Run Atlas with local inference servers — Ollama, vLLM, and TGI. Model selection, hardware requirements, and troubleshooting.

Self-Hosted Only

This guide is for operators running their own Atlas instance who want to use local inference servers instead of cloud LLM providers. On app.useatlas.dev, the LLM provider is managed by the Atlas platform — no model hosting is required.

Atlas works with any OpenAI-compatible inference server. This guide covers setting up Ollama, vLLM, and TGI, choosing the right model, and troubleshooting common issues.

Atlas requires models with tool calling (function calling) support. The agent loop depends on executeSQL and explore tools — models without tool calling cannot run Atlas queries. See the compatibility matrix for tested models.


Quick Start

The fastest way to run Atlas with a local model:

# 1. Install and start Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b

# 2. Configure Atlas
ATLAS_PROVIDER=ollama
ATLAS_MODEL=llama3.1:8b
OLLAMA_BASE_URL=http://localhost:11434/v1

# 3. Start Atlas
bun run dev

Or use Docker Compose for a fully containerized setup:

# From repo root — starts Atlas + Postgres + Ollama
docker compose -f examples/docker/docker-compose.ollama.yml up

Providers

Atlas supports two provider modes for self-hosted models:

ollama — Ollama preset

Preconfigured for Ollama's default endpoint. No API key needed.

ATLAS_PROVIDER=ollama
ATLAS_MODEL=llama3.1:8b
# Optional: override if Ollama is on a different host
OLLAMA_BASE_URL=http://localhost:11434/v1

openai-compatible — Any OpenAI-compatible server

Works with vLLM, TGI, LiteLLM, LocalAI, and any server that implements the OpenAI Chat Completions API with tool calling.

ATLAS_PROVIDER=openai-compatible
ATLAS_MODEL=llama3.1                         # Model name as served by your server
OPENAI_COMPATIBLE_BASE_URL=http://localhost:8000/v1  # Required
# Optional: API key if your server requires one
OPENAI_COMPATIBLE_API_KEY=your-key

ATLAS_MODEL is required for openai-compatible — there is no default. Set it to the model name as reported by your server's /v1/models endpoint.


Inference Servers

Ollama

The easiest way to run models locally. Handles model downloading, quantization, and GPU management automatically.

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (downloads ~4.7 GB for 8B Q4)
ollama pull llama3.1:8b

# Verify it's running
curl http://localhost:11434/api/tags

Pros: Simple setup, automatic GPU detection, built-in model management, good for development. Cons: Lower throughput than vLLM even with continuous batching, limited serving options.

Atlas config:

ATLAS_PROVIDER=ollama
ATLAS_MODEL=llama3.1:8b

vLLM

High-throughput serving with continuous batching. Best for production self-hosted deployments.

# Install
pip install vllm

# Serve with tool calling enabled (required for Atlas)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --served-model-name llama3.1 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 8192

# Verify
curl http://localhost:8000/v1/models

Pros: Highest throughput (continuous batching, PagedAttention), production-grade, tensor parallelism for multi-GPU. Cons: Requires NVIDIA GPU, longer startup (model loading), more complex configuration.

vLLM requires --enable-auto-tool-choice and a --tool-call-parser for Atlas to work. Without these flags, tool calls will fail silently or return malformed responses.

Atlas config:

ATLAS_PROVIDER=openai-compatible
ATLAS_MODEL=llama3.1
OPENAI_COMPATIBLE_BASE_URL=http://localhost:8000/v1

Text Generation Inference (TGI)

Hugging Face's inference server. Good middle ground between Ollama and vLLM.

# Run with Docker (recommended)
docker run --gpus all -p 8080:80 \
  -v tgi_data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --max-input-tokens 4096 \
  --max-total-tokens 8192

# Verify
curl http://localhost:8080/v1/models

Pros: Good throughput, Hugging Face ecosystem integration, Flash Attention support. Cons: Tool calling support varies by model — not all models work reliably. Check the compatibility matrix.

Atlas config:

ATLAS_PROVIDER=openai-compatible
ATLAS_MODEL=meta-llama/Llama-3.1-8B-Instruct
OPENAI_COMPATIBLE_BASE_URL=http://localhost:8080/v1

Model Selection

Which model should I use?

Atlas needs models that can:

  1. Call tools reliably — generate structured JSON for executeSQL and explore tool calls
  2. Write SQL — translate natural language to correct SQL for your schema
  3. Follow system prompts — respect the semantic layer context injected into the system prompt

Not all models do this well. Larger models are significantly better at tool calling and SQL generation.

ModelParametersQualitySpeedBest For
Llama 3.1 70B70BHighModerateProduction self-hosted — best quality-to-cost ratio
Qwen 2.5 72B72BHighModerateProduction — strong tool calling and multilingual SQL
Mistral Large123BVery HighSlowMaximum quality when latency is acceptable
Llama 3.1 8B8BModerateFastDevelopment and testing — quick iteration
Qwen 2.5 7B7BModerateFastDevelopment — good tool calling for its size
Mistral 7B7BLowFastNot recommended — unreliable tool calling
DeepSeek V3671B (MoE)Very HighModerateMulti-GPU setups with ample VRAM

Minimum viable model for text-to-SQL: 8B parameter models (Llama 3.1 8B, Qwen 2.5 7B) can handle simple queries against small schemas (< 20 tables). For complex joins, subqueries, or large schemas, use 70B+ models.

Quality Tiers

Tier 1 — Production ready (70B+): Reliable tool calling, accurate SQL generation for complex queries, handles large schemas. Comparable to GPT-4o for most text-to-SQL tasks.

Tier 2 — Development viable (7-8B): Works for simple queries (single-table SELECTs, basic aggregations). Tool calling works but may require retries. Struggles with multi-table joins and complex WHERE clauses.

Tier 3 — Not recommended (< 7B): Unreliable tool calling, frequent SQL syntax errors, poor schema comprehension. Use only for testing the pipeline, not for actual queries.


Hardware Requirements

GPU Memory (VRAM)

ModelFP16Q8Q4Minimum GPU
Llama 3.1 8B16 GB9 GB5 GBRTX 3090 / A10
Qwen 2.5 7B14 GB8 GB5 GBRTX 3090 / A10
Mistral 7B14 GB8 GB5 GBRTX 3090 / A10
Llama 3.1 70B140 GB75 GB40 GB2× A100 80GB / 1× A100 (Q4)
Qwen 2.5 72B144 GB77 GB42 GB2× A100 80GB / 1× A100 (Q4)
Mistral Large (123B)246 GB131 GB72 GB4× A100 80GB
DeepSeek V3 (671B MoE)~130 GB*~70 GB*~40 GB*2× A100 80GB (FP8)

* DeepSeek V3 uses Mixture-of-Experts — only active parameters are loaded, so VRAM is lower than the total parameter count suggests.

System Requirements

ComponentMinimumRecommended
RAMModel VRAM × 1.5Model VRAM × 2
DiskModel size + 20 GBSSD with 100+ GB free
CPU4 cores8+ cores (for vLLM continuous batching)
GPUCUDA 11.8+ compatibleNVIDIA Ampere or newer (A100, H100, RTX 4090)

CPU-only inference is possible with Ollama for 7-8B models (Q4 quantization) but is 10-50× slower than GPU. Not recommended for interactive use — the agent loop's built-in step timeout (30s per tool call) may kill requests before the model finishes generating.

Quantization Trade-offs

QuantizationVRAM SavingsQuality ImpactRecommendation
FP16BaselineNoneBest quality, if you have VRAM
Q8~45% reductionMinimal (< 1% accuracy loss)Good default for production
Q4~70% reductionNoticeable on complex queriesAcceptable for development, risky for production
Q2~85% reductionSignificant degradationNot recommended — tool calling becomes unreliable

Compatibility Matrix

Tested model and inference server combinations for Atlas. Tool calling is the critical requirement — without it, Atlas cannot function.

Legend

  • ✅ Works — tool calling, streaming, and SQL generation all function correctly
  • ⚠️ Partial — works but with known limitations (see notes)
  • ❌ Fails — tool calling broken or too unreliable for use

Ollama

ModelTool CallingStreamingNotes
Llama 3.1 70BBest self-hosted option for Ollama
Llama 3.1 8BGood for development
Qwen 2.5 72BStrong tool calling
Qwen 2.5 7BGood tool calling for its size
Mistral LargeRequires significant VRAM
Mistral 7B (v0.3)⚠️Tool calling works but sometimes malformed — retries help
DeepSeek V3⚠️Requires Ollama 0.5+; large VRAM requirement
Phi-3 Medium (14B)⚠️Tool calling inconsistent — not recommended for Atlas
CodeLlama 34BNo tool calling support
Llama 2 (any size)No tool calling support

vLLM

ModelTool CallingStreamingNotes
Llama 3.1 70BBest production option — use --tool-call-parser hermes
Llama 3.1 8BUse --tool-call-parser hermes
Qwen 2.5 72BUse --tool-call-parser hermes
Qwen 2.5 7BUse --tool-call-parser hermes
Mistral LargeUse --tool-call-parser mistral
Mistral 7B (v0.3)⚠️Tool calling less reliable than 70B+ models
DeepSeek V3Requires FP8 or multi-GPU; use --tool-call-parser hermes

vLLM requires --enable-auto-tool-choice and a --tool-call-parser flag. The parser must match the model's chat template. Most Llama and Qwen models use hermes; Mistral models use mistral.

TGI (Text Generation Inference)

ModelTool CallingStreamingNotes
Llama 3.1 70BRequires TGI v2.0+
Llama 3.1 8BRequires TGI v2.0+
Qwen 2.5 72B⚠️Tool calling works but output format can vary
Qwen 2.5 7B⚠️Same as 72B — format inconsistencies
Mistral LargeGood TGI support
Mistral 7B⚠️Inconsistent tool calling

Docker Compose Profiles

Pre-built Docker Compose files for common self-hosted setups. All include Atlas API + Postgres + demo data.

Ollama

# Start with default model (Llama 3.1 8B)
docker compose -f examples/docker/docker-compose.ollama.yml up

# Use a different model
OLLAMA_MODEL=qwen2.5:72b docker compose -f examples/docker/docker-compose.ollama.yml up

Included services: Postgres, Ollama (with GPU passthrough), model auto-pull, Atlas API.

For CPU-only: remove the deploy block from the ollama service in the compose file.

vLLM

# Start with default model (Llama 3.1 8B Instruct)
HUGGING_FACE_HUB_TOKEN=hf_... docker compose -f examples/docker/docker-compose.vllm.yml up

# Use a different model
HUGGING_FACE_HUB_TOKEN=hf_... \
VLLM_MODEL=meta-llama/Llama-3.1-70B-Instruct \
VLLM_SERVED_NAME=llama3.1-70b \
docker compose -f examples/docker/docker-compose.vllm.yml up

Included services: Postgres, vLLM (with tool calling enabled), Atlas API.

A Hugging Face token is required for gated models (Llama, Mistral). Create one at huggingface.co/settings/tokens.


Benchmark Results

Expected performance ranges for self-hosted models with Atlas. Results vary by hardware, quantization, schema complexity, and query type.

These benchmarks reflect expected ranges based on model architecture and published benchmarks. Actual performance depends heavily on hardware, quantization, context length, and schema complexity. Run your own benchmarks against your schema for production sizing.

Latency

Estimated for a single A100 80GB GPU with Q8 quantization, 10-table demo schema.

ModelTTFT (simple)TTFT (complex)Total (simple)Total (complex)
Llama 3.1 8B0.3–0.5s0.5–1.0s2–4s5–10s
Qwen 2.5 7B0.3–0.5s0.5–1.0s2–4s5–10s
Llama 3.1 70B1–2s2–4s5–10s15–30s
Qwen 2.5 72B1–2s2–4s5–10s15–30s
Mistral Large2–3s3–6s8–15s20–45s

Simple: Single-table query, 1 tool call (e.g., "How many users signed up this week?"). Complex: Multi-table join, 2–3 tool calls, aggregation (e.g., "What's the average resolution time by severity for tickets assigned to the top 5 agents?").

Accuracy

Approximate success rates on representative query suites. "Success" means the generated SQL executes without error and returns correct results.

ModelSimple QueriesComplex QueriesTool Calling Reliability
Llama 3.1 70B90–95%70–80%95%+
Qwen 2.5 72B90–95%70–80%95%+
Llama 3.1 8B80–90%50–65%85–90%
Qwen 2.5 7B80–90%50–65%85–90%
Mistral 7B70–80%35–50%70–80%

Token Efficiency

Average tokens consumed per successful query (system prompt + tool calls + response).

ModelSimple QueryComplex Query
70B models1,500–2,5003,000–5,000
7-8B models1,800–3,0004,000–7,000

Smaller models tend to use more tokens due to retries and less efficient tool call formatting.


Tuning Tips

Temperature

Atlas sets temperature to 0.2 by default — a good starting point for SQL generation. This is applied by the agent loop regardless of your inference server's default. If you see inconsistent SQL output, the issue is more likely model size or quantization than temperature.

Context Length

Atlas injects the semantic layer into the system prompt. Large schemas (20+ tables) can consume 4,000–8,000 tokens of context. Ensure your model's context window can accommodate this plus the conversation history.

Schema SizeSystem Prompt TokensRecommended Min Context
< 10 tables1,000–2,0004,096
10–20 tables2,000–4,0008,192
20–50 tables4,000–8,00016,384
50+ tables8,000+32,768

For vLLM, set --max-model-len to match. For Ollama, set num_ctx in the Modelfile or via the API.

Agent Max Steps

Smaller models may need more steps to complete complex queries (they retry more). Consider increasing the step limit:

# Default: 25 — increase for smaller models
ATLAS_AGENT_MAX_STEPS=40

Troubleshooting

Tool calling failures

Symptom: Atlas responds with text instead of executing SQL. The agent describes what it would query but never calls executeSQL.

Causes:

  • Model doesn't support tool calling (check the compatibility matrix)
  • vLLM missing --enable-auto-tool-choice or wrong --tool-call-parser
  • Model too small — 7B models sometimes "forget" to use tools on complex queries

Fixes:

  1. Verify tool calling works: curl http://localhost:8000/v1/chat/completions -d '{"model":"llama3.1","messages":[{"role":"user","content":"Call the get_weather function for NYC"}],"tools":[{"type":"function","function":{"name":"get_weather","parameters":{"type":"object","properties":{"city":{"type":"string"}}}}}]}'
  2. For vLLM, ensure both --enable-auto-tool-choice and --tool-call-parser are set
  3. Try a larger model — 70B models are dramatically more reliable at tool calling than 7B

Streaming issues

Symptom: Responses appear all at once instead of streaming, or the connection times out.

Causes:

  • Reverse proxy buffering (nginx, Cloudflare)
  • Inference server not configured for streaming
  • Connection timeout too low

Fixes:

  1. Check that your inference server returns Transfer-Encoding: chunked
  2. If behind nginx, add: proxy_buffering off; and proxy_http_version 1.1;
  3. Note that the agent loop has built-in timeouts (5s per chunk, 30s per step) — very slow models may exceed these limits

Context length exceeded

Symptom: Error messages about maximum context length, or the model produces garbage output mid-response.

Causes:

  • Large semantic layer exhausting the context window
  • Long conversation history

Fixes:

  1. Enable the semantic index (ATLAS_SEMANTIC_INDEX_ENABLED=true, default) — it compresses the semantic layer summary
  2. Increase model context: vLLM --max-model-len, Ollama num_ctx
  3. For very large schemas (50+ tables), use a 70B+ model with 32K+ context

Slow first response

Symptom: First query after startup takes 30+ seconds.

Causes:

  • Model loading into GPU memory (normal for large models)
  • KV cache allocation (vLLM pre-allocates based on --gpu-memory-utilization)

Fixes:

  1. This is expected on cold start — subsequent queries will be fast
  2. For vLLM, reduce --gpu-memory-utilization if startup is OOM-killed (default 0.9)
  3. Use Ollama's keep_alive to prevent model unloading: ollama run llama3.1 --keepalive 24h

Quantization quality issues

Symptom: SQL has subtle errors (wrong column names, incorrect join conditions) that don't appear with larger quantizations.

Causes:

  • Aggressive quantization (Q2, Q3) degrades the model's ability to follow schemas precisely

Fixes:

  1. Use Q8 for production — best balance of VRAM savings and quality
  2. Avoid Q2/Q3 for any text-to-SQL use case
  3. If VRAM is limited, use a smaller model at higher quantization rather than a larger model at Q4

See Also

On this page