Atlas
Platform Operations

SLA Monitoring

Track per-workspace uptime, query latency percentiles, and error rates with configurable alerting.

Atlas provides built-in SLA monitoring that tracks per-workspace query performance and reliability metrics. Platform operators can view latency percentiles (p50/p95/p99), error rates, and uptime — with configurable alerting when thresholds are breached.

SaaS Feature

SLA monitoring is available on app.useatlas.dev Enterprise plans. Self-hosted deployments can use their own monitoring infrastructure.

Prerequisites

  • Active Enterprise plan on app.useatlas.dev
  • Internal database configured (DATABASE_URL)
  • Platform admin role for dashboard access

How It Works

Every query execution automatically records two data points:

  • Latency — round-trip time in milliseconds
  • Outcome — success or error

These are stored in the internal database and aggregated on-demand into:

MetricDescription
P50 / P95 / P99 latencyQuery latency percentiles over the time window
Error ratePercentage of queries that returned errors
UptimePercentage of successful queries (inverse of error rate)
Total queriesQuery volume per workspace

Metrics are computed over a configurable time window (default: 24 hours). Pass ?hours=N (1–720) to the API endpoints to adjust the window.

Alerting

SLA alerts fire when workspace metrics exceed configured thresholds. Two alert types are supported:

Alert TypeDefault ThresholdDescription
P99 Latency5000msP99 query latency exceeds threshold
Error Rate5%Error rate exceeds threshold

Alert Lifecycle

Alerts progress through three states:

  1. Firing — Threshold breached. Notification sent via webhook (if configured).
  2. Acknowledged — Operator has acknowledged the alert but it remains active.
  3. Resolved — Metric returned below threshold. Auto-resolved on next evaluation.

Webhook Notifications

Set ATLAS_SLA_WEBHOOK_URL to receive alert notifications via HTTP POST:

{
  "type": "sla.alert.fired",
  "alert": {
    "id": "abc-123",
    "workspaceId": "ws-456",
    "workspaceName": "Acme Corp",
    "type": "latency_p99",
    "status": "firing",
    "currentValue": 6200,
    "threshold": 5000,
    "message": "Workspace \"Acme Corp\" p99 latency 6200ms exceeds threshold 5000ms"
  },
  "timestamp": "2026-03-23T10:30:00.000Z"
}

Configuration

Environment Variables

VariableDefaultDescription
ATLAS_SLA_LATENCY_P99_MS5000Default P99 latency alert threshold (ms)
ATLAS_SLA_ERROR_RATE_PCT5Default error rate alert threshold (%)
ATLAS_SLA_WEBHOOK_URLWebhook URL for alert delivery

Thresholds can also be configured through the admin UI, which takes precedence over env vars.

Dashboard

The SLA monitoring dashboard is available in the admin console under Platform Admin > SLA Monitoring. It requires the platform_admin role.

Overview Tab

A table of all workspaces showing:

  • Latency percentiles (P50, P95, P99) with color-coded badges
  • Error rate and uptime percentage
  • Total query count
  • Click-through to per-workspace detail with hourly time-series charts

Alerts Tab

  • Active and recent alerts with status badges
  • One-click acknowledge for firing alerts
  • Manual "Evaluate Now" to trigger immediate alert evaluation
  • Threshold configuration dialog

API Endpoints

All endpoints require platform_admin role and are mounted at /api/v1/platform/sla.

MethodPathDescription
GET/?hours=24All workspaces SLA summary (hours: 1–720)
GET/:workspaceId?hours=24Per-workspace detail with time-series
GET/alerts?status=&limit=100List alerts (status: firing, resolved, acknowledged)
GET/thresholdsCurrent alert thresholds
PUT/thresholdsUpdate alert thresholds
POST/alerts/:alertId/acknowledgeAcknowledge a firing alert
POST/evaluateTrigger alert evaluation

On this page