SLA Monitoring

Track per-workspace uptime, query latency percentiles, and error rates with configurable alerting.

Atlas provides built-in SLA monitoring that tracks per-workspace query performance and reliability metrics. Platform operators can view latency percentiles (p50/p95/p99), error rates, and uptime — with configurable alerting when thresholds are breached.

SaaS Feature

SLA monitoring is available on app.useatlas.dev Enterprise plans. Self-hosted deployments can use their own monitoring infrastructure.

Prerequisites

Active Enterprise plan on app.useatlas.dev
Internal database configured (DATABASE_URL)
Platform admin role for dashboard access

How It Works

Every query execution automatically records two data points:

Latency — round-trip time in milliseconds
Outcome — success or error

These are stored in the internal database and aggregated on-demand into:

Metric	Description
P50 / P95 / P99 latency	Query latency percentiles over the time window
Error rate	Percentage of queries that returned errors
Uptime	Percentage of successful queries (inverse of error rate)
Total queries	Query volume per workspace

Metrics are computed over a configurable time window (default: 24 hours). Pass ?hours=N (1–720) to the API endpoints to adjust the window.

Alerting

SLA alerts fire when workspace metrics exceed configured thresholds. Two alert types are supported:

Alert Type	Default Threshold	Description
P99 Latency	5000ms	P99 query latency exceeds threshold
Error Rate	5%	Error rate exceeds threshold

Alert Lifecycle

Alerts progress through three states:

Firing — Threshold breached. Notification sent via webhook (if configured).
Acknowledged — Operator has acknowledged the alert but it remains active.
Resolved — Metric returned below threshold. Auto-resolved on next evaluation.

Webhook Notifications

Set ATLAS_SLA_WEBHOOK_URL to receive alert notifications via HTTP POST:

{
  "type": "sla.alert.fired",
  "alert": {
    "id": "abc-123",
    "workspaceId": "ws-456",
    "workspaceName": "Acme Corp",
    "type": "latency_p99",
    "status": "firing",
    "currentValue": 6200,
    "threshold": 5000,
    "message": "Workspace \"Acme Corp\" p99 latency 6200ms exceeds threshold 5000ms"
  },
  "timestamp": "2026-03-23T10:30:00.000Z"
}

Configuration

Environment Variables

Variable	Default	Description
`ATLAS_SLA_LATENCY_P99_MS`	`5000`	Default P99 latency alert threshold (ms)
`ATLAS_SLA_ERROR_RATE_PCT`	`5`	Default error rate alert threshold (%)
`ATLAS_SLA_WEBHOOK_URL`	—	Webhook URL for alert delivery

Thresholds can also be configured through the admin UI, which takes precedence over env vars.

Dashboard

The SLA monitoring dashboard is available in the admin console under Platform Admin > SLA Monitoring. It requires the platform_admin role.

Overview Tab

A table of all workspaces showing:

Latency percentiles (P50, P95, P99) with color-coded badges
Error rate and uptime percentage
Total query count
Click-through to per-workspace detail with hourly time-series charts

Alerts Tab

Active and recent alerts with status badges
One-click acknowledge for firing alerts
Manual "Evaluate Now" to trigger immediate alert evaluation
Threshold configuration dialog

API Endpoints

All endpoints require platform_admin role and are mounted at /api/v1/platform/sla.

Method	Path	Description
`GET`	`/?hours=24`	All workspaces SLA summary (hours: 1–720)
`GET`	`/:workspaceId?hours=24`	Per-workspace detail with time-series
`GET`	`/alerts?status=&limit=100`	List alerts (status: firing, resolved, acknowledged)
`GET`	`/thresholds`	Current alert thresholds
`PUT`	`/thresholds`	Update alert thresholds
`POST`	`/alerts/:alertId/acknowledge`	Acknowledge a firing alert
`POST`	`/evaluate`	Trigger alert evaluation

SLA Monitoring

On this page