Skip to content

Prometheus & AlertManager

Prometheus v3.3.0 stores all infrastructure metrics pushed by Alloy agents via remote write, evaluates 39 alert rules, and routes firing alerts through AlertManager v0.28.1 to Slack.


Overview

Property Value
Prometheus version v3.3.0
AlertManager version v0.28.1
Host prometheus-shared-cwiq-io (VPC 10.0.15.9)
Prometheus UI https://prometheus.shared.cwiq.io
Remote write port 9009 (all interfaces — for Alloy agents)
Metric retention 30 days / 50 GB
Playbook prometheus/deploy-prometheus.yml

Both services run as Docker containers (observability-prometheus, observability-alertmanager) on the same host. Nginx provides SSL termination for the Prometheus UI.


How Metrics Arrive

Prometheus does not scrape hosts directly. All metrics are pushed by Alloy agents via remote write:

Alloy agent (every host)
    | remote write every 15 seconds
    v
prometheus-shared-cwiq-io:9009  (all interfaces)
    |
prometheus-shared-cwiq-io:9090  (internal API, localhost only)

Static scrape targets are only used for co-located services (Prometheus itself and AlertManager).

Cross-VPC agents must use Tailscale hostname

Alloy on DEV servers writes to http://prometheus-shared-cwiq-io:9009. FQDN resolves to the Shared VPC private IP which is not routable from the DEV VPC.


AlertManager Routing

AlertManager routes alerts to Slack based on the environment label attached to every metric by the Alloy agent:

route (default: shared-alerts)
  ├── environment=shared          → #cwiq-shared-infra-alerts
  └── environment=development|demo → #cwiq-dev-infra-alerts
        ├── severity=critical  → repeat_interval: 1h
        └── severity=warning   → repeat_interval: 4h

Receivers

Receiver Environment Slack Channel
shared-alerts shared #cwiq-shared-infra-alerts
dev-alerts development, demo #cwiq-dev-infra-alerts

Inhibition Rules

  • A critical alert suppresses all warning alerts with the same {alertname, host}.
  • A ServiceDown alert suppresses all warning alerts with the same instance — prevents alert storms when a host goes unreachable.

Slack Message Fields

Field Example
Alert name (links to Grafana) HighDiskUsage
Host prometheus-shared-cwiq-io
Mount /data or n/a
Severity warning
Environment shared
Description Root filesystem at 83% on prometheus-shared-cwiq-io

Alert Rules Reference

39 rules total across 6 groups. Rules are in prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2.

Infrastructure Group (25 rules)

Alert Threshold For Severity
HighDiskUsage root / > 80% 5m warning
HighDiskUsageCritical root / > 90% 2m critical
HighDiskUsageData /data > 80% 5m warning
HighDiskUsageDataCritical /data > 90% 2m critical
HighDiskUsageContainerd /var/lib/containerd > 80% 5m warning
HighDiskUsageContainerdCritical /var/lib/containerd > 90% 2m critical
DiskWillFillIn24h / predicted to fill within 24h 30m warning
DataDiskWillFillIn24h /data predicted to fill within 24h 30m warning
ContainerdDiskWillFillIn24h /var/lib/containerd predicted to fill within 24h 30m warning
FilesystemInodeExhaustion / inodes < 10% free 5m warning
DataFilesystemInodeExhaustion /data inodes < 10% free 5m warning
ContainerdFilesystemInodeExhaustion /var/lib/containerd inodes < 10% free 5m warning
HighDiskUsageOther any other mount (ext4/xfs) > 80% 5m warning
HighDiskUsageOtherCritical any other mount > 90% 2m critical
OtherDiskWillFillIn24h any other mount predicted to fill in 24h 30m warning
OtherFilesystemInodeExhaustion any other mount inodes < 10% free 5m warning
HighMemoryUsage memory > 80% 5m warning
HighMemoryUsageCritical memory > 90% 2m critical
HighCPUUsage CPU > 80% 5m warning
HighCPUUsageCritical CPU > 90% 2m critical
HighSwapUsage swap > 80% 5m warning
ServiceDown up == 0 2m critical
NetworkReceiveErrors receive errors > 0 5m warning
NetworkTransmitErrors transmit errors > 0 5m warning
NTPClockDrift NTP offset > 50ms 10m warning

Catch-all disk rules

The *Other* rules use fstype=~"ext4|xfs" with a negative mountpoint filter. This covers any additional EBS volumes automatically without requiring a rule change when new volumes are added.

Docker Group (4 rules — dormant)

Requires cadvisor metrics. Alloy does not currently collect cadvisor data, so these rules produce no alerts. They will activate when cadvisor collection is added.

Alert Trigger Severity
ContainerRestartLoop Container restarted > 3 times in 15m warning
ContainerOOMKilled Container killed by OOM critical
ContainerHighCPU Container CPU > 90% sustained warning
ContainerHighMemory Container near memory limit warning

Loki Group (2 rules)

Alert Trigger Severity
LokiIngestionErrors Loki dropping ingested logs warning
LokiRequestErrors Loki returning 5xx errors warning

Application Group (3 rules)

Alert Trigger Severity
HighHTTPErrorRate Application 5xx rate > 0.1/s per job warning
HighHTTPLatency p99 latency > 5s warning
HighHTTPLatencyCritical p99 latency > 15s critical

Requires the application to expose http_requests_total (with status label) and http_request_duration_seconds histogram metrics.

Kafka / Redpanda Group (2 rules)

Alert Trigger Severity
RedpandaBrokerDown Redpanda metrics absent for 2m critical
KafkaConsumerLagHigh Consumer group lag > 10,000 events warning

Observability Self-Monitoring Group (3 rules)

Alert Trigger Severity
PrometheusStorageHigh Prometheus /data volume > 80% warning
PrometheusRemoteWriteFailures Remote write failures > 0 warning
AlertmanagerNotificationsFailing AlertManager notification failures > 0 critical

Deployment

ssh ansible@ansible-shared-cwiq-io
ansible-helper
cd prometheus
ansible-playbook -i inventory/shared.yml deploy-prometheus.yml

The playbook writes prometheus.yml, alerting-rules.yml, and alertmanager.yml from templates, then starts both containers.


Useful PromQL Queries

# All hosts reporting metrics
count by (host) (up)

# CPU usage per host
100 - (avg by (host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory used per host
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# Disk usage percent on /data
(node_filesystem_size_bytes{mountpoint="/data"} - node_filesystem_avail_bytes{mountpoint="/data"})
/ node_filesystem_size_bytes{mountpoint="/data"} * 100

# All currently firing alerts
ALERTS{alertstate="firing"}

Testing Slack Integration

# Fire a test alert (run on prometheus-shared-cwiq-io)
curl -X POST http://localhost:9093/api/v2/alerts \
  -H 'Content-Type: application/json' \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning",
      "environment": "shared"
    },
    "annotations": {
      "summary": "Test alert",
      "description": "Slack integration test"
    }
  }]'

The test alert appears in #cwiq-shared-infra-alerts within 30 seconds (AlertManager group_wait). It auto-resolves after ~5 minutes.