Prometheus & AlertManager¶

Prometheus v3.3.0 stores all infrastructure metrics pushed by Alloy agents via remote write, evaluates 39 alert rules, and routes firing alerts through AlertManager v0.28.1 to Slack.

Overview¶

Property	Value
Prometheus version	v3.3.0
AlertManager version	v0.28.1
Host	`prometheus-shared-cwiq-io` (VPC `10.0.15.9`)
Prometheus UI	`https://prometheus.shared.cwiq.io`
Remote write port	`9009` (all interfaces — for Alloy agents)
Metric retention	30 days / 50 GB
Playbook	`prometheus/deploy-prometheus.yml`

Both services run as Docker containers (observability-prometheus, observability-alertmanager) on the same host. Nginx provides SSL termination for the Prometheus UI.

How Metrics Arrive¶

Prometheus does not scrape hosts directly. All metrics are pushed by Alloy agents via remote write:

Alloy agent (every host)
    | remote write every 15 seconds
    v
prometheus-shared-cwiq-io:9009  (all interfaces)
    |
prometheus-shared-cwiq-io:9090  (internal API, localhost only)

Static scrape targets are only used for co-located services (Prometheus itself and AlertManager).

Cross-VPC agents must use Tailscale hostname

Alloy on DEV servers writes to http://prometheus-shared-cwiq-io:9009. FQDN resolves to the Shared VPC private IP which is not routable from the DEV VPC.

AlertManager Routing¶

AlertManager routes alerts to Slack based on the environment label attached to every metric by the Alloy agent:

route (default: shared-alerts)
  ├── environment=shared          → #cwiq-shared-infra-alerts
  └── environment=development|demo → #cwiq-dev-infra-alerts
        ├── severity=critical  → repeat_interval: 1h
        └── severity=warning   → repeat_interval: 4h

Receivers¶

Receiver	Environment	Slack Channel
`shared-alerts`	`shared`	`#cwiq-shared-infra-alerts`
`dev-alerts`	`development`, `demo`	`#cwiq-dev-infra-alerts`

Inhibition Rules¶

A critical alert suppresses all warning alerts with the same {alertname, host}.
A ServiceDown alert suppresses all warning alerts with the same instance — prevents alert storms when a host goes unreachable.

Slack Message Fields¶

Field	Example
Alert name (links to Grafana)	`HighDiskUsage`
Host	`prometheus-shared-cwiq-io`
Mount	`/data` or `n/a`
Severity	`warning`
Environment	`shared`
Description	`Root filesystem at 83% on prometheus-shared-cwiq-io`

Alert Rules Reference¶

39 rules total across 6 groups. Rules are in prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2.

Infrastructure Group (25 rules)¶

Alert	Threshold	For	Severity
`HighDiskUsage`	root `/` > 80%	5m	warning
`HighDiskUsageCritical`	root `/` > 90%	2m	critical
`HighDiskUsageData`	`/data` > 80%	5m	warning
`HighDiskUsageDataCritical`	`/data` > 90%	2m	critical
`HighDiskUsageContainerd`	`/var/lib/containerd` > 80%	5m	warning
`HighDiskUsageContainerdCritical`	`/var/lib/containerd` > 90%	2m	critical
`DiskWillFillIn24h`	`/` predicted to fill within 24h	30m	warning
`DataDiskWillFillIn24h`	`/data` predicted to fill within 24h	30m	warning
`ContainerdDiskWillFillIn24h`	`/var/lib/containerd` predicted to fill within 24h	30m	warning
`FilesystemInodeExhaustion`	`/` inodes < 10% free	5m	warning
`DataFilesystemInodeExhaustion`	`/data` inodes < 10% free	5m	warning
`ContainerdFilesystemInodeExhaustion`	`/var/lib/containerd` inodes < 10% free	5m	warning
`HighDiskUsageOther`	any other mount (ext4/xfs) > 80%	5m	warning
`HighDiskUsageOtherCritical`	any other mount > 90%	2m	critical
`OtherDiskWillFillIn24h`	any other mount predicted to fill in 24h	30m	warning
`OtherFilesystemInodeExhaustion`	any other mount inodes < 10% free	5m	warning
`HighMemoryUsage`	memory > 80%	5m	warning
`HighMemoryUsageCritical`	memory > 90%	2m	critical
`HighCPUUsage`	CPU > 80%	5m	warning
`HighCPUUsageCritical`	CPU > 90%	2m	critical
`HighSwapUsage`	swap > 80%	5m	warning
`ServiceDown`	`up == 0`	2m	critical
`NetworkReceiveErrors`	receive errors > 0	5m	warning
`NetworkTransmitErrors`	transmit errors > 0	5m	warning
`NTPClockDrift`	NTP offset > 50ms	10m	warning

Catch-all disk rules

The *Other* rules use fstype=~"ext4|xfs" with a negative mountpoint filter. This covers any additional EBS volumes automatically without requiring a rule change when new volumes are added.

Docker Group (4 rules — dormant)¶

Requires cadvisor metrics. Alloy does not currently collect cadvisor data, so these rules produce no alerts. They will activate when cadvisor collection is added.

Alert	Trigger	Severity
`ContainerRestartLoop`	Container restarted > 3 times in 15m	warning
`ContainerOOMKilled`	Container killed by OOM	critical
`ContainerHighCPU`	Container CPU > 90% sustained	warning
`ContainerHighMemory`	Container near memory limit	warning

Loki Group (2 rules)¶

Alert	Trigger	Severity
`LokiIngestionErrors`	Loki dropping ingested logs	warning
`LokiRequestErrors`	Loki returning 5xx errors	warning

Application Group (3 rules)¶

Alert	Trigger	Severity
`HighHTTPErrorRate`	Application 5xx rate > 0.1/s per job	warning
`HighHTTPLatency`	p99 latency > 5s	warning
`HighHTTPLatencyCritical`	p99 latency > 15s	critical

Requires the application to expose http_requests_total (with status label) and http_request_duration_seconds histogram metrics.

Kafka / Redpanda Group (2 rules)¶

Alert	Trigger	Severity
`RedpandaBrokerDown`	Redpanda metrics absent for 2m	critical
`KafkaConsumerLagHigh`	Consumer group lag > 10,000 events	warning

Observability Self-Monitoring Group (3 rules)¶

Alert	Trigger	Severity
`PrometheusStorageHigh`	Prometheus `/data` volume > 80%	warning
`PrometheusRemoteWriteFailures`	Remote write failures > 0	warning
`AlertmanagerNotificationsFailing`	AlertManager notification failures > 0	critical

Deployment¶

ssh ansible@ansible-shared-cwiq-io
ansible-helper
cd prometheus
ansible-playbook -i inventory/shared.yml deploy-prometheus.yml

The playbook writes prometheus.yml, alerting-rules.yml, and alertmanager.yml from templates, then starts both containers.

Useful PromQL Queries¶

# All hosts reporting metrics
count by (host) (up)

# CPU usage per host
100 - (avg by (host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory used per host
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# Disk usage percent on /data
(node_filesystem_size_bytes{mountpoint="/data"} - node_filesystem_avail_bytes{mountpoint="/data"})
/ node_filesystem_size_bytes{mountpoint="/data"} * 100

# All currently firing alerts
ALERTS{alertstate="firing"}

Testing Slack Integration¶

# Fire a test alert (run on prometheus-shared-cwiq-io)
curl -X POST http://localhost:9093/api/v2/alerts \
  -H 'Content-Type: application/json' \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning",
      "environment": "shared"
    },
    "annotations": {
      "summary": "Test alert",
      "description": "Slack integration test"
    }
  }]'

The test alert appears in #cwiq-shared-infra-alerts within 30 seconds (AlertManager group_wait). It auto-resolves after ~5 minutes.

Monitoring Overview
Alloy Log & Metric Collection
Slack Alerting
Adding Monitoring for New Infrastructure
Source: ansible-playbooks/prometheus/docs/ALERTING.md
Source: ansible-playbooks/prometheus/docs/README.md