Prometheus & AlertManager¶
Prometheus v3.3.0 stores all infrastructure metrics pushed by Alloy agents via remote write, evaluates 39 alert rules, and routes firing alerts through AlertManager v0.28.1 to Slack.
Overview¶
| Property | Value |
|---|---|
| Prometheus version | v3.3.0 |
| AlertManager version | v0.28.1 |
| Host | prometheus-shared-cwiq-io (VPC 10.0.15.9) |
| Prometheus UI | https://prometheus.shared.cwiq.io |
| Remote write port | 9009 (all interfaces — for Alloy agents) |
| Metric retention | 30 days / 50 GB |
| Playbook | prometheus/deploy-prometheus.yml |
Both services run as Docker containers (observability-prometheus, observability-alertmanager) on the same host. Nginx provides SSL termination for the Prometheus UI.
How Metrics Arrive¶
Prometheus does not scrape hosts directly. All metrics are pushed by Alloy agents via remote write:
Alloy agent (every host)
| remote write every 15 seconds
v
prometheus-shared-cwiq-io:9009 (all interfaces)
|
prometheus-shared-cwiq-io:9090 (internal API, localhost only)
Static scrape targets are only used for co-located services (Prometheus itself and AlertManager).
Cross-VPC agents must use Tailscale hostname
Alloy on DEV servers writes to http://prometheus-shared-cwiq-io:9009. FQDN resolves to the Shared VPC private IP which is not routable from the DEV VPC.
AlertManager Routing¶
AlertManager routes alerts to Slack based on the environment label attached to every metric by the Alloy agent:
route (default: shared-alerts)
├── environment=shared → #cwiq-shared-infra-alerts
└── environment=development|demo → #cwiq-dev-infra-alerts
├── severity=critical → repeat_interval: 1h
└── severity=warning → repeat_interval: 4h
Receivers¶
| Receiver | Environment | Slack Channel |
|---|---|---|
shared-alerts |
shared |
#cwiq-shared-infra-alerts |
dev-alerts |
development, demo |
#cwiq-dev-infra-alerts |
Inhibition Rules¶
- A
criticalalert suppresses allwarningalerts with the same{alertname, host}. - A
ServiceDownalert suppresses allwarningalerts with the sameinstance— prevents alert storms when a host goes unreachable.
Slack Message Fields¶
| Field | Example |
|---|---|
| Alert name (links to Grafana) | HighDiskUsage |
| Host | prometheus-shared-cwiq-io |
| Mount | /data or n/a |
| Severity | warning |
| Environment | shared |
| Description | Root filesystem at 83% on prometheus-shared-cwiq-io |
Alert Rules Reference¶
39 rules total across 6 groups. Rules are in prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2.
Infrastructure Group (25 rules)¶
| Alert | Threshold | For | Severity |
|---|---|---|---|
HighDiskUsage |
root / > 80% |
5m | warning |
HighDiskUsageCritical |
root / > 90% |
2m | critical |
HighDiskUsageData |
/data > 80% |
5m | warning |
HighDiskUsageDataCritical |
/data > 90% |
2m | critical |
HighDiskUsageContainerd |
/var/lib/containerd > 80% |
5m | warning |
HighDiskUsageContainerdCritical |
/var/lib/containerd > 90% |
2m | critical |
DiskWillFillIn24h |
/ predicted to fill within 24h |
30m | warning |
DataDiskWillFillIn24h |
/data predicted to fill within 24h |
30m | warning |
ContainerdDiskWillFillIn24h |
/var/lib/containerd predicted to fill within 24h |
30m | warning |
FilesystemInodeExhaustion |
/ inodes < 10% free |
5m | warning |
DataFilesystemInodeExhaustion |
/data inodes < 10% free |
5m | warning |
ContainerdFilesystemInodeExhaustion |
/var/lib/containerd inodes < 10% free |
5m | warning |
HighDiskUsageOther |
any other mount (ext4/xfs) > 80% | 5m | warning |
HighDiskUsageOtherCritical |
any other mount > 90% | 2m | critical |
OtherDiskWillFillIn24h |
any other mount predicted to fill in 24h | 30m | warning |
OtherFilesystemInodeExhaustion |
any other mount inodes < 10% free | 5m | warning |
HighMemoryUsage |
memory > 80% | 5m | warning |
HighMemoryUsageCritical |
memory > 90% | 2m | critical |
HighCPUUsage |
CPU > 80% | 5m | warning |
HighCPUUsageCritical |
CPU > 90% | 2m | critical |
HighSwapUsage |
swap > 80% | 5m | warning |
ServiceDown |
up == 0 |
2m | critical |
NetworkReceiveErrors |
receive errors > 0 | 5m | warning |
NetworkTransmitErrors |
transmit errors > 0 | 5m | warning |
NTPClockDrift |
NTP offset > 50ms | 10m | warning |
Catch-all disk rules
The *Other* rules use fstype=~"ext4|xfs" with a negative mountpoint filter. This covers any additional EBS volumes automatically without requiring a rule change when new volumes are added.
Docker Group (4 rules — dormant)¶
Requires cadvisor metrics. Alloy does not currently collect cadvisor data, so these rules produce no alerts. They will activate when cadvisor collection is added.
| Alert | Trigger | Severity |
|---|---|---|
ContainerRestartLoop |
Container restarted > 3 times in 15m | warning |
ContainerOOMKilled |
Container killed by OOM | critical |
ContainerHighCPU |
Container CPU > 90% sustained | warning |
ContainerHighMemory |
Container near memory limit | warning |
Loki Group (2 rules)¶
| Alert | Trigger | Severity |
|---|---|---|
LokiIngestionErrors |
Loki dropping ingested logs | warning |
LokiRequestErrors |
Loki returning 5xx errors | warning |
Application Group (3 rules)¶
| Alert | Trigger | Severity |
|---|---|---|
HighHTTPErrorRate |
Application 5xx rate > 0.1/s per job | warning |
HighHTTPLatency |
p99 latency > 5s | warning |
HighHTTPLatencyCritical |
p99 latency > 15s | critical |
Requires the application to expose http_requests_total (with status label) and http_request_duration_seconds histogram metrics.
Kafka / Redpanda Group (2 rules)¶
| Alert | Trigger | Severity |
|---|---|---|
RedpandaBrokerDown |
Redpanda metrics absent for 2m | critical |
KafkaConsumerLagHigh |
Consumer group lag > 10,000 events | warning |
Observability Self-Monitoring Group (3 rules)¶
| Alert | Trigger | Severity |
|---|---|---|
PrometheusStorageHigh |
Prometheus /data volume > 80% |
warning |
PrometheusRemoteWriteFailures |
Remote write failures > 0 | warning |
AlertmanagerNotificationsFailing |
AlertManager notification failures > 0 | critical |
Deployment¶
ssh ansible@ansible-shared-cwiq-io
ansible-helper
cd prometheus
ansible-playbook -i inventory/shared.yml deploy-prometheus.yml
The playbook writes prometheus.yml, alerting-rules.yml, and alertmanager.yml from templates, then starts both containers.
Useful PromQL Queries¶
# All hosts reporting metrics
count by (host) (up)
# CPU usage per host
100 - (avg by (host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory used per host
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# Disk usage percent on /data
(node_filesystem_size_bytes{mountpoint="/data"} - node_filesystem_avail_bytes{mountpoint="/data"})
/ node_filesystem_size_bytes{mountpoint="/data"} * 100
# All currently firing alerts
ALERTS{alertstate="firing"}
Testing Slack Integration¶
# Fire a test alert (run on prometheus-shared-cwiq-io)
curl -X POST http://localhost:9093/api/v2/alerts \
-H 'Content-Type: application/json' \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"environment": "shared"
},
"annotations": {
"summary": "Test alert",
"description": "Slack integration test"
}
}]'
The test alert appears in #cwiq-shared-infra-alerts within 30 seconds (AlertManager group_wait). It auto-resolves after ~5 minutes.
Related Documentation¶
- Monitoring Overview
- Alloy Log & Metric Collection
- Slack Alerting
- Adding Monitoring for New Infrastructure
- Source:
ansible-playbooks/prometheus/docs/ALERTING.md - Source:
ansible-playbooks/prometheus/docs/README.md