Monitoring & Alerting Overview¶
The CWIQ observability stack collects logs and metrics from all 22 servers, routes alerts to Slack, and provides dashboards through Grafana. Three systems are combined: Alloy (collection), Loki + Prometheus (storage), and Grafana + Icinga (visualization and health checks).
Architecture¶
22 Servers — Alloy Agents
| logs via HTTP push | metrics via remote write
v v
loki-shared-cwiq-io prometheus-shared-cwiq-io
Loki 3.6.7 :3100 Prometheus v3.3.0 :9090 / :9009
S3 backend AlertManager v0.28.1 :9093
| |
v v
grafana-shared-cwiq-io #cwiq-shared-infra-alerts
Grafana 12.4.0 #cwiq-dev-infra-alerts
https://grafana.shared.cwiq.io
icinga-shared-cwiq-io ←→ icinga-dev-cwiq-io
Master (14 hosts) Satellite (7 hosts)
https://icinga.shared.cwiq.io
|
v
Slack channels (same as above)
Components¶
| Component | Version | Host | Purpose |
|---|---|---|---|
| Grafana Alloy | 1.13.2 | All 22 servers | Log and metric collection agent |
| Loki | 3.6.7 | loki-shared-cwiq-io |
Log aggregation with S3 backend |
| Prometheus | v3.3.0 | prometheus-shared-cwiq-io |
Metrics storage and alerting |
| AlertManager | v0.28.1 | prometheus-shared-cwiq-io |
Alert routing and silencing |
| Grafana | 12.4.0 | grafana-shared-cwiq-io |
Dashboards and log exploration |
| Icinga2 | 2.15.2 | icinga-shared-cwiq-io (master) + icinga-dev-cwiq-io (satellite) |
Infrastructure health checks |
Connection Reference¶
Cross-VPC: always use Tailscale hostnames
Alloy agents on DEV servers (10.1.x.x) MUST use Tailscale hostnames to reach the Shared observability stack. VPC peering does not cover the Shared private subnets. FQDN (loki.shared.cwiq.io) resolves to a VPC private IP that is not routable from DEV.
| Service | For Shared VPC agents | For DEV VPC agents | Port | Protocol |
|---|---|---|---|---|
| Loki log push | loki.shared.cwiq.io (Route53 private DNS) |
loki-shared-cwiq-io (Tailscale) |
3100 | HTTP |
| Prometheus remote write | prometheus.shared.cwiq.io:9009 |
prometheus-shared-cwiq-io:9009 |
9009 | HTTP |
| Grafana UI | https://grafana.shared.cwiq.io |
same (HTTPS via Nginx) | 443 | HTTPS |
| Prometheus UI | https://prometheus.shared.cwiq.io |
same (HTTPS via Nginx) | 443 | HTTPS |
| Icinga UI | https://icinga.shared.cwiq.io |
same (HTTPS via Nginx) | 443 | HTTPS |
Slack Channels¶
| Channel | Covers | Trigger |
|---|---|---|
#cwiq-shared-infra-alerts |
Shared Services environment (14 hosts) | Prometheus AlertManager + Icinga master |
#cwiq-dev-infra-alerts |
DEV and Demo environments (7 hosts) | Prometheus AlertManager + Icinga satellite |
Alert Coverage¶
Every server in the CWIQ infrastructure must be covered by all three monitoring systems:
| System | Coverage Requirement |
|---|---|
| Alloy | Running on every server, forwarding logs and metrics |
| Icinga | Every host has a host object with SSH + HTTPS checks |
| Prometheus | Alert rules cover disk, CPU, memory, swap, NTP for every host via remote write |
See Adding Monitoring for the mandatory co-change checklist when adding a new server or service.
Playbook Reference¶
All playbooks run from the ansible server via ansible-helper:
ssh ansible@ansible-shared-cwiq-io
ansible-helper
git pull origin main
# Deploy a specific component
cd alloy && ansible-playbook -i inventory/dev.yml deploy-alloy.yml
cd ../prometheus && ansible-playbook -i inventory/shared.yml deploy-prometheus.yml
cd ../loki && ansible-playbook -i inventory/shared.yml deploy-loki.yml
cd ../grafana && ansible-playbook -i inventory/shared.yml deploy-grafana.yml
cd ../icinga && ansible-playbook -i inventory/shared.yml deploy-config.yml