Skip to content

Monitoring & Alerting Overview

The CWIQ observability stack collects logs and metrics from all 22 servers, routes alerts to Slack, and provides dashboards through Grafana. Three systems are combined: Alloy (collection), Loki + Prometheus (storage), and Grafana + Icinga (visualization and health checks).


Architecture

22 Servers — Alloy Agents
      |  logs via HTTP push            |  metrics via remote write
      v                                v
loki-shared-cwiq-io             prometheus-shared-cwiq-io
  Loki 3.6.7 :3100                Prometheus v3.3.0 :9090 / :9009
  S3 backend                      AlertManager v0.28.1 :9093
      |                                |
      v                                v
grafana-shared-cwiq-io          #cwiq-shared-infra-alerts
  Grafana 12.4.0                 #cwiq-dev-infra-alerts
  https://grafana.shared.cwiq.io

icinga-shared-cwiq-io  ←→  icinga-dev-cwiq-io
  Master (14 hosts)           Satellite (7 hosts)
  https://icinga.shared.cwiq.io
      |
      v
Slack channels (same as above)

Components

Component Version Host Purpose
Grafana Alloy 1.13.2 All 22 servers Log and metric collection agent
Loki 3.6.7 loki-shared-cwiq-io Log aggregation with S3 backend
Prometheus v3.3.0 prometheus-shared-cwiq-io Metrics storage and alerting
AlertManager v0.28.1 prometheus-shared-cwiq-io Alert routing and silencing
Grafana 12.4.0 grafana-shared-cwiq-io Dashboards and log exploration
Icinga2 2.15.2 icinga-shared-cwiq-io (master) + icinga-dev-cwiq-io (satellite) Infrastructure health checks

Connection Reference

Cross-VPC: always use Tailscale hostnames

Alloy agents on DEV servers (10.1.x.x) MUST use Tailscale hostnames to reach the Shared observability stack. VPC peering does not cover the Shared private subnets. FQDN (loki.shared.cwiq.io) resolves to a VPC private IP that is not routable from DEV.

Service For Shared VPC agents For DEV VPC agents Port Protocol
Loki log push loki.shared.cwiq.io (Route53 private DNS) loki-shared-cwiq-io (Tailscale) 3100 HTTP
Prometheus remote write prometheus.shared.cwiq.io:9009 prometheus-shared-cwiq-io:9009 9009 HTTP
Grafana UI https://grafana.shared.cwiq.io same (HTTPS via Nginx) 443 HTTPS
Prometheus UI https://prometheus.shared.cwiq.io same (HTTPS via Nginx) 443 HTTPS
Icinga UI https://icinga.shared.cwiq.io same (HTTPS via Nginx) 443 HTTPS

Slack Channels

Channel Covers Trigger
#cwiq-shared-infra-alerts Shared Services environment (14 hosts) Prometheus AlertManager + Icinga master
#cwiq-dev-infra-alerts DEV and Demo environments (7 hosts) Prometheus AlertManager + Icinga satellite

Alert Coverage

Every server in the CWIQ infrastructure must be covered by all three monitoring systems:

System Coverage Requirement
Alloy Running on every server, forwarding logs and metrics
Icinga Every host has a host object with SSH + HTTPS checks
Prometheus Alert rules cover disk, CPU, memory, swap, NTP for every host via remote write

See Adding Monitoring for the mandatory co-change checklist when adding a new server or service.


Playbook Reference

All playbooks run from the ansible server via ansible-helper:

ssh ansible@ansible-shared-cwiq-io
ansible-helper
git pull origin main

# Deploy a specific component
cd alloy && ansible-playbook -i inventory/dev.yml deploy-alloy.yml
cd ../prometheus && ansible-playbook -i inventory/shared.yml deploy-prometheus.yml
cd ../loki && ansible-playbook -i inventory/shared.yml deploy-loki.yml
cd ../grafana && ansible-playbook -i inventory/shared.yml deploy-grafana.yml
cd ../icinga && ansible-playbook -i inventory/shared.yml deploy-config.yml