Skip to content

CWIQ.IO Wiki

Observability Overview

Monitoring & Alerting Overview¶

The CWIQ observability stack collects logs and metrics from all 22 servers, routes alerts to Slack, and provides dashboards through Grafana. Three systems are combined: Alloy (collection), Loki + Prometheus (storage), and Grafana + Icinga (visualization and health checks).

Architecture¶

22 Servers — Alloy Agents
      |  logs via HTTP push            |  metrics via remote write
      v                                v
loki-shared-cwiq-io             prometheus-shared-cwiq-io
  Loki 3.6.7 :3100                Prometheus v3.3.0 :9090 / :9009
  S3 backend                      AlertManager v0.28.1 :9093
      |                                |
      v                                v
grafana-shared-cwiq-io          #cwiq-shared-infra-alerts
  Grafana 12.4.0                 #cwiq-dev-infra-alerts
  https://grafana.shared.cwiq.io

icinga-shared-cwiq-io  ←→  icinga-dev-cwiq-io
  Master (14 hosts)           Satellite (7 hosts)
  https://icinga.shared.cwiq.io
      |
      v
Slack channels (same as above)

Components¶

Component	Version	Host	Purpose
Grafana Alloy	1.13.2	All 22 servers	Log and metric collection agent
Loki	3.6.7	`loki-shared-cwiq-io`	Log aggregation with S3 backend
Prometheus	v3.3.0	`prometheus-shared-cwiq-io`	Metrics storage and alerting
AlertManager	v0.28.1	`prometheus-shared-cwiq-io`	Alert routing and silencing
Grafana	12.4.0	`grafana-shared-cwiq-io`	Dashboards and log exploration
Icinga2	2.15.2	`icinga-shared-cwiq-io` (master) + `icinga-dev-cwiq-io` (satellite)	Infrastructure health checks

Connection Reference¶

Cross-VPC: always use Tailscale hostnames

Alloy agents on DEV servers (10.1.x.x) MUST use Tailscale hostnames to reach the Shared observability stack. VPC peering does not cover the Shared private subnets. FQDN (loki.shared.cwiq.io) resolves to a VPC private IP that is not routable from DEV.

Service	For Shared VPC agents	For DEV VPC agents	Port	Protocol
Loki log push	`loki.shared.cwiq.io` (Route53 private DNS)	`loki-shared-cwiq-io` (Tailscale)	3100	HTTP
Prometheus remote write	`prometheus.shared.cwiq.io:9009`	`prometheus-shared-cwiq-io:9009`	9009	HTTP
Grafana UI	`https://grafana.shared.cwiq.io`	same (HTTPS via Nginx)	443	HTTPS
Prometheus UI	`https://prometheus.shared.cwiq.io`	same (HTTPS via Nginx)	443	HTTPS
Icinga UI	`https://icinga.shared.cwiq.io`	same (HTTPS via Nginx)	443	HTTPS

Slack Channels¶

Channel	Covers	Trigger
`#cwiq-shared-infra-alerts`	Shared Services environment (14 hosts)	Prometheus AlertManager + Icinga master
`#cwiq-dev-infra-alerts`	DEV and Demo environments (7 hosts)	Prometheus AlertManager + Icinga satellite

Alert Coverage¶

Every server in the CWIQ infrastructure must be covered by all three monitoring systems:

System	Coverage Requirement
Alloy	Running on every server, forwarding logs and metrics
Icinga	Every host has a host object with SSH + HTTPS checks
Prometheus	Alert rules cover disk, CPU, memory, swap, NTP for every host via remote write

See Adding Monitoring for the mandatory co-change checklist when adding a new server or service.

Playbook Reference¶

All playbooks run from the ansible server via ansible-helper:

ssh ansible@ansible-shared-cwiq-io
ansible-helper
git pull origin main

# Deploy a specific component
cd alloy && ansible-playbook -i inventory/dev.yml deploy-alloy.yml
cd ../prometheus && ansible-playbook -i inventory/shared.yml deploy-prometheus.yml
cd ../loki && ansible-playbook -i inventory/shared.yml deploy-loki.yml
cd ../grafana && ansible-playbook -i inventory/shared.yml deploy-grafana.yml
cd ../icinga && ansible-playbook -i inventory/shared.yml deploy-config.yml