Skip to content

Adding Monitoring for New Infrastructure

The mandatory checklist for updating the observability stack when adding a new server or service. Every table below maps to a specific file that must be updated in the same MR.


When This Applies

This checklist applies any time you:

  • Provision a new EC2 server
  • Add a new Docker container to an existing server
  • Add a new FastAPI service with a /health endpoint
  • Change a service port or nginx upstream
  • Modify files in alloy/, icinga/, prometheus/, or loki/

Adding a New Server (5 mandatory files)

CRITICAL: All five files must be in the same MR

MRs that add infrastructure without updating all monitoring documentation will be rejected.

# File Action
1 alloy/inventory/{shared|dev}.yml Add host entry with alloy_environment, ansible_host, ansible_user
2 alloy/docs/README.md Add row to Monitored Servers table (Docker Logs, App Metrics, Targets)
3 icinga/conf.d/hosts/{shared|dev}/<hostname>.conf Create host config importing cwiq-shared-host or cwiq-dev-host
4 icinga/README.md Add row to Monitored Hosts table (zone, host, checks)
5 docs/SLACK_ALERTING.md Add row to Coverage Map table and Icinga Checks table

Template: Alloy inventory entry

# For a DEV server in alloy/inventory/dev.yml
<hostname>-dev-cwiq-io:
  ansible_host: <hostname>-dev-cwiq-io      # Tailscale hostname
  ansible_user: ec2-user
  alloy_environment: development
  alloy_scrape_app_metrics: false
  alloy_app_metrics_targets: []

Template: Icinga host config

# icinga/conf.d/hosts/dev/<hostname>-dev.conf
object Host "<hostname>-dev-cwiq-io" {
  import "cwiq-dev-host"

  address = "<hostname>-dev-cwiq-io"
  display_name = "<Service Name> DEV"

  vars.environment = "dev"
  vars.os = "AlmaLinux"

  vars.http_vhosts["HTTPS"] = {
    http_address = "<service>.dev.cwiq.io"
    http_ssl = true
    http_vhost = "<service>.dev.cwiq.io"
    http_uri = "/health"
    http_port = 443
  }
}

Adding a New Service to Existing Server (up to 7 files)

# File Required When
1 alloy/inventory/{env}.yml Service exposes /metrics endpoint
2 alloy/docs/README.md Always — update host's row
3 icinga/conf.d/hosts/{env}/<hostname>.conf Always — add HTTP/TCP check for new port
4 icinga/README.md Always — update host's row
5 docs/SLACK_ALERTING.md Always — update Coverage Map
6 prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2 Service needs a dedicated alert rule
7 prometheus/docs/ALERTING.md New Prometheus alert rules were added

Adding a New Prometheus Alert Rule (2 files)

# File Action
1 prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2 Add rule to appropriate group
2 prometheus/docs/ALERTING.md Add row to the alert rules table

Alert rule template

- alert: <AlertName>
  expr: |
    <promql_expression> > <threshold>
  for: 5m
  labels:
    severity: warning   # or: critical
  annotations:
    summary: "<Short description> on {{ $labels.host }}"
    description: "<Detailed description>. Current value: {{ $value | humanize }}"

After updating rules, deploy and verify:

cd prometheus
ansible-playbook -i inventory/shared.yml deploy-prometheus.yml

# Verify rule loaded (allow 30 seconds)
curl -s https://prometheus.shared.cwiq.io/api/v1/rules | \
  python3 -m json.tool | grep -A3 "<AlertName>"

Modifying Existing Monitoring Configuration

Any modification to these directories requires documentation updates in the same commit:

Directory Modified Documentation Files to Update
alloy/ (inventory, config templates, roles) alloy/docs/README.md, docs/SLACK_ALERTING.md
icinga/conf.d/ (host configs, services) icinga/README.md, docs/SLACK_ALERTING.md
prometheus/roles/deploy_prometheus/templates/ prometheus/docs/ALERTING.md, docs/SLACK_ALERTING.md
loki/ (config templates, roles) loki/docs/README.md, loki/docs/OPERATIONS.md
Docker Compose files (new service containers) alloy/docs/README.md, icinga/README.md, docs/SLACK_ALERTING.md
nginx configs (new upstream/proxy_pass) icinga/README.md, docs/SLACK_ALERTING.md

This applies to additions, modifications, AND removals. Examples: - Removing a check type → update Icinga and Slack docs - Changing alert thresholds → update prometheus/docs/ALERTING.md and docs/SLACK_ALERTING.md - Adding a Loki pipeline stage → update loki/docs/README.md


Adding a New Environment

New environment requires 4 new configs and 4 doc updates

Step Action
1 Create Slack channel #cwiq-{env}-infra-alerts and configure Incoming Webhook
2 Store webhook in Vault: vault kv patch secret/slack/webhooks {env}="https://..."
3 Add AlertManager receiver in prometheus/.../alertmanager.yml.j2
4 Add Icinga notification user in icinga/conf.d/notifications.conf.j2
5 Create Alloy inventory file at alloy/inventory/{env}.yml
6 Update docs/SLACK_ALERTING.md (add channel to strategy table)
7 Update prometheus/docs/ALERTING.md (add receiver to receivers section)
8 Update icinga/README.md (add zone to zones table)
9 Update alloy/docs/README.md (add inventory file to inventory section)

Verification Checklist

After deploying new monitoring coverage, confirm all three systems are working:

# 1. Alloy service is running on the new host
ssh ec2-user@<hostname>-cwiq-io "sudo systemctl status alloy"

# 2. Logs are arriving in Loki
# Grafana Explore → Loki: {host="<hostname>-cwiq-io"}

# 3. Metrics are arriving in Prometheus
# Grafana Explore → Prometheus: up{host="<hostname>-cwiq-io"}

# 4. Icinga checks are green
# https://icinga.shared.cwiq.io → host → all checks OK

Reference Files

The primary documentation files for the monitoring stack:

File Contains
ansible-playbooks/alloy/docs/README.md What Alloy collects per server
ansible-playbooks/icinga/README.md What is checked per server
ansible-playbooks/prometheus/docs/ALERTING.md All 39 alert rules with PromQL
ansible-playbooks/loki/docs/README.md Log ingestion pipeline
ansible-playbooks/docs/SLACK_ALERTING.md Unified alerting overview and coverage map