Adding Monitoring for New Infrastructure¶

The mandatory checklist for updating the observability stack when adding a new server or service. Every table below maps to a specific file that must be updated in the same MR.

When This Applies¶

This checklist applies any time you:

Provision a new EC2 server
Add a new Docker container to an existing server
Add a new FastAPI service with a /health endpoint
Change a service port or nginx upstream
Modify files in alloy/, icinga/, prometheus/, or loki/

Adding a New Server (5 mandatory files)¶

CRITICAL: All five files must be in the same MR

MRs that add infrastructure without updating all monitoring documentation will be rejected.

#	File	Action
1	`alloy/inventory/{shared\|dev}.yml`	Add host entry with `alloy_environment`, `ansible_host`, `ansible_user`
2	`alloy/docs/README.md`	Add row to Monitored Servers table (Docker Logs, App Metrics, Targets)
3	`icinga/conf.d/hosts/{shared\|dev}/<hostname>.conf`	Create host config importing `cwiq-shared-host` or `cwiq-dev-host`
4	`icinga/README.md`	Add row to Monitored Hosts table (zone, host, checks)
5	`docs/SLACK_ALERTING.md`	Add row to Coverage Map table and Icinga Checks table

Template: Alloy inventory entry¶

# For a DEV server in alloy/inventory/dev.yml
<hostname>-dev-cwiq-io:
  ansible_host: <hostname>-dev-cwiq-io      # Tailscale hostname
  ansible_user: ec2-user
  alloy_environment: development
  alloy_scrape_app_metrics: false
  alloy_app_metrics_targets: []

Template: Icinga host config¶

# icinga/conf.d/hosts/dev/<hostname>-dev.conf
object Host "<hostname>-dev-cwiq-io" {
  import "cwiq-dev-host"

  address = "<hostname>-dev-cwiq-io"
  display_name = "<Service Name> DEV"

  vars.environment = "dev"
  vars.os = "AlmaLinux"

  vars.http_vhosts["HTTPS"] = {
    http_address = "<service>.dev.cwiq.io"
    http_ssl = true
    http_vhost = "<service>.dev.cwiq.io"
    http_uri = "/health"
    http_port = 443
  }
}

Adding a New Service to Existing Server (up to 7 files)¶

#	File	Required When
1	`alloy/inventory/{env}.yml`	Service exposes `/metrics` endpoint
2	`alloy/docs/README.md`	Always — update host's row
3	`icinga/conf.d/hosts/{env}/<hostname>.conf`	Always — add HTTP/TCP check for new port
4	`icinga/README.md`	Always — update host's row
5	`docs/SLACK_ALERTING.md`	Always — update Coverage Map
6	`prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2`	Service needs a dedicated alert rule
7	`prometheus/docs/ALERTING.md`	New Prometheus alert rules were added

Adding a New Prometheus Alert Rule (2 files)¶

#	File	Action
1	`prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2`	Add rule to appropriate group
2	`prometheus/docs/ALERTING.md`	Add row to the alert rules table

Alert rule template¶

- alert: <AlertName>
  expr: |
    <promql_expression> > <threshold>
  for: 5m
  labels:
    severity: warning   # or: critical
  annotations:
    summary: "<Short description> on {{ $labels.host }}"
    description: "<Detailed description>. Current value: {{ $value | humanize }}"

After updating rules, deploy and verify:

cd prometheus
ansible-playbook -i inventory/shared.yml deploy-prometheus.yml

# Verify rule loaded (allow 30 seconds)
curl -s https://prometheus.shared.cwiq.io/api/v1/rules | \
  python3 -m json.tool | grep -A3 "<AlertName>"

Modifying Existing Monitoring Configuration¶

Any modification to these directories requires documentation updates in the same commit:

Directory Modified	Documentation Files to Update
`alloy/` (inventory, config templates, roles)	`alloy/docs/README.md`, `docs/SLACK_ALERTING.md`
`icinga/conf.d/` (host configs, services)	`icinga/README.md`, `docs/SLACK_ALERTING.md`
`prometheus/roles/deploy_prometheus/templates/`	`prometheus/docs/ALERTING.md`, `docs/SLACK_ALERTING.md`
`loki/` (config templates, roles)	`loki/docs/README.md`, `loki/docs/OPERATIONS.md`
Docker Compose files (new service containers)	`alloy/docs/README.md`, `icinga/README.md`, `docs/SLACK_ALERTING.md`
nginx configs (new upstream/proxy_pass)	`icinga/README.md`, `docs/SLACK_ALERTING.md`

This applies to additions, modifications, AND removals. Examples: - Removing a check type → update Icinga and Slack docs - Changing alert thresholds → update prometheus/docs/ALERTING.md and docs/SLACK_ALERTING.md - Adding a Loki pipeline stage → update loki/docs/README.md

Adding a New Environment¶

New environment requires 4 new configs and 4 doc updates

Step	Action
1	Create Slack channel `#cwiq-{env}-infra-alerts` and configure Incoming Webhook
2	Store webhook in Vault: `vault kv patch secret/slack/webhooks {env}="https://..."`
3	Add AlertManager receiver in `prometheus/.../alertmanager.yml.j2`
4	Add Icinga notification user in `icinga/conf.d/notifications.conf.j2`
5	Create Alloy inventory file at `alloy/inventory/{env}.yml`
6	Update `docs/SLACK_ALERTING.md` (add channel to strategy table)
7	Update `prometheus/docs/ALERTING.md` (add receiver to receivers section)
8	Update `icinga/README.md` (add zone to zones table)
9	Update `alloy/docs/README.md` (add inventory file to inventory section)

Verification Checklist¶

After deploying new monitoring coverage, confirm all three systems are working:

# 1. Alloy service is running on the new host
ssh ec2-user@<hostname>-cwiq-io "sudo systemctl status alloy"

# 2. Logs are arriving in Loki
# Grafana Explore → Loki: {host="<hostname>-cwiq-io"}

# 3. Metrics are arriving in Prometheus
# Grafana Explore → Prometheus: up{host="<hostname>-cwiq-io"}

# 4. Icinga checks are green
# https://icinga.shared.cwiq.io → host → all checks OK

Reference Files¶

The primary documentation files for the monitoring stack:

File	Contains
`ansible-playbooks/alloy/docs/README.md`	What Alloy collects per server
`ansible-playbooks/icinga/README.md`	What is checked per server
`ansible-playbooks/prometheus/docs/ALERTING.md`	All 39 alert rules with PromQL
`ansible-playbooks/loki/docs/README.md`	Log ingestion pipeline
`ansible-playbooks/docs/SLACK_ALERTING.md`	Unified alerting overview and coverage map