Adding Monitoring for New Infrastructure¶
The mandatory checklist for updating the observability stack when adding a new server or service. Every table below maps to a specific file that must be updated in the same MR.
When This Applies¶
This checklist applies any time you:
- Provision a new EC2 server
- Add a new Docker container to an existing server
- Add a new FastAPI service with a
/healthendpoint - Change a service port or nginx upstream
- Modify files in
alloy/,icinga/,prometheus/, orloki/
Adding a New Server (5 mandatory files)¶
CRITICAL: All five files must be in the same MR
MRs that add infrastructure without updating all monitoring documentation will be rejected.
| # | File | Action |
|---|---|---|
| 1 | alloy/inventory/{shared|dev}.yml |
Add host entry with alloy_environment, ansible_host, ansible_user |
| 2 | alloy/docs/README.md |
Add row to Monitored Servers table (Docker Logs, App Metrics, Targets) |
| 3 | icinga/conf.d/hosts/{shared|dev}/<hostname>.conf |
Create host config importing cwiq-shared-host or cwiq-dev-host |
| 4 | icinga/README.md |
Add row to Monitored Hosts table (zone, host, checks) |
| 5 | docs/SLACK_ALERTING.md |
Add row to Coverage Map table and Icinga Checks table |
Template: Alloy inventory entry¶
# For a DEV server in alloy/inventory/dev.yml
<hostname>-dev-cwiq-io:
ansible_host: <hostname>-dev-cwiq-io # Tailscale hostname
ansible_user: ec2-user
alloy_environment: development
alloy_scrape_app_metrics: false
alloy_app_metrics_targets: []
Template: Icinga host config¶
# icinga/conf.d/hosts/dev/<hostname>-dev.conf
object Host "<hostname>-dev-cwiq-io" {
import "cwiq-dev-host"
address = "<hostname>-dev-cwiq-io"
display_name = "<Service Name> DEV"
vars.environment = "dev"
vars.os = "AlmaLinux"
vars.http_vhosts["HTTPS"] = {
http_address = "<service>.dev.cwiq.io"
http_ssl = true
http_vhost = "<service>.dev.cwiq.io"
http_uri = "/health"
http_port = 443
}
}
Adding a New Service to Existing Server (up to 7 files)¶
| # | File | Required When |
|---|---|---|
| 1 | alloy/inventory/{env}.yml |
Service exposes /metrics endpoint |
| 2 | alloy/docs/README.md |
Always — update host's row |
| 3 | icinga/conf.d/hosts/{env}/<hostname>.conf |
Always — add HTTP/TCP check for new port |
| 4 | icinga/README.md |
Always — update host's row |
| 5 | docs/SLACK_ALERTING.md |
Always — update Coverage Map |
| 6 | prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2 |
Service needs a dedicated alert rule |
| 7 | prometheus/docs/ALERTING.md |
New Prometheus alert rules were added |
Adding a New Prometheus Alert Rule (2 files)¶
| # | File | Action |
|---|---|---|
| 1 | prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2 |
Add rule to appropriate group |
| 2 | prometheus/docs/ALERTING.md |
Add row to the alert rules table |
Alert rule template¶
- alert: <AlertName>
expr: |
<promql_expression> > <threshold>
for: 5m
labels:
severity: warning # or: critical
annotations:
summary: "<Short description> on {{ $labels.host }}"
description: "<Detailed description>. Current value: {{ $value | humanize }}"
After updating rules, deploy and verify:
cd prometheus
ansible-playbook -i inventory/shared.yml deploy-prometheus.yml
# Verify rule loaded (allow 30 seconds)
curl -s https://prometheus.shared.cwiq.io/api/v1/rules | \
python3 -m json.tool | grep -A3 "<AlertName>"
Modifying Existing Monitoring Configuration¶
Any modification to these directories requires documentation updates in the same commit:
| Directory Modified | Documentation Files to Update |
|---|---|
alloy/ (inventory, config templates, roles) |
alloy/docs/README.md, docs/SLACK_ALERTING.md |
icinga/conf.d/ (host configs, services) |
icinga/README.md, docs/SLACK_ALERTING.md |
prometheus/roles/deploy_prometheus/templates/ |
prometheus/docs/ALERTING.md, docs/SLACK_ALERTING.md |
loki/ (config templates, roles) |
loki/docs/README.md, loki/docs/OPERATIONS.md |
| Docker Compose files (new service containers) | alloy/docs/README.md, icinga/README.md, docs/SLACK_ALERTING.md |
| nginx configs (new upstream/proxy_pass) | icinga/README.md, docs/SLACK_ALERTING.md |
This applies to additions, modifications, AND removals. Examples:
- Removing a check type → update Icinga and Slack docs
- Changing alert thresholds → update prometheus/docs/ALERTING.md and docs/SLACK_ALERTING.md
- Adding a Loki pipeline stage → update loki/docs/README.md
Adding a New Environment¶
New environment requires 4 new configs and 4 doc updates
| Step | Action |
|---|---|
| 1 | Create Slack channel #cwiq-{env}-infra-alerts and configure Incoming Webhook |
| 2 | Store webhook in Vault: vault kv patch secret/slack/webhooks {env}="https://..." |
| 3 | Add AlertManager receiver in prometheus/.../alertmanager.yml.j2 |
| 4 | Add Icinga notification user in icinga/conf.d/notifications.conf.j2 |
| 5 | Create Alloy inventory file at alloy/inventory/{env}.yml |
| 6 | Update docs/SLACK_ALERTING.md (add channel to strategy table) |
| 7 | Update prometheus/docs/ALERTING.md (add receiver to receivers section) |
| 8 | Update icinga/README.md (add zone to zones table) |
| 9 | Update alloy/docs/README.md (add inventory file to inventory section) |
Verification Checklist¶
After deploying new monitoring coverage, confirm all three systems are working:
# 1. Alloy service is running on the new host
ssh ec2-user@<hostname>-cwiq-io "sudo systemctl status alloy"
# 2. Logs are arriving in Loki
# Grafana Explore → Loki: {host="<hostname>-cwiq-io"}
# 3. Metrics are arriving in Prometheus
# Grafana Explore → Prometheus: up{host="<hostname>-cwiq-io"}
# 4. Icinga checks are green
# https://icinga.shared.cwiq.io → host → all checks OK
Reference Files¶
The primary documentation files for the monitoring stack:
| File | Contains |
|---|---|
ansible-playbooks/alloy/docs/README.md |
What Alloy collects per server |
ansible-playbooks/icinga/README.md |
What is checked per server |
ansible-playbooks/prometheus/docs/ALERTING.md |
All 39 alert rules with PromQL |
ansible-playbooks/loki/docs/README.md |
Log ingestion pipeline |
ansible-playbooks/docs/SLACK_ALERTING.md |
Unified alerting overview and coverage map |