Adding a New Service to an Existing Server¶
How to add a new application or container to a server that is already in the observability stack. Every new service requires updates to Alloy, Icinga, Prometheus, and the monitoring documentation in the same MR.
Overview¶
When a new service (Docker container, systemd service, or API endpoint) is added to an existing server, the observability stack must be updated to cover it. This is mandatory — not optional.
The seven files that must be updated:
| # | System | File |
|---|---|---|
| 1 | Alloy | alloy/inventory/{env}.yml |
| 2 | Alloy docs | alloy/docs/README.md |
| 3 | Icinga | icinga/conf.d/hosts/{env}/<hostname>.conf |
| 4 | Icinga docs | icinga/README.md |
| 5 | Alerting docs | docs/SLACK_ALERTING.md |
| 6 | Prometheus rules | prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2 (if new rules needed) |
| 7 | Prometheus docs | prometheus/docs/ALERTING.md (if new rules added) |
Step 1: Write the Ansible Role¶
The service must be deployed entirely through Ansible. Never install packages or edit config files directly on the server.
Docker Compose Services¶
Create a Docker Compose template in the role's templates/ directory:
<app-name>/
└── roles/
└── <service-name>/
├── defaults/main.yml # Port, image version, volume paths
├── tasks/main.yml # Pull image, template compose file, docker compose up
├── handlers/main.yml # restart service
└── templates/
└── docker-compose.yml.j2
Example templates/docker-compose.yml.j2:
services:
{{ service_name }}:
image: "{{ service_image }}:{{ service_version }}"
container_name: "{{ service_name }}"
restart: unless-stopped
ports:
- "127.0.0.1:{{ service_port }}:{{ service_container_port }}"
volumes:
- "{{ service_data_dir }}:/data"
environment:
- SERVICE_CONFIG={{ service_config }}
Binding to 127.0.0.1 is the default — services are not exposed directly to the network. Access is through nginx or Tailscale.
Systemd Services¶
For non-container services, create a systemd unit template:
Deploy it with the template module and reload systemd:
- name: Deploy systemd unit
template:
src: <service-name>.service.j2
dest: /etc/systemd/system/<service-name>.service
notify: reload systemd and restart service
Step 2: Add Nginx Upstream (if service needs HTTPS access)¶
If the service requires an HTTPS endpoint, add an nginx upstream and server block through Ansible.
Never edit nginx configs directly on the server
All nginx configuration changes must go through the Ansible role. See IaC Principles.
upstream {{ service_name }}_upstream {
server 127.0.0.1:{{ service_port }};
}
server {
listen 443 ssl;
server_name {{ service_fqdn }};
ssl_certificate /etc/letsencrypt/live/{{ service_fqdn }}/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/{{ service_fqdn }}/privkey.pem;
location / {
proxy_pass http://{{ service_name }}_upstream;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
location /health {
proxy_pass http://{{ service_name }}_upstream/health;
access_log off;
}
}
Step 3: Update Alloy Inventory¶
In ansible-playbooks/alloy/inventory/{env}.yml, update the existing host entry to add a scrape target for the new service's metrics endpoint.
If the service exposes a /metrics endpoint¶
# Before
<hostname>-cwiq-io:
ansible_host: <hostname>-cwiq-io
ansible_user: ec2-user
alloy_environment: development
alloy_scrape_app_metrics: false
alloy_app_metrics_targets: []
# After
<hostname>-cwiq-io:
ansible_host: <hostname>-cwiq-io
ansible_user: ec2-user
alloy_environment: development
alloy_scrape_app_metrics: true
alloy_app_metrics_targets:
- { name: <service-name>, address: "localhost:<port>", metrics_path: "/metrics" }
If the service only produces logs (no metrics endpoint)¶
No Alloy inventory change is needed. Docker container logs are collected automatically via the Docker socket. Systemd service logs are collected automatically via journald.
Redeploy Alloy to the Host¶
ssh ansible@ansible-shared-cwiq-io
ansible-helper
cd alloy
ansible-playbook -i inventory/dev.yml deploy-alloy.yml \
--limit <hostname>-cwiq-io
Update alloy/docs/README.md¶
Update the existing row for the host in the Monitored Servers table to mention the new scrape target.
Step 4: Add Icinga Health Check¶
In icinga/conf.d/hosts/{env}/<hostname>.conf, add a check for the new service's port or HTTP endpoint.
HTTP health check (most common)¶
vars.http_vhosts["<Service Name>"] = {
http_address = "<service-fqdn>"
http_ssl = true
http_vhost = "<service-fqdn>"
http_uri = "/health"
http_port = 443
}
TCP port check (for non-HTTP services)¶
Docker container check (for containers on the host)¶
Container checks use check_by_ssh to run docker inspect on the host. The Icinga2 daemon connects via Tailscale SSH as user icinga (UID 5665).
Redeploy the Icinga Configuration¶
cd icinga
# For DEV hosts:
ansible-playbook -i inventory/shared.yml deploy-config.yml --tags dev
# For Shared hosts:
ansible-playbook -i inventory/shared.yml deploy-config.yml --tags shared
Update icinga/README.md¶
Update the row for the host in the Monitored Hosts table to add the new check.
Step 5: Add Prometheus Alert Rules (if needed)¶
Most services are covered by the existing catch-all alerting rules (volume, host-down, SSL expiry). Add service-specific rules only when the service has its own health endpoint or SLA requirements.
In prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2:
- name: <service-name>
rules:
- alert: <ServiceName>Down
expr: |
probe_success{job="<service-name>"} == 0
for: 2m
labels:
severity: critical
environment: "{{ '{{' }} $labels.environment {{ '}}' }}"
annotations:
summary: "<Service Name> is unreachable"
description: "<Service Name> at {{ '{{' }} $labels.instance {{ '}}' }} has been unreachable for 2 minutes."
Redeploy Prometheus after updating rules:
If you add rules, update prometheus/docs/ALERTING.md with a row in the alert rules reference table.
Mandatory Documentation Co-Change Checklist¶
CRITICAL: All applicable files must be in the same MR
MRs that modify service configurations without updating documentation will be rejected.
| # | File | Required When |
|---|---|---|
| 1 | alloy/inventory/{env}.yml |
Always |
| 2 | alloy/docs/README.md |
Always |
| 3 | icinga/conf.d/hosts/{env}/<hostname>.conf |
Always |
| 4 | icinga/README.md |
Always |
| 5 | docs/SLACK_ALERTING.md |
Always |
| 6 | prometheus/roles/.../alerting-rules.yml.j2 |
New alert rules needed |
| 7 | prometheus/docs/ALERTING.md |
New alert rules added |
Verification Checklist¶
After deploying, confirm all systems are working:
# 1. Service is running
ssh ec2-user@<hostname>-cwiq-io "docker ps | grep <service-name>"
# or
ssh ec2-user@<hostname>-cwiq-io "systemctl status <service-name>"
# 2. Health endpoint responds
curl -sf https://<service-fqdn>/health
# 3. Alloy is collecting (check for errors)
ssh ec2-user@<hostname>-cwiq-io "sudo journalctl -u alloy -n 20 --no-pager"
# 4. Logs are in Loki
# Grafana Explore → Loki: {host="<hostname>-cwiq-io", container="<service-name>"}
# 5. Icinga check passes
# https://icinga.shared.cwiq.io → host → verify new check is green