Adding a New Service to an Existing Server¶

How to add a new application or container to a server that is already in the observability stack. Every new service requires updates to Alloy, Icinga, Prometheus, and the monitoring documentation in the same MR.

Overview¶

When a new service (Docker container, systemd service, or API endpoint) is added to an existing server, the observability stack must be updated to cover it. This is mandatory — not optional.

The seven files that must be updated:

#	System	File
1	Alloy	`alloy/inventory/{env}.yml`
2	Alloy docs	`alloy/docs/README.md`
3	Icinga	`icinga/conf.d/hosts/{env}/<hostname>.conf`
4	Icinga docs	`icinga/README.md`
5	Alerting docs	`docs/SLACK_ALERTING.md`
6	Prometheus rules	`prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2` (if new rules needed)
7	Prometheus docs	`prometheus/docs/ALERTING.md` (if new rules added)

Step 1: Write the Ansible Role¶

The service must be deployed entirely through Ansible. Never install packages or edit config files directly on the server.

Docker Compose Services¶

Create a Docker Compose template in the role's templates/ directory:

<app-name>/
└── roles/
    └── <service-name>/
        ├── defaults/main.yml       # Port, image version, volume paths
        ├── tasks/main.yml          # Pull image, template compose file, docker compose up
        ├── handlers/main.yml       # restart service
        └── templates/
            └── docker-compose.yml.j2

Example templates/docker-compose.yml.j2:

services:
  {{ service_name }}:
    image: "{{ service_image }}:{{ service_version }}"
    container_name: "{{ service_name }}"
    restart: unless-stopped
    ports:
      - "127.0.0.1:{{ service_port }}:{{ service_container_port }}"
    volumes:
      - "{{ service_data_dir }}:/data"
    environment:
      - SERVICE_CONFIG={{ service_config }}

Binding to 127.0.0.1 is the default — services are not exposed directly to the network. Access is through nginx or Tailscale.

Systemd Services¶

For non-container services, create a systemd unit template:

templates/
└── <service-name>.service.j2

Deploy it with the template module and reload systemd:

- name: Deploy systemd unit
  template:
    src: <service-name>.service.j2
    dest: /etc/systemd/system/<service-name>.service
  notify: reload systemd and restart service

Step 2: Add Nginx Upstream (if service needs HTTPS access)¶

If the service requires an HTTPS endpoint, add an nginx upstream and server block through Ansible.

Never edit nginx configs directly on the server

All nginx configuration changes must go through the Ansible role. See IaC Principles.

templates/
└── nginx-<service-name>.conf.j2

upstream {{ service_name }}_upstream {
    server 127.0.0.1:{{ service_port }};
}

server {
    listen 443 ssl;
    server_name {{ service_fqdn }};

    ssl_certificate     /etc/letsencrypt/live/{{ service_fqdn }}/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/{{ service_fqdn }}/privkey.pem;

    location / {
        proxy_pass http://{{ service_name }}_upstream;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }

    location /health {
        proxy_pass http://{{ service_name }}_upstream/health;
        access_log off;
    }
}

Step 3: Update Alloy Inventory¶

In ansible-playbooks/alloy/inventory/{env}.yml, update the existing host entry to add a scrape target for the new service's metrics endpoint.

If the service exposes a `/metrics` endpoint¶

# Before
<hostname>-cwiq-io:
  ansible_host: <hostname>-cwiq-io
  ansible_user: ec2-user
  alloy_environment: development
  alloy_scrape_app_metrics: false
  alloy_app_metrics_targets: []

# After
<hostname>-cwiq-io:
  ansible_host: <hostname>-cwiq-io
  ansible_user: ec2-user
  alloy_environment: development
  alloy_scrape_app_metrics: true
  alloy_app_metrics_targets:
    - { name: <service-name>, address: "localhost:<port>", metrics_path: "/metrics" }

If the service only produces logs (no metrics endpoint)¶

No Alloy inventory change is needed. Docker container logs are collected automatically via the Docker socket. Systemd service logs are collected automatically via journald.

Redeploy Alloy to the Host¶

ssh ansible@ansible-shared-cwiq-io
ansible-helper
cd alloy
ansible-playbook -i inventory/dev.yml deploy-alloy.yml \
  --limit <hostname>-cwiq-io

Update `alloy/docs/README.md`¶

Update the existing row for the host in the Monitored Servers table to mention the new scrape target.

Step 4: Add Icinga Health Check¶

In icinga/conf.d/hosts/{env}/<hostname>.conf, add a check for the new service's port or HTTP endpoint.

HTTP health check (most common)¶

vars.http_vhosts["<Service Name>"] = {
  http_address = "<service-fqdn>"
  http_ssl     = true
  http_vhost   = "<service-fqdn>"
  http_uri     = "/health"
  http_port    = 443
}

TCP port check (for non-HTTP services)¶

vars.tcp_ports["<Service Name>"] = {
  tcp_port = <port>
}

Docker container check (for containers on the host)¶

vars.docker_containers["<container-name>"] = {
  display_name = "<Service Name> container"
}

Container checks use check_by_ssh to run docker inspect on the host. The Icinga2 daemon connects via Tailscale SSH as user icinga (UID 5665).

Redeploy the Icinga Configuration¶

cd icinga
# For DEV hosts:
ansible-playbook -i inventory/shared.yml deploy-config.yml --tags dev
# For Shared hosts:
ansible-playbook -i inventory/shared.yml deploy-config.yml --tags shared

Update `icinga/README.md`¶

Update the row for the host in the Monitored Hosts table to add the new check.

Step 5: Add Prometheus Alert Rules (if needed)¶

Most services are covered by the existing catch-all alerting rules (volume, host-down, SSL expiry). Add service-specific rules only when the service has its own health endpoint or SLA requirements.

In prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2:

- name: <service-name>
  rules:
    - alert: <ServiceName>Down
      expr: |
        probe_success{job="<service-name>"} == 0
      for: 2m
      labels:
        severity: critical
        environment: "{{ '{{' }} $labels.environment {{ '}}' }}"
      annotations:
        summary: "<Service Name> is unreachable"
        description: "<Service Name> at {{ '{{' }} $labels.instance {{ '}}' }} has been unreachable for 2 minutes."

Redeploy Prometheus after updating rules:

cd prometheus
ansible-playbook -i inventory/shared.yml deploy-prometheus.yml

If you add rules, update prometheus/docs/ALERTING.md with a row in the alert rules reference table.

Mandatory Documentation Co-Change Checklist¶

CRITICAL: All applicable files must be in the same MR

MRs that modify service configurations without updating documentation will be rejected.

#	File	Required When
1	`alloy/inventory/{env}.yml`	Always
2	`alloy/docs/README.md`	Always
3	`icinga/conf.d/hosts/{env}/<hostname>.conf`	Always
4	`icinga/README.md`	Always
5	`docs/SLACK_ALERTING.md`	Always
6	`prometheus/roles/.../alerting-rules.yml.j2`	New alert rules needed
7	`prometheus/docs/ALERTING.md`	New alert rules added

Verification Checklist¶

After deploying, confirm all systems are working:

# 1. Service is running
ssh ec2-user@<hostname>-cwiq-io "docker ps | grep <service-name>"
# or
ssh ec2-user@<hostname>-cwiq-io "systemctl status <service-name>"

# 2. Health endpoint responds
curl -sf https://<service-fqdn>/health

# 3. Alloy is collecting (check for errors)
ssh ec2-user@<hostname>-cwiq-io "sudo journalctl -u alloy -n 20 --no-pager"

# 4. Logs are in Loki
# Grafana Explore → Loki: {host="<hostname>-cwiq-io", container="<service-name>"}

# 5. Icinga check passes
# https://icinga.shared.cwiq.io → host → verify new check is green