Skip to content

Adding a New Service to an Existing Server

How to add a new application or container to a server that is already in the observability stack. Every new service requires updates to Alloy, Icinga, Prometheus, and the monitoring documentation in the same MR.


Overview

When a new service (Docker container, systemd service, or API endpoint) is added to an existing server, the observability stack must be updated to cover it. This is mandatory — not optional.

The seven files that must be updated:

# System File
1 Alloy alloy/inventory/{env}.yml
2 Alloy docs alloy/docs/README.md
3 Icinga icinga/conf.d/hosts/{env}/<hostname>.conf
4 Icinga docs icinga/README.md
5 Alerting docs docs/SLACK_ALERTING.md
6 Prometheus rules prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2 (if new rules needed)
7 Prometheus docs prometheus/docs/ALERTING.md (if new rules added)

Step 1: Write the Ansible Role

The service must be deployed entirely through Ansible. Never install packages or edit config files directly on the server.

Docker Compose Services

Create a Docker Compose template in the role's templates/ directory:

<app-name>/
└── roles/
    └── <service-name>/
        ├── defaults/main.yml       # Port, image version, volume paths
        ├── tasks/main.yml          # Pull image, template compose file, docker compose up
        ├── handlers/main.yml       # restart service
        └── templates/
            └── docker-compose.yml.j2

Example templates/docker-compose.yml.j2:

services:
  {{ service_name }}:
    image: "{{ service_image }}:{{ service_version }}"
    container_name: "{{ service_name }}"
    restart: unless-stopped
    ports:
      - "127.0.0.1:{{ service_port }}:{{ service_container_port }}"
    volumes:
      - "{{ service_data_dir }}:/data"
    environment:
      - SERVICE_CONFIG={{ service_config }}

Binding to 127.0.0.1 is the default — services are not exposed directly to the network. Access is through nginx or Tailscale.

Systemd Services

For non-container services, create a systemd unit template:

templates/
└── <service-name>.service.j2

Deploy it with the template module and reload systemd:

- name: Deploy systemd unit
  template:
    src: <service-name>.service.j2
    dest: /etc/systemd/system/<service-name>.service
  notify: reload systemd and restart service

Step 2: Add Nginx Upstream (if service needs HTTPS access)

If the service requires an HTTPS endpoint, add an nginx upstream and server block through Ansible.

Never edit nginx configs directly on the server

All nginx configuration changes must go through the Ansible role. See IaC Principles.

templates/
└── nginx-<service-name>.conf.j2
upstream {{ service_name }}_upstream {
    server 127.0.0.1:{{ service_port }};
}

server {
    listen 443 ssl;
    server_name {{ service_fqdn }};

    ssl_certificate     /etc/letsencrypt/live/{{ service_fqdn }}/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/{{ service_fqdn }}/privkey.pem;

    location / {
        proxy_pass http://{{ service_name }}_upstream;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }

    location /health {
        proxy_pass http://{{ service_name }}_upstream/health;
        access_log off;
    }
}

Step 3: Update Alloy Inventory

In ansible-playbooks/alloy/inventory/{env}.yml, update the existing host entry to add a scrape target for the new service's metrics endpoint.

If the service exposes a /metrics endpoint

# Before
<hostname>-cwiq-io:
  ansible_host: <hostname>-cwiq-io
  ansible_user: ec2-user
  alloy_environment: development
  alloy_scrape_app_metrics: false
  alloy_app_metrics_targets: []

# After
<hostname>-cwiq-io:
  ansible_host: <hostname>-cwiq-io
  ansible_user: ec2-user
  alloy_environment: development
  alloy_scrape_app_metrics: true
  alloy_app_metrics_targets:
    - { name: <service-name>, address: "localhost:<port>", metrics_path: "/metrics" }

If the service only produces logs (no metrics endpoint)

No Alloy inventory change is needed. Docker container logs are collected automatically via the Docker socket. Systemd service logs are collected automatically via journald.

Redeploy Alloy to the Host

ssh ansible@ansible-shared-cwiq-io
ansible-helper
cd alloy
ansible-playbook -i inventory/dev.yml deploy-alloy.yml \
  --limit <hostname>-cwiq-io

Update alloy/docs/README.md

Update the existing row for the host in the Monitored Servers table to mention the new scrape target.


Step 4: Add Icinga Health Check

In icinga/conf.d/hosts/{env}/<hostname>.conf, add a check for the new service's port or HTTP endpoint.

HTTP health check (most common)

vars.http_vhosts["<Service Name>"] = {
  http_address = "<service-fqdn>"
  http_ssl     = true
  http_vhost   = "<service-fqdn>"
  http_uri     = "/health"
  http_port    = 443
}

TCP port check (for non-HTTP services)

vars.tcp_ports["<Service Name>"] = {
  tcp_port = <port>
}

Docker container check (for containers on the host)

vars.docker_containers["<container-name>"] = {
  display_name = "<Service Name> container"
}

Container checks use check_by_ssh to run docker inspect on the host. The Icinga2 daemon connects via Tailscale SSH as user icinga (UID 5665).

Redeploy the Icinga Configuration

cd icinga
# For DEV hosts:
ansible-playbook -i inventory/shared.yml deploy-config.yml --tags dev
# For Shared hosts:
ansible-playbook -i inventory/shared.yml deploy-config.yml --tags shared

Update icinga/README.md

Update the row for the host in the Monitored Hosts table to add the new check.


Step 5: Add Prometheus Alert Rules (if needed)

Most services are covered by the existing catch-all alerting rules (volume, host-down, SSL expiry). Add service-specific rules only when the service has its own health endpoint or SLA requirements.

In prometheus/roles/deploy_prometheus/templates/alerting-rules.yml.j2:

- name: <service-name>
  rules:
    - alert: <ServiceName>Down
      expr: |
        probe_success{job="<service-name>"} == 0
      for: 2m
      labels:
        severity: critical
        environment: "{{ '{{' }} $labels.environment {{ '}}' }}"
      annotations:
        summary: "<Service Name> is unreachable"
        description: "<Service Name> at {{ '{{' }} $labels.instance {{ '}}' }} has been unreachable for 2 minutes."

Redeploy Prometheus after updating rules:

cd prometheus
ansible-playbook -i inventory/shared.yml deploy-prometheus.yml

If you add rules, update prometheus/docs/ALERTING.md with a row in the alert rules reference table.


Mandatory Documentation Co-Change Checklist

CRITICAL: All applicable files must be in the same MR

MRs that modify service configurations without updating documentation will be rejected.

# File Required When
1 alloy/inventory/{env}.yml Always
2 alloy/docs/README.md Always
3 icinga/conf.d/hosts/{env}/<hostname>.conf Always
4 icinga/README.md Always
5 docs/SLACK_ALERTING.md Always
6 prometheus/roles/.../alerting-rules.yml.j2 New alert rules needed
7 prometheus/docs/ALERTING.md New alert rules added

Verification Checklist

After deploying, confirm all systems are working:

# 1. Service is running
ssh ec2-user@<hostname>-cwiq-io "docker ps | grep <service-name>"
# or
ssh ec2-user@<hostname>-cwiq-io "systemctl status <service-name>"

# 2. Health endpoint responds
curl -sf https://<service-fqdn>/health

# 3. Alloy is collecting (check for errors)
ssh ec2-user@<hostname>-cwiq-io "sudo journalctl -u alloy -n 20 --no-pager"

# 4. Logs are in Loki
# Grafana Explore → Loki: {host="<hostname>-cwiq-io", container="<service-name>"}

# 5. Icinga check passes
# https://icinga.shared.cwiq.io → host → verify new check is green