Grafana Dashboards¶

Grafana 12.4.0 is the primary UI for the CWIQ observability stack, providing log exploration via Loki and metrics dashboards via Prometheus, behind Nginx SSL termination with optional Authentik SSO.

Overview¶

Property	Value
Version	12.4.0
Host	`grafana-shared-cwiq-io` (VPC `10.0.15.178`)
External URL	`https://grafana.shared.cwiq.io`
Grafana container	`observability-grafana` (port 3000 localhost)
Nginx container	`observability-nginx` (ports 80/443)
Datasources	Loki (default), Prometheus (pre-provisioned)
SSO	Authentik OIDC (optional, separate playbook)
Playbook	`grafana/deploy-grafana.yml`

Pre-Provisioned Datasources¶

Both datasources are provisioned at startup. No manual configuration is needed.

Name	Type	URL	Default
`Loki`	Loki	`http://10.0.15.157:3100`	Yes
`Prometheus`	Prometheus	`http://10.0.15.9:9090`	No

URLs use VPC private IPs because Loki and Prometheus run on different hosts. Docker container DNS only resolves within the same host's Docker network.

Using Grafana¶

Log Exploration (Loki)¶

Open https://grafana.shared.cwiq.io
Click Explore (compass icon) in the left sidebar
Select Loki from the datasource dropdown
Enter a LogQL query — see Loki for the full LogQL reference

Common queries:

# All logs from a specific server
{host="orchestrator-dev-cwiq-io"}

# Error logs from all orchestrator containers
{job="docker", compose_project="orchestrator"} |= "error"

# Systemd journal for the alloy service
{job="journal"} |= "alloy"

Metrics Exploration (Prometheus)¶

Click Explore
Select Prometheus from the datasource dropdown
Enter a PromQL query

Common queries:

# CPU usage per host
100 - (avg by (host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory used
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# All currently firing alerts
ALERTS{alertstate="firing"}

Available Dashboards¶

Three Node Exporter dashboards are pre-provisioned:

Dashboard	Shows
Node Exporter — System Overview	CPU, memory, swap, load across all hosts
Node Exporter — Disk & Filesystem	Disk usage and IOPS per host per mount
Node Exporter — Network	Network throughput, errors, drops per interface

All dashboards include a Host dropdown populated from the host label on incoming metrics. Select a host to filter all panels.

Alert Management¶

The Grafana Alerting UI connects to AlertManager at http://observability-alertmanager:9093 via the Docker network on prometheus-shared-cwiq-io. Use Grafana to:

View currently firing alerts
Create and manage silences
Inspect the alert routing tree

AlertManager's API is not directly accessible from browsers. Access it through Grafana or via SSH on prometheus-shared-cwiq-io:

ssh ec2-user@prometheus-shared-cwiq-io "curl -s http://localhost:9093/api/v2/alerts | python3 -m json.tool"

Configuration Variables¶

Variable	Default	Description
`grafana_version`	`12.4.0`	Docker image tag
`grafana_admin_password`	(required)	Admin password — get from Vault `secret/grafana/admin`
`grafana_data_dir`	`/data/grafana`	Host path for data, plugins, dashboards
`grafana_domain`	`grafana.shared.cwiq.io`	External hostname
`grafana_loki_url`	`http://10.0.15.157:3100`	Loki datasource URL
`grafana_prometheus_url`	`http://10.0.15.9:9090`	Prometheus datasource URL

Deployment¶

ssh ansible@ansible-shared-cwiq-io
ansible-helper
cd grafana
cp group_vars/all.yml.template group_vars/all.yml
# Set grafana_admin_password from Vault
vi group_vars/all.yml

ansible-playbook -i inventory/shared.yml deploy-grafana.yml

The playbook creates config and data directories, writes grafana.ini and nginx config from templates, and starts both containers. It waits up to 300 seconds for /api/health to return HTTP 200.

Health Check Commands¶

# External health via Nginx
curl https://grafana.shared.cwiq.io/api/health

# Internal Grafana health (from host)
ssh ec2-user@grafana-shared-cwiq-io "curl http://localhost:3000/api/health"

# Container status
ssh ec2-user@grafana-shared-cwiq-io "docker ps \
  --filter 'name=observability-grafana' \
  --filter 'name=observability-nginx' \
  --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'"

# Ansible healthcheck
cd grafana
ansible-playbook -i inventory/shared.yml healthcheck.yml

Operational Playbooks¶

Playbook	Purpose
`deploy-grafana.yml`	Full deployment (config + image pull + start)
`healthcheck.yml`	Check `/api/health`
`restart.yml`	Restart Grafana and Nginx containers
`stop.yml`	Stop both containers

Monitoring Overview
Loki Log Aggregation
Prometheus & AlertManager
Source: ansible-playbooks/grafana/docs/README.md