Skip to content

Authentik High-Availability Architecture

Authentik runs Active-Active across two EC2 instances behind an internal ALB, with RDS PostgreSQL, EFS shared media, and Vault Agent for secrets management.

Infrastructure Overview

Component Value
Version 2025.10.3
Mode Active-Active HA
DNS sso.shared.cwiq.io
EC2 instances 2 × t3.medium (us-west-2a, us-west-2b)
Database RDS PostgreSQL 17 (db.t3.medium, Single-AZ)
Shared media EFS (General Purpose)
Load balancer Internal ALB with ACM certificate
Session stickiness lb_cookie, 24 hours
Secrets Vault Agent sidecar (tmpfs, AppRole)

No Redis in 2025.10+

Authentik 2025.10 removed the Redis dependency. All caching and task queuing now uses PostgreSQL. The Docker Compose stack has no Redis container.

Architecture Diagram

Users (Tailscale)
Tailscale Subnet Router
(SNAT mode — ALB sees router's VPC IP)
Internal ALB (HTTPS:443, ACM cert)
Stickiness: 24h lb_cookie
Ingress: 10.0.12.0/26 only
  ┌────┴────┐
  │         │
EC2-1      EC2-2
us-west-2a  us-west-2b
t3.medium   t3.medium
  │         │
  ├── Vault Agent (sidecar, AppRole, tmpfs)
  ├── Authentik Server (HTTP:9000)
  └── Authentik Worker
       │         │
       └────┬────┘
     ┌──────┴──────┐
     │             │
    EFS           RDS PostgreSQL 17
 (shared media)   (Single-AZ, db.t3.medium)

Components

EC2 Instances

Property Value
Instance type t3.medium
Count 2
Names cwiq-shared-authentik-1, cwiq-shared-authentik-2
Subnets us-west-2a (10.0.11.0/26), us-west-2b (10.0.11.64/26)
AMI AlmaLinux 9

Each instance runs three Docker containers: vault-agent, authentik-server, authentik-worker.

Application Load Balancer

Property Value
Type Internal (VPC-only access via Tailscale)
SSL policy ELBSecurityPolicy-TLS13-1-2-2021-06
Certificate ACM (imported from Let's Encrypt, managed by cert-server)
Health check /-/health/live/
Stickiness lb_cookie, 24 hours
Allowed CIDRs 10.0.12.0/26 (Tailscale subnet router subnet)

Session stickiness is required because OAuth and SAML flows must complete on the same Authentik instance that initiated them.

Tailscale Access Model

The ALB security group only permits ingress from 10.0.12.0/26 (the Tailscale subnet router's VPC subnet). The subnet router operates in SNAT mode (default) — all traffic arriving at the ALB appears to come from the router's VPC IP rather than individual Tailscale client IPs. This simplifies security group rules and eliminates return-route configuration.

RDS PostgreSQL

Property Value
Engine PostgreSQL 17
Instance db.t3.medium
Multi-AZ No (Single-AZ)
Storage 20 GB, auto-scaling to 100 GB
Backup retention 7 days
Encryption At rest (KMS)

EFS (Shared Media)

Property Value
Purpose Authentik uploaded assets (icons, logos, branding)
Mount path /data/authentik/media
Performance General Purpose
Encryption Enabled

Both EC2 instances mount the same EFS file system so media assets are consistent regardless of which instance handles a request.

Vault Agent Sidecar

Property Value
Image hashicorp/vault:1.18
Auth method AppRole
Secret refresh Every 5 minutes
Output volume tmpfs (RAM only — secrets never touch disk)

Secrets managed by Vault Agent:

Secret Vault Path Env Var
RDS host secret/data/cwiq/shared/authentik/database.pg_host AUTHENTIK_POSTGRESQL__HOST
RDS user secret/data/cwiq/shared/authentik/database.pg_user AUTHENTIK_POSTGRESQL__USER
RDS password secret/data/cwiq/shared/authentik/database.pg_password AUTHENTIK_POSTGRESQL__PASSWORD
RDS database secret/data/cwiq/shared/authentik/database.pg_name AUTHENTIK_POSTGRESQL__NAME
Secret key secret/data/cwiq/shared/authentik/config.secret_key AUTHENTIK_SECRET_KEY

SSL Certificate Architecture

Authentik uses a dual SSL architecture:

1. HTTPS termination (ALB + ACM)

Users (HTTPS:443) → ALB (ACM cert) → HTTP:9000 → Authentik

The Let's Encrypt certificate is imported into AWS ACM and managed by cert-server/acm-import.yml. Auto-renewed every 60 days.

2. SAML signing (Authentik internal)

The same Let's Encrypt RSA certificate is imported via the Authentik API (by configure.yml) for SAML assertion signing. Required for the AWS Identity Center integration.

Failover Behavior

EC2 Instance Failure

Step Time Action
Instance stops responding 0s
ALB marks instance unhealthy (3 failed checks) ~30s
All traffic routed to healthy instance ~30s Automatic

Total failover time: ~30 seconds, fully automatic.

RDS Failure (Single-AZ)

Step Estimated Time Action
RDS becomes unavailable 0s
Restore from automated backup 5–10 min Manual intervention
DNS/config update if needed 2–5 min Manual intervention
Total RTO ~15 minutes Manual

Single-AZ RDS was chosen for cost. If RTO < 15 min is required, enable Multi-AZ in Terraform.

Security

Control Configuration
Secrets on disk None — Vault Agent renders to tmpfs
Token TTL 1 hour (Vault Agent auto-renews at 45 min)
Secret refresh Every 5 minutes
Login rate limiting 5 failed attempts → user blocked; 10 → IP blocked; 24h reset
Network Internal ALB only; EC2 instances have no public IPs
RDS access Isolated data subnet, security group ingress from Authentik instances only

Recovery Procedures

Restart Services

ssh ansible@ansible-shared-cwiq-io
ansible-helper
ansible-playbook authentik/restart.yml

Recover EC2 Instance from Backup

ansible-playbook aws/recover_instance.yml \
  -e "app=authentik env=shared"

# From a specific snapshot
ansible-playbook aws/recover_instance.yml \
  -e "app=authentik env=shared snapshot_id=snap-xxxxx"

Recover RDS (Point-in-Time)

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier cwiq-shared-authentik-postgres \
  --target-db-instance-identifier cwiq-shared-authentik-postgres-recovered \
  --restore-time 2026-03-15T10:00:00Z \
  --profile shared-services

Regenerate Vault Secret ID

# On Vault server
vault write -f auth/approle/role/authentik/secret-id

# Redeploy with new Secret ID
ansible-playbook authentik/setup.yml \
  -e "vault_enabled=true" \
  -e "vault_role_id=<role_id>" \
  -e "vault_secret_id=<new_secret_id>"

Health Endpoints

Endpoint Purpose
https://sso.shared.cwiq.io/-/health/live/ Liveness (ALB health check)
https://sso.shared.cwiq.io/-/health/ready/ Readiness
# Quick health check
curl -sf https://sso.shared.cwiq.io/-/health/live/ && echo "healthy"

# Container logs
ssh cwiq@cwiq-shared-authentik-1
docker compose logs server --tail 50
docker compose logs worker --tail 20
docker compose logs vault-agent --tail 20