Authentik High-Availability Architecture¶
Authentik runs Active-Active across two EC2 instances behind an internal ALB, with RDS PostgreSQL, EFS shared media, and Vault Agent for secrets management.
Infrastructure Overview¶
| Component | Value |
|---|---|
| Version | 2025.10.3 |
| Mode | Active-Active HA |
| DNS | sso.shared.cwiq.io |
| EC2 instances | 2 × t3.medium (us-west-2a, us-west-2b) |
| Database | RDS PostgreSQL 17 (db.t3.medium, Single-AZ) |
| Shared media | EFS (General Purpose) |
| Load balancer | Internal ALB with ACM certificate |
| Session stickiness | lb_cookie, 24 hours |
| Secrets | Vault Agent sidecar (tmpfs, AppRole) |
No Redis in 2025.10+
Authentik 2025.10 removed the Redis dependency. All caching and task queuing now uses PostgreSQL. The Docker Compose stack has no Redis container.
Architecture Diagram¶
Users (Tailscale)
│
▼
Tailscale Subnet Router
(SNAT mode — ALB sees router's VPC IP)
│
▼
Internal ALB (HTTPS:443, ACM cert)
Stickiness: 24h lb_cookie
Ingress: 10.0.12.0/26 only
│
┌────┴────┐
│ │
EC2-1 EC2-2
us-west-2a us-west-2b
t3.medium t3.medium
│ │
├── Vault Agent (sidecar, AppRole, tmpfs)
├── Authentik Server (HTTP:9000)
└── Authentik Worker
│ │
└────┬────┘
│
┌──────┴──────┐
│ │
EFS RDS PostgreSQL 17
(shared media) (Single-AZ, db.t3.medium)
Components¶
EC2 Instances¶
| Property | Value |
|---|---|
| Instance type | t3.medium |
| Count | 2 |
| Names | cwiq-shared-authentik-1, cwiq-shared-authentik-2 |
| Subnets | us-west-2a (10.0.11.0/26), us-west-2b (10.0.11.64/26) |
| AMI | AlmaLinux 9 |
Each instance runs three Docker containers: vault-agent, authentik-server, authentik-worker.
Application Load Balancer¶
| Property | Value |
|---|---|
| Type | Internal (VPC-only access via Tailscale) |
| SSL policy | ELBSecurityPolicy-TLS13-1-2-2021-06 |
| Certificate | ACM (imported from Let's Encrypt, managed by cert-server) |
| Health check | /-/health/live/ |
| Stickiness | lb_cookie, 24 hours |
| Allowed CIDRs | 10.0.12.0/26 (Tailscale subnet router subnet) |
Session stickiness is required because OAuth and SAML flows must complete on the same Authentik instance that initiated them.
Tailscale Access Model¶
The ALB security group only permits ingress from 10.0.12.0/26 (the Tailscale subnet router's VPC subnet). The subnet router operates in SNAT mode (default) — all traffic arriving at the ALB appears to come from the router's VPC IP rather than individual Tailscale client IPs. This simplifies security group rules and eliminates return-route configuration.
RDS PostgreSQL¶
| Property | Value |
|---|---|
| Engine | PostgreSQL 17 |
| Instance | db.t3.medium |
| Multi-AZ | No (Single-AZ) |
| Storage | 20 GB, auto-scaling to 100 GB |
| Backup retention | 7 days |
| Encryption | At rest (KMS) |
EFS (Shared Media)¶
| Property | Value |
|---|---|
| Purpose | Authentik uploaded assets (icons, logos, branding) |
| Mount path | /data/authentik/media |
| Performance | General Purpose |
| Encryption | Enabled |
Both EC2 instances mount the same EFS file system so media assets are consistent regardless of which instance handles a request.
Vault Agent Sidecar¶
| Property | Value |
|---|---|
| Image | hashicorp/vault:1.18 |
| Auth method | AppRole |
| Secret refresh | Every 5 minutes |
| Output volume | tmpfs (RAM only — secrets never touch disk) |
Secrets managed by Vault Agent:
| Secret | Vault Path | Env Var |
|---|---|---|
| RDS host | secret/data/cwiq/shared/authentik/database.pg_host |
AUTHENTIK_POSTGRESQL__HOST |
| RDS user | secret/data/cwiq/shared/authentik/database.pg_user |
AUTHENTIK_POSTGRESQL__USER |
| RDS password | secret/data/cwiq/shared/authentik/database.pg_password |
AUTHENTIK_POSTGRESQL__PASSWORD |
| RDS database | secret/data/cwiq/shared/authentik/database.pg_name |
AUTHENTIK_POSTGRESQL__NAME |
| Secret key | secret/data/cwiq/shared/authentik/config.secret_key |
AUTHENTIK_SECRET_KEY |
SSL Certificate Architecture¶
Authentik uses a dual SSL architecture:
1. HTTPS termination (ALB + ACM)
The Let's Encrypt certificate is imported into AWS ACM and managed by cert-server/acm-import.yml. Auto-renewed every 60 days.
2. SAML signing (Authentik internal)
The same Let's Encrypt RSA certificate is imported via the Authentik API (by configure.yml) for SAML assertion signing. Required for the AWS Identity Center integration.
Failover Behavior¶
EC2 Instance Failure¶
| Step | Time | Action |
|---|---|---|
| Instance stops responding | 0s | — |
| ALB marks instance unhealthy (3 failed checks) | ~30s | — |
| All traffic routed to healthy instance | ~30s | Automatic |
Total failover time: ~30 seconds, fully automatic.
RDS Failure (Single-AZ)¶
| Step | Estimated Time | Action |
|---|---|---|
| RDS becomes unavailable | 0s | — |
| Restore from automated backup | 5–10 min | Manual intervention |
| DNS/config update if needed | 2–5 min | Manual intervention |
| Total RTO | ~15 minutes | Manual |
Single-AZ RDS was chosen for cost. If RTO < 15 min is required, enable Multi-AZ in Terraform.
Security¶
| Control | Configuration |
|---|---|
| Secrets on disk | None — Vault Agent renders to tmpfs |
| Token TTL | 1 hour (Vault Agent auto-renews at 45 min) |
| Secret refresh | Every 5 minutes |
| Login rate limiting | 5 failed attempts → user blocked; 10 → IP blocked; 24h reset |
| Network | Internal ALB only; EC2 instances have no public IPs |
| RDS access | Isolated data subnet, security group ingress from Authentik instances only |
Recovery Procedures¶
Restart Services¶
Recover EC2 Instance from Backup¶
ansible-playbook aws/recover_instance.yml \
-e "app=authentik env=shared"
# From a specific snapshot
ansible-playbook aws/recover_instance.yml \
-e "app=authentik env=shared snapshot_id=snap-xxxxx"
Recover RDS (Point-in-Time)¶
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier cwiq-shared-authentik-postgres \
--target-db-instance-identifier cwiq-shared-authentik-postgres-recovered \
--restore-time 2026-03-15T10:00:00Z \
--profile shared-services
Regenerate Vault Secret ID¶
# On Vault server
vault write -f auth/approle/role/authentik/secret-id
# Redeploy with new Secret ID
ansible-playbook authentik/setup.yml \
-e "vault_enabled=true" \
-e "vault_role_id=<role_id>" \
-e "vault_secret_id=<new_secret_id>"
Health Endpoints¶
| Endpoint | Purpose |
|---|---|
https://sso.shared.cwiq.io/-/health/live/ |
Liveness (ALB health check) |
https://sso.shared.cwiq.io/-/health/ready/ |
Readiness |
# Quick health check
curl -sf https://sso.shared.cwiq.io/-/health/live/ && echo "healthy"
# Container logs
ssh cwiq@cwiq-shared-authentik-1
docker compose logs server --tail 50
docker compose logs worker --tail 20
docker compose logs vault-agent --tail 20
Related Documentation¶
- App Onboarding — Connecting applications via OIDC/SAML/Proxy
- User Lifecycle — Onboarding and offboarding users
- Vault Integration — AppRole setup and secret management
- MFA — TOTP enforcement
- SSL: ACM Import — How the ALB certificate is managed