Skip to content

SSL: Troubleshooting

Diagnosis and resolution steps for SSL certificate issuance failures, deployment failures, services not picking up renewed certs, and ACM issues.

Certificate Issuance Failures

DNS-01 Challenge Fails

Certbot uses Route53 to create TXT records for DNS-01 validation. If the cert-server's IAM role does not have Route53 permissions, or the hosted zone is not accessible, issuance will fail.

# Verify IAM role has Route53 permissions
aws sts get-caller-identity  # Run on cert-server; confirms instance role
aws route53 list-hosted-zones  # Should list shared.cwiq.io and dev.cwiq.io zones

# Test with a dry run before issuing
certbot certonly --dns-route53 --dry-run -d test.dev.cwiq.io

If list-hosted-zones fails with AccessDenied, update the IAM instance role policy in Terraform (terraform-plan/organization/environments/shared-services/).

Certificate Already Exists

If you receive an error that the certificate already exists, check the current cert status:

certbot certificates | grep -A6 <domain>

If the cert is valid and not near expiry, no action is needed. If it is corrupted or missing files:

# Remove corrupted cert and re-issue
sudo certbot delete --cert-name <domain>
ansible-playbook ssl-issue-all.yml

Rate Limit Hit

Let's Encrypt enforces rate limits (5 duplicate certificates per week per domain). If you hit this limit:

  • Wait for the rate limit window to reset (up to 7 days)
  • Use the dry-run flag for testing: certbot certonly --dns-route53 --dry-run -d <domain>
  • Check certbot certificates to confirm whether a valid cert already exists before re-issuing

Deployment Failures

Tailscale Connectivity Lost

The cert-server deploys certificates over SSH via Tailscale. If a target host drops off the tailnet, deployment fails for that host.

# Check Tailscale status on the cert-server
tailscale status

# Ping the target host
tailscale ping gitlab-dev-cwiq-io
tailscale ping authentik-shared-cwiq-io-1

# Test SSH directly
ssh -i ~/.ssh/cwiq-ansible ec2-user@gitlab-dev-cwiq-io "hostname"

If the host is not reachable on Tailscale, check Tailscale status on the target server and restart the Tailscale daemon if needed.

Nexus Shared hostname workaround

nexus-shared-cwiq-io is deployed using its Tailscale IP (100.67.249.34) rather than its hostname. If ssl-deploy-nexus.yml fails for the shared instance, verify the Tailscale IP is still correct: tailscale status | grep nexus-shared.

SSH Permission Denied

If ssh -i ~/.ssh/cwiq-ansible ec2-user@<host> fails:

# Verify the key exists
ls -la ~/.ssh/cwiq-ansible

# Check known_hosts
ssh-keygen -F <hostname>

# If host key changed (e.g., after EC2 instance replacement)
ssh-keygen -R <hostname>
ssh -i ~/.ssh/cwiq-ansible ec2-user@<hostname>  # Accept new host key

Cert Files Missing After Deployment

If the deploy playbook reports success but the cert files are not on the target server:

ssh ec2-user@<hostname>
ls -la /data/ssl/<domain>/

Check that /data/ssl/ exists and the deploy user has write access. The playbook creates the directory if it does not exist, but the parent /data/ mount point must be present.

Service Not Using New Certificate

After a certificate is deployed, the service must reload to pick up the new files. This is handled by the reload_command in inventory.yml. If the reload did not happen or failed, run it manually:

GitLab

# Dev
ssh ec2-user@gitlab-dev-cwiq-io \
  "sudo -u gitlab docker exec gitlab gitlab-ctl hup nginx"

# Shared
ssh ec2-user@gitlab-shared-cwiq-io \
  "sudo -u gitlab docker exec gitlab gitlab-ctl hup nginx"

Orchestrator

# Dev
ssh cwiq@orchestrator-dev-cwiq-io "docker restart orchestrator-nginx"

# Demo
ssh cwiq@orchestrator-demo-cwiq-io "docker restart orchestrator-nginx"

Vault

ssh ec2-user@vault-shared-cwiq-io \
  "sudo -u vault docker compose -f /data/vault/docker-compose.yml restart"

After a restart, verify the certificate:

openssl s_client -connect vault.shared.cwiq.io:443 \
  -servername vault.shared.cwiq.io < /dev/null 2>/dev/null | \
  openssl x509 -noout -subject -dates

Authentik

ssh ec2-user@authentik-shared-cwiq-io-1 \
  "sudo -u authentik docker compose -f /data/authentik/docker-compose.yml restart server"

ssh ec2-user@authentik-shared-cwiq-io-2 \
  "sudo -u authentik docker compose -f /data/authentik/docker-compose.yml restart server"

Nginx-Based Services (Grafana, Prometheus, OpenLDAP, etc.)

ssh ec2-user@grafana-shared-cwiq-io "docker restart grafana-nginx"
ssh ec2-user@prometheus-shared-cwiq-io "docker restart prometheus-nginx"
ssh ec2-user@openldap-shared-cwiq-io "docker restart openldap-nginx"
ssh ec2-user@sonarqube-shared-cwiq-io "docker restart sonarqube-nginx"
ssh ec2-user@defectdojo-shared-cwiq-io "docker restart defectdojo-nginx"
ssh ec2-user@reportportal-shared-cwiq-io "docker restart reportportal-nginx"

Vault: Private Key Permission Error

Vault's container runs as UID 100, which differs from the vault OS user (UID 1001). If the Vault container cannot read privkey.pem, the deploy playbook may not have set the correct ownership.

ssh ec2-user@vault-shared-cwiq-io
ls -la /data/ssl/vault.shared.cwiq.io/
# privkey.pem must be owned by UID 100

# Fix if needed
sudo chown 100:100 /data/ssl/vault.shared.cwiq.io/privkey.pem
sudo chmod 640 /data/ssl/vault.shared.cwiq.io/privkey.pem

# Then restart Vault
sudo -u vault docker compose -f /data/vault/docker-compose.yml restart

The ssl-deploy-vault.yml playbook sets UID 100:100 automatically. If using ssl-deploy-all.yml, check that the Vault-specific ownership logic runs.

Docker: SSL Files Mounted as Directories

If Docker created directory stubs for the SSL files before the cert was deployed, the container will see directories instead of files. This typically happens when a container is started before the cert files exist.

ssh ec2-user@<hostname>
ls -la /data/ssl/<domain>/
# If fullchain.pem or privkey.pem shows as a directory (d-----), remove and redeploy

sudo rm -rf /data/ssl/<domain>/fullchain.pem /data/ssl/<domain>/privkey.pem

Then redeploy the certificate and force-recreate the container:

# Redeploy cert
ansible-playbook -i inventory.yml ssl-deploy-<service>.yml

# Force-recreate the container that mounts the SSL files
ssh ec2-user@<hostname>
cd /data/<service>
sudo docker compose -f docker-compose.yml up -d --force-recreate <nginx-container>

ACM Troubleshooting

ACM Certificate Not Importing

# Verify the cert exists on the cert-server
ls -la /etc/letsencrypt/live/<domain>/

# If missing, issue it first
ansible-playbook ssl-issue-all.yml

# Then import
ansible-playbook -i inventory.yml acm-import.yml -e "cert_domain=<domain>"

ALB Still Showing Old Certificate

ACM certificate updates take effect immediately on new TLS handshakes, but existing connections use the old cert until they terminate. If the ALB appears to serve an old cert:

# Verify the import actually updated ACM
aws acm list-certificates --region us-west-2 --profile shared-services \
  --query "CertificateSummaryList[?DomainName=='<domain>']"

aws acm describe-certificate \
  --certificate-arn <arn> \
  --region us-west-2 \
  --profile shared-services \
  --query "Certificate.{Status:Status,NotAfter:NotAfter,InUseBy:InUseBy}"

# Verify the ALB listener is using this certificate
aws elbv2 describe-listeners \
  --load-balancer-arn <alb_arn> \
  --region us-west-2 \
  --profile shared-services \
  --query "Listeners[*].Certificates"

If the certificate ARN in the ALB listener does not match the ACM certificate, update the ALB listener in Terraform.

ACM Import Fails: IAM Permissions

If acm-import.yml fails with an access denied error:

# Verify the cert-server instance role has ACM permissions
aws sts get-caller-identity  # Run on cert-server

aws acm list-certificates --region us-west-2
# If this fails with AccessDenied, the IAM role is missing ACM permissions

Update the IAM instance role policy in terraform-plan/organization/environments/shared-services/.

Renewal Timer Not Running

# Check if the timer is active
systemctl status ssl-renew-deploy.timer
systemctl list-timers ssl-renew-deploy.timer

# If timer is inactive, re-run setup
cd /data/ansible/cwiq-ansible-playbooks/cert-server
ansible-playbook setup-auto-renewal.yml

# Check the service file exists
ls /etc/systemd/system/ssl-renew-deploy.{timer,service}

# Check renewal logs for recent activity
tail -50 /var/log/ssl-renewal.log

Certificate Expiry Check

To check all managed certificates at once:

# On cert-server
certbot certificates

# Check a specific cert remotely
openssl s_client -connect <domain>:443 -servername <domain> < /dev/null 2>/dev/null | \
  openssl x509 -noout -subject -dates

# Examples:
openssl s_client -connect vault.shared.cwiq.io:443 -servername vault.shared.cwiq.io < /dev/null 2>/dev/null | openssl x509 -noout -dates
openssl s_client -connect sso.shared.cwiq.io:443 -servername sso.shared.cwiq.io < /dev/null 2>/dev/null | openssl x509 -noout -dates

Quick Reference: Symptom to Action

Symptom Most Likely Cause First Step
certbot certonly returns AccessDenied IAM role missing Route53 permissions aws route53 list-hosted-zones to confirm
Deploy playbook fails on one host Tailscale connectivity tailscale ping <host>
Service still shows old cert after deploy Reload command failed Restart the nginx/service container manually
Vault: permission denied reading privkey.pem UID mismatch (vault container is UID 100) chown 100:100 privkey.pem on the target
SSL file mounted as directory in container Container started before cert deployed Remove directory stubs, redeploy, --force-recreate
ACM shows cert but ALB shows old cert Listener not pointing to ACM cert Check ALB listener certificate ARN in describe-listeners
Timer exists but not running systemd unit disabled systemctl enable --now ssl-renew-deploy.timer