SSL: Troubleshooting¶
Diagnosis and resolution steps for SSL certificate issuance failures, deployment failures, services not picking up renewed certs, and ACM issues.
Certificate Issuance Failures¶
DNS-01 Challenge Fails¶
Certbot uses Route53 to create TXT records for DNS-01 validation. If the cert-server's IAM role does not have Route53 permissions, or the hosted zone is not accessible, issuance will fail.
# Verify IAM role has Route53 permissions
aws sts get-caller-identity # Run on cert-server; confirms instance role
aws route53 list-hosted-zones # Should list shared.cwiq.io and dev.cwiq.io zones
# Test with a dry run before issuing
certbot certonly --dns-route53 --dry-run -d test.dev.cwiq.io
If list-hosted-zones fails with AccessDenied, update the IAM instance role policy in Terraform (terraform-plan/organization/environments/shared-services/).
Certificate Already Exists¶
If you receive an error that the certificate already exists, check the current cert status:
If the cert is valid and not near expiry, no action is needed. If it is corrupted or missing files:
# Remove corrupted cert and re-issue
sudo certbot delete --cert-name <domain>
ansible-playbook ssl-issue-all.yml
Rate Limit Hit¶
Let's Encrypt enforces rate limits (5 duplicate certificates per week per domain). If you hit this limit:
- Wait for the rate limit window to reset (up to 7 days)
- Use the dry-run flag for testing:
certbot certonly --dns-route53 --dry-run -d <domain> - Check
certbot certificatesto confirm whether a valid cert already exists before re-issuing
Deployment Failures¶
Tailscale Connectivity Lost¶
The cert-server deploys certificates over SSH via Tailscale. If a target host drops off the tailnet, deployment fails for that host.
# Check Tailscale status on the cert-server
tailscale status
# Ping the target host
tailscale ping gitlab-dev-cwiq-io
tailscale ping authentik-shared-cwiq-io-1
# Test SSH directly
ssh -i ~/.ssh/cwiq-ansible ec2-user@gitlab-dev-cwiq-io "hostname"
If the host is not reachable on Tailscale, check Tailscale status on the target server and restart the Tailscale daemon if needed.
Nexus Shared hostname workaround
nexus-shared-cwiq-io is deployed using its Tailscale IP (100.67.249.34) rather than its hostname. If ssl-deploy-nexus.yml fails for the shared instance, verify the Tailscale IP is still correct: tailscale status | grep nexus-shared.
SSH Permission Denied¶
If ssh -i ~/.ssh/cwiq-ansible ec2-user@<host> fails:
# Verify the key exists
ls -la ~/.ssh/cwiq-ansible
# Check known_hosts
ssh-keygen -F <hostname>
# If host key changed (e.g., after EC2 instance replacement)
ssh-keygen -R <hostname>
ssh -i ~/.ssh/cwiq-ansible ec2-user@<hostname> # Accept new host key
Cert Files Missing After Deployment¶
If the deploy playbook reports success but the cert files are not on the target server:
Check that /data/ssl/ exists and the deploy user has write access. The playbook creates the directory if it does not exist, but the parent /data/ mount point must be present.
Service Not Using New Certificate¶
After a certificate is deployed, the service must reload to pick up the new files. This is handled by the reload_command in inventory.yml. If the reload did not happen or failed, run it manually:
GitLab¶
# Dev
ssh ec2-user@gitlab-dev-cwiq-io \
"sudo -u gitlab docker exec gitlab gitlab-ctl hup nginx"
# Shared
ssh ec2-user@gitlab-shared-cwiq-io \
"sudo -u gitlab docker exec gitlab gitlab-ctl hup nginx"
Orchestrator¶
# Dev
ssh cwiq@orchestrator-dev-cwiq-io "docker restart orchestrator-nginx"
# Demo
ssh cwiq@orchestrator-demo-cwiq-io "docker restart orchestrator-nginx"
Vault¶
ssh ec2-user@vault-shared-cwiq-io \
"sudo -u vault docker compose -f /data/vault/docker-compose.yml restart"
After a restart, verify the certificate:
openssl s_client -connect vault.shared.cwiq.io:443 \
-servername vault.shared.cwiq.io < /dev/null 2>/dev/null | \
openssl x509 -noout -subject -dates
Authentik¶
ssh ec2-user@authentik-shared-cwiq-io-1 \
"sudo -u authentik docker compose -f /data/authentik/docker-compose.yml restart server"
ssh ec2-user@authentik-shared-cwiq-io-2 \
"sudo -u authentik docker compose -f /data/authentik/docker-compose.yml restart server"
Nginx-Based Services (Grafana, Prometheus, OpenLDAP, etc.)¶
ssh ec2-user@grafana-shared-cwiq-io "docker restart grafana-nginx"
ssh ec2-user@prometheus-shared-cwiq-io "docker restart prometheus-nginx"
ssh ec2-user@openldap-shared-cwiq-io "docker restart openldap-nginx"
ssh ec2-user@sonarqube-shared-cwiq-io "docker restart sonarqube-nginx"
ssh ec2-user@defectdojo-shared-cwiq-io "docker restart defectdojo-nginx"
ssh ec2-user@reportportal-shared-cwiq-io "docker restart reportportal-nginx"
Vault: Private Key Permission Error¶
Vault's container runs as UID 100, which differs from the vault OS user (UID 1001). If the Vault container cannot read privkey.pem, the deploy playbook may not have set the correct ownership.
ssh ec2-user@vault-shared-cwiq-io
ls -la /data/ssl/vault.shared.cwiq.io/
# privkey.pem must be owned by UID 100
# Fix if needed
sudo chown 100:100 /data/ssl/vault.shared.cwiq.io/privkey.pem
sudo chmod 640 /data/ssl/vault.shared.cwiq.io/privkey.pem
# Then restart Vault
sudo -u vault docker compose -f /data/vault/docker-compose.yml restart
The ssl-deploy-vault.yml playbook sets UID 100:100 automatically. If using ssl-deploy-all.yml, check that the Vault-specific ownership logic runs.
Docker: SSL Files Mounted as Directories¶
If Docker created directory stubs for the SSL files before the cert was deployed, the container will see directories instead of files. This typically happens when a container is started before the cert files exist.
ssh ec2-user@<hostname>
ls -la /data/ssl/<domain>/
# If fullchain.pem or privkey.pem shows as a directory (d-----), remove and redeploy
sudo rm -rf /data/ssl/<domain>/fullchain.pem /data/ssl/<domain>/privkey.pem
Then redeploy the certificate and force-recreate the container:
# Redeploy cert
ansible-playbook -i inventory.yml ssl-deploy-<service>.yml
# Force-recreate the container that mounts the SSL files
ssh ec2-user@<hostname>
cd /data/<service>
sudo docker compose -f docker-compose.yml up -d --force-recreate <nginx-container>
ACM Troubleshooting¶
ACM Certificate Not Importing¶
# Verify the cert exists on the cert-server
ls -la /etc/letsencrypt/live/<domain>/
# If missing, issue it first
ansible-playbook ssl-issue-all.yml
# Then import
ansible-playbook -i inventory.yml acm-import.yml -e "cert_domain=<domain>"
ALB Still Showing Old Certificate¶
ACM certificate updates take effect immediately on new TLS handshakes, but existing connections use the old cert until they terminate. If the ALB appears to serve an old cert:
# Verify the import actually updated ACM
aws acm list-certificates --region us-west-2 --profile shared-services \
--query "CertificateSummaryList[?DomainName=='<domain>']"
aws acm describe-certificate \
--certificate-arn <arn> \
--region us-west-2 \
--profile shared-services \
--query "Certificate.{Status:Status,NotAfter:NotAfter,InUseBy:InUseBy}"
# Verify the ALB listener is using this certificate
aws elbv2 describe-listeners \
--load-balancer-arn <alb_arn> \
--region us-west-2 \
--profile shared-services \
--query "Listeners[*].Certificates"
If the certificate ARN in the ALB listener does not match the ACM certificate, update the ALB listener in Terraform.
ACM Import Fails: IAM Permissions¶
If acm-import.yml fails with an access denied error:
# Verify the cert-server instance role has ACM permissions
aws sts get-caller-identity # Run on cert-server
aws acm list-certificates --region us-west-2
# If this fails with AccessDenied, the IAM role is missing ACM permissions
Update the IAM instance role policy in terraform-plan/organization/environments/shared-services/.
Renewal Timer Not Running¶
# Check if the timer is active
systemctl status ssl-renew-deploy.timer
systemctl list-timers ssl-renew-deploy.timer
# If timer is inactive, re-run setup
cd /data/ansible/cwiq-ansible-playbooks/cert-server
ansible-playbook setup-auto-renewal.yml
# Check the service file exists
ls /etc/systemd/system/ssl-renew-deploy.{timer,service}
# Check renewal logs for recent activity
tail -50 /var/log/ssl-renewal.log
Certificate Expiry Check¶
To check all managed certificates at once:
# On cert-server
certbot certificates
# Check a specific cert remotely
openssl s_client -connect <domain>:443 -servername <domain> < /dev/null 2>/dev/null | \
openssl x509 -noout -subject -dates
# Examples:
openssl s_client -connect vault.shared.cwiq.io:443 -servername vault.shared.cwiq.io < /dev/null 2>/dev/null | openssl x509 -noout -dates
openssl s_client -connect sso.shared.cwiq.io:443 -servername sso.shared.cwiq.io < /dev/null 2>/dev/null | openssl x509 -noout -dates
Quick Reference: Symptom to Action¶
| Symptom | Most Likely Cause | First Step |
|---|---|---|
certbot certonly returns AccessDenied |
IAM role missing Route53 permissions | aws route53 list-hosted-zones to confirm |
| Deploy playbook fails on one host | Tailscale connectivity | tailscale ping <host> |
| Service still shows old cert after deploy | Reload command failed | Restart the nginx/service container manually |
Vault: permission denied reading privkey.pem |
UID mismatch (vault container is UID 100) | chown 100:100 privkey.pem on the target |
| SSL file mounted as directory in container | Container started before cert deployed | Remove directory stubs, redeploy, --force-recreate |
| ACM shows cert but ALB shows old cert | Listener not pointing to ACM cert | Check ALB listener certificate ARN in describe-listeners |
| Timer exists but not running | systemd unit disabled | systemctl enable --now ssl-renew-deploy.timer |
Related Documentation¶
- SSL: Architecture — How the cert-server centralizes SSL management
- SSL: Renewal and Deployment — Renewal pipeline and manual deploy commands
- SSL: ACM Import — ACM-specific operations and troubleshooting
- SSL: Inventory — Per-host reload commands