Skip to content

Operations & Emergency

Day-to-day Vault operations: health checks, audit logs, secret rotation, and emergency procedures for sealed Vault, lost tokens, and compromised secrets.


Health Checks

# Full status check
curl -s https://vault.shared.cwiq.io/v1/sys/health | jq

# HTTP response codes:
# 200 — Initialized, unsealed, active (normal)
# 429 — Unsealed, standby
# 501 — Not initialized
# 503 — Sealed
# Container status on vault-shared-cwiq-io
ssh ec2-user@vault-shared-cwiq-io \
  "sudo -u vault docker compose -f /data/vault/docker-compose.yml ps"

# Live container logs
ssh ec2-user@vault-shared-cwiq-io \
  "sudo -u vault docker compose -f /data/vault/docker-compose.yml logs -f"

Audit Logs

All Vault operations are recorded at /data/vault/logs/vault-audit.log. The log is in JSON format.

ssh ec2-user@vault-shared-cwiq-io

# Live log stream
sudo tail -f /data/vault/logs/vault-audit.log | jq

# Search for access to a specific path
sudo grep "secret/data/cwiq" /data/vault/logs/vault-audit.log | jq

# View last 100 entries
sudo tail -100 /data/vault/logs/vault-audit.log | jq

Each entry includes: timestamp, requesting identity, operation (read/write/delete), and the secret path. Secret values are never written to the audit log.


Secret Rotation

Rotate a Secret

# Create a new version (KV v2 — previous version is retained)
vault kv put secret/cwiq/shared/<app>/database \
  pg_password="<new-password>" \
  pg_host="<same-host>" \
  pg_user="<same-user>"

# Verify new version
vault kv metadata get secret/cwiq/shared/<app>/database

Rotation Workflow

  1. Update the secret in Vault (creates a new version).
  2. Vault Agent auto-refreshes within the vault_agent_render_interval (default: 5 minutes).
  3. If the application does not pick up new env vars at runtime, restart the application container.
  4. Verify the application is using the new credential.

RDS Backups

Vault's storage is RDS PostgreSQL (vault-shared-storage). Automated backups are enabled.

# List RDS snapshots
aws rds describe-db-snapshots \
  --profile shared-services \
  --db-instance-identifier vault-shared-storage \
  --query 'DBSnapshots[*].[DBSnapshotIdentifier,SnapshotCreateTime]' \
  --output table

# Create a manual snapshot before any risky operation
aws rds create-db-snapshot \
  --profile shared-services \
  --db-instance-identifier vault-shared-storage \
  --db-snapshot-identifier vault-manual-$(date +%Y%m%d)

Emergency: Vault is Sealed

With AWS KMS auto-unseal, Vault automatically unseals on restart. If Vault is sealed:

Step 1: Check the seal status and KMS access

ssh ec2-user@vault-shared-cwiq-io
sudo docker logs vault 2>&1 | grep -i kms

If KMS errors appear, the EC2 IAM role may have lost kms:Decrypt permission, or the KMS key has been disabled.

Step 2: Verify KMS connectivity

curl -s https://kms.us-west-2.amazonaws.com

Step 3: Restart the Vault container

sudo -u vault docker compose -f /data/vault/docker-compose.yml restart

Vault should unseal within 30 seconds after restart if KMS is reachable.


Emergency: Revoke a Compromised Token

# Revoke a specific token
vault token revoke <token>

# Revoke by accessor (if token value is unknown)
vault token revoke -accessor <accessor>

# List all active token accessors (requires root)
vault list auth/token/accessors

# Revoke all AppRole tokens (emergency — breaks all sidecars)
vault token revoke -mode=path auth/approle/

After revoking AppRole tokens, the affected application sidecars will stop working. Redeploy the service after rotating the secret-id:

vault write -f auth/approle/role/<app>/secret-id

Emergency: Regenerate Root Token

If the root token is lost or compromised, use the recovery key holders:

# Start the recovery process
vault operator generate-root -init
# Returns a Nonce and OTP — save both

# Each recovery key holder runs (requires recovery key threshold)
vault operator generate-root -nonce=<nonce>
# Enter recovery key when prompted

# After the threshold is reached, decode the encoded token:
vault operator generate-root -decode=<encoded-token> -otp=<otp>

Revoke the root token after use

Once the root token has been used to complete the emergency task, revoke it immediately:

vault token revoke <root-token>


Emergency: Compromised Secret

If a secret (password, API token, etc.) may have been exposed:

  1. Immediately rotate the secret in the upstream system (e.g., reset the database password, revoke the API token).
  2. Update Vault with the new value:
    vault kv put secret/<path> <key>=<new-value>
    
  3. Restart affected application containers to force Vault Agent to re-render.
  4. Check the audit log to determine when and by whom the secret was accessed:
    ssh ec2-user@vault-shared-cwiq-io
    sudo grep "<secret-path>" /data/vault/logs/vault-audit.log | jq '{time: .time, auth: .auth.display_name, op: .request.operation}'
    

RDS Point-in-Time Recovery

If Vault's storage backend needs to be restored:

  1. Go to RDS → Databases → vault-shared-storage in the AWS Console (shared-services account).
  2. Actions → Restore to point in time.
  3. Select the recovery time point.
  4. Launch the recovery instance.
  5. Update Vault's configuration to point to the new RDS endpoint:
    # Update group_vars/all.yml on the ansible server, then redeploy
    cd vault-server
    ansible-playbook -i inventory/shared.yml setup.yml
    

Prometheus Metrics

Vault exposes metrics at https://vault.shared.cwiq.io/v1/sys/metrics?format=prometheus.

Metric Description
vault_core_unsealed 1 if unsealed, 0 if sealed
vault_token_count Number of active tokens
vault_secret_kv_count Number of secrets in KV store
vault_runtime_alloc_bytes Memory allocation