Infrastructure

Service identity and secret rotation: a production runbook, not an isolated task

Build operable rotation for secrets, certificates, and application identities with dependency inventory, evidence, change windows, monitoring, and rollback.

04 Jun 2026 identityautomationmonitoringrunbooklogsrollbacksecurityazure

Rotating an application secret looks simple as long as it is treated as a one-off operation: generate a new value, store it in a vault, restart the service. In production, generating the secret is rarely the hard part. The risk sits in undocumented dependencies, caches, scheduled jobs, SaaS connectors, CI identities, forgotten certificates, and applications that cannot reload configuration without redeployment.

The use case here is a platform where several applications consume secrets, certificates, or service identities: database access, internal APIs, cloud service principals, third-party webhooks, storage, message buses, or Terraform backends. The goal is not only to rotate a value before it expires. The runbook must prove who consumes what, how the new value is propagated, which signals confirm success, and how to roll back without reopening access too broadly.

Start with the real consumer inventory

A secret does not belong only to an application. It belongs to an execution path. The same credential may be used by a worker, an API, a night job, a CI pipeline, and a maintenance script. Until those consumers are listed, rotation remains a bet.

text secret-consumer-inventory.txt

Secret or credential
Functional name
Source location: Key Vault, CI variable, SaaS vault, encrypted file
Owning identity
Expiration date or rotation policy

Consumers
Application service
Scheduled job
CI/CD pipeline
Operations script
External integration

Expected evidence
Successful authentication log
Stable error metric
Targeted application test
Previous value unused after cutover

The useful question is not “where is the secret stored?”. It is “which processes fail if this value changes now?”. That nuance prevents the runbook from stopping at the secret vault while the incident may appear in a worker, scheduler, or pipeline.

Separate rotation, deployment, and revocation

A healthy rotation happens in three phases. First, the new value is created and made available. Then consumers switch to that value. Only after that is the old value revoked. Mixing those steps increases outage risk and makes diagnosis harder.

text rotation-phases.txt

Phase 1 - Prepare
Create the new secret or certificate
Store it in the target location
Verify read permissions
Do not revoke the old value

Phase 2 - Cut over
Redeploy or reload consumers
Validate real calls
Watch authentication errors and latency
Confirm the critical path uses the new value

Phase 3 - Revoke
Confirm no usage of the old value remains
Delete or disable the old value
Keep evidence in the change ticket
Update the next rotation date

This separation also gives the team a practical rollback. If cutover fails, the old value still exists. If revocation fails, consumers are already running on the new value.

Verify permissions before the change window

Many rotations fail for a plain reason: the service that must read the new value lacks vault permissions, or the pipeline that injects it uses a different identity than the one tested manually. Readiness checks must therefore use the identity that will actually execute the change.

bash 01-check-secret-readiness.sh

# Azure Key Vault example: verify presence and expiration
az keyvault secret show --vault-name kv-prod-app --name api-backend-client-secret --query '{name:name, enabled:attributes.enabled, expires:attributes.expires, updated:attributes.updated}'

# Verify from the real pipeline or workload identity.
# A manual test from an admin account does not prove the production path.

The critical step is documenting which identity ran the test. Success from an administration workstation does not prove that the CI runner, application pod, or managed service will be able to read the new version.

Look for residual usage in logs

Before revoking the old value, look for signs of residual usage. Depending on the platform, that signal may come from authentication logs, application logs, 401/403 errors, database connection metrics, or vault events.

kusto 02-detect-auth-failures-after-rotation.kql

let RotationStart = datetime(2026-06-04T08:00:00Z);
AppServiceHTTPLogs
| where TimeGenerated > RotationStart
| where ScStatus in (401, 403, 500)
| summarize errors=count() by bin(TimeGenerated, 5m), CsHost, CsUriStem, ScStatus
| order by TimeGenerated desc

This query is not universal proof, but it illustrates the expected reflex: tie the rotation to a usable time-based signal. The runbook should say where to look, which window to observe, and which threshold stops revocation.

Account for caches and long-running processes

A service may have reloaded the secret while keeping an old connection open. A worker may reload configuration only at startup. A monthly job may be absent from immediate tests. The runbook must distinguish immediate validation from delayed validation.

text delayed-consumers-checklist.txt

Delayed-risk consumers
Workers and message queues
Persistent database connections
Infrequent batch jobs
CI runners used only on demand
External integrations that cache credentials

Additional evidence
Controlled restart when needed
Synthetic test after redeployment
24h watch on authentication errors
Follow-up ticket for jobs that cannot be replayed immediately

Without this reading, the team may conclude too early that rotation succeeded. The real incident sometimes arrives at the next batch run, after the old value has already been removed.

Make rollback precise

Rollback should not mean “put it back as before” without details. It must say which value can be reactivated, who is allowed to do it, how long the exception stays open, and which signal lets the team resume rotation. Otherwise, recovery becomes a durable opening.

text secret-rotation-rollback.txt

Bounded rollback
Reactivate the previous version if it was not deleted
Switch back only failing consumers
Keep the new value available for analysis
Add a short expiration to the exception
Replay the application test and log check
Plan a new rotation window

Rollback is also security information. A reactivated old value must have an explicit lifetime and owner. It should not remain as silent debt after stabilization.

Automate without losing human-readable evidence

Automation helps generate, store, and propagate secrets. It should not remove operational evidence. A rotation pipeline must produce a readable journal: secret values never displayed, versions affected, consumers redeployed, tests executed, errors observed, and revocation decision.

text rotation-automation-contract.txt

Automation contract
No secret value in logs
Dedicated execution identity
New and previous versions traced by identifier
Mandatory post-cutover tests
Revocation separated from creation
Human approval before final deletion on critical paths

This trace avoids two extremes: fragile manual rotation and opaque automatic rotation. Operations need a reproducible mechanism, but also readable evidence during an incident.

Conclusion

Secret and service identity rotation becomes reliable when it is treated as a production runbook. Inventory consumers, separate preparation, cutover, and revocation, test with real identities, observe errors after change, and keep rollback bounded.

The right indicator is not only “the secret changed”. It is “critical consumers use the new value, the old one is no longer needed, and the team knows what to do if a forgotten path fails”. At that point, rotation stops being an isolated security task and becomes controlled operations practice.