Infrastructure

Operational architecture documentation: write a note that actually helps run after deployment

Structure architecture documentation that remains useful after production with decisions, limits, validation commands, failure modes, rollback and operational evidence.

24 May 2026 architecture-documentationrunbookvalidationoperationsfailure-modes

Architecture documentation can look very clean on delivery day and be useless three weeks later. A diagram, a few component choices and a resource list are not always enough to operate a system. Run teams need to know how to validate, where to look when it breaks, which limits were accepted and which decisions should not be reversed without review.

The scenario here is a cloud or infrastructure architecture handed over to operations: private networking, DNS, HTTPS frontend, automation, backup or Linux integration. The goal is not to produce a long document. The goal is to produce a note that remains useful after deployment.

Write the problem before the solution

A good note starts with the operational need, not with service names. If the first sentence only says that the architecture uses Application Gateway, Private Endpoint or AWX, it lacks context. Explain the flow or problem being solved.

text problem-statement.txt
Useful context
A business application must be published to an external partner.
The call must be authenticated with a client certificate.
The backend remains private.
The run team must diagnose DNS, TLS, WAF and backend health separately.

Weak context
Deployment of an Application Gateway with WAF.

That difference changes the rest of the document. The first context prepares validations and failure modes. The second says almost nothing about operations.

Document decisions and non-decisions

An architecture always contains alternatives that were not chosen. Writing them down avoids having the same discussion again during an incident or evolution. A useful decision explains the choice, the reason and the limit.

text decision-record.txt
Decision
Use Application Gateway as the HTTPS entry point.

Reason
The main need is controlled publication of a web application with a dedicated listener, mTLS, WAF and backend probe.

Not selected
API Management is not used because the need is not API lifecycle, products, subscriptions or payload transformation.

Limit
If the need evolves toward API governance, this decision must be reviewed.

This format is short, but it avoids documents that only describe the final state without explaining why it exists.

Add validation commands

Operable documentation must let the team prove that the system works. Commands should not be decorative. They should validate critical points: DNS resolution, certificate, backend health, identity, backup, automation job or restore.

bash validation-examples.sh
nslookup app.example.com
curl -vk https://app.example.com/health

az network application-gateway show-backend-health -g rg-network-hub-prod -n agw-internet-prod-001 -o table

id user@example.local
sssctl domain-status example.local

These commands must be adapted to the topic, but the intent stays the same: give operations a way to validate a hypothesis. A command without an expected result is less useful. State what success or failure means.

Write likely failure modes

Failure modes are often more useful than a complete diagram. They say what to inspect when a symptom appears. The goal is not to enumerate every possible error, but predictable failures tied to the design.

text failure-modes.txt
Symptom: client receives 502
Check Application Gateway backend health
Verify probe, backend hostname and TLS certificate
Read access and WAF logs for the same period

Symptom: PaaS service answers publicly
Check publicNetworkAccess
Verify DNS resolution from the tested source
Confirm that the test does not use an external resolver

Symptom: an AWX job changes too many machines
Check inventory associated with the template
Check launch-time variables
Check guardrails in the playbook

This block gives immediate value. It turns design experience into diagnostic help.

Describe rollback or return path

Not every change has a simple rollback, but every change should have a strategy. Temporarily re-enabling public access, returning to an old backend setting, restoring a VM, disabling an AWX job or going back to a playbook version are different decisions. Writing them down before the incident reduces improvisation.

text rollback-notes.txt
Change: close Function public access
Possible return: temporarily re-enable publicNetworkAccess
Condition: security approval and limited duration
After-return validation: documented public and private tests

Change: new Application Gateway backend setting
Possible return: re-associate the old rule to the previous backend setting
Condition: old object kept during the change window

Rollback should also state what does not return automatically: modified data, rotated secrets, caches, propagated DNS or automation-applied configuration.

Keep the document alive but bounded

An operable note must evolve with real changes, but it should not become an infinite wiki. Stable sections are context, decisions, flow diagram, validations, failure modes and rollback. Volatile details can point to inventories or source-of-truth tools.

text operational-note-outline.txt
Recommended structure
1. Context and objective
2. Expected flow
3. Decisions and limits
4. Components and responsibilities
5. Operational validations
6. Likely failure modes
7. Rollback or return path
8. Short history of important changes

Example handover card

To make the note truly operable, finish with a short card that the run team can use without reading the full project file again. This card does not replace the architecture, but it summarizes the controls that matter once the application is in production.

text handover-card.txt
Service: partner portal
Entry point: app.example.com through Application Gateway
Backend: private application in application spoke
Critical DNS: app.example.com, backend.internal.example.com
Daily validation: healthy backend health, valid certificate, no abnormal WAF block
502 incident: check probe, backend hostname, certificate and WAF logs
Sensitive change: listener, backend setting, WAF policy or DNS zone modification
Rollback: previous backend setting kept during the change window
Run contact: platform team
Application contact: relevant product team

This card gives the document weight. It shows what must be known before an incident: names, dependencies, validations, symptoms, rollback and responsibilities.

Conclusion

Useful architecture documentation is not the one that describes the most components. It is the one that helps a team understand the flow, validate state, diagnose failures and decide on rollback. It keeps decisions, limits and evidence close to the design.

The right level of detail is the one that remains usable under pressure. A diagram can help, but validation commands, failure modes, non-selected decisions and a handover card often make the difference when the system no longer behaves as expected.