Backups you can actually restore from
A backup is only useful when it can be restored inside the recovery window and to an acceptable point in time. Until restoration is tested, the backup is only an assumption.
A backup is only useful when it can be restored inside the recovery window and to an acceptable point in time. Until restoration is tested, the backup is only an assumption.
Start with RPO and RTO
Recovery point objective, or RPO, is the maximum acceptable data loss measured as time. Recovery time objective, or RTO, is the maximum acceptable time to restore service.
These numbers should be set by business impact, not by default tool settings. A public marketing site, an order database, and an audit log archive can have very different recovery needs.
Backups must be designed to meet both values. A daily backup cannot meet a fifteen minute RPO. A backup that takes two days to restore cannot meet a four hour RTO.
Know what must be recovered
Reliable recovery needs more than database files. List everything required to restore the service.
Typical recovery scope includes:
- application data
- schema and migration history
- object storage
- search indexes or a rebuild plan
- message streams or replay position
- configuration
- secrets and key material
- infrastructure definitions
- container images or release artefacts
- DNS and routing configuration
- runbooks and access paths
If a component can be rebuilt, document the rebuild steps and expected duration. If it cannot be rebuilt inside the RTO, it belongs in the recovery plan.
Protect backups from the same failure
Backups should survive the incident that made them necessary. Accidental deletion, compromised credentials, ransomware, regional outage, bad deployment, and operator error all have different failure patterns.
Use separation deliberately. That may mean separate accounts, separate regions, immutable storage, restricted deletion rights, separate encryption keys, and monitored backup access. The right design depends on the threat model and recovery objectives.
A backup that can be deleted by the same identity that can delete production data is not strong protection against account compromise.
Automate creation and verification
Manual backups are easy to miss. Automate backup creation, retention, expiry, and monitoring.
Monitor at least:
- last successful backup time
- backup size and unexpected size changes
- backup duration
- backup failure count
- replication or copy status
- retention policy compliance
- encryption status
Verification must go beyond job success. A completed backup job does not prove that the data is usable.
Restore regularly
Periodic restore tests prove whether the backup process meets RPO and RTO. They also reveal missing permissions, missing configuration, slow transfer paths, incompatible versions, broken encryption keys, and undocumented manual steps.
A restore test should create a fresh environment, restore from backup, run integrity checks, run application smoke tests, and record elapsed time. The result should be reviewed against the stated objectives.
Do not test only the easiest path. Test point in time recovery, single object recovery, full environment recovery, and recovery after a deliberately bad change where relevant.
Make restoration repeatable
The restore process should be scripted where possible and documented where judgement is required.
A good restore runbook includes:
- prerequisites and required access
- how to choose the restore point
- how to create the recovery environment
- restore commands or workflows
- validation checks
- cutover steps
- rollback or abort criteria
- communication points
- expected timings
Keep commands current. A restore command copied from an old incident can be worse than no command at all.
Validate integrity and application behaviour
A database that starts is not necessarily a recovered service. Validate the data and the application.
Use checks such as:
- database consistency checks
- expected table and object counts
- application smoke tests
- authentication tests
- critical read and write paths
- background worker checks
- audit log continuity
- monitoring and alerting checks
Record known gaps. If search indexes are rebuilt after restore, state how long that takes and what users see while it happens.
Practise destructive scenarios safely
The hardest recoveries are caused by bad writes, accidental deletion, and compromise. Practise them in non-production or isolated recovery environments.
Useful exercises include:
- restore after a deleted table
- restore after corrupted application data
- restore a single tenant or account where architecture supports it
- restore after credentials are rotated
- restore into a clean account or region
- prove that immutable backups cannot be altered by normal production roles
The point is not to create theatre. The point is to find the step that fails before a real incident.
Keep evidence
Keep records of backup tests. Record the source backup, restore point, environment, commands or workflow used, duration, validation results, issues found, and follow-up actions.
This evidence helps audits, but its operational value is greater. It shows whether the team can still restore after architecture, tooling, data volume, or staffing changes.
Conclusion
Backups are a recovery capability, not a storage habit. Define RPO and RTO, protect backups from the failures you expect, automate creation, test restoration, validate the recovered service, and keep evidence. A backup you cannot restore from is not a recovery plan.
