Service Levels for MOJ Applications
Service Levels for MOJ Applications
Purpose and Context
The Ministry of Justice operates a diverse portfolio of applications ranging from mission-critical systems supporting courts and prisons to administrative utilities. Historically, investment in resilience, recovery, and continuity has been inconsistent and often based on reactive incident response rather than planned, tiered expectations.
This inconsistency created several problems:
- Misaligned investment decisions – teams over-engineer low-value systems and under-resource critical services
- Unclear recovery expectations – no consistent framework for defining RTO/RPO targets across the department
- Inefficient resource allocation – continuity and disaster recovery planning was ad hoc and often duplicative
- Compliance and audit risk – no clear traceability between business impact and architectural decisions
This Service Level framework establishes a five-tier model of business criticality with aligned Recovery Time Objectives (RTO), Recovery Point Objectives (RPO), and architectural expectations. By doing so, we achieve:
- Consistent architectural decisions across all MOJ applications based on business impact
- Predictable recovery and continuity capabilities aligned to operational need
- Efficient resource allocation by right-sizing investment to business criticality
- Clear accountability for continuity planning and resilience engineering
- Reduced operational risk by ensuring critical systems meet appropriate availability standards
This framework applies to all applications in the MOJ portfolio, from citizen-facing services and operational systems to internal administrative tools. New systems must be classified during design; existing systems should be re-evaluated during their next major change or upgrade.
The model is grounded in industry best practice:
- ISO 22301:2019 (Business Continuity Management Systems)
- ISO 22313:2020 (Business Continuity and Resilience – Guidance and Methods for Implementation)
- NIST SP 800-34 Rev. 1 (Contingency Planning)
- ServiceNow Recovery Tier Guidance (widely adopted cloud recovery models)
| Level | Description | Typical RTO | Typical RPO | Architectural Expectation |
|---|---|---|---|---|
| 1 — Mission Critical | Immediate harm if unavailable | Minutes | Near-zero | Multi-region active-active; continuous replication |
| 2 — Business Critical | Material operational disruption | 4–8 hours | <1 hour | Warm standby; rapid failover |
| 3 — Operational | Workarounds available | <24 hours | <24 hours | Cold/warm DR; daily backups |
| 4 — Administrative | Low operational impact | 1–3 days | 1–2 days | Standard backup and restore |
| 5 — Non-Critical | Minimal impact | Days–weeks | 2+ days | Fix-on-failure |
Level 1 — Mission Critical
Systems whose loss results in immediate and unacceptable harm to life, liberty, security, financial obligations, or statutory duties. These systems typically underpin core operational services where no downtime and no data loss are acceptable.
| Parameter | Value |
|---|---|
| RTO | 0–15 minutes |
| RPO | Zero or near-zero |
Continuity and DR expectations
- Active-active architectures, multi-region failover or hot-standby
- Automated failover, continuous monitoring, and tested DR drills
- Regular resilience testing including cyber-attack and infrastructure-level scenarios
- Must align with a very short Maximum Tolerable Period of Disruption (MTPD)
Level 2 — Business Critical
Systems supporting essential services where downtime produces material operational, financial or reputational impact, but not immediate organisational failure.
| Parameter | Value |
|---|---|
| RTO | 4–8 hours |
| RPO | 15 minutes to 1 hour |
Continuity and DR expectations
- Warm standby, high-availability clustering, and replicated storage
- Daily or sub-hourly backups; asynchronous replication acceptable
- Annual DR exercises; scenario testing recommended
- Clear manual fallback processes to maintain partial operations
Level 3 — Operational
Systems supporting day-to-day operations. Temporary outages are tolerable and well-defined workarounds exist.
| Parameter | Value |
|---|---|
| RTO | <24 hours |
| RPO | 4–24 hours |
Continuity and DR expectations
- Daily backups; restore from secondary region or cold standby environment
- Documented manual operating procedures
- DR testing biannual or annual depending on change velocity
Level 4 — Administrative / Supporting
Internal productivity, administration, and support systems where downtime produces inconvenience but limited operational consequence.
| Parameter | Value |
|---|---|
| RTO | 24–72 hours |
| RPO | 24–48 hours |
Continuity and DR expectations
- Standard backup regimes; recovery from snapshots or cold standby
- Manual workaround acceptable for several days
- Periodic review to ensure the system does not become more business-relevant over time
Level 5 — Non-Critical / Low-Impact
Systems with purely convenience-based, informational or peripheral use. Failure does not materially impact organisational outcomes (e.g. internal dashboards with no operational effect).
| Parameter | Value |
|---|---|
| RTO | 72 hours to weeks |
| RPO | 48 hours or more |
Continuity and DR expectations
- Backups at standard intervals
- Fix-on-failure; no immediate failover required
- Minimal DR testing; restore as part of BAU backlog
Cross-cutting expectations
Business continuity
- Must link to ISO 22301 Business Impact Analysis (BIA) outputs including MTPD, RTO, RPO, and MBCO
- Each service must define dependencies, upstream/downstream integrations, and manual fallback capability
Disaster recovery
- Recovery strategies must be justified against cost and risk
- DR parameters (RTO, RPO, WRT, MTD) should be documented following the NIST SP 800-34 recommended structure
Continuity testing
- Regular exercises required across all levels, scaling from full failover for Level 1 to table-top for Level 5
- Plans must be validated regularly to remain effective as systems and priorities change
Architecture alignment
- Design patterns (HA, DR, geo-redundancy, backup frequency) must be driven by the chosen criticality level
- Lower RTO → more automation, multi-site redundancy and monitoring
- Higher RTO → simpler backup/restore approaches
Was this page useful?