Recovery Block

The Recovery Block structural pattern is a derivative of the Design Diversity architectural pattern and the Compensation strategy pattern in the original resilience design pattern specification (Fig. 32) [B24]. It offers detection, containment, and mitigation with continuous operatation in the presence of an error or failure, and with some interruption and no loss of progress. The following describes the Pattern and its application in the System Scope and in the Service Scope of the INTERSECT federated ecosystem for instrument science. Note that the Pattern description uses the terms system, subsystem, and service in an abstract way, while the System Scope and the Service Scope map those terms to the INTERSECT federated ecosystem.

Pattern

Problem

A hardware or software error or subsystem failure due to a design fault (e.g., human mistake or defective design tool) causes a software, such as a service, to experience an error and potentially a subsequent failure.

Context

The pattern applies to a system that has the following characteristics:

  • The system is deterministic, i.e., forward progress of the system is defined in terms of the input state to the system and the execution steps completed since system initialization.

  • The system has a well-defined specification for which multiple implementation variants may be created.

  • There is an implicit assumption of independence between multiple variants of the implementation.

Forces
  • The pattern introduces an execution time and/or resource requirement (storage space, computational capability, etc.) penalty independent of whether an error or failure occurs during system operation or not.

  • The scope and strength of the diversity employed by the pattern determine its execution time and resource requirement overhead.

  • The pattern requires distinct implementations of the same design specification, which may need to be created by different individuals.

  • The pattern increases design complexity due to the need of additional design and verification effort required to create multiple implementations.

  • The pattern introduces a performance penalty upon an error or failure, as the recovery block processes the input and validates its output after error/failure discovery.

Solution

The pattern enables the continuous correct operation of a system impacted by an error or failure. It supports resilient operation by applying redundancy to system state and optionally to system resources. This redundancy is in the form of a functionally equivalent alternate system implementation encapsulated in a recovery block. This pattern designs a different implementation of the system that is functionally equivalent to enable error and failure resilience through design diversity. The two different implementations of the system are less likely to experience the same error or failure.

The pattern requires very well defined input and output to permit input replication and output validation. Input is replicated to the functionally equivalent alternate system implementations. The original system processes the input and validates its output. Upon discovery of an error or failure, the recovery block implementation processes the input, validates its output, and corrects the output of the original system. The scope and strength of the redundancy is defined by the implementation design diversity.

Redundancy is in time, as the recovery block processes the input and validates its output after error/failure discovery. It may be additionally in space, if the recovery block is executed on different resources than the system. The components of this pattern are illustrated in Fig. 66.

Recovery Block pattern components

Fig. 66 Recovery Block pattern components

Capability

A system using this pattern is able to continue to operate in the presence of an error or failure with some interruption for the execution of the recovery block. This pattern provides error and/or failure detection in the system by output validation. The pattern provides mitigation of an error or failure in the system by applying redundancy to system state and optionally to system resources, such that the system continues to operate correctly in the presence of such an event. The flowchart of the pattern is shown in Fig. 67, the state diagram in Fig. 68, and its parameters in Table 15.

Flowchart

Fig. 67 Flowchart

State diagram

Fig. 68 State diagram

Table 15 Recovery Block pattern parameters

Parameter

Definition

\(T_{a}\)

Time to activate the recovery block of the (sub-) system

\(T_{i}\)

Time to replicate the input to the (sub-) system and the recovery block of the (sub-) system

\(T_{e}\)

Time to execute (sub-) system progress

\(T_{o}\)

Time to validate the output from the (sub-) system

\(T_{r}\)

Time to execute the recovery block of the (sub-) system

Protection Domain

The protection domain extends to the system state and the system resources described by the design specification that implement the recovery block.

Resulting Context

Correct operation is performed despite an error or failure impacting the system. Progress in the system is not lost due to an error or failure. The system is not interrupted during error/failure-free operation. It is interrupted when encountering an error or failure for the execution of the recovery block. Resource usage in time or space is increased according to the additional resource usage and execution time of the recovery block that employs the redundancy in the form of the functionally equivalent alternate system implementation.

The pattern may be used in conjunction with other patterns that provide containment and mitigation in a complementary fashion, where some error/failure types are covered by the other pattern(s) and the pattern covers for the remaining error/failure types.

Performance

The error/failure-free performance \(T_{f=0}\) of the pattern is defined by the task total execution time without any resilience strategy \(T_{E}\), the total time to activate the recovery block of the (sub-) system \(T_{a}\), the time to replicate the input to the (sub-) system and the recovery block of the (sub-) system \(T_{i}\), and the time to validate the output from the (sub-) system \(T_{o}\) with the total number of input-execute-output cycles \(P\).

\[\begin{aligned} T_{f=0} = T_{E} + T_{a} + P (T_{i} + T_{o}) \end{aligned}\]

The performance under errors/failures \(T_{f!=0}\) is defined by the failure free performance \(T_{f=0}\) plus the time to execute the recovery block of the (sub-) system \(T_{r}\) for each of the errors or failures \(N\). Assuming constant times to execute the recovery block of the (sub-) system \(T_{r}\), the performance under errors/failures \(T_{f!=0}\) can be reformulated to:

\[\begin{aligned} T_{f!=0} = T_{f=0} + N T_{r} \end{aligned}\]
Reliability

The reliability \(R(t)\) of a system applying this pattern is defined by the parallel reliability of the \(N\)-redundant execution and the performance under errors/failures \(T_{f!=0}\), assuming constant probabilistic rate \(\lambda_{n}\) of errors and failures for each redundant execution (or its corresponding inverse, the MTTI \(M\)). It can be simplified for redundancy of identical systems \(R_{i}(t)\), assuming an identical constant probabilistic error/failure rate \(\lambda\) (or its corresponding inverse \(M\)).

\[\begin{split}\begin{aligned} R(t) &= 1 - \prod_{n=1}^{N}(1-e^{-\lambda_{n} T_{f!=0}}) = 1 - \prod_{n=1}^{N}(1-e^{-T_{f!=0}/M})\\ R_{i}(t) &= 1 - (1 - e^{-\lambda T_{f!=0}})^{N} = 1 - (1 - e^{-T_{f!=0}/M})^{N} \end{aligned}\end{split}\]
Availability

The availability \(A\) of a system applying this pattern is defined by \(N\)-parallel availability and the performance under failure \(T_{f!=0}\). It can be simplified for redundancy of identical systems \(A_{i}\). If \(T_{a}\), \(T_{i}\), \(T_{d}\), \(T_{r}\), and \(T_{f}\) are small enough, non-identical and identical availability can be simplified further, where \(M_{n}\) (or \(M\)) is the MTTI and \(R_{n}\) (or \(R\)) is the mean-time to recover (MTTR) of each individual system (\(T_{f}\)).

\[\begin{split}\begin{aligned} A &= 1 - \prod_{n=1}^{N} (1 - A_{n})\notag\\ &= 1 - \prod_{n=1}^{N} \left(1 - \frac{T_{E,n}}{T_{n}}\right)\\ A_{i} &= 1 - (1-A)^{N}\notag\\ &= 1 - \left(1 - \frac{T_{E}}{T}\right)^{N} \end{aligned}\end{split}\]
\[\begin{aligned} A &= 1 - \prod_{n=1}^{N} \left(1 - \frac{M_{n}}{M_{n} + R_{n}}\right) A_{i} &= 1 - \left(1 - \frac{M}{M + R}\right)^{N} \end{aligned}\]
Examples

Containment Domains [B67] provide language-based approaches for recovery blocks. Applications also often contain verification routines that check for the validity of a computation and correct any detected errors using application-specific knowledge.

Rationale

The pattern enables a system to tolerate an error or failure through continuation of correct operation after impact. It relies on system state redundancy in the form of a functionally equivalent alternate system implementation encapsulated in a recovery block. The pattern performs some proactive actions, such as maintaining redundancy, but mainly relies on reactive actions, such as the execution of a recovery block after an error or failure was detected. Error or failure detection is part of the pattern in the form of output validation. The pattern has high design complexity due to the need for a functionally equivalent alternate system implementation encapsulated in a recovery block.

System Scope

In the context of INTERSECT Systems, Subsystems, and Services, this pattern is not applicable to INTERSECT systems and subsystems, as redundancy is in the form of a functionally equivalent alternate implementation encapsulated in a recovery block and such INTERSECT system and subsystem funtionality is provided by INTERSECT services.

Service Scope

In the context of INTERSECT Systems, Subsystems, and Services, this pattern can be applied to an INTERSECT service.

Microservice Scope

In the context of the INTERSECT Microservices Architecture, this pattern can be applied to an INTERSECT microservice. If it is applied to a group of microservices, then this is typically within the Service Scope.