N-Modular Redundancy

The N-Modular Redundancy structural pattern is a derivative of the Redundancy architectural pattern and the Compensation strategy pattern in the original resilience design pattern specification (Fig. 32) [B24]. It offers detection, containment and mitigation with continuous operatation in the presence of an error or failure, and with none-to-little interruption and no loss of progress. The following describes the Pattern and its application in the System Scope and in the Service Scope of the INTERSECT federated ecosystem for instrument science. Note that the Pattern description uses the terms system, subsystem, and service in an abstract way, while the System Scope and the Service Scope map those terms to the INTERSECT federated ecosystem.

Pattern

Problem

A hardware error or subsystem failure due to a physical fault (e.g., wear-out or destruction) causes a software, such as a service, to experience an error and potentially a subsequent failure.

Context

The pattern applies to a system that has the following characteristics:

The system is deterministic, i.e., forward progress of the system is defined in terms of the input state to the system and the execution steps completed since system initialization.
The system has a modular design that has a well-defined scope and a set of inputs and outputs.

Forces

The pattern introduces an execution time and/or resource requirement (storage space, computational capability, etc.) penalty independent of whether an error or failure occurs during system operation or not.
The scope and strength of the redundancy employed by the pattern determine its execution time and resource requirement overhead.

Solution

The pattern enables the continuous correct operation of a system impacted by an error or failure. It supports resilient operation by applying redundancy to system state and optionally to system resources. This redundancy is in the form of \(N\) functionally identical replicas. The pattern requires very well defined input and output to permit input replication and output comparison. Input is replicated to identical instances of the system, processed by each replica system, and the output is then compared. The comparison corrects an error or failure of a replica system. The scope and strength of the redundancy are defined by the number of functionally identical replicas \(N\).

Redundancy can be in time, meaning the same system resources are used for redundancy and execute the \(N\) functionally identical replicas in time. Redundancy can also be in space, meaning additional (redundant) system resources are used and execute the \(N\) functionally identical replicas in space. Redundancy in time saves system resources, while redundancy in space offers more error/failure coverage. A mix between redundancy in time and space is possible as well, where there are more functionally identical replicas than additional (redundant) system resources. The components of this pattern are illustrated in Fig. 60.

N-modular Redundancy pattern components — Fig. 60 N-Modular Redundancy pattern components

Capability

A system using this pattern is able to continue to operate in the presence of an error or failure with no interruption. This pattern provides error and/or failure detection in the system by applying redundancy to system state in the form of \(N\) functionally identical replicas. The pattern provides mitigation of an error or failure in the system by applying redundancy to system state and optionally to system resources, such that the system continues to operate correctly in the presence of such an event. The flowchart of the pattern is shown in Fig. 61, the state diagram in Fig. 62, and its parameters in Table 13.

Table 13 N-Modular Redundancy pattern parameters
Parameter	Definition
\(T_{a}\)	Time to activate \(N\) replicas of the (sub-) system
\(T_{i}\)	Time to replicate the input to the \(N\) replicas of the (sub-) system
\(T_{e}\)	Time to execute (sub-) system progress in the \(N\) replicas of the (sub-) system
\(T_{o}\)	Time to compare the outputs from the \(N\) replicas of the (sub-) system
\(T_{r}\)	Time to remove, replace, or discount the affected redundant (sub) system replica(s)

Protection Domain

The protection domain extends to the system state and the system resources that implement the \(N\) functionally identical replica systems.

Resulting Context

Correct operation is performed despite an error or failure impacting the system. Progress in the system is not lost due to an error or failure. The system is not interrupted during error/failure-free operation or when encountering an error or failure. Resource usage in time or space is increased according to the amount of redundancy employed in the form of \(N\) functionally identical replicas and due to the replication of input and comparison and correction of output.

A trade-off exists between the amount of redundancy employed and the number of errors and/or failures that can be tolerated at the same time and/or in time. More redundancy tolerates generally more errors and/or failures, but requires either more resources or more execution time.

This pattern may be used in conjunction with other patterns that provide containment and mitigation in a complementary fashion, where some error/failure types are covered by the other pattern(s) and this pattern covers for the remaining error/failure types.

Performance

The error/failure-free free performance \(T_{f=0}\) of the pattern is defined by the task total execution time without any resilience strategy \(T_{E}\), the time to activate N replicas of the system \(T_{a}\), the time to replicate the input \(T_{i}\), and the time to compare the outputs \(T_{o}\) with the total number of input-execute-output cycles \(P\).

\[\begin{aligned} T_{f=0} = T_{E} + T_{a} + P (T_{i} + T_{o}) \end{aligned}\]

The performance under errors/failures \(T_{f!=0}\) is defined by the failure free performance \(T_{f=0}\) plus the time to remove, replace, or discount the affected redundant (sub) system replica(s) \(T_{r}\) for each of the errors or failures \(N\). Assuming constant times to remove, replace, or discount the affected redundant (sub) system replica(s) \(T_{r}\) and a ratio for replication in space vs. in time of \(\alpha\), the performance under errors/failures \(T_{f!=0}\) can be reformulated to:

\[\begin{aligned} T_{f!=0} = \alpha T_{E} + (1 - \alpha) N T_{E} + T_{a} + P (T_{i} + T_{o}) + N T_{r} \end{aligned}\]

Reliability

The reliability \(R(t)\) of a system applying this pattern is defined by the parallel reliability of the \(N\)-redundant execution and the performance under errors/failures \(T_{f!=0}\), assuming constant probabilistic rate \(\lambda_{n}\) of errors and failures for each redundant execution (or its corresponding inverse, the MTTI \(M\)). It can be simplified for redundancy of identical systems \(R_{i}(t)\), assuming an identical constant probabilistic error/failure rate \(\lambda\) (or its corresponding inverse \(M\)).

\[\begin{split}\begin{aligned} R(t) &= 1 - \prod_{n=1}^{N}(1-e^{-\lambda_{n} T_{f!=0}}) = 1 - \prod_{n=1}^{N}(1-e^{-T_{f!=0}/M})\\ R_{i}(t) &= 1 - (1 - e^{-\lambda T_{f!=0}})^{N} = 1 - (1 - e^{-T_{f!=0}/M})^{N} \end{aligned}\end{split}\]

Availability

The availability \(A\) of a system applying this pattern is defined by \(N\)-parallel availability and the performance under failure \(T_{f!=0}\). It can be simplified for redundancy of identical systems \(A_{i}\). If \(T_{a}\), \(T_{i}\), \(T_{d}\), \(T_{r}\), and \(T_{f}\) are small enough, non-identical and identical availability can be simplified further, where \(M_{n}\) (or \(M\)) is the MTTI and \(R_{n}\) (or \(R\)) is the mean-time to recover (MTTR) of each individual system (\(T_{f}\)).

\[\begin{split}\begin{aligned} A &= 1 - \prod_{n=1}^{N} (1 - A_{n})\notag\\ &= 1 - \prod_{n=1}^{N} \left(1 - \frac{T_{E,n}}{T_{n}}\right)\\ A_{i} &= 1 - (1-A)^{N}\notag\\ &= 1 - \left(1 - \frac{T_{E}}{T}\right)^{N} \end{aligned}\end{split}\]

\[\begin{aligned} A &= 1 - \prod_{n=1}^{N} \left(1 - \frac{M_{n}}{M_{n} + R_{n}}\right) A_{i} &= 1 - \left(1 - \frac{M}{M + R}\right)^{N} \end{aligned}\]

Examples

The use of the pattern in various hardware and software systems enables detection and correction of errors, or the compensation of failures. Dual-modular redundancy for error detection and failure compensation and triple-modular redundancy for error detection and correction and failure compensation are used forms of this pattern in high-performance computing (HPC) environments. Examples include dual-redundant cooling fans, dual- and triple–modular redundant Message Passing Interface (MPI) implementations [B65], dual-redundant parallel file system metadata service (MDS) solutions [B66] and dual-redundant mission-critical HPC systems (e.g., weather forecast).

Rationale

The pattern enables a system to tolerate an error or failure through continuation of correct operation after impact. It relies on system state redundancy in the form of functionally identical replicas. The pattern performs mostly proactive actions, such as maintaining redundancy. Error or failure detection is part of the pattern in the form of output comparison. The pattern has some design complexity, as input needs to be replicated and output needs to be compared.

System Scope

In the context of INTERSECT Systems, Subsystems, and Services, this pattern can be applied to INTERSECT systems and subsystems. It would be primarily applied to an entire infrastructure system and its subsystems, as opposed to an entire logical system that spans across multiple infrastructure systems. It could be applied to a logical subsystem of an infrastructure system only.

Service Scope

In the context of INTERSECT Systems, Subsystems, and Services, this pattern can be applied to an INTERSECT service. If it is applied to a group of services, then this is typically within the System Scope.

Microservice Scope

In the context of the INTERSECT Microservices Architecture, this pattern can be applied to an INTERSECT microservice. If it is applied to a group of microservices, then this is typically within the Service Scope.