1 Introduction
Reliability is defined as the probability that a system or component will perform its required function for a specified period of time under specified operating conditions (
Ebeling, 2019). Traditional reliability theory assumes that both a system and its components have only two possible states in the operation stage, either perfectly functioning or completely failing. However, as engineered systems become more sophisticated, involving larger sizes, increased functionality, and greater intelligence, and with advancements in sensing technologies, it has been extensively observed that systems and components exhibit multiple distinct states as they deteriorate over time. The failure behavior, damage severity, performance capacity, and operational efficiency of the system and components vary among these states (
Lisnianski et al., 2018). It, therefore, necessitates the development of advanced reliability theories and methodologies that are capable of characterizing these systems and components with multi-state nature.
The earliest research work on multi-state system (MSS) reliability can be traced back to Hirsch et al. (
1968). In this work, components were assumed to have two states, while systems could have multiple states with distinct performance levels. Since 1975, there has been significant development and study of concepts and models of MSS reliability (
Lisnianski et al., 2018). In 2003, the first book introducing MSS reliability models and theories was published by Lisnianski and Levitin (
2003), followed by several other monographs tracking recent advances in MSS reliability research (
Lisnianski et al., 2018;
Natvig, 2011). Meanwhile, the state-of-the-art MSS reliability models and tools have been extensively implemented in a diversity of engineering practices, such as power systems, manufacturing systems, computer networks and grid systems, spacecraft, transportation systems, and infrastructures (
Lisnianski et al., 2018).
The multi-state nature of engineered systems presents new challenges. From a reliability modeling perspective, traditional lifetime distributions are inadequate for characterizing the deteriorating behaviors of both individual components and entire systems with multiple states. Additionally, the heterogeneity of deteriorating behaviors among components complicates reliability modeling from the component level to the system level due to the state-space exploration issue, which is magnified in comparison to binary-state systems. To address these challenges, advanced models and algorithms have been developed to characterize state transitions of components and map components’ state combinations to system states.
From a reliability assessment standpoint, data collected for MSSs throughout their lifecycle is more diverse than that for binary-state systems. The observed data can be collected at different inspection intervals, spanning multiple physical levels of a system. This can include observed states at each inspection time, the duration of sojourning time in each state, or a mixture of both. The collected data can range from accurate to imprecise due to uncertainties stemming from poor sensing techniques or subjective judgments by practitioners. Consequently, new parameter estimation algorithms and data fusion methods must be explored to effectively utilize all collected data and attain improved reliability assessments at the beginning of system use or during operation.
From the perspective of reliability optimization, the incorporation of multiple states allows for greater flexibility in the design and operation of components and systems. The decision space increases exponentially with the numbers of heterogenous components and components’ state combinations. Advanced solution algorithms that can overcome the “curse of dimensionality” in both state and action spaces are highly desirable. On the other hand, the traditional models and methods for maintenance efficiency quantification, redundancy allocation, and maintenance/operation decision-making in the binary-state context need to be extended to accommodate the multi-state context.
2 Reliability modeling and assessment of MSSs
Fig.1 illustrates typical methods for reliability modeling and assessment of MSSs. In engineering applications, MSSs can be categorized into the following three scenarios: (1) Systems composed of different binary-state units that collectively impact the overall system performance; (2) Systems that exhibit varying levels of efficiency or performance capacities during their operation; (3) Systems that experience deterioration throughout their lifetimes, which can be differentiated into more than two states based on the severity of damage or health status. The subsequent subsections delve into the reliability modeling and assessment methods of MSSs, as outlined in Fig.1.
2.1 Reliability modeling methods of MSSs
Markov and Semi-Markov models: These models have been extensively applied in the field of MSS reliability modeling. Markov models are based on the assumption of memorylessness, implying that future states only depend on the current state and are independent of the past state sequence. Conversely, the Semi-Markov model takes into account not only the current state, but also the duration of time spent in that state, allowing for an arbitrary distribution of waiting time (
Wu et al., 2019). These stochastic models are powerful tools for considering various factors such as repair, periodic testing, common cause failure, fault coverage, and fault-tolerance (
Son et al., 2020). Nevertheless, their applicability is limited to relatively small MSSs due to the rapid increase in the number of system states as the number of multi-state components grows. Recently, methods based on state aggregation (
Yi et al., 2021a) have been explored to simplify Markov models and enhance the efficiency of MSS reliability assessment.
Hidden (Semi-) Markov models (HMMs/HSMMs): Due to the widespread adoption of sensing technology, a significant volume of inspection data can be amassed for assessing and updating the reliability of MSS. HMMs (
Zhao et al., 2021) and HSMMs (
Moghaddass et al., 2013) have been proposed to incorporate inspection data into MSS reliability modeling, wherein the actual system state is latent but can be inferred from inspection data. Specifically, the data used for parameter estimation in HMMs and HSMMs can include direct system state observations, indirect system state observations, expert judgment, and combinations of the both. Furthermore, recognizing the hierarchical nature of MSSs, HMMs and HSMMs integrated with Bayesian networks have been developed to incorporate multi-level observations into parameter estimation (
Jiang and Liu, 2017;
Soleimani et al., 2021).
2.2 Reliability assessment methods of MSSs
2.2.1 Graph-based approaches
Bayesian Networks (BNs): BNs are extensively implemented for the reliability assessment of MSSs due to their ability to handle nodes with multiple possible values corresponding to components, subsystems, or the system itself. BNs use conditional probability tables (CPTs) to quantitatively capture probabilistic relationships among nodes. Unlike traditional reliability models, BNs have the capacity to represent complex probabilistic relationships within MSSs. Moreover, BNs can integrate data from diverse sources, enhancing the accuracy of reliability assessment for MSSs. They also accommodate correlations among component failures, making it possible to model common-cause failures (
Li et al., 2022) and cascading failures (
Codetta-Raiteri et al., 2012).
Decision Diagrams: Different forms of decision diagrams including multi-state binary decision diagrams (MBDDs), logarithmic binary decision diagrams (LBDDs), multi-state multivalued decision diagrams (MMDDs) and edge-valued multi-valued decision diagrams (EVMDDs), have been introduced to analyze the reliability of MSSs. These methods expand on the Boolean methods established for binary-state systems and adapt them for MSSs. MMDDs, in particular, are effective for assessing the reliability of phased-mission systems (
Mo et al., 2014) and cold standby sparing systems (
Zhai et al., 2015). They are capable of efficiently handling large-scale MSSs with components that have many states. For a comprehensive resource on decision diagrams for reliability analysis of complex systems, including MSSs, refer to (
Xing and Amari, 2015).
Petri Nets (PNs): PN, coined by Carl Petri (
1966), are graph-based structures consisting of places and transitions. PNs have been widely used for modeling the dynamic behaviors of MSSs by triggering enabled transitions through the creation and removal of tokens in specific input and output places (
Taleb-Berrouane et al., 2020). PNs are particularly suitable for analyzing the distributed and concurrent behaviors of MSSs. However, traditional PNs have limitations in integrating evidence/data during the system’s operation stage. To address this, BNs have been incorporated into PNs to enhance their modeling capabilities and enable dynamic reliability assessment of MSSs through data collection (
Taleb-Berrouane et al., 2020).
GO Methods: GO (
Shen et al., 2003) is a success-oriented technique used to model components with complex logic relationships and multi-state nature. In GO, operators represent system components, while signal-flows depict the connections between these components. One significant advantage of GO methods is the graphical similarity between the GO model and the system structure. However, traditional GO methods have limitations in two scenarios: extra computations due to shared signals with corrections, and heavy computational burden for large-scale MSSs. Recently, the integration of GO methods with other tools, such as multi-valued decision diagrams (MVDDs) and BNs, has led to the development of MVDD-GO and BN-GO (
Ye et al., 2023), significantly enhancing the computational efficiency for reliability assessment of MSSs.
2.2.2 Analytical/approximate approaches
Universal Generating Function (UGF): Initialized by Ushakov (
1986), UGF is a primary technique for the reliability assessment of MSSs. UGF employs the
u-function to establish a connection between the state performances of components and their associated probabilities. To assess the
u-functions of subsystems and the system, a set of composition operators are designed to perform algebraic operations on the
u-functions of components. By utilizing a fast algebraic procedure, UGF efficiently determines the state probability of the MSS. However, there are several challenges associated with traditional UGF-based methods when they are applied to real-world cases. First, the UGF can only be used to analyze the steady-state probability of MSSs, as the generating function is defined exclusively for random variables rather than stochastic processes. To address this limitation, the
Lz-transform was introduced (
Lisnianski et al., 2018) to evaluate MSS reliability while considering the dynamic behaviors of components. Secondly, UGF can only be applied when both the state performance and the corresponding probability are precise values. Recent extensions of UGF, such as fuzzy UGF, interval UGF, and belief UGF, allow for the consideration of epistemic uncertainties surrounding the state performance and the corresponding probability. Lastly, UGF is applicable only to
s-independent components. Significant research efforts have been devoted to extending traditional UGF to incorporate unilateral dependency (
Levitin, 2004) and bilateral dependency (
Jafary and Fiondella, 2016) between components.
Signature/Survival Signature Function: Signature, first proposed by Samaniego (
1985), plays a pivotal role in reliability assessment of coherent systems with identical components. Survival signature extends the concept of signature to systems involving multiple component types and various failure time distributions. Both methods have been investigated in the context of MSSs. However, the computational complexity of MSS signature and survival signature rapidly increases with the size of the system. To mitigate this issue, UGF-based approaches (
Coolen and Coolen-Maturi, 2012) and finite Markov chain imbedding methods (
Yi et al., 2021b) have been proposed to alleviate the computational burden associated with MSS signature/survival signature.
Recursive Algorithms: Recursive algorithms serve as the primary means of efficiently and accurately evaluating the performance and reliability of small-scale MSSs, particularly multi-state weighted
k-out-of-
n: G and multi-state weighted
k-out-of-
n: F systems (
Li and Zuo, 2008). These algorithms establish recursive equations to evaluate the state distribution of the MSS and identify the boundary conditions for the recursive function. Recursive algorithms have demonstrated superior computational efficiency compared to UGF methods, although they remain computationally demanding for large-scale MSSs.
d-MPs (d-MCs)-based algorithms: These algorithms are formulated based on minimal path/cut vectors, known as d-MPs (d-MCs), which are the lower (upper) boundary component state vectors that satisfy the system demand (
Xu et al., 2022). Once all d-MPs (d-MCs) of an MSS have been identified, the reliability can be calculated using the inclusion-exclusion (IE) principle or the sum of the disjoint product (SDP) principle. However, when dealing with large-scale MSSs, it becomes cumbersome to obtain the exact reliability using d-MP (d-MC)-based algorithms. Reliability-bound algorithms (
Liu et al., 2021) offer approximations of the reliability of large-scale MSSs.
Monte Carlo Simulation (MCS): Analytical methods, like the ones mentioned above, may become impractical due to the challenges posed by large-scale systems and complex structures. In such scenarios, MCS serves as an approximation and benchmark for evaluating the reliability of the MSS through random sampling and repetitive simulation. However, MCS requires significant computational resources and time, especially when dealing with extensive system state spaces. Recent proposals, such as population-based intelligent search methods (
Green et al., 2011) and machine learning-based methods (
Urgun and Singh, 2019), aim to alleviate the computational burden and improve the convergence speed of MCS.
3 Reliability optimization of MSSs
3.1 Reliability optimization of MSSs in the design stage
Reliability optimization plays a crucial role in the design of engineered systems. Enhancing component reliability, increasing component redundancy, and the mixture of the both ways are effective strategies for improving system reliability during the design stage (
Sun et al., 2019), as depicted in Fig.2.
In the context of MSSs, system performance is prioritized over reliability. The main focus is on improving system performance by enhancing the performance of individual components. The overall system performance is influenced by numerous combinations of component performance levels. The large number of possible combinations of component states increases the complexity of assessing reliability. Therefore, various approximate assessment approaches, including Monte Carlo simulation and cut-based approximation approach, have been introduced to optimize reliability and improve computational efficiency. However, obtaining reliable and representative data for MSSs is challenging. Limited data and an unclear understanding of the systems’ failure mechanisms contribute to imprecise parameters and epistemic uncertainty. Several non-probabilistic measures, such as evidence theory (
Xiahou et al., 2018), p-box theory (
Liu et al., 2020), and interval theory (
Xiahou and Liu, 2020), have been introduced to address epistemic uncertainty related to component state probabilities. Nonetheless, reliability optimization for MSSs under epistemic uncertainty presents a nested optimization problem that involves identifying the upper and lower bounds of system reliability as a lower-level optimization and searching for an optimal design scheme as an upper-level optimization (
Sun et al., 2019). Additionally, robust optimization is another effective approach for handling uncertainty in reliability optimization problems by ensuring optimal system design in worst-case scenarios (
Zhang and Li, 2022).
Addressing reliability optimization problems for MSSs is challenging due to the discrete, probabilistic, and nonlinear nature of the resulting optimization models and variables. Various optimization algorithms have been developed to tackle the complexity of reliability optimization, including branch-and-bound (
Ghare and Taylor, 1969) and dynamic programming (
Bellman and Dreyfus, 1958). However, exact algorithms tend to be time-consuming, primarily due to the complexity of reliability assessment in MSS reliability optimization. To overcome this limitation, numerous heuristic and meta-heuristic algorithms have been developed to provide near-optimal solutions within a reasonable timeframe for multi-state reliability optimization problems.
3.2 Reliability optimization of MSSs in the operation stage
In the operation stage of MSSs, our focus primarily revolves around three types of reliability optimization problems: maintenance optimization, mission abort problems, and component assignment problems, as depicted in Fig.3.
Maintenance Optimization: Maintenance is a critical activity in the operation phase of engineered systems, aimed at restoring system performance. In the context of maintenance optimization for MSSs, three imperfect maintenance models have been developed to represent maintenance efficiency. The first model considers that executing a maintenance action can restore the system to a specified state (
Liu et al., 2018). The second model incorporates a state transition matrix to account for the stochastic nature of maintenance actions, which arises from the inherent uncertainty in the maintenance environment and the skill of the repairperson (
Liu et al., 2022). The third model acknowledges that even after restoration, an MSS cannot be considered in a completely new condition. Consequently, the state transition intensities increase after repairs, resulting in a faster pace of degradation from a better state to a lower state than that before repairs (
Liu and Huang, 2010). Markov decision processes (MDPs) have been used for sequential maintenance optimization for MSSs. Many reinforcement learning (RL) approaches, such as value iteration, policy iteration, and Q-learning, have been developed to address MDPs. However, the curse of dimensionality is a common drawback of MDPs, particularly in situations where the problem size increases. To overcome this limitation, deep reinforcement learning (DRL) techniques, such as deep deterministic policy gradient (
Chen et al., 2022) and proximal policy optimization (
Chen et al., 2023), have been introduced to efficiently estimate the optimal maintenance strategy through computational means. In practice, accurately determining the system’s condition is challenging due to measurement errors and sensor signal noises. To address maintenance optimization under such circumstances, partially observed MDP (POMDP) is an effective tool. POMDP introduces the concept of a brief state to characterize system condition information with uncertainty (
Liu et al., 2022).
Mission Abort: In engineering practices, ensuring the survival of systems is just as critical as successfully completing the mission, particularly in fields such as aerospace, submarine, and nuclear power. When the risk of system failure is deemed unacceptable, the mission can be aborted or terminated before completion in order to strike a balance between mission success and system survival. This is known as mission abort (
Levitin et al., 2020a). The literature has explored various factors to capture the complexity of the mission environment, including external shocks and multi-attempt missions. In mission abort models for MSSs, system degradation is typically characterized by a delay-time model (
Qiu et al., 2023) or a general multi-state model (
Levitin et al., 2020b). Determining the mission abort policy of multi-component MSSs involves defining a subset of system states or combinations of components’ states that trigger an immediate mission abort. However, the multitude of possible combinations of component states can make mission abort decisions challenging.
Component Assignment Problems (CAPs): In addition to implementing maintenance activities to restore system performance, another approach to improving the reliability of reconfigurable systems is by reassigning functionally exchangeable components. CAP aims to find the optimal permutation of components that maximizes system reliability. CAPs can generally be classified as single type or multi-type component CAPs (
Zhu et al., 2017). CAP is a combinational optimization problem that is typically NP-hard. Numerous heuristic approaches have been proposed to solve such problems within a reasonable time frame, including ZKA and ZKB heuristics (
Zuo and Kuo, 1990), as well as the Birnbaum importance-based genetic algorithm (
Cai et al., 2016). CAPs and condition-based maintenance have also been jointly optimized to reduce the system’s long-run average operational cost (
Sun et al., 2020). Most existing CAPs have focused on binary-state systems, with limited attention given to MSSs. In the CAPs of MSSs, the combinatorial explosions dilemma and the complexity of system reliability assessment pose challenges in developing tractable optimization models. Additionally, real-world engineering problems involve parametric uncertainty regarding component reliabilities and model uncertainty regarding the system structure (
Qiu et al., 2017). Therefore, several importance measures and heuristic algorithms have been developed to efficiently solve the CAPs of MSSs under uncertainty (
Qiu and Ming, 2020).
4 Challenges and future avenue
Despite significant achievements in MSS reliability research, there are several challenges that persist, providing potential avenues for future research endeavors.
(1) Scalability. The issue of scalability in reliability assessment of MSSs arises from two aspects. First, the multi-state nature of these systems leads to a significant increase in the number of combinations of component states as the number of component states increases. Secondly, assessing the reliability of large-scale systems is a time-consuming process. Recently, the physics informed neural network (PINN) has emerged as a promising approach. It incorporates the physical knowledge of MSSs into neural networks to enhance the performance and applicability of MSSs.
(2) Parameter Estimation. The existing literature primarily focuses on providing point estimates for these parameters from a data-driven perspective. This is done using methods such as maximum likelihood estimation (MLE) or Expectation-Maximization (EM) algorithm. However, these methods may encounter overfitting issues when dealing with small sample sizes. A potential research direction involves integrating prior information and extending these methods to Bayesian EM algorithms. Additionally, the prevailing literature often assumes parameter stationarity, which may not align with the actual dynamic nature of MSSs. Therefore, considering the time-varying nature of parameters is another important direction to explore.
(3) Curse of Dimensionality. In reliability optimization of MSSs, the demand for computational resources and the complexity of the optimization process increase exponentially as the dimensionality grows. This phenomenon, known as the “curse of dimensionality,” renders reliability optimization problems computationally intractable. Recent advancements in artificial intelligence (AI) approaches have demonstrated their potential to facilitate decision-making in various fields. Leveraging advanced AI-based algorithms, such as multi-agent DRL techniques, to improve computational efficiency in solving large-scale reliability optimization problems is an interesting direction for further exploration. Another promising approach for addressing computational challenges is to develop effective decomposition methods. These methods decompose a large-scale problem into a series of tractable sub-problems, thereby facilitating the utilization of parallel and distributed computing architectures to collaboratively solve the sub-problems and distribute the computational workload.