1 Introduction
A system-of-systems (SoS) is defined as an integrated system comprising multiple constituent systems that operate independently from one another (
Maier, 1998; DeLaurentis and Ayyalasomayajula,
2009; ISO/IEC/IEEE,
2019). SoS possesses capabilities that individual constituent systems are unable to provide and has found extensive application across various contexts, including transportation, energy, and urban infrastructure SoS (
Arasteh et al., 2019;
Dui et al., 2021;
Wu et al., 2022), production, manufacturing, and supply chain SoS (
Bourne et al., 2018;
Chen et al., 2020;
Panetto et al., 2019), as well as emergency and disaster management SoS (
Liu, 2011; Cavallo and Ireland,
2014), and defense weapon and equipment SoS (
Chen et al., 2023;
Yang et al., 2022). SoS signifies an evolution and innovation in system thinking, emphasizing the complex interactions and significant effects among constituent systems, as well as between the SoS and its surrounding environment. Acknowledging this complexity, system-of-systems engineering (SoSE) employs principles and methodologies derived from complexity science to tackle emerging challenges in engineering practice. SoSE is particularly effective in addressing issues that traditional systems engineering methods struggle to resolve, such as distribution, dynamics, coupling, uncertainty, evolution, and adaptability within SoS (
De Domenico, 2023;
Han et al., 2022;
Jiao et al., 2024;
Panetto et al., 2019;
Pan et al., 2022;
Wu et al., 2022).
The emergence of SoS presents significant challenges in SoS engineering practices, particularly concerning SoS reliability. These challenges stem from the complex interactions among constituent systems, which include dependencies, conflicts, and cooperation, as well as from variable environmental conditions. Flaws in architectural design or disturbances during operation can result in failure or even severe collapse of the SoS (
Duan et al., 2019;
Ma et al., 2022). Unlike failures in traditional systems, failures in SoS are characterized by multiple states; disturbances do not invariably lead to a complete breakdown. Instead, a subset of constituent systems within the SoS may remain operational. These systems utilize strategies such as repair and logistics to recover, dynamically reconfigure, or adaptively evolve to establish a new system architecture, thereby enabling the SoS to function reliably and robustly (
Sun et al., 2022). Given this characteristic of SoS, research on SoS reliability has transitioned from an emphasis on failure in traditional reliability to an examination of perturbations during operation (
Pan et al., 2019).
In the context of various systems emerging across different domains, the design and verification of these complex engineered systems using systematic engineering approaches present numerous reliability challenges that differ from those encountered in conventional systems. Drawing upon aerospace engineering practices, this research summarizes and generalizes the concept of SoS while establishing a comprehensive research framework for SoS reliability. Based on these insights, this paper proposes a perspective on future directions for SoS reliability and discusses the issues and challenges associated with general SoS reliability.
2 Reliability of SoS: Requirements from complex engineered system
As complex engineered systems, SoSs across various domains face a common challenge: ensuring normal and reliable operation. However, compared to simpler systems, the methodologies for measuring SoS reliability and the engineering approaches for designing and verifying such systems have evolved. This evolution highlights the universal necessity for conducting research on SoS reliability within complex engineered systems.
2.1 SoS reliability measurement and evaluation
Reliability refers to the capacity of a system or product to function properly over time. In the context of SoS, reliability includes the ability to operate continuously and stably. The reliability of SoS is often examined within the framework of a complex engineered system comprising several constituent systems that collaborate toward a common objective (
Chen et al., 2023;
Eusgeld et al., 2011). In comparison to general system reliability, SoS reliability possesses a more nuanced connotation, and its measurement and evaluation present challenges across several dimensions.
(1) The duration of operation is challenging to quantify. System reliability is typically assessed by calendar duration, operational duration, or frequency, indicating its dependence on time. However, SoS reliability cannot be measured in this manner due to its loose structure, ambiguous boundaries, variable operational states, uncertain environments, and evolving configurations (
Baldwin et al., 2012;
Mostafavi et al., 2011). Therefore, it is more effective to evaluate SoS reliability from a task-oriented perspective rather than a time-oriented perspective.
(2) The operating environment is difficult to define. System reliability is complexly linked to the environment, which includes factors such as force, heat, and vibration. In contrast, the operating environment of SoS has a broader scope that includes not only the environments in which constituent systems function but also the rules and processes that interconnect these systems, as well as the human and organizational factors that SoS relies on during operation (
Corsello, 2008). This expanded scope complicates the definition of environmental conditions and operational profiles necessary for assessing SoS reliability.
(3) Establishing criteria for failure is particularly challenging. Precisely defining failure criteria for SoS is difficult, as these criteria vary among different types of SoS. The failure of SoS is contingent upon its specified operational objectives, which may include mission failure (
Chen et al., 2024c), loss of capability (
Jia et al., 2019), service interruption (
Thacker et al., 2017), and disintegration of network structure (
Gao et al., 2016). Furthermore, these failure patterns do not necessarily signify unreliability, as SoS can sustain downgraded operation following failures or recover through resilience processes, thereby maintaining overall reliability.
Furthermore, the concept of SoS reliability has expanded beyond the traditional focus on sustained operational capability over time. It is now recognized that the reliability of SoS should incorporate resilience, defined as the ability to resist, absorb, and recover from disturbance events, as well as to operate stably under uncertain disruptions. This transition signifies a movement from merely resisting failure to managing risks, and from passive post-failure repair to proactive defense, tolerance, adaptation, and recovery from internal failures or external shocks (
Sun et al., 2024;
Uday and Marais, 2015;
Yu et al., 2024).
2.2 SoS reliability design and testing
When constructing an SoS for a specific purpose, it is essential to optimize its reliability design and evaluate its reliability through simulation. The specific measures are outlined as follows.
The optimization of SoS reliability design necessitates not only the application of methods such as demonstration and allocation to establish the reliability requirements for the design and analysis of constituent systems (
Holt et al., 2015), but also the optimization of the SoS architecture, which includes operational rules, coupling mechanisms, and configuration (
Fiksel, 2003;
Uday and Marais, 2015). Among various methodologies, an effective approach uses contribution rates to analyze and optimize the constituent systems within the SoS from the perspectives of sensitivity and importance analysis (
Li et al., 2021;
Wang et al., 2020). An alternative approach is to employ AI-based techniques, such as reinforcement learning, to enhance the self-adaptability of the SoS, thereby deriving an operational architecture and strategy that maximizes reliability (
Lin et al., 2023;
Liu et al., 2024;
Sun et al., 2024). AI methodologies aid in identifying the optimal architecture of the SoS to mitigate failures under uncertain conditions and to enable more rapid and effective recovery in the event of random disturbances (
Tao et al., 2025).
The assessment of SoS reliability can be performed using various simulation methods. Techniques for measuring the reliability of complex systems, such as multi-agent systems, Markov processes, and Petri nets, remain applicable to SoS reliability (
Ge et al., 2014;
Silva and Braga, 2020). In the domain of defense equipment, simulation experiments based on confrontation and game theory represent a significant methodology for SoS reliability testing (
Axelsson, 2019). Furthermore, AI-based methods such as neural networks and deep learning can also prove effective in generating diverse application scenarios and inferring reliability for the assessment of SoS reliability (
Feng et al., 2023).
3 Equipment SoS reliability: Insights from aerospace engineering
Currently, the equipment constituting SoS represents a critical area of development within the field of aerospace engineering. To meet the high reliability demands inherent in aerospace engineering practices, the architecture of SoS is designed to possess reliability capabilities, including task coordination, stable operation, rapid recovery under disturbances, and timely and effective resource allocation. This research refines insights derived from engineering practices in both equipment and aerospace, thereby offering guidance for reliability research across all SoSs.
3.1 Background
Equipment SoS refers to a large, complex engineered system comprised of multiple pieces of equipment that are functionally interconnected and capable of independently performing specific tasks to achieve a common mission (
Li et al., 2017;
Sun et al., 2024). This concept is widely applied across various fields of engineering, particularly in aerospace. An aerospace equipment SoS includes a diverse array of equipment, such as spacecraft, satellites, and ground systems. Extensive efforts have been dedicated to the development of aerospace equipment SoSs, with the reliability of equipment SoS emerging as a significant concern in this domain (
Chen et al., 2023). Specifically, the reliability of aerospace equipment SoS can be enhanced through measures such as resistance, reconfiguration, and logistics. When conventional systems, such as certain satellites or ground systems, are impacted by disturbances due to failures or attacks, these measures can ensure that critical systems essential for the overall operation of the aerospace equipment SoS remain functional, thereby allowing the SoS to maintain its minimum operational capability and reliably execute its designated missions (
Wang et al., 2023). Drawing inspiration from research practices in aerospace, this study will provide a comprehensive analysis of the concept and research framework of equipment SoS reliability. Moreover, the insights derived from aerospace engineering will contribute to addressing existing challenges in general SoS reliability, with a focus on resilience, including resistivity and recoverability, and extending beyond traditional notions of reliability.
3.2 Concept of equipment SoS reliability
Equipment SoS reliability pertains to the ability of an equipment SoS to fulfill its mission tasks as outlined in the specified mission profile. This includes the SoS’s resilience when faced with breakdowns and destruction, its capacity for reconfiguration to promptly restore system functionality, its logistical capabilities after system disturbances, and its ability to accomplish tasks, ultimately leading to the attainment of the mission objective. These capabilities reflect a nuanced understanding of equipment SoS reliability, emphasizing resilience and recoverability under various disturbances. Equipment SoS reliability is subject to influence from multiple factors, including the natural environment in which the constituent systems operate, as well as factors inherent to the architecture of the SoS, such as the reconfiguration of equipment, the logical relationships within the SoS mission, and the command and control network (
Tsilipanos et al., 2013). Based on this understanding, a research framework for equipment SoS reliability has been systematically summarized and developed to guide the design and analysis of reliability.
3.3 Research framework for equipment SoS reliability
The foundational approach of SoSE is to design the system architecture around capabilities (
DoD., 2008). The development of an SoS follows three fundamental steps: capability transformation, architecture development, and requirement validation, in accordance with the iterative process of “objective proposing-solution design-test evaluation” (
Maier, 1998;
Nielsen et al., 2015). Grounded in the SoSE framework, the framework for equipment SoS reliability is proposed, as illustrated in Fig.1. This framework follows a hierarchical structure of “SoS mission-architecture capability-system function” for reliability capability analysis and system architecture development, while the hierarchy of “function verification-capability verification-mission verification” is essential for conducting verification and evaluation (
Cherfa et al., 2019). These iterative processes create a cyclical pattern of SoS reliability analysis, design, and validation.
Specifically, the framework for equipment SoS is categorized into three principal components: reliability capability analysis, reliability-oriented architecture development, and reliability verification and evaluation.
(1) SoS reliability capability analysis aligns the mission with the reliability capability requirements of the SoS, including task coordination, stable operation, disturbance recovery, and resource allocation, through the decomposition method of “mission-task-capability.” Additionally, high-level concepts of SoS may be derived from capability analysis, including mission and task processes, which provide initial input for reliability-oriented SoS architecture development.
(2) When developing SoS architecture from the perspective of reliability, it is necessary to analyze the tasks and coordination activities related to the normal operation of constituent systems within a network environment. Furthermore, it is essential to consider resilience processes from a reliability perspective, such as resistance, reconfiguration, and logistics under abnormal disturbances. These considerations yield an architecture model, mission models, disturbance models, and fault models.
(3) The architecture model is subsequently employed to support verification and evaluation of equipment SoS reliability. In addition to testing the applicability of SoS, simulation for reliability is a common and effective method in engineering. The architecture model can be utilized to facilitate logical simulation and multi-agent simulation for assessing SoS reliability requirements, evaluating reliability indicators such as survivability, robustness, and supportability, as well as optimizing the reliability of architectural design.
4 SoS reliability: Challenges of complex engineered systems
4.1 Conception shift of SoS reliability
To investigate SoS reliability, it is crucial to model both the SoS and its reliability. The inherent complexity of SoS poses challenges for effective modeling. Regarding the modeling of SoS reliability, it is necessary to represent the SoS behavior under both normal operational conditions and various disturbances. Moreover, the focus on SoS reliability has transitioned from solely preventing failures to ensuring availability and maintaining efficient operation, even in the presence of disturbances. This transition has broadened the concept of SoS reliability to include resilience, defined as the ability of a system to resist, absorb, adapt to, and recover from disturbances in a timely manner to maintain effective functioning (
Holling, 1973). Therefore, a more comprehensive approach to modeling SoS reliability should prioritize the system’s resilience rather than remain confined to the traditional perspective of failure modes and occurrences (
Pan et al., 2022).
4.2 Network reliability in SoS
Network reliability refers to the capacity of a network to perform designated functions within a specified timeframe and under defined conditions. The reliability of SoS is closely linked to network reliability (
Chen et al., 2024a). On one hand, communication and interconnection among constituent systems serve as the foundation for effective task coordination. In particular, command and control systems, data links, and communication networks within equipment SoS play a crucial role in task completion (
Tran et al., 2015). On the other hand, the network functions as a valuable tool for describing and modeling various complex relationships within SoS, such as collaboration, support, and command and control. Research on network reliability can investigate issues related to evolution, adaptation, and emergence within complex systems (
Gao et al., 2016). As networked systems increase in complexity, a novel form of SoS, referred to as the Network of Networks, has emerged as a modeling tool to assess SoS reliability across multiple domains, including infrastructure and biological systems (
Gao et al., 2011).
4.3 Disturbance propagation and recovery mechanism within SoS
Research on system reliability predominantly addresses functional failures. Similarly, the reliability of SoS should be examined from the perspective of normal and stable operation, with particular attention to the adverse effects of disturbances, including internal failures and external shocks (Pan et al., 2022). It remains essential to identify and characterize faults and failures within SoS, which may be defined in terms of task failure, loss of capability, service disruption, and network collapse. Given the complexity and interconnectivity of SoS, any faults or failures that arise are likely to propagate from local to global levels and from individual components to others, potentially resulting in widespread collapse of the SoS (
Duan et al., 2019). Furthermore, research on SoS reliability has expanded beyond merely addressing faults and failures to include recovery and adaptation strategies in response to these challenges. This shift has prompted the application of various methods to manage disturbances, including survivability analysis, vulnerability assessment, reconfiguration or reorganization, and self-adaptation (
Chen et al., 2023;
Sun et al., 2022). These methods represent generalized restoration and recovery strategies for addressing disturbances in system reliability research, which significantly differ from traditional component or system maintenance approaches.
4.4 Design and simulation technology of SoS reliability
As the concept of SoS reliability continues to evolve, research should focus on application scenarios and objectives, addressing several critical issues in SoS design and simulation. First, the reliability design of SoS must be closely integrated with its architecture. Mission profiles and actual usage factors can be analyzed through the establishment of real application scenarios. It is crucial to organize and decompose the elements of tasks, components, activities, and processes during normal operation and disturbances in accordance with the architecture of SoS (
Cherfa et al., 2019). Second, the reliability design of SoS should undergo iterative optimization through a comprehensive closed-loop verification approach that takes into account the application scenario, hierarchy, and task logic. Research should focus on requirement decomposition for SoS reliability, vulnerability analysis, and optimization of the SoS architecture. Reliability verification in a closed-loop framework is conducted based on a checklist derived from capability analysis of SoS (
Garro and Tundis, 2015). Third, the design and verification of SoS reliability should be integrated with intelligent technologies. In this context, SoS is conceptualized as a multi-agent system characterized by autonomous perception, learning, decision-making, and execution. Techniques such as reinforcement learning and large models are employed to optimize the architecture through multiple iterations, leading to reliable designs for the configuration of constituent systems. Furthermore, intelligent technologies such as AI agents and adversarial networks can produce a greater number of credible scenarios that enhance the reliability testing of SoS under sparse and uncertain conditions (
Chen et al., 2024b). Finally, it is essential to concurrently develop the necessary tools, software platforms, and technical standards, such as those established by NASA and ISO, in conjunction with research on SoS reliability.
4.5 Reliability management of SoSE
The practice of SoSE has demonstrated that the adaptability and flexibility inherent in the development of SoS, configuration changes, and interfaces among constituent systems significantly affect the management of SoS reliability. Various characteristics of SoS present challenges to the reliability management of such systems (
Gorod et al., 2008). First, SoS is recognized as an engineered resilient system, with resilience manifesting in three key aspects: the capacity to resist, recover from, and adapt to faults and disturbances; a flexible architecture that enables adaptation to uncertain environments, thereby enhancing the versatility of SoS; and the interfaces and configuration modifications of constituent systems during the incremental development of SoS (
Uday and Marais, 2015). The resilience of SoS distinguishes its reliability management process from that of a single system, as the former encounters greater uncertainties and potential risks, resulting in a continuously evolving configuration within SoSE. This contrasts with the requirements in traditional system engineering, where the configuration must be established prior to conducting reliability testing and evaluation. Furthermore, SoS is not developed in a singular step or with initially defined objectives; rather, it undergoes continuous optimization, supplementation, and evolution based on existing configurations. Its development follows a wave-like trajectory, characterized by ongoing upgrades and incremental enhancements (
Keating et al., 2008). Collectively, these attributes significantly affect reliability management in SoSE.
5 Conclusions
This paper defines SoS as a complex engineered system and analyzes SoS reliability from multiple perspectives. It first outlines the differences between SoS reliability and the reliability of individual systems or products, particularly in relation to reliability measurement and evaluation. It then explores the challenges associated with researching SoS reliability within the engineering domain, considering the design and testing requirements pertinent to SoS reliability. Subsequently, drawing from research practices related to equipment SoS reliability in the aerospace engineering sector, this paper defines equipment SoS reliability as the capability of the SoS to fulfill its mission in accordance with the specified mission profile and proposes a research framework for equipment SoS reliability grounded in the SoSE framework. This framework incorporates SoS reliability capability analysis, reliability-oriented SoS architecture development, and SoS reliability verification and evaluation, thereby providing a comprehensive and referential methodology and research paradigm for equipment SoS reliability engineering. Finally, it addresses the challenges present in the modeling, mechanisms, technology, and management of SoS reliability, aiming to inspire future research endeavors.