With the demand of agile development and management, cloud applications today are moving towards a more fine-grained microservice paradigm, where smaller and simpler functioning parts are combined for providing end-to-end services. In recent years, we have witnessed many research efforts that strive to optimize the performance of cloud computing system in this new era. This paper provides an overview of existing works on recent system performance optimization techniques and classify them based on their design focuses. We also identify open issues and challenges in this important research direction.
Wide-area high-performance computing is widely used for large-scale parallel computing applications owing to its high computing and storage resources. However, the geographical distribution of computing and storage resources makes efficient task distribution and data placement more challenging. To achieve a higher system performance, this study proposes a two-level global collaborative scheduling strategy for wide-area high-performance computing environments. The collaborative scheduling strategy integrates lightweight solution selection, redundant data placement and task stealing mechanisms, optimizing task distribution and data placement to achieve efficient computing in wide-area environments. The experimental results indicate that compared with the state-of-the-art collaborative scheduling algorithm HPS+, the proposed scheduling strategy reduces the makespan by 23.24%, improves computing and storage resource utilization by 8.28% and 21.73% respectively, and achieves similar global data migration costs.
As the complexity of software systems is increasing; software maintenance is becoming a challenge for software practitioners. The prediction of classes that require high maintainability effort is of utmost necessity to develop cost-effective and high-quality software. In research of software engineering predictive modeling, various software maintainability prediction (SMP) models are evolved to forecast maintainability. To develop a maintainability prediction model, software practitioners may come across situations in which classes or modules requiring high maintainability effort are far less than those requiring low maintainability effort. This condition gives rise to a class imbalance problem (CIP). In this situation, the minority classes’ prediction, i.e., the classes demanding high maintainability effort, is a challenge. Therefore, in this direction, this study investigates three techniques for handling the CIP on ten open-source software to predict software maintainability. This empirical investigation supports the use of resampling with replacement technique (RR) for treating CIP and develop useful models for SMP.
Container-based virtualization techniques are becoming an alternative to traditional virtual machines, due to less overhead and better scaling. As one of the most widely used open-source container orchestration systems, Kubernetes provides a built-in mechanism, that is, horizontal pod autoscaler (HPA), for dynamic resource provisioning. By default, scaling pods only based on CPU utilization, a single performance metric, HPA may create more pods than actually needed. Through extensive measurements of a containerized n-tier application benchmark, RUBBoS, we find that excessive pods consume more CPU and memory and even deteriorate response times of applications, due to interference. Furthermore, a Kubernetes service does not balance incoming requests among old pods and new pods created by HPA, due to stateful HTTP. In this paper, we propose a bi-metric approach to scaling pods by taking into account both CPU utilization and utilization of a thread pool, which is a kind of important soft resource in Httpd and Tomcat. Our approach collects the utilization of CPU and memory of pods. Meanwhile, it makes use of ELBA, a milli-bottleneck detector, to calculate queue lengths of Httpd and Tomcat pods and then evaluate the utilization of their thread pools. Based on the utilization of both CPU and thread pools, our approach could scale up less replicas of Httpd and Tomcat pods, contributing to a reduction of hardware resource utilization. At the same time, our approach leverages preStop hook along with liveness and readiness probes to relieve load imbalance among old Tomcat pods and new ones. Based on the containerized RUBBoS, our experimental results show that the proposed approach could not only reduce the usage of CPU and memory by as much as 14% and 24% when compared with HPA, but also relieve the load imbalance to reduce average response time of requests by as much as 80%. Our approach also demonstrates that it is better to scale pods by multiple metrics rather than a single one.
As the mean-time-between-failures (MTBF) continues to decline with the increasing number of components on large-scale high performance computing (HPC) systems, program failures might occur during the execution period with high probability. Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned. From the user perspective, if the program failure cannot be detected and handled in time, it would waste resources and delay the progress of program execution. Unfortunately, the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege. Currently, automated tools for supporting user-level failure detection and autorecovery of parallel programs in HPC systems are missing. This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs. The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs. In addition, we propose a dual-checker mechanism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher (ARL) and evaluate it on the Tianhe-2 system. Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system. In addition, the communication and performance overhead caused by ARL is negligible. The good scalability of ARL makes it applicable for large-scale HPC systems.
Recently, a growing number of scientific applications have been migrated into the cloud. To deal with the problems brought by clouds, more and more researchers start to consider multiple optimization goals in workflow scheduling. However, the previous works ignore some details, which are challenging but essential. Most existing multi-objective workflow scheduling algorithms overlook weight selection, which may result in the quality degradation of solutions. Besides, we find that the famous partial critical path (PCP) strategy, which has been widely used to meet the deadline constraint, can not accurately reflect the situation of each time step. Workflow scheduling is an NP-hard problem, so self-optimizing algorithms are more suitable to solve it.
In this paper, the aim is to solve a workflow scheduling problem with a deadline constraint. We design a deadline constrained scientific workflow scheduling algorithm based on multi-objective reinforcement learning (RL) called DCMORL. DCMORL uses the Chebyshev scalarization function to scalarize its Q-values. This method is good at choosing weights for objectives. We propose an improved version of the PCP strategy calledMPCP. The sub-deadlines in MPCP regularly update during the scheduling phase, so they can accurately reflect the situation of each time step. The optimization objectives in this paper include minimizing the execution cost and energy consumption within a given deadline. Finally, we use four scientific workflows to compare DCMORL and several representative scheduling algorithms. The results indicate that DCMORL outperforms the above algorithms. As far as we know, it is the first time to apply RL to a deadline constrained workflow scheduling problem.
Exploring the interleaving space of a multithreaded program to efficiently detect concurrency bugs is important but also difficult because of the astronomically many thread schedules. This paper presents a novel framework to decompose a thread schedule generator that explores the interleaving space into the composition of a basic generator and its extension under the “small interleaving hypothesis”. Under this framework, we in-depth analyzed research work on interleaving space exploration, illustrated how to design an effective schedule generator, and shed light on future research opportunities.
With the advent of new computing paradigms, parallel file systems serve not only traditional scientific computing applications but also non-scientific computing applications, such as financial computing, business, and public administration. Parallel file systems provide storage services for multiple applications. As a result, various requirements need to be met. However, parallel file systems usually provide a unified storage solution, which cannot meet specific application needs. In this paper, an extended file handle scheme is proposed to deal with this problem. The original file handle is extended to record I/O optimization information, which allows file systems to specify optimizations for a file or directory based on workload characteristics. Therefore, fine-grained management of I/O optimizations can be achieved. On the basis of the extended file handle scheme, data prefetching and small file optimization mechanisms are proposed for parallel file systems. The experimental results show that the proposed approach improves the aggregate throughput of the overall system by up to 189.75%.
Mobile computing has fast emerged as a pervasive technology to replace the old computing paradigms with portable computation and context-aware communication. Existing software systems can be migrated (while preserving their data and logic) to mobile computing platforms that support portability, context-sensitivity, and enhanced usability. In recent years, some research and development efforts have focused on a systematic migration of existing software systems to mobile computing platforms.
To investigate the research state-of-the-art on the migration of existing software systems to mobile computing platforms. We aim to analyze the progression and impacts of existing research, highlight challenges and solutions that reflect dimensions of emerging and futuristic research.
We followed evidence-based software engineering (EBSE) method to conduct a systematic mapping study (SMS) of the existing research that has progressed over more than a decade (25 studies published from 1996–2017).We have derived a taxonomical classification and a holistic mapping of the existing research to investigate its progress, impacts, and potential areas of futuristic research and development.
The SMS has identified three types of migration namely Static, Dynamic, and State-based Migration of existing software systems to mobile computing platforms.Migration to mobile computing platforms enables existing software systems to achieve portability, context-sensitivity, and high connectivity. However, mobile systems may face some challenges such as resource poverty, data security, and privacy. The emerging and futuristic research aims to support patterns and tool support to automate the migration process. The results of this SMS can benefit researchers and practitioners–by highlighting challenges, solutions, and tools, etc., –to conceptualize the state-ofthe- art and futuristic trends that support migration of existing software to mobile computing.