Algorithms for online fault tolerance server consolidation

Li Boyu , Wu Bin , Shen Meng , Peng Hao , Li Weisheng , Zhang Hong , Gan Jie , Tian Zhihong , Xu Guangquan

›› 2025, Vol. 11 ›› Issue (2) : 514 -523.

PDF
›› 2025, Vol. 11 ›› Issue (2) : 514 -523. DOI: 10.1016/j.dcan.2024.06.007
Original article

Algorithms for online fault tolerance server consolidation

Author information +
History +
PDF

Abstract

We study a novel replication mechanism to ensure service continuity against multiple simultaneous server failures. In this mechanism, each item represents a computing task and is replicated into ξ+1 servers for some integer ξ≥1, with workloads specified by the amount of required resources. If one or more servers fail, the affected workloads can be redirected to other servers that host replicas associated with the same item, such that the service is not interrupted by the failure of up to ξ servers. This requires that any feasible assignment algorithm must reserve some capacity in each server to accommodate the workload redirected from potential failed servers without overloading, and determining the optimal method for reserving capacity becomes a key issue. Unlike existing algorithms that assume that no two servers share replicas of more than one item, we first formulate capacity reservation for a general arbitrary scenario. Due to the combinatorial nature of this problem, finding the optimal solution is difficult. To this end, we propose a Generalized and Simple Calculating Reserved Capacity (GSCRC) algorithm, with a time complexity only related to the number of items packed in the server. In conjunction with GSCRC, we propose a robust replica packing algorithm with capacity optimization (RobustPack), which aims to minimize the number of servers hosting replicas and tolerate multiple server failures. Through theoretical analysis and experimental evaluations, we show that the RobustPack algorithm can achieve better performance.

Keywords

Cloud computing / Server consolidation / Replica / Fault tolerance

Cite this article

Download citation ▾
Li Boyu, Wu Bin, Shen Meng, Peng Hao, Li Weisheng, Zhang Hong, Gan Jie, Tian Zhihong, Xu Guangquan. Algorithms for online fault tolerance server consolidation. , 2025, 11(2): 514-523 DOI:10.1016/j.dcan.2024.06.007

登录浏览全文

4963

注册一个新账户 忘记密码

CRediT authorship contribution statement

Boyu Li: Writing - review & editing. Bin Wu: Methodology. Meng Shen: Methodology. Hao Peng: Writing - review & editing. Weisheng Li: Writing - review & editing. Zhihong Tian: Writing - review & editing. Guangquan Xu: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is supported in part by the National Key R&D Program of China under No. 2023YFB2703800, the National Science Foundation of China under Grants U22B2027, 62172297, 62102262, 61902276 and 62272311, Tianjin Intelligent Manufacturing Special Fund Project under Grants 20211097, the China Guangxi Science and Technology Plan Project (Guangxi Science and Technology Base and Talent Special Project) under Grant AD23026096 (Application Number 2022AC20001), Henan Provincial Natural Science Foundation of China under Grant 622RC616, and CCF-Nsfocus Kunpeng Fund Project under Grants CCF-NSFOCUS202207.

References

[1]

K. Tang, Y. Shi, T. Lou, W. Peng, X. He, P. Zhu, Z. Gu, Z. Tian, Rethinking per-turbation directions for imperceptible adversarial attacks on point clouds, IEEE Int. Things J. 10 (6) (2022) 5158-5169.

[2]

Q. Zhang, L.T. Yang, Z. Yan, Z. Chen, P. Li, An efficient deep learning model to predict cloud workload for industry informatics, IEEE Trans. Ind. Inform. 14 (7) (2018) 3170-3178.

[3]

S. Bergsma, T. Zeyl, A. Senderovich, J.C. Beck, Generating complex, realistic cloud workloads using recurrent neural networks,in: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, 2021, pp. 376-391.

[4]

G. Xu, X. Ding, S. Xu, Y. Jia, S. Liu, S. Feng, X. Zheng, Real-time diagnosis of con-figuration errors for software of ai server infrastructure, IEEE Trans. Dependable Secure Comput. (2023) 1-12.

[5]

C. Gonzalez, B. Tang, Ft-vmp: fault-tolerant virtual machine placement in cloud data centers, in: 2020 29th International Conference on Computer Communications and Networks (ICCCN), IEEE, 2020, pp. 1-9.

[6]

X. Xu, R. Mo, F. Dai, W. Lin, S. Wan, W. Dou, Dynamic resource provisioning with fault tolerance for data-intensive meteorological workflows in cloud, IEEE Trans. Ind. Inform. 16 (9) (2019) 6172-6181.

[7]

L. Wang, Architecture-based reliability-sensitive criticality measure for fault-tolerance cloud applications, IEEE Trans. Parallel Distrib. Syst. 30 (11) (2019) 2408-2421.

[8]

Y. Lv, W. Shi, W. Zhang, H. Lu, Z. Tian, Don’t trust the clouds easily: the insecurity of content security policy based on object storage, IEEE Int. Things J. 10 (12) (2023) 10462-10470.

[9]

Y. Alahmad, T. Daradkeh, A. Agarwal, Proactive failure-aware task scheduling framework for cloud computing, IEEE Access 9 (2021) 106152-106168.

[10]

R. Tripathi, S. Vignesh, V. Tamarapalli, D. Medhi, Cost efficient design of fault tol-erant geo-distributed data centers, IEEE Trans. Netw. Serv. Manag. 14 (2) (2017) 289-301.

[11]

W. Zhang, X. Chen, J. Jiang, A multi-objective optimization method of initial virtual machine fault-tolerant placement for star topological data centers of cloud systems, Tsinghua Sci. Technol. 26 (1) (2020) 95-111.

[12]

G. Xie, G. Zeng, Y. Chen, Y. Bai, Z. Zhou, R. Li, K. Li, Minimizing redundancy to satisfy reliability requirement for a parallel application on heterogeneous service-oriented systems, IEEE Trans. Serv. Comput. 13 (05) (2020) 871-886.

[13]

C.K. Swain, A. Sahu, Reliability-ensured efficient scheduling with replication in cloud environment, IEEE Syst. J. 16 (2) (2021) 2729-2740.

[14]

Z. Li, H. Yu, G. Fan, J. Zhang, Cost-efficient fault-tolerant workflow scheduling for deadline-constrained microservice-based applications in clouds, IEEE Trans. Netw. Serv. Manag. 20 (3) (2023) 3220-3232.

[15]

X. Tang, Y. Liu, Z. Zeng, B. Veeravalli, Service cost effective and reliability aware job scheduling algorithm on cloud computing systems, IEEE Trans. Cloud Comput. 11 (2) (2023) 1461-1473.

[16]

G. Xu, G. Feng, L. Jiao, M. Feng, X. Zheng, J. Liu, Fnet: a two-stream model for detecting adversarial attacks against 5g-based deep learning services, J. Secur. Com-mun. Netw. 2021 ( 2021) 1-10.

[17]

X. Tang, Reliability-aware cost-efficient scientific workflows scheduling strategy on multi-cloud systems, IEEE Trans. Cloud Comput. 10 (4) (2021) 2909-2919.

[18]

Y. Xia, M. Zhou, X. Luo, S. Pang, Q. Zhu, Stochastic modeling and performance anal-ysis of migration-enabled and error-prone clouds, IEEE Trans. Ind. Inform. 11 (2) (2015) 495-504.

[19]

D. Cotroneo, L. De Simone, P. Liguori, R. Natella, N. Bidokhti, Enhancing failure propagation analysis in cloud computing systems, in: 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE), IEEE, 2019, pp. 139-150.

[20]

M. Fahmideh, G. Beydoun, G. Low, Experiential probabilistic assessment of cloud services, Inf. Sci. 502 (2019) 510-524.

[21]

H. Yanagisawa, T. Osogami, R. Raymond, Dependable virtual machine allocation, in: 2013 Proceedings IEEE INFOCOM, IEEE, 2013, pp. 629-637.

[22]

K. Daudjee, S. Kamali, A. López-Ortiz,On the online fault-tolerant server consol-idation problem, in:Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, 2014, pp. 12-21.

[23]

J. Mate, K. Daudjee, S. Kamali, Robust multi-tenant server consolidation in the cloud for data analytics workloads, in: 2017 IEEE 37th International Conference on Dis-tributed Computing Systems (ICDCS), IEEE, 2017, pp. 2111-2118.

[24]

B. Li, X. Tang, B. Wu, A robust algorithm for multi-tenant server consolidation, in:International Conference on Wireless Algorithms, Systems, and Applications, Springer, 2021, pp. 566-573.

[25]

D. Ye, F. Xie, G. Zhang, Truthful mechanism design for bin packing with applications on cloud computing, J. Comb. Optim. (2020) 1-22.

[26]

A. Chraibi, S.B. Alla, A. Ezzati,An efficient cloudlet scheduling via bin packing in cloud computing, Int. J. Electr. Comput Syst. Eng. 12 (3) (2022) 3226.

[27]

M. Li, Z. Tian, X. Du, X. Yuan, C. Shan, M. Guizani, Power normalized cepstral ro-bust features of deep neural networks in a cloud computing data privacy protection scheme, Neurocomputing 518 (2023) 165-173.

[28]

C. Munien, A.E. Ezugwu, Metaheuristic algorithms for one-dimensional bin-packing problems: a survey of recent advances and applications, J. Intell. Syst. 30 (1) (2021) 636-663.

[29]

A. Laterre, Y. Fu, M.K. Jabri, A.-S. Cohen, D. Kas, K. Hajjar, H. Chen, T.S. Dahl, A. Kerkeni, K. Beguir, Ranked reward: enabling self-play reinforcement learning for bin packing, 2019.

[30]

H. Zhao, Q. She, C. Zhu, Y. Yang, K. Xu,Online 3d bin packing with constrained deep reinforcement learning, in:Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 741-749.

[31]

H. Zhao, C. Zhu, X. Xu, H. Huang, K. Xu, Learning practically feasible policies for online 3d bin packing, Sci. China Inf. Sci. 65 (1) (2022) 112105.

[32]

G. Xu, C. Qi, W. Dong, L. Gong, S. Liu, S. Chen, J. Liu, X. Zheng, A privacy-preserving medical data sharing scheme based on blockchain, IEEE J. Biomed. Health Inform. 27 (2) (2022) 698-709.

[33]

G. Yao, Q. Ren, X. Li, S. Zhao, R. Ruiz, A hybrid fault-tolerant scheduling for deadline-constrained tasks in cloud systems, IEEE Trans. Serv. Comput. 15 (3) (2020) 1371-1384.

[34]

A. Zhou, S. Wang, B. Cheng, Z. Zheng, F. Yang, R.N. Chang, M.R. Lyu, R. Buyya, Cloud service reliability enhancement via virtual machine placement optimization, IEEE Trans. Serv. Comput. 6 (10) (2017) 902-913.

[35]

C. Li, X. Tang, On fault-tolerant bin packing for online resource allocation, IEEE Trans. Parallel Distrib. Syst. 31 (4) (2020) 817-829.

[36]

S. Kamali, P. Nikbakht, On the fault-tolerant online bin packing problem, in: Algo-rithmic Aspects of Cloud Computing: 6th International Symposium, ALGOCLOUD 2021, Lisbon, Portugal, September 6-7, 2021, in: Revised Selected Papers, Springer, 2021, pp. 1-17.

[37]

S. Burckhardt, B. Chandramouli, C. Gillum, D. Justo, K. Kallas, C. McMahon, C.S. Meiklejohn, X. Zhu Netherite,Efficient execution of serverless workflows, Proc. VLDB Endow. 15 (8) (2022) 1591-1604.

AI Summary AI Mindmap
PDF

333

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/