ICCG: low-cost and efficient consistency with adaptive synchronization for metadata replication
Chenhao ZHANG, Liang WANG, Jing SHANG, Zhiwen XIAO, Limin XIAO, Meng HAN, Bing WEI, Runnan SHEN, Jinquan WANG
ICCG: low-cost and efficient consistency with adaptive synchronization for metadata replication
The rapid growth in the storage scale of wide-area distributed file systems (DFS) calls for fast and scalable metadata management. Metadata replication is the widely used technique for improving the performance and scalability of metadata management. Because of the POSIX requirement of file systems, many existing metadata management techniques utilize a costly design for the sake of metadata consistency, leading to unacceptable performance overhead. We propose a new metadata consistency maintenance method (ICCG), which includes an incremental consistency guaranteed directory tree synchronization (ICGDT) and a causal consistency guaranteed replica index synchronization (CCGRI), to ensure system performance without sacrificing metadata consistency. ICGDT uses a flexible consistency scheme based on the state of files and directories maintained through the conflict state tree to provide an incremental consistency for metadata, which satisfies both metadata consistency and performance requirements. CCGRI ensures low latency and consistent access to data by establishing a causal consistency for replica indexes through multi-version extent trees and logical time. Experimental results demonstrate the effectiveness of our methods. Compared with the strong consistency policies widely used in modern DFSes, our methods significantly improve the system performance. For example, in file creation, ICCG can improve the performance of directory tree operations by at least 36.4 times.
metadata management / metadata replication / consistency / directory tree / replica index
Chenhao Zhang received the BS degree in Internet of Things engineering from China University of Petroleum (East China), China in 2019. He is currently pursuing a PhD degree in computer science at Beihang University, China. His main research interests include distributed file systems, storage system, and high performance computing
Liang Wang received the BEng and MSc degrees in electronics engineering from Harbin Institute of Technology, China in 2011 and 2013 respectively, and the PhD degree in computer science and engineering from The Chinese University of Hong Kong, China in 2017. He is currently an assistant professor with the School of Computer Science and Engineering, Beihang University, China. He was a postdoctoral research fellow in Institute of Microelectronics, Tsinghua University, China during 2017 and 2020. His research interests include power-efficient and reliability-aware design for network-on-chip and many-core system
Jing Shang received the PhD degree in circuits and systems from Beijing University of Posts and Telecommunications, China in 2005. She is the chief architect of big data domain at China Mobile Information Technology Center, China. Her current research interests include large-scale distributed system, big data storage, cloud computing, and wide-area data analysis
Zhiwen Xiao received the PhD degree in signal and information processing from University of Chinese Academy of Sciences, China in 2022. He is a researcher at China Mobile Information Technology Center, China. His current research interests include cloud computing, machine learning, optimization theory, large-scale distributed system, and data mining
Limin Xiao received the BS in computer science from Tsinghua University, China in 1993, the MS and PhD degrees in computer science from Institute of Computing, Chinese Academy of Sciences, China in 1996 and 1998, respectively. He is a professor of the School of Computer Science and Engineering, Beihang University, China. He is a senior membership of China Computer Federation. His main research areas are computer architecture, computer system software, high performance computing, virtualization, and cloud computing
Meng Han received the BS degrees in computer science from Beijing University of Posts and Telecommunications, China in 2019. He is currently working toward the PhD degree in computer architecture with the School of Computer Science and Engineering, Beihang University, China. His research interests include computer architecture and deep learning accelerator
Bing Wei received the BS in electrical engineering and MS degrees in computer science from Capital Normal University, China in 2012 and 2015, respectively, He is currently pursuing a PhD degree in computer science at Beihang University, China. His main research interests include file systems, high performance computing, software engineering, and clusters
Runnan Shen received the BS degree in computer science and technology from Beihang University, China in 2020. He is currently working toward the PhD degree in Computer Architecture with the School of Computer Science and Engineering, Beihang University, China. His research interests include blockchain, cryptography, and distributed storage system
Jinquan Wang received the BS degree in software engineering form Hunan University, China in 2021. He is currently pursuing a PhD degree in computer architecture at Beihang University, China. His main research interests include hybrid storage systems, distributed storage systems, and scheduling system
[1] |
Lavric J V, Juurola E, Vermeulen A T, Kutsch W L. Integrated carbon observation system (ICOS)-a domain-overarching long-term research infrastructure for the future. In: Proceedings of AGU Fall Meeting Abstracts. 2016, GC21C−1117
|
[2] |
Wrzeszcz M, Trzepla K, S ota R, Zemek K, Lichoń T, Opioła Ł, Nikolow D, Dutka Ł, Słota R, Kitowski J. Metadata organization and management for globalization of data access with Onedata. In: Proceedings of the 11th International Conference on Parallel Processing and Applied Mathematics. 2016, 312−321
|
[3] |
Wei B, Xiao L M, Zhou H J, Qin G J, Song Y, Zhang C H . Global virtual data space for unified data access across supercomputing centers. IEEE Transactions on Cloud Computing, 2023, 11( 2): 1822–1839
|
[4] |
Huo J T, Xu Y W, Huo Z S, Xiao L M, He Z X . Research on key technologies of edge cache in virtual data space across wan. Frontiers of Computer Science, 2023, 17( 1): 171102
|
[5] |
Dai H, Wang Y, Kent K B, Zeng L F, Xu C Z . The state of the art of metadata managements in large-scale distributed file systems– scalability, performance and availability. IEEE Transactions on Parallel and Distributed Systems, 2022, 33( 12): 3850–3869
|
[6] |
Lv W H, Lu Y Y, Zhang Y M, Duan P L, Shu J W. InfiniFS: an efficient metadata service for Large-Scale distributed filesystems. In: Proceedings of the 20th USENIX Conference on File and Storage Technologies. 2022, 313−328
|
[7] |
Ousterhout J K, Da Costa H, Harrison D, Kunze J A, Kupfer M, Thompson J G. A trace-driven analysis of the Unix 4.2 BSD file system. In: Proceedings of the 10th ACM Symposium on Operating Systems Principles. 1985, 15−24
|
[8] |
Miller E L, Greenan K, Leung A, et al. Reliable and efficient metadata storage and indexing using nvram. Available: dcslab. hanyang. ac. kr/nvramos08/EthanMiller. pdf, 2008.
|
[9] |
OPENSFS
|
[10] |
Thomson A, Abadi D J. CalvinFS: Consistent WAN replication and scalable metadata management for distributed file systems. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies. 2015, 1−14
|
[11] |
Weil S A, Brandt S A, Miller E L, Long D D E, Maltzahn C. Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation. 2006, 307−320
|
[12] |
Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In: Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST). 2010, 1−10
|
[13] |
Alvaro P, Condie T, Conway N, Elmeleegy K, Hellerstein J M, Sears P C. BOOM: data-centric programming in the datacenter. Technical Report UCB/EECS-2009-113. Berkeley: University of California at Berkeley, 2009
|
[14] |
Parallel Data Lab. Shardfs. See Pdl.cmu.edu/ShardFS website, 2023
|
[15] |
Matri P, Pérez M S, Costan A, Antoniu G. TýrFS: increasing small files access performance with dynamic metadata replication. In: Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 2018: 452−461
|
[16] |
Burrows M. The chubby lock service for loosely-coupled distributed systems. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation. 2006, 335−350
|
[17] |
Lipcon T, Alves D, Burkert D, et al. Kudu: Storage for fast analytics on fast data. Cloudera, Inc, 2015, 28: 36−77
|
[18] |
Li Z Y, Xue R N, Ao L X. Replichard: towards tradeoff between consistency and performance for metadata. In: Proceedings of 2016 International Conference on Supercomputing. 2016, 25
|
[19] |
Bravo M, Rodrigues L, Van Roy P. Saturn: a distributed metadata service for causal consistency. In: Proceedings of the 12th European Conference on Computer Systems. 2017, 111−126
|
[20] |
Vef M A, Moti N, Süß T, Tocci T, Nou R, Miranda A, Cortes T, Brinkmann A. GekkoFS-a temporary distributed file system for HPC applications. In: Proceedings of 2018 IEEE International Conference on Cluster Computing (CLUSTER). 2018, 319−324
|
[21] |
Guerraoui R, Pavlovic M, Seredinschi D A. Incremental consistency guarantees for replicated objects. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation. 2016, 169−184
|
[22] |
Abadi D . Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. Computer, 2012, 45( 2): 37–42
|
[23] |
Rodeh O, Teperman A. zFS-a scalable distributed file system using object disks. In: Proceedings of the 20th IEEE/ the 11th NASA Goddard Conference on Mass Storage Systems and Technologies. 2003, 207−218
|
[24] |
Boyer E B, Broomfield M C, Perrotti T A. Glusterfs one storage server to rule them all. Los Alamos: Los Alamos National Laboratory, 2012
|
[25] |
Niazi S, Ismail M, Haridi S, Dowling J, Grohsschmiedt S, Ronström M. HopsFS: scaling hierarchical file system metadata using newSQL databases. In: Proceedings of the 15th USENIX Conference on File and Storage Technologies. 2017, 89−103
|
[26] |
Özsu M T, Valduriez P. Principles of Distributed Database Systems. Upper Saddle River: Prentice Hall, 1999
|
[27] |
Lamport L. Paxos made simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001), 2001: 51−58
|
[28] |
Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In: Proceedings of 2014 USENIX Conference on USENIX Annual Technical Conference. 2014, 305−320
|
[29] |
Xu Q Q, Arumugam R V, Yong K L, Mahadevan S . Efficient and scalable metadata management in EB-scale file systems. IEEE Transactions on Parallel and Distributed Systems, 2014, 25( 11): 2840–2850
|
[30] |
Zhou J, Chen Y, Wang W P, Meng D. MAMS: a highly reliable policy for metadata service. In: Proceedings of the 44th International Conference on Parallel Processing. 2015, 729−738
|
[31] |
Chen Z, Xiong J, Meng D. Replication-based highly available metadata management for cluster file systems. In: Proceedings of 2010 IEEE International Conference on Cluster Computing. 2010, 292−301
|
[32] |
Chandra T D, Griesemer R, Redstone J. Paxos made live: an engineering perspective. In: Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing. 2007, 398−407
|
[33] |
Saito Y, Shapiro M . Optimistic replication. ACM Computing Surveys, 2005, 37( 1): 42–81
|
[34] |
Ladin R, Liskov B, Shrira L, Ghemawat S . Providing high availability using lazy replication. ACM Transactions on Computer Systems, 1992, 10( 4): 360–391
|
[35] |
MongoDB
|
[36] |
Bailis P, Fekete A, Franklin M J, Ghodsi A, Hellerstein J M, Stoica I. Feral concurrency control: an empirical investigation of modern application integrity. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. 2015, 1327−1342
|
[37] |
Giannakopoulos I, Konstantinou I, Tsoumakos D, Koziris N . Cloud application deployment with transient failure recovery. Journal of Cloud Computing, 2018, 7( 1): 11
|
[38] |
Jia J, Liu Y, Zhang G Z, Gao Y L, Qian D P . Software approaches for resilience of high performance computing systems: a survey. Frontiers of Computer Science, 2023, 17( 4): 174105
|
[39] |
Wang C, Mohror K, Snir M. File system semantics requirements of HPC applications. In: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing. 2021, 19−30
|
[40] |
Zhang C X, Li Y M, Zhang R, Qian W N, Zhou A Y . Scalable and quantitative contention generation for performance evaluation on OLTP databases. Frontiers of Computer Science, 2023, 17( 2): 172202
|
[41] |
Lamport L. Time, clocks, and the ordering of events in a distributed system. In: Malkhi D, ed. Concurrency: The Works of Leslie Lamport. New York: ACM, 2019, 179−196
|
[42] |
Wei B, Xiao L M, Song Y, Qin G J, Zhu J B, Yan B C, Wang C B, Huo Z S . A self-tuning client-side metadata prefetching scheme for wide area network file systems. Science China Information Sciences, 2022, 65( 3): 132101
|
[43] |
Zhou H, Qian W N, Zhou X, Dong Q W, Zhou A Y, Tan W R . Scalable and adaptive log manager in distributed systems. Frontiers of Computer Science, 2023, 17( 2): 172205
|
[44] |
Alibaba. Alibaba elastic compute service. See alibabacloud.com/zh/product/ecs website, 2023
|
[45] |
HPC IO Benchmark Repository. Mdtest parallel I/O benchmark. See github.com/hpc/ior website, 2023
|
[46] |
Gupta A, Milojicic D. Evaluation of HPC applications on cloud. In: Proceedings of the 6th Open Cirrus Summit. 2011, 22−26
|
[47] |
Wang C, Snir M, Mohror K. High performance computing application I/O traces. Livermore: Lawrence Livermore National Laboratory, 2020
|
[48] |
Charapko A, Ailijiang A, Demirbas M. Linearizable quorum reads in Paxos. In: Proceedings of the 11th USENIX Workshop on Hot Topics in Storage and File Systems. 2019, 8
|
[49] |
Jens A. Fio-flexible io tester. See freecode.com/projects/fio website, 2014.
|
[50] |
Glass G, Gopalan A, Koujalagi D, Palicherla A, Sakdeo S. Logical synchronous replication in the tintri VMstore file system. In: Proceedings of the 16th USENIX Conference on File and Storage Technologies. 2018, 295−308
|
[51] |
Lampson B, Lomet D. A new presumed commit optimization for two phase commit. In: Proceedings of the 19th International Conference on Very Large Data Bases (VLDB'93). 1993: 630-640
|
[52] |
Liu J W, Shen H Y, Chi H M, Narman H S, Yang Y Y, Cheng L, Chung W Y . A low-cost multi-failure resilient replication scheme for high-data availability in cloud storage. IEEE/ACM Transactions on Networking, 2021, 29( 4): 1436–1451
|
[53] |
Haeberlen A, Mislove A, Druschel P. Glacier: highly durable, decentralized storage despite massive correlated failures. In: Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation. 2005, 143−158
|
[54] |
Liu J W, Shen H Y. A popularity-aware cost-effective replication scheme for high data durability in cloud storage. In: Proceedings of 2016 IEEE International Conference on Big Data (Big Data). 2016, 384−389
|
[55] |
Zhou J, Chen Y, Wang W P, He S B, Meng D . A highly reliable metadata service for large-scale distributed file systems. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 2): 374–392
|
[56] |
Stamatakis D, Tsikoudis N, Micheli E, Magoutis K . A general-purpose architecture for replicated metadata services in distributed file systems. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 10): 2747–2759
|
/
〈 | 〉 |