
Scalable and adaptive log manager in distributed systems
Huan ZHOU, Weining QIAN, Xuan ZHOU, Qiwen DONG, Aoying ZHOU, Wenrong TAN
Front. Comput. Sci. ›› 2023, Vol. 17 ›› Issue (2) : 172205.
Scalable and adaptive log manager in distributed systems
On-line transaction processing (OLTP) systems rely on transaction logging and quorum-based consensus protocol to guarantee durability, high availability and strong consistency. This makes the log manager a key component of distributed database management systems (DDBMSs). The leader of DDBMSs commonly adopts a centralized logging method to writing log entries into a stable storage device and uses a constant log replication strategy to periodically synchronize its state to followers. With the advent of new hardware and high parallelism of transaction processing, the traditional centralized design of logging limits scalability, and the constant trigger condition of replication can not always maintain optimal performance under dynamic workloads.
In this paper, we propose a new log manager named Salmo with scalable logging and adaptive replication for distributed database systems. The scalable logging eliminates centralized contention by utilizing a highly concurrent data structure and speedy log hole tracking. The kernel of adaptive replication is an adaptive log shipping method, which dynamically adjusts the number of log entries transmitted between leader and followers based on the real-time workload. We implemented and evaluated Salmo in the open-sourced transaction processing systems Cedar and DBx1000. Experimental results show that Salmo scales well by increasing the number of working threads, improves peak throughput by and reduces latency by more than over log replication of Raft, and maintains efficient and stable performance under dynamic workloads all the time.
distributed database systems / transaction logging / log replication / scalable / adaptive
Huan Zhou is a lecturer of the School of Computer Science and Engineering, Southwest Minzu University (SWU), China. She received her BS in computer science and technology from Sichuan Normal University (SICNU), China in 2013 and her PhD in software engineering from East China Normal University (ECNU), China in 2019. Before she joined SWU in 2021, she had worked as a postdoctoral researcher in the School of Data Science and Engineering, ECNU, China. Her research interests include database management system and data mining
Weining Qian is a professor and dean of the School of Data Science and Engineering, East China Normal University (ECNU), China. He received his BS, MS and PhD in computer science from Fudan University, China in 1998, 2001 and 2004, respectively. He is now serving as a standing committee member of Technical Committee on Databases of China Computer Federation (CCF), a member of Expert Group on Artificial Intelligence Science and Technology Innovation of MoE and a vice president of Data Science and Knowledge Systems Engineering Committee of Systems Engineering Society of China (SESC). His research interests include scalable transaction processing, benchmarking on big data systems, big data analysis and processing, big data application and computational education
Xuan Zhou is a professor and a vice dean of the School of Data Science and Engineering, East China Normal University (ECNU), China. He obtained his BS from Fudan University, China in 2001 and his PhD from the National University of Singapore, Singapore in 2005, both in computer science. Since his graduation, he had worked as a scientist at the L3S Research Centre (Germany) and the CSIRO ICT Centre (Australia) until the end of 2010. Before he joined ECNU in 2017, he spent six years in Renmin University of China, as an associate professor. He is the winner of the Program for New Century Excellent Talents in University of Ministry of Education (MoE). His research interests include database system and information retrieval
Qiwen Dong is a professor of the School of Data Science and Engineering, East China Normal University (ECNU), China. He received his BS and MS in aerospace engineering and mechanics from Harbin Institute of Technology, China in 2000 and 2002, respectively, and his PhD in computer science from Harbin Institute of Technology, China in 2008. He was a postdoctoral researcher in the School of Computer Science of Fudan University, China from 2008 to 2010, and subsequently served as an associate professor there until 2016. In the meantime, he visited the University of Michigan, USA. His research interests include bioinformatics, data mining, big-data. etc
Aoying Zhou, Vice President of East China Normal University (ECNU), Vice President of Guizhou University (GZU), Founding Dean of School of Data Science and Engineering (DaSE) ECNU, Professor. He got his BS and MS in computer science from Sichuan University, China in 1985 and 1988, respectively, and he won his PhD from Fudan University, China in 1993. He is the winner of the National Science Fund for Distinguished Young Scholars supported by the National Natural Science Foundation of China (NSFC). He is CCF (China Computer Federation) Fellow and Associate Editor-in-Chief of Chinese Journal of Computer. He served General Chair of ER’2004, Vice PC Chair of ICDE’2009 and ICDE’2012, PC Co-chair of VLDB’2014. His research interests include Web data management, data management for data-intensive computing, in-memory cluster computing, distributed transaction processing, benchmarking for big data and performance
Wenrong Tan is a professor and dean of the School of Computer Science and Engineering, Southwest Minzu University (SWU), China. She received her BS from Sichuan University, China in 1988 and her MS from Chongqing University, China in 1991, both in computer science. She is now serving as a vice president of Sichuan Province Computer Federation and a vice president of Sichuan Institute of Artificial Intelligence. Her research interests include internet of things and natural language processing
[1] |
Stonebraker M . Concurrency control and consistency of multiple copies of data in distributed INGRES. IEEE Transactions on Software Engineering, 1979, SE-5( 3): 188– 194
|
[2] |
Corbett J C, Dean J, Epstein M, Fikes A, Frost C, .
|
[3] |
Lamport L . The part-time parliament. ACM Transactions on Computer Systems, 1998, 16( 2): 133– 169
|
[4] |
Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In: Proceedings of 2014 USENIX Conference on USENIX Annual Technical Conference. 2014, 305– 320
|
[5] |
Gray J, McJones P, Blasgen M, Lindsay B, Lorie R, Price T, Putzolu F, Traiger I . The recovery manager of the system R database manager. ACM Computing Surveys, 1981, 13( 2): 223– 243
|
[6] |
Mohan C, Haderle D, Lindsay B, Pirahesh H, Schwarz P . ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems, 1992, 17( 1): 94– 162
|
[7] |
Diaconu C, Freedman C, Ismert E, Larson P A, Mittal P, Stonecipher R, Verma N, Zwilling M. Hekaton: SQL server’s memory-optimized OLTP engine. In: Proceedings of 2013 ACM SIGMOD International Conference on Management of Data. 2013, 1243– 1254
|
[8] |
Levandoski J J, Lomet D B, Sengupta S, Stutsman R, Wang R. High performance transactions in deuteronomy. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research. 2015
|
[9] |
Lim H, Kaminsky M, Andersen D G. Cicada: dependably fast multi-core in-memory transactions. In: Proceedings of 2017 ACM International Conference on Management of Data. 2017, 21– 35
|
[10] |
Johnson R, Pandis I, Stoica R, Athanassoulis M, Ailamaki A . Aether: a scalable approach to logging. Proceedings of the VLDB Endowment, 2010, 3( 1−2): 681– 692
|
[11] |
Johnson R, Pandis I, Stoica R, Athanassoulis M, Ailamaki A . Scalability of write-ahead logging on multicore and multisocket hardware. The VLDB Journal, 2012, 21( 2): 239– 263
|
[12] |
Kim K, Wang T, Johnson R, Pandis I. ERMIA: fast memory-optimized database system for heterogeneous workloads. In: Proceedings of 2016 International Conference on Management of Data. 2016, 1675– 1687
|
[13] |
Jung H, Han H, Kang S . Scalable database logging for multicores. Proceedings of the VLDB Endowment, 2017, 11( 2): 135– 148
|
[14] |
Tu S, Zheng W, Kohler E, Liskov B, Madden S. Speedy transactions in multicore in-memory databases. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles. 2013, 18– 32
|
[15] |
Zheng W, Tu S, Kohler E, Liskov B. Fast databases with fast durability and recovery through multicore parallelism. In: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation. 2014, 465– 477
|
[16] |
Kimura H. FOEDUS: OLTP engine for a thousand cores and NVRAM. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. 2015, 691– 706
|
[17] |
Kim J, Jang H, Son S, Han H, Kang S, Jung H. Border-collie: a wait-free, read-optimal algorithm for database logging on multicore hardware. In: Proceedings of 2019 International Conference on Management of Data. 2019, 723– 740
|
[18] |
Xia Y, Yu X, Pavlo A, Devadas S . Taurus: lightweight parallel logging for in-memory database management systems. Proceedings of the VLDB Endowment, 2020, 14( 2): 189– 201
|
[19] |
Haubenschild M, Sauer C, Neumann T, Leis V. Rethinking logging, checkpoints, and recovery for high-performance storage engines. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 877– 892
|
[20] |
Zhou H, Guo J, Hu H, Qian W, Zhou X, Zhou A . Plover: parallel logging for replication systems. Frontiers of Computer Science, 2020, 14( 4): 144606
|
[21] |
Hong C, Zhou D, Yang M, Kuo C, Zhang L, Zhou L. KuaFu: closing the parallelism gap in database replication. In: Proceedings of the 29th International Conference on Data Engineering. 2013, 1186– 1195
|
[22] |
Poke M, Hoefler T. DARE: high-performance state machine replication on RDMA networks. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. 2015, 107– 118
|
[23] |
Qin D, Brown A D, Goel A . Scalable replay-based replication for fast databases. Proceedings of the VLDB Endowment, 2017, 10( 13): 2025– 2036
|
[24] |
Wang T, Johnson R, Pandis I . Query fresh: log shipping on steroids. Proceedings of the VLDB Endowment, 2017, 11( 4): 406– 419
|
[25] |
Guo J, Chu J, Cai P, Zhou M, Zhou A . Low-overhead Paxos replication. Data Science and Engineering, 2017, 2( 2): 169– 177
|
[26] |
Arora V, Mittal T, Agrawal D, El-Abbadi A, Xue X, Zhi Y, Zhu J. Leader or majority: why have one when you can have both? Improving read scalability in raft-like consensus protocols. In: Proceedings of the 9th USENIX Conference on Hot Topics in Cloud Computing. 2017
|
[27] |
Cao W, Liu Z, Wang P, Chen S, Zhu C, Zheng S, Wang Y, Ma G . PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database. Proceedings of the VLDB Endowment, 2018, 11( 12): 1849– 1862
|
[28] |
Zhang Z, Hu H, Yu Y, Qian W, Shu K. Dependency preserved raft for transactions. In: Proceedings of the 25th International Conference of Database Systems for Advanced Applications. 2020, 228– 245
|
[29] |
Pan Y, Li Y, Zhang C, Zhang R, Hong D. The design and implementation of an efficient order management system based on CEDAR. Journal of East China Normal University (Natural Science), 2018, (3): 88− 96
|
[30] |
Helland P, Sammer H, Lyon J, Carr R, Garrett P, Reuter A. Group commit timers and high volume transaction systems. In: Proceedings of the 2nd International Workshop on High Performance Transaction Systems. 1987, 301– 329
|
[31] |
Spiro P M, Joshi A M, Rengarajan T K. Designing an optimized transaction commit protocol. Digital Technical Journal, 1991, 3(1): 1– 32
|
[32] |
Howard H. ARC: analysis of raft consensus. Technical Report UCAM-CL-TR-857. Cambridge: University of Cambridge, 2014
|
[33] |
Huang D, Liu Q, Cui Q, Fang Z, Ma X, Xu F, Shen L, Tang L, Zhou Y, Huang M, Wei W, Liu C, Zhang J, Li J, Wu X, Song L, Sun R, Yu S, Zhao L, Cameron N, Pei L, Tang X . TiDB: a raft-based HTAP database. Proceedings of the VLDB Endowment, 2020, 13( 12): 3072– 3084
|
[34] |
Wang T, Johnson R . Scalable logging through emerging non-volatile memory. Proceedings of the VLDB Endowment, 2014, 7( 10): 865– 876
|
[35] |
Li K, Han F. Memory transaction engine of OceanBase. Journal of East China Normal University (Natural Science), 2014, 5: 149− 163
|
[36] |
Cooper B F, Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing. 2010, 143– 154
|
[37] |
Zhu T, Zhao Z, Li F, Qian W, Zhou A, Xie D, Stutsman R, Li H, Hu H. SolarDB: toward a shared-everything database on distributed log-structured storage. ACM Transaction on Storage, 2019, 15( 2): 11
|
[38] |
Gray J. Notes on data base operating systems. In: Proceedings of Operating Systems, an Advanced Course. 1978, 393– 481
|
[39] |
Harizopoulos S, Abadi D J, Madden S, Stonebraker M. OLTP through the looking glass, and what we found there. In: Proceedings of 2008 ACM SIGMOD International Conference on Management of Data. 2008, 981– 992
|
[40] |
Huang J, Schwan K, Qureshi M K . NVRAM-aware logging in transaction systems. Proceedings of the VLDB Endowment, 2014, 8( 4): 389– 400
|
[41] |
Arulraj J, Perron M, Pavlo A . Write-behind logging. Proceedings of the VLDB Endowment, 2016, 10( 4): 337– 348
|
[42] |
Hagmann R. Reimplementing the cedar file system using logging and group commit. In: Proceedings of the Eleventh ACM Symposium on Operating Systems Principles. 1987, 155– 162
|
[43] |
Lamport L. Paxos made simple, fast, and byzantine. In: Proceedings of the 6th International Conference on Principles of Distributed Systems. 2002, 7– 9
|
[44] |
Lamport L . Fast Paxos. Distributed Computing, 2006, 19( 2): 79– 103
|
[45] |
Burrows M. The Chubby lock service for loosely-coupled distributed systems. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation. 2006, 335– 350
|
[46] |
Baker J, Bond C, Corbett J C, Furman J J, Khorlin A, Larson J, Leon J M, Li Y, Lloyd A, Yushprakh V. Megastore: providing scalable, highly available storage for interactive services. In: Proceedings of the 5th Biennial Conference on Innovative Data Systems Research. 2011, 223– 234
|
[47] |
Shute J, Vingralek R, Samwel B, Handy B, Whipkey C, Rollins E, Oancea M, Littlefield K, Menestrina D, Ellner S, Cieslewicz J, Rae I, Stancescu T, Apte H . F1: a distributed SQL database that scales. Proceedings of the VLDB Endowment, 2013, 6( 11): 1068– 1079
|
[48] |
Zheng J, Lin Q, Xu J, Wei C, Zeng C, Yang P, Zhang Y . PaxosStore: high-availability storage made practical in WeChat. Proceedings of the VLDB Endowment, 2017, 10( 12): 1730– 1741
|
[49] |
Rao J, Shekita E J, Tata S . Using Paxos to build a scalable, consistent, and highly available datastore. Proceedings of the VLDB Endowment, 2011, 4( 4): 243– 254
|
[50] |
Ousterhout J, Agrawal P, Erickson D, Kozyrakis C, Leverich J, Mazières D, Mitra S, Narayanan A, Parulkar G, Rosenblum M, Rumble S M, Stratmann E, Stutsman R . The case for RAMClouds: scalable high-performance storage entirely in DRAM. ACM SIGOPS Operating Systems Review, 2010, 43( 4): 92– 105
|
/
〈 |
|
〉 |