Scalable and adaptive log manager in distributed systems

Huan ZHOU; Weining QIAN; Xuan ZHOU; Qiwen DONG; Aoying ZHOU; Wenrong TAN

doi:10.1007/s11704-022-1357-5

Front. Comput. Sci. ›› 2023, Vol. 17 ›› Issue (2) : 172205 DOI: 10.1007/s11704-022-1357-5

Software

RESEARCH ARTICLE

Scalable and adaptive log manager in distributed systems

Author information +

History +

PDF (12769KB)

Abstract

On-line transaction processing (OLTP) systems rely on transaction logging and quorum-based consensus protocol to guarantee durability, high availability and strong consistency. This makes the log manager a key component of distributed database management systems (DDBMSs). The leader of DDBMSs commonly adopts a centralized logging method to writing log entries into a stable storage device and uses a constant log replication strategy to periodically synchronize its state to followers. With the advent of new hardware and high parallelism of transaction processing, the traditional centralized design of logging limits scalability, and the constant trigger condition of replication can not always maintain optimal performance under dynamic workloads.

In this paper, we propose a new log manager named Salmo with scalable logging and adaptive replication for distributed database systems. The scalable logging eliminates centralized contention by utilizing a highly concurrent data structure and speedy log hole tracking. The kernel of adaptive replication is an adaptive log shipping method, which dynamically adjusts the number of log entries transmitted between leader and followers based on the real-time workload. We implemented and evaluated Salmo in the open-sourced transaction processing systems Cedar and DBx1000. Experimental results show that Salmo scales well by increasing the number of working threads, improves peak throughput by $1.56 ×$ and reduces latency by more than $4 ×$ over log replication of Raft, and maintains efficient and stable performance under dynamic workloads all the time.

Graphical abstract

Keywords

distributed database systems / transaction logging / log replication / scalable / adaptive

Cite this article

Download citation ▾

Huan ZHOU, Weining QIAN, Xuan ZHOU, Qiwen DONG, Aoying ZHOU, Wenrong TAN. Scalable and adaptive log manager in distributed systems. Front. Comput. Sci., 2023, 17(2): 172205 DOI:10.1007/s11704-022-1357-5

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Stonebraker M . Concurrency control and consistency of multiple copies of data in distributed INGRES. IEEE Transactions on Software Engineering, 1979, SE-5( 3): 188– 194

[2]	Corbett J C, Dean J, Epstein M, Fikes A, Frost C, . . Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems, 2013, 31( 3): 8

[3]	Lamport L . The part-time parliament. ACM Transactions on Computer Systems, 1998, 16( 2): 133– 169

[4]	Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In: Proceedings of 2014 USENIX Conference on USENIX Annual Technical Conference. 2014, 305– 320

[5]	Gray J, McJones P, Blasgen M, Lindsay B, Lorie R, Price T, Putzolu F, Traiger I . The recovery manager of the system R database manager. ACM Computing Surveys, 1981, 13( 2): 223– 243

[6]	Mohan C, Haderle D, Lindsay B, Pirahesh H, Schwarz P . ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems, 1992, 17( 1): 94– 162

[7]	Diaconu C, Freedman C, Ismert E, Larson P A, Mittal P, Stonecipher R, Verma N, Zwilling M. Hekaton: SQL server’s memory-optimized OLTP engine. In: Proceedings of 2013 ACM SIGMOD International Conference on Management of Data. 2013, 1243– 1254

[8]	Levandoski J J, Lomet D B, Sengupta S, Stutsman R, Wang R. High performance transactions in deuteronomy. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research. 2015

[9]	Lim H, Kaminsky M, Andersen D G. Cicada: dependably fast multi-core in-memory transactions. In: Proceedings of 2017 ACM International Conference on Management of Data. 2017, 21– 35

[10]	Johnson R, Pandis I, Stoica R, Athanassoulis M, Ailamaki A . Aether: a scalable approach to logging. Proceedings of the VLDB Endowment, 2010, 3( 1−2): 681– 692

[11]	Johnson R, Pandis I, Stoica R, Athanassoulis M, Ailamaki A . Scalability of write-ahead logging on multicore and multisocket hardware. The VLDB Journal, 2012, 21( 2): 239– 263

[12]	Kim K, Wang T, Johnson R, Pandis I. ERMIA: fast memory-optimized database system for heterogeneous workloads. In: Proceedings of 2016 International Conference on Management of Data. 2016, 1675– 1687

[13]	Jung H, Han H, Kang S . Scalable database logging for multicores. Proceedings of the VLDB Endowment, 2017, 11( 2): 135– 148

[14]	Tu S, Zheng W, Kohler E, Liskov B, Madden S. Speedy transactions in multicore in-memory databases. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles. 2013, 18– 32

[15]	Zheng W, Tu S, Kohler E, Liskov B. Fast databases with fast durability and recovery through multicore parallelism. In: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation. 2014, 465– 477

[16]	Kimura H. FOEDUS: OLTP engine for a thousand cores and NVRAM. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. 2015, 691– 706

[17]	Kim J, Jang H, Son S, Han H, Kang S, Jung H. Border-collie: a wait-free, read-optimal algorithm for database logging on multicore hardware. In: Proceedings of 2019 International Conference on Management of Data. 2019, 723– 740

[18]	Xia Y, Yu X, Pavlo A, Devadas S . Taurus: lightweight parallel logging for in-memory database management systems. Proceedings of the VLDB Endowment, 2020, 14( 2): 189– 201

[19]	Haubenschild M, Sauer C, Neumann T, Leis V. Rethinking logging, checkpoints, and recovery for high-performance storage engines. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 877– 892

[20]	Zhou H, Guo J, Hu H, Qian W, Zhou X, Zhou A . Plover: parallel logging for replication systems. Frontiers of Computer Science, 2020, 14( 4): 144606

[21]	Hong C, Zhou D, Yang M, Kuo C, Zhang L, Zhou L. KuaFu: closing the parallelism gap in database replication. In: Proceedings of the 29th International Conference on Data Engineering. 2013, 1186– 1195

[22]	Poke M, Hoefler T. DARE: high-performance state machine replication on RDMA networks. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. 2015, 107– 118

[23]	Qin D, Brown A D, Goel A . Scalable replay-based replication for fast databases. Proceedings of the VLDB Endowment, 2017, 10( 13): 2025– 2036

[24]	Wang T, Johnson R, Pandis I . Query fresh: log shipping on steroids. Proceedings of the VLDB Endowment, 2017, 11( 4): 406– 419

[25]	Guo J, Chu J, Cai P, Zhou M, Zhou A . Low-overhead Paxos replication. Data Science and Engineering, 2017, 2( 2): 169– 177

[26]	Arora V, Mittal T, Agrawal D, El-Abbadi A, Xue X, Zhi Y, Zhu J. Leader or majority: why have one when you can have both? Improving read scalability in raft-like consensus protocols. In: Proceedings of the 9th USENIX Conference on Hot Topics in Cloud Computing. 2017

[27]	Cao W, Liu Z, Wang P, Chen S, Zhu C, Zheng S, Wang Y, Ma G . PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database. Proceedings of the VLDB Endowment, 2018, 11( 12): 1849– 1862

[28]	Zhang Z, Hu H, Yu Y, Qian W, Shu K. Dependency preserved raft for transactions. In: Proceedings of the 25th International Conference of Database Systems for Advanced Applications. 2020, 228– 245

[29]	Pan Y, Li Y, Zhang C, Zhang R, Hong D. The design and implementation of an efficient order management system based on CEDAR. Journal of East China Normal University (Natural Science), 2018, (3): 88− 96

[30]	Helland P, Sammer H, Lyon J, Carr R, Garrett P, Reuter A. Group commit timers and high volume transaction systems. In: Proceedings of the 2nd International Workshop on High Performance Transaction Systems. 1987, 301– 329

[31]	Spiro P M, Joshi A M, Rengarajan T K. Designing an optimized transaction commit protocol. Digital Technical Journal, 1991, 3(1): 1– 32

[32]	Howard H. ARC: analysis of raft consensus. Technical Report UCAM-CL-TR-857. Cambridge: University of Cambridge, 2014

[33]	Huang D, Liu Q, Cui Q, Fang Z, Ma X, Xu F, Shen L, Tang L, Zhou Y, Huang M, Wei W, Liu C, Zhang J, Li J, Wu X, Song L, Sun R, Yu S, Zhao L, Cameron N, Pei L, Tang X . TiDB: a raft-based HTAP database. Proceedings of the VLDB Endowment, 2020, 13( 12): 3072– 3084

[34]	Wang T, Johnson R . Scalable logging through emerging non-volatile memory. Proceedings of the VLDB Endowment, 2014, 7( 10): 865– 876

[35]	Li K, Han F. Memory transaction engine of OceanBase. Journal of East China Normal University (Natural Science), 2014, 5: 149− 163

[36]	Cooper B F, Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing. 2010, 143– 154

[37]	Zhu T, Zhao Z, Li F, Qian W, Zhou A, Xie D, Stutsman R, Li H, Hu H. SolarDB: toward a shared-everything database on distributed log-structured storage. ACM Transaction on Storage, 2019, 15( 2): 11

[38]	Gray J. Notes on data base operating systems. In: Proceedings of Operating Systems, an Advanced Course. 1978, 393– 481

[39]	Harizopoulos S, Abadi D J, Madden S, Stonebraker M. OLTP through the looking glass, and what we found there. In: Proceedings of 2008 ACM SIGMOD International Conference on Management of Data. 2008, 981– 992

[40]	Huang J, Schwan K, Qureshi M K . NVRAM-aware logging in transaction systems. Proceedings of the VLDB Endowment, 2014, 8( 4): 389– 400

[41]	Arulraj J, Perron M, Pavlo A . Write-behind logging. Proceedings of the VLDB Endowment, 2016, 10( 4): 337– 348

[42]	Hagmann R. Reimplementing the cedar file system using logging and group commit. In: Proceedings of the Eleventh ACM Symposium on Operating Systems Principles. 1987, 155– 162

[43]	Lamport L. Paxos made simple, fast, and byzantine. In: Proceedings of the 6th International Conference on Principles of Distributed Systems. 2002, 7– 9

[44]	Lamport L . Fast Paxos. Distributed Computing, 2006, 19( 2): 79– 103

[45]	Burrows M. The Chubby lock service for loosely-coupled distributed systems. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation. 2006, 335– 350

[46]	Baker J, Bond C, Corbett J C, Furman J J, Khorlin A, Larson J, Leon J M, Li Y, Lloyd A, Yushprakh V. Megastore: providing scalable, highly available storage for interactive services. In: Proceedings of the 5th Biennial Conference on Innovative Data Systems Research. 2011, 223– 234

[47]	Shute J, Vingralek R, Samwel B, Handy B, Whipkey C, Rollins E, Oancea M, Littlefield K, Menestrina D, Ellner S, Cieslewicz J, Rae I, Stancescu T, Apte H . F1: a distributed SQL database that scales. Proceedings of the VLDB Endowment, 2013, 6( 11): 1068– 1079

[48]	Zheng J, Lin Q, Xu J, Wei C, Zeng C, Yang P, Zhang Y . PaxosStore: high-availability storage made practical in WeChat. Proceedings of the VLDB Endowment, 2017, 10( 12): 1730– 1741

[49]	Rao J, Shekita E J, Tata S . Using Paxos to build a scalable, consistent, and highly available datastore. Proceedings of the VLDB Endowment, 2011, 4( 4): 243– 254

[50]	Ousterhout J, Agrawal P, Erickson D, Kozyrakis C, Leverich J, Mazières D, Mitra S, Narayanan A, Parulkar G, Rosenblum M, Rumble S M, Stratmann E, Stutsman R . The case for RAMClouds: scalable high-performance storage entirely in DRAM. ACM SIGOPS Operating Systems Review, 2010, 43( 4): 92– 105