Efficient and stable quorum-based log replication and replay for modern cluster-databases

Donghui WANG, Peng CAI, Weining QIAN, Aoying ZHOU

PDF(7276 KB)
PDF(7276 KB)
Front. Comput. Sci. ›› 2022, Vol. 16 ›› Issue (5) : 165612. DOI: 10.1007/s11704-020-0210-y
Information Systems
RESEARCH ARTICLE

Efficient and stable quorum-based log replication and replay for modern cluster-databases

Author information +
History +

Abstract

The modern in-memory database (IMDB) can support highly concurrent on-line transaction processing (OLTP) workloads and generate massive transactional logs per second. Quorum-based replication protocols such as Paxos or Raft have been widely used in the distributed databases to offer higher availability and fault-tolerance. However, it is non-trivial to replicate IMDB because high transaction rate has brought new challenges. First, the leader node in quorum replication should have adaptivity by considering various transaction arrival rates and the processing capability of follower nodes. Second, followers are required to replay logs to catch up the state of the leader in the highly concurrent setting to reduce visibility gap. Third, modern databases are often built with a cluster of commodity machines connected by low configuration networks, in which the network anomalies often happen. In this case, the performance would be significantly affected because the follower node falls into the long-duration exception handling process (e.g., fetch lost logs from the leader). To this end, we build QuorumX, an efficient and stable quorum-based replication framework for IMDB under heavy OLTP workloads. QuorumX combines critical path based batching and pipeline batching to provide an adaptive log propagation scheme to obtain a stable and high performance at various settings. Further, we propose a safe and coordination-free log replay scheme to minimize the visibility gap between the leader and follower IMDBs. We further carefully design the process for the follower node in order to alleviate the influence of the unreliable network on the replication performance. Our evaluation results with the YCSB, TPC-C and a realistic micro-benchmark demonstrate that QuorumX achieves the performance close to asynchronous primary-backup replication and could always provide a stable service with data consistency and a low-level visibility gap.

Graphical abstract

Keywords

log replication / log replay / consensus protocol / high performance / high availability / quorum / unreliable network / packet loss

Cite this article

Download citation ▾
Donghui WANG, Peng CAI, Weining QIAN, Aoying ZHOU. Efficient and stable quorum-based log replication and replay for modern cluster-databases. Front. Comput. Sci., 2022, 16(5): 165612 https://doi.org/10.1007/s11704-020-0210-y

References

[1]
Chandra T D, Griesemer R, Redstone J. Paxos made live: an engineering perspective. In: Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing. 2007, 398– 407
[2]
Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In: Proceedings of 2014 USENIX Annual Technical Conference. 2014, 305– 319
[3]
van Renesse R , Altinbuken D . Paxos made moderately complex. ACM Computing Surveys, 2015, 47( 3): 42–
[4]
Rao J , Shekita E J , Tata S . Using paxos to build a scalable, consistent, and highly available datastore. Proceedings of the VLDB Endowment, 2011, 4( 4): 243– 254
[5]
Zheng J , Lin Q , Xu J , Wei C , Zeng C , Yang P , Zhang Y . PaxosStore: high-availability storage made practical in WeChat. Proceedings of the VLDB Endowment, 2017, 10( 12): 1730– 1741
[6]
Zhu T, Zhao Z, Li F, Qian W, Zhou A, Xie D, Stutsman R, Li H, Hu H. Solar: towards a shared-everything database on distributed log-structured storage. In: Proceedings of 2018 USENIX Conference on Usenix Annual Technical Conference. 2018, 795– 807
[7]
Gilbert S , Lynch N . Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 2002, 33( 2): 51– 59
[8]
Breitbart Y , Garcia-Molina H , Silberschatz A . Overview of multidatabase transaction management. The VLDB Journal, 1992, 1( 2): 181– 239
[9]
Daudjee K, Salem K. Lazy database replication with ordering guarantees. In: Proceedings of the 20th International Conference on Data Engineering. 2004, 424– 435
[10]
Elnikety S, Pedone F, Zwaenepoel W. Database replication using generalized snapshot isolation. In: Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems. 2005, 73– 84
[11]
Corbett J C, Dean J, Epstein M, Fikes A, Frost C. Spanner: Google’s globally-distributed database. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation. 2012, 251– 264
[12]
DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W. Dynamo: amazon’s highly available key-value store. In: Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles. 2007, 205– 220
[13]
Lakshman A , Malik P . Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 2010, 44( 2): 35– 40
[14]
Santos N, Schiper A. Tuning paxos for high-throughput with batching and pipelining. In: Proceedings of the 13th International Conference on Distributed Computing and Networking. 2012, 153– 167
[15]
Özcan F, Tian Y, Tözün P. Hybrid transactional/analytical processing: a survey. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, 1771−1775
[16]
Lee J , Moon S , Kim K , Kim D H , Cha S K , Han W S . Parallel replication across formats in SAP HANA for scaling out mixed OLTP/OLAP workloads. Proceedings of the VLDB Endowment, 2017, 10( 12): 1598– 1609
[17]
Qin D , Brown A D , Goel A . Scalable replay-based replication for fast databases. Proceedings of the VLDB Endowment, 2017, 10( 13): 2025– 2036
[18]
Zheng W, Tu S, Kohler E, Liskov B. Fast databases with fast durability and recovery through multicore parallelism. In: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. 2014, 465– 477
[19]
Romano P, Leonetti M. Self-tuning batching in total order broadcast protocols via analytical modelling and reinforcement learning. In: Proceedings of 2012 International Conference on Computing, Networking and Communications. 2012, 786– 792
[20]
Friedman R, Hadad E. Adaptive batching for replicated servers. In: Proceedings of the 2006 25th IEEE Symposium on Reliable Distributed Systems. 2006, 311– 320
[21]
Yu X , Bezerra G , Pavlo A , Devadas S , Stonebraker M . Staring into the abyss: an evaluation of concurrency control with one thousand cores. Proceedings of the VLDB Endowment, 2014, 8( 3): 209– 220
[22]
Wang T , Kimura H . Mostly-optimistic concurrency control for highly contended dynamic workloads on a thousand cores. Proceedings of the VLDB Endowment, 2016, 10( 2): 49– 60
[23]
Ren K , Thomson A , Abadi D J . Lightweight locking for main memory database systems. Proceedings of the VLDB Endowment, 2012, 6( 2): 145– 156
[24]
Kemme B, Alonso G. Don’t be lazy, be consistent: Postgres-R, a new way to implement database replication. In: Proceedings of the 26th International Conference on Very Large Data Bases. 2000, 134– 143
[25]
Wiesmann M, Pedone F, Schiper A, Kemme B, Alonso G. Database replication techniques: a three parameter classification. In: Proceedings of the 19th IEEE Symposium on Reliable Distributed Systems. 2000, 206– 215
[26]
Stonebraker M . Concurrency control and consistency of multiple copies of data in distributed INGRES. IEEE Transactions on Software Engineering, 1979, SE-5( 3): 188– 194
[27]
Hong C, Zhou D, Yang M, Kuo C, Zhang L, Zhou L. KuaFu: closing the parallelism gap in database replication. In: Proceedings of the 2013 IEEE 29th International Conference on Data Engineering. 2013, 1186-1195
[28]
Hunt P, Konar M, Junqueira F P, Reed B. ZooKeeper: wait-free coordination for internet-scale systems. In: Proceedings of 2010 USENIX Annual Technical Conference. 2010
[29]
Wang D, Cai P, Qian W, Zhou A. Fast quorum-based log replication and replay for fast databases. In: Proceedings of the 24th International Conference on Database Systems for Advanced Applications. 2019, 209– 226

Acknowledgements

This work was partially supported by National Key R&D Program of China (2018YFB1003404), NSFC (Grant Nos. 61972149, 61977026), and ECNU Academic Innovation Promotion Program for Excellent Doctoral Students.

RIGHTS & PERMISSIONS

2022 Higher Education Press
AI Summary AI Mindmap
PDF(7276 KB)

Accesses

Citations

Detail

Sections
Recommended

/