Localitycache: Toward efficient straggler tolerance in LRC-coded storage via caching local parity blocks

Ximeng Chen; Si Wu; Yinlong Xu

doi:10.1016/j.hcc.2025.100339

High-Confidence Computing ›› 2026, Vol. 6 ›› Issue (1) :100339 DOI: 10.1016/j.hcc.2025.100339

Research Articles

research-article

Localitycache: Toward efficient straggler tolerance in LRC-coded storage via caching local parity blocks

Author information +

History +

PDF (813KB)

Abstract

Modern distributed storage systems increasingly employ Locally Repairable Codes (LRCs) to provide reliable, low-cost data storage with high repair efficiency. However, the presence of stragglers, i.e., nodes that unpredictably slow down, can significantly impact access latency. Traditional approaches for handling stragglers, such as detection, blacklisting, or speculative execution, are often insufficient for efficient straggler tolerance. In this paper, we show how an in-memory caching strategy coupled with LRCs can bypass stragglers without relying on precise straggler detection. We propose LocalityCache, a novel in-memory caching mechanism designed for LRC-coded distributed storage systems, which effectively mitigates the impact of stragglers by caching local parity blocks. We provide theoretical guarantees for LocalityCache and show that caching local parity blocks minimizes the likelihood of encountering stragglers. Additionally, we devise optimized workflows for write, read, and repair operations under LocalityCache to ensure system efficiency. We implement LocalityCache in a distributed key-value store prototype atop Redis. Our extensive testbed evaluations show that LocalityCache can significantly reduce read latency of the baselines by up to 73.6% in the presence of stragglers.

Keywords

Locally Repairable Codes / Straggler tolerance / System performance / Distributed storage

Cite this article

Download citation ▾

Ximeng Chen, Si Wu, Yinlong Xu. Localitycache: Toward efficient straggler tolerance in LRC-coded storage via caching local parity blocks. High-Confidence Computing, 2026, 6(1): 100339 DOI:10.1016/j.hcc.2025.100339

登录浏览全文

4963

注册一个新账户忘记密码

CRediT authorship contribution statement

Ximeng Chen: Writing - original draft, Methodology, Data curation, Project administration, Formal analysis, Software, Validation, Visualization, Investigation, Conceptualization. Si Wu: Supervision, Conceptualization, Project administration, Writing - review & editing. Yinlong Xu: Supervision, Project administration.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work is supported by NSFC (62202440).

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	D. Ford, F. Labelle, F.I. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, S. Quinlan, Availability in globally distributed storage systems, in: Proc. of USENIX OSDI, 2010.

[2]	Y. He, Z. Zhou, Y. Pan, F. Chong, B. Wu, K. Xiao, H. Li, Review of data security within energy blockchain: A comprehensive analysis of storage, management, and utilization, High-Confid. Comput. 4 (3) (2024) 100233.

[3]	C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, S. Yekhanin, Erasure coding in windows azure storage, in: Proc. of USENIX ATC, 2012.

[4]	M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A.G. Dimakis, R. Vadali, S. Chen, D. Borthakur, Xoring elephants: Novel erasure codes for big data,in:Proc. of VLDB Endowment, 2013.

[5]	S. Muralidhar, W. Lloyd, S. Roy, C. Hill, E. Lin, W. Liu, S. Pan, S. Shankar, V. Sivakumar, L. Tang, et al., f4: Facebook’s warm BLOB storage system,in:Proc. of USENIX OSDI, 2014.

[6]	H. Weatherspoon, J.D. Kubiatowicz, Erasure coding vs. replication: A quantitative comparison,in:Proc. of IPTPS, 2002.

[7]	Ceph - Locally Repairable Erasure Code Plugin, https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/.

[8]	CubeFS - Erasure Code Subsystem, https://www.cubefs.io/.

[9]	O. Kolosov, G. Yadgar, M. Liram, I. Tamo, A. Barg, On fault tolerance, locality, and optimality in locally repairable codes, ACM Trans. Storage 16 (2) (2020) 1-32.

[10]	S. Kadekodi, S. Silas, D. Clausen, A. Merchant, Practical design considerations for wide locally recoverable codes (LRCs), ACM Trans. Storage 19 (4) (2023) 1-26.

[11]	H.S. Gunawi, R.O. Suminto, R. Sears, C. Golliher, S. Sundararaman, X. Lin, T. Emami, W. Sheng, N. Bidokhti, C. McCaffrey, et al., Fail-slow at scale: Evidence of hardware performance faults in large production systems, ACM Trans. Storage (TOS) 14 (3) (2018) 1-26.

[12]	P. Huang, C. Guo, L. Zhou, J.R. Lorch, Y. Dang, M. Chintalapati, R. Yao, Gray failure: The achilles’ heel of cloud-scale systems, in: Proc. of ACM HotOS, 2017.

[13]	Q. Yu, M.A. Maddah-Ali, A.S. Avestimehr, Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding, IEEE Trans. Inform. Theory 66 (3) (2020) 1920-1933.

[14]	W. Halbawi, N. Azizan, F. Salehi, B. Hassibi, Improving distributed gradient descent using reed-solomon codes, in: Proc. of IEEE ISIT, 2018.

[15]	X. Ren, G. Ananthanarayanan, A. Wierman, M. Yu, Hopper: Decentralized speculation-aware cluster scheduling at scale, in: Proc. of ACM SIGCOMM, 2015.

[16]	G. Ananthanarayanan, A. Ghodsi, S. Shenker, I. Stoica, Effective straggler mitigation: Attack of the clones,in:Proc. of USENIX NSDI, 2013.

[17]	K. Rashmi, M. Chowdhury, J. Kosaian, I. Stoica, K. Ramchandran, EC-Cache:Load-balanced, Low-Latency cluster caching with online erasure coding, in: Proc. of USENIX OSDI, 2016.

[18]	Y. Xu, Z. Musgrave, B. Noble, M. Bailey, Bobtail: Avoiding long tails in the cloud,in:Proc. of USENIX NSDI, 2013.

[19]	Z. Cao, V. Tarasov, H.P. Raman, D. Hildebrand, E. Zadok, On the performance variation in modern storage stacks, in: Proc. of USENIX FAST, 2017.

[20]	T. Zhu, A. Tumanov, M.A. Kozuch, M. Harchol-Balter, G.R. Ganger, Prioritymeister: Tail latency qos for shared networked storage,in:Proc. of ACM Symposium on Cloud Computing, 2014.

[21]	M. Zhang, Q. Wang, Z. Shen, P.P. Lee, POCache: Toward robust and configurable straggler tolerance with parity-only caching, J. Parallel Distrib. Comput. 167 (2022) 157-172.

[22]	J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, Commun. ACM 51 (1) (2008) 107-113.

[23]	G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, E. Harris, Reining in the outliers in Map-Reduce clusters using mantri, in: Proc. of USENIX OSDI, 2010.

[24]	M. Zaharia, A. Konwinski, A.D. Joseph, R.H. Katz, I. Stoica, Improving MapReduce performance in heterogeneous environments., in: Proc. of USENIX OSDI, 2008.

[25]	Q. Huang, H. Gudmundsdottir, Y. Vigfusson, D.A. Freedman, K. Birman, R. Van Renesse, Characterizing load imbalance in real-world networked caches, in: Proc. of ACM Workshop on Hot Topics in Networks, 2014.

[26]	S. Melnik, A. Gubarev, J.J. Long, G. Romer, S. Shivakumar, M. Tolton, T. Vassilakis, Dremel: interactive analysis of web-scale datasets, Proc. VLDB Endow. 3 (1-2) (2010) 330-339.

[27]	R. Halalai, P. Felber, A.-M. Kermarrec, F. Taïani, Agar: A caching system for erasure-coded data, in: Proc. of IEEE ICDCS, 2017.

[28]	Redis, https://redis.io/.

[29]	M. Chowdhury, S. Kandula, I. Stoica, Leveraging endpoint flexibility in dataintensive clusters, ACM SIGCOMM Comput. Commun. Rev. 43 (4) (2013) 231-242.

[30]	Y. Hu, L. Cheng, Q. Yao, P.P. Lee, W. Wang, W. Chen, Exploiting combined locality for Wide-Stripe erasure coding in distributed storage, in: Proc. of USENIX FAST, 2021.

[31]	I. Reed, G. Solomon, Polynomial codes over certain finite fields, J. Soc. Ind. Appl. Math. 8 (2) (1960) 300-304.

[32]	K.V. Rashmi, N.B. Shah, D. Gu, H. Kuang, D. Borthakur, K. Ramchandran, A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the facebook warehouse cluster,in:Proc. of USENIX HotStorage, 2013.

[33]	L. Gao, W. Li, H. Ma, Y. Liu, C. Li, Data cube-based storage optimization for resource-constrained edge computing, High-Confid. Comput. 4 (4) (2024) 100212.

[34]	M.F. Aktaş, E. Soljanin, Straggler mitigation at scale, IEEE/ACM Trans. Netw. 27 (6) (2019) 2266-2279.

[35]	M. Xia, M. Saxena, M. Blaum, D.A. Pease, A tale of two erasure codes in HDFS, in: Proc. of USENIX FAST, 2015.

[36]	K. Liu, J. Peng, J. Wang, J. Pan, Optimal caching for low latency in distributed coded storage systems, IEEE/ACM Trans. Netw. 30 (3) (2021) 1132-1145.

[37]	M. Abebe, K. Daudjee, B. Glasbergen, Y. Tian, EC-store: Bridging the gap between storage and latency in distributed erasure coded systems, in: Proc. of IEEE ICDCS, 2018.

[38]	V. Aggarwal, Y.-F.R. Chen, T. Lan, Y. Xiang, Sprout: A functional caching approach to minimize service latency in erasure-coded storage, IEEE/ACM Trans. Netw. 25 (6) (2017) 3683-3694.

[39]	A. Vulimiri, C. Curino, P.B. Godfrey, T. Jungblut, J. Padhye, G. Varghese, Global analytics in the face of bandwidth and regulatory constraints, in: Proc. of USENIX NSDI, 2015.

[40]	Erasure code - Ceph documentation, https://docs.ceph.com/docs/master/rados/operations/erasure-code/.

[41]	Alibaba YaLanTingLibs, https://github.com/alibaba/yalantinglibs.

[42]	J.S. Plank, J. Luo, C.D. Schuman, L. Xu, Z. Wilcox-O’Hearn, et al., A performance evaluation and examination of open-source erasure coding libraries for storage., in: Proc. of USENIX FAST, 2009.

[43]	The Wonder Shaper 1.4, https://github.com/magnific0/wondershaper/.

[44]	C. Huang, M. Chen, J. Li, Pyramid codes: Flexible schemes to trade space for access efficiency in reliable data storage systems, ACM Trans. Storage (TOS) 9 (1) (2013) 1-28.

[45]	I. Tamo, A. Barg, A family of optimal locally recoverable codes, IEEE Trans. Inform. Theory 60 (8) (2014) 4661-4676.

[46]	P. Gopalan, C. Huang, H. Simitci, S. Yekhanin, On the locality of codeword symbols, IEEE Trans. Inform. Theory 58 (11) (2012) 6925-6934.

[47]	A.S. Rawat, D.S. Papailiopoulos, A.G. Dimakis, S. Vishwanath, Locality and availability in distributed storage, IEEE Trans. Inform. Theory 62 (8) (2016) 4481-4493.

[48]	N. Silberstein, A.S. Rawat, O.O. Koyluoglu, S. Vishwanath, Optimal locally repairable codes via rank-metric codes, in: Proc. of IEEE ISIT, 2013.

[49]	I. Tamo, D.S. Papailiopoulos, A.G. Dimakis, Optimal locally repairable codes and connections to matroid theory, IEEE Trans. Inform. Theory 62 (12) (2016) 6661-6671.

[50]	S. Wu, Z. Shen, P.P. Lee, On the optimal repair-scaling trade-off in locally repairable codes, in: Proc. of IEEE INFOCOM, 2020.

[51]	J. Hao, K.W. Shum, S.-T. Xia, D. Li, Optimal locally repairable codes for parallel reading, IEEE Access 8 (2020) 80447-80453.

[52]	J. Dean, Achieving rapid response times in large online services, in: Berkeley AMPLab Cloud Seminar, 2012.

[53]	N.J. Yadwadkar, W. Choi, Proactive straggler avoidance using machine learning, White Pap. Univ. Berkeley (2012).

[54]	R.O. Suminto, C.A. Stuardo, A. Clark, H. Ke, T. Leesatapornwongsa, B. Fu, D.H. Kurniawan, V. Martin, M.R.G. Uma, H.S. Gunawi, Pbse: A robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks,in:Proc. of Symposium on Cloud Computing, 2017.

[55]	J. Xiong, M. Wei, Z. Lu, Y. Liu, Assessing the effectiveness of crawlers and large language models in detecting adversarial hidden link threats in meta computing, High-Confid. Comput. (2024) 100292.

[56]	Q. Xia, W. Ye, Z. Tao, J. Wu, Q. Li, A survey of federated learning for edge computing: Research problems and solutions, High-Confid. Comput. 1 (1) (2021) 100008.

[57]	H. Bommala, R. Aluvalu, S. Mudrakola, et al., Machine learning job failure analysis and prediction model for the cloud environment, High-Confid. Comput. 3 (4) (2023) 100165.

[58]	B. Fitzpatrick, Distributed caching with memcached, Linux J. 2004 (124) (2004) 5.

[59]	R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H.C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, et al., Scaling memcache at facebook,in:Proc. of USENIX NSDI, 2013.