AutoCache: an online and automatic caching solution for Spark

Hui LI; Shuping JI; Yang LI; Yujie QIAO; Huayi SUI; Zhen TANG; Wei CHEN; Zheng QIN; Wei WANG; Hua ZHONG; Tao HUANG

doi:10.1007/s11704-025-40776-9

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (5) :2005108 DOI: 10.1007/s11704-025-40776-9

Architecture

RESEARCH ARTICLE

AutoCache: an online and automatic caching solution for Spark

Hui LI ¹^,²^,³^,^‡
, Shuping JI ¹^,^‡
, Yang LI ⁴^,⁵
, Yujie QIAO ⁴^,⁵
, Huayi SUI ⁴^,⁵
, Zhen TANG ¹^,²^,³
, Wei CHEN ¹^,²^,³
, Zheng QIN ¹^,²
, Wei WANG ¹^,²^,³
, Hua ZHONG ¹^,²
, Tao HUANG ¹^,²

Author information +

History +

PDF (2262KB)

Abstract

Apache Spark is one of the most popular in-memory distributed computing frameworks for processing large-scale datasets, and caching is indispensable for improving the performance of different Spark applications. However, proper cache usage in Spark is non-trivial. Developers must cache the hot data manually and evict unnecessary data timely in their applications to achieve better performance. This requires deep understanding and sufficient experience. Otherwise, the wrong caching decisions can lead to performance degradation, application bugs, and even system crashes. To overcome these challenges, we propose AutoCache, in a non-intrusive manner, which means it can identify the hot datasets and cache them automatically during the execution of a workload without changing any application code. For a given Spark application, AutoCache first parses the execution paths of the application and then analyzes the data references based on the DAG maintained within Spark. After that, AutoCache heuristically identifies the datasets, in the form of RDDs, that would be accessed multiple times at run time. Along with the application’s execution, AutoCache automatically caches and evicts the RDDs by invoking Spark’s underlying APIs on the fly. We evaluate AutoCache by using an open-source benchmark that contains various applications. Our experimental results show that AutoCache can significantly improve the performance of real-world applications and obviously outperform related work. Moreover, by comparing the caching decisions of AutoCache with existing manual written caching logics in these applications, nine previously unknown caching-related issues are detected, all of them have been confirmed and five of them have been fixed by related developers. This constructs another strong proof of the effectiveness of AutoCache.

Graphical abstract

Keywords

AutoCache / spark / in-memory data processing / RDD history graph / heuristic rules

Cite this article

Download citation ▾

Hui LI, Shuping JI, Yang LI, Yujie QIAO, Huayi SUI, Zhen TANG, Wei CHEN, Zheng QIN, Wei WANG, Hua ZHONG, Tao HUANG. AutoCache: an online and automatic caching solution for Spark. Front. Comput. Sci., 2026, 20(5): 2005108 DOI:10.1007/s11704-025-40776-9

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. 2010, 10

[2]	Broo D G, Bravo-Haro M, Schooling J . Design and implementation of a smart infrastructure digital twin. Automation in Construction, 2022, 136: 104171

[3]	Spark streaming twitter connector. See Bahir.apache.org/docs/spark/current/spark-streaming-twitter/ website, 2024

[4]	ZDNET. Google announces kubernetes operator for apache spark. See Zdnet.com/article/google-announces-kubernetes-operator-for-apache-spark/ website, 2024

[5]	ZDNET. NetApp scoops up data mechanics to add spark analytics. See Zdnet.com/article/netapp-scoops-up-data-mechanics-to-add-spark-analytics/ website, 2024

[6]	Spark. SPARK-18657: persist UUID across query restart. See Issues.apache.org/jira/browse/SPARK-18657 website, 2024

[7]	Spark. SPARK-21166: automated ML persistence. See Issues.apache.org/jira/browse/SPARK-21166 website, 2024

[8]	Spark. Document persistence recommendation for MulticlassMetrics. See Issues.apache.org/jira/browse/SPARK-9717 website, 2024

[9]	Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin M J, Shenker S, Stoica I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. 2012, 15−28

[10]	Xu E, Saxena M, Chiu L. Neutrino: Revisiting memory caching for iterative data analytics. In: Proceedings of the 8th USENIX Conference on Hot Topics in Storage and File Systems. 2016, 16−20

[11]	Xu L, Li M, Zhang L, Butt A R, Wang Y, Hu Z Z. MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS). 2016, 383−392

[12]	Ananthanarayanan G, Ghodsi A, Wang A, Borthakur D, Kandula S, Shenker S, Stoica I. PACman: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. 2012, 267−280

[13]	Xu Y, Liu L, Ding Z. DAG-aware joint task scheduling and cache management in spark clusters. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2020, 378−387

[14]	Perez T B G, Zhou X, Cheng D. Reference-distance eviction and prefetching for cache management in spark. In: Proceedings of the 47th International Conference on Parallel Processing. 2018, 88

[15]	Yu Y, Wang W, Zhang J, Letaief K B. LRC: Dependency-aware cache management for data analytics clusters. In: Proceedings of IEEE Conference on Computer Communications (INFOCOM). 2017, 1−9

[16]	Mior M J, Salem K. ReSpark: Automatic caching for iterative applications in apache spark. In: Proceedings of the International Conference on Big Data (Big Data). 2020, 331−340

[17]	Li H, Wang D, Huang T, Gao Y, Dou W, Xu L, Wang W, Wei J, Zhong H. Detecting cache-related bugs in spark applications. In: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 2020, 363−375

[18]	BigDL: Building large-scale ai applications for distributed big data. See github.com/intel/ipex-llm website, 2024

[19]	MLlib apache spark. See spark.apache.org/mllib/ website, 2024

[20]	Li M, Tan J, Wang Y, Zhang L, Salapura V. SparkBench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers. 2015, 53

[21]	CODAIT/spark-bench. See Github.com/CODAIT/spark-bench/tree/legacy website, 2024

[22]	Spark. Linearsvc should persist instances if needed. See Issues.apache.org/jira/browse/SPARK-29686 website, 2024

[23]	Spark. Redundant RDD computation in LDAOptimizer. See Issues.apache.org/jira/browse/SPARK-16697 website, 2024

[24]	Spark. Improve ANN training, add training data persist if needed. See Issues.apache.org/jira/browse/SPARK-16880 website, 2024

[25]	Transformation lazy in spark. See blog.csdn.net/officercat/article/details/82114271 website, 2024

[26]	Index of /dist/spark. See archive.apache.org/dist/spark/ website, 2024

[27]	JohnSnowLabs. spark-nlp: State of the art natural language processing. See Github.com/JohnSnowLabs/spark-nlp website, 2024

[28]	Deequ. Deequ is a library built on top of apache spark for defining “unit tests for data”. See Github.com/awslabs/deequ website, 2024

[29]	Salesforce. Transmogrifai is an automl library for building modular, reusable, strongly typed machine learning workflows on apache spark with minimal hand-tuning. See Github.com/salesforce/TransmogrifAI website, 2024

[30]	Intel. Modify textclassifier.scala. See Github.com/intel-analytics/BigDL/pull/2987/files website, 2024

[31]	JohnSnowLabs. Issue 6917: Modify doc2vecapproach.scala. See Github.com/JohnSnowLabs/spark-nlp/issues/6917 website, 2024

[32]	JohnSnowLabs. Issue 6981: Modify word2VecApproach.scala. See Github.com/JohnSnowLabs/spark-nlp/issues/6981 website, 2024

[33]	JohnSnowLabs. Introducing enableCaching param in Doc2Vec and Word2Vec trainable annotators. See Github.com/JohnSnowLabs/spark-nlp/pull/6988/files website, 2024

[34]	Salesforce. Two cache miss case. See Github.com/salesforce/TransmogrifAI/issues/572 website, 2024

[35]	Zhai M, Song A, Qiu J, Ji X, Wu Q. Query optimization approach with shuffle intermediate cache layer for spark SQL. In: Proceedings of the 38th International Performance Computing and Communications Conference (IPCCC). 2019, 1−6

[36]	Rang W, Yang D, Cheng D. A shared memory cache layer across multiple executors in apache spark. In: Proceedings of the International Conference on Big Data (Big Data). 2020, 477−482

[37]	Intelligent cache in azure synapse analytics. See Learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-intelligent-cache-concept website, 2024

[38]	Elghandour I, Aboulnaga A. Restore: reusing results of mapreduce jobs in pig. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 2012, 701−704

[39]	Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In: OSDI’04: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation. 2004, 137−150

[40]	Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: A not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 2008, 1099−1110

[41]	Yang Z, Jia D, Ioannidis S, Mi N, Sheng B. Intermediate data caching optimization for multi-stage and parallel big data frameworks. In: Proceedings of the 11th International Conference on Cloud Computing (CLOUD). 2018, 277−284

[42]	Huang C Q, Yang S Q, Tang J C, Zhou Y. RDDShare: Reusing results of spark RDD. In: Proceedings of the 1st International Conference on Data Science in Cyberspace (DSC). 2016, 370−375

[43]	Ju W, Li J, Yu W, Zhang R . iGraph: an incremental data processing system for dynamic graph. Frontiers of Computer Science, 2016, 10( 3): 462–476

[44]	Grandl R, Kandula S, Rao S, Akella A, Kulkarni J. Graphene: packing and dependency-aware scheduling for data-parallel clusters. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. 2016, 81−97