AutoCache: an online and automatic caching solution for Spark
Hui LI , Shuping JI , Yang LI , Yujie QIAO , Huayi SUI , Zhen TANG , Wei CHEN , Zheng QIN , Wei WANG , Hua ZHONG , Tao HUANG
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (5) : 2005108
AutoCache: an online and automatic caching solution for Spark
Apache Spark is one of the most popular in-memory distributed computing frameworks for processing large-scale datasets, and caching is indispensable for improving the performance of different Spark applications. However, proper cache usage in Spark is non-trivial. Developers must cache the hot data manually and evict unnecessary data timely in their applications to achieve better performance. This requires deep understanding and sufficient experience. Otherwise, the wrong caching decisions can lead to performance degradation, application bugs, and even system crashes. To overcome these challenges, we propose AutoCache, in a non-intrusive manner, which means it can identify the hot datasets and cache them automatically during the execution of a workload without changing any application code. For a given Spark application, AutoCache first parses the execution paths of the application and then analyzes the data references based on the DAG maintained within Spark. After that, AutoCache heuristically identifies the datasets, in the form of RDDs, that would be accessed multiple times at run time. Along with the application’s execution, AutoCache automatically caches and evicts the RDDs by invoking Spark’s underlying APIs on the fly. We evaluate AutoCache by using an open-source benchmark that contains various applications. Our experimental results show that AutoCache can significantly improve the performance of real-world applications and obviously outperform related work. Moreover, by comparing the caching decisions of AutoCache with existing manual written caching logics in these applications, nine previously unknown caching-related issues are detected, all of them have been confirmed and five of them have been fixed by related developers. This constructs another strong proof of the effectiveness of AutoCache.
AutoCache / spark / in-memory data processing / RDD history graph / heuristic rules
| [1] |
|
| [2] |
|
| [3] |
Spark streaming twitter connector. See Bahir.apache.org/docs/spark/current/spark-streaming-twitter/ website, 2024 |
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
BigDL: Building large-scale ai applications for distributed big data. See github.com/intel/ipex-llm website, 2024 |
| [19] |
MLlib apache spark. See spark.apache.org/mllib/ website, 2024 |
| [20] |
|
| [21] |
CODAIT/spark-bench. See Github.com/CODAIT/spark-bench/tree/legacy website, 2024 |
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
Transformation lazy in spark. See blog.csdn.net/officercat/article/details/82114271 website, 2024 |
| [26] |
Index of /dist/spark. See archive.apache.org/dist/spark/ website, 2024 |
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
Intelligent cache in azure synapse analytics. See Learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-intelligent-cache-concept website, 2024 |
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
Higher Education Press
/
| 〈 |
|
〉 |