Superior F1-score: I/O feature driven algorithms for stream computing systems workload identification

Yuxiao HAN; Yubo LIU; Ziyan ZHANG; Fei LI; Zhiguang CHEN; Nong XIAO

doi:10.1007/s11704-024-40710-5

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (5) :2005102 DOI: 10.1007/s11704-024-40710-5

Architecture

RESEARCH ARTICLE

Superior F1-score: I/O feature driven algorithms for stream computing systems workload identification

Author information +

History +

PDF (2053KB)

Abstract

Workload identification is fundamental for resource management in stream computing systems and is a key factor in improving their cost-benefit. However, existing workload identification algorithms often fail to handle the diversity of workload types and the complexity of the environments, making them usually unable to provide guidance for improving the performance of stream computing systems.

In this work, we propose two workload identification algorithms for different scenarios. The first one is the Fine-Grained I/O traces Workload Identification (FGWI) algorithm, which is suitable for the system that is not sensitive to overhead but mostly pursues the identification F1-score. FGWI analyzes the basic, time, spatial and temporal access features of every I/O operation, and then utilizes CatBoost to classify the workloads, meeting the high F1-score requirement. The second one is the simplified version of FGWI called AWI (Aggregated I/O traces Workload Identification), which mostly focuses on the temporal accesses features of minute-level aggregated I/O traces to reduce the overhead. We conduct experiments driven by the traces collected from Alibaba Cloud to evaluate the two algorithms. Experimental results demonstrate that, FGWI achieves an average 8.2% improvement in F1-score compared to the state-of-the-art algorithms, while AWI maintains a time overhead of only 0.22% relative to FGWI, but achieving an average of 6.8% improvement in F1-score compared to the state-of-the-art algorithms. Both algorithms present robustness and scalability across disks, proving their effectiveness for workload identification.

Graphical abstract

Keywords

stream computing system / workload identificationtion / I/O feature

Cite this article

Download citation ▾

Yuxiao HAN, Yubo LIU, Ziyan ZHANG, Fei LI, Zhiguang CHEN, Nong XIAO. Superior F1-score: I/O feature driven algorithms for stream computing systems workload identification. Front. Comput. Sci., 2026, 20(5): 2005102 DOI:10.1007/s11704-024-40710-5

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Zhang X, Huang Z, Wu C, Li Z, Lau F C M. Online auctions in IaaS clouds: welfare and profit maximization with server costs. In: Proceedings of 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 2015, 3−15

[2]	Roy A, Jindal A, Gomatam P, Ouyang X, Gosalia A, Ravi N, Mann S, Jain P. SparkCruise: workload optimization in managed spark clusters at microsoft. Proceedings of the VLDB Endowment, 2021, 14( 12): 3122–3134

[3]	Fragkoulis M, Carbone P, Kalavri V, Katsifodimos A. A survey on the evolution of stream processing systems. The VLDB Journal, 2024, 33( 2): 507–541

[4]	Babu S, Motwani R, Munagala K, Nishizawa I, Widom J. Adaptive ordering of pipelined stream filters. In: Proceedings of 2004 ACM SIGMOD International Conference on Management of Data. 2004, 407−418

[5]	Babcock B, Babu S, Motwani R, Datar M. Chain: operator scheduling for memory minimization in data stream systems. In: Proceedings of 2003 ACM SIGMOD International Conference on Management of Data. 2003, 253−264

[6]	Carney D, Çetintemel U, Rasin A, Zdonik S, Cherniack M, Stonebraker M. Operator scheduling in a data stream manager. In: Proceedings of the 29th International Conference on Very Large Data Bases. 2003, 838−849

[7]	Gedik B. Partitioning functions for stateful data parallelism in stream processing. The VLDB Journal, 2014, 23( 4): 517–539

[8]	Rivetti N, Querzoni L, Anceaume E, Busnel Y, Sericola B. Efficient key grouping for near-optimal load balancing in stream processing systems. In: Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems. 2015, 80−91

[9]	Tolosana-Calasanz R, Diaz-Montes J, Rana O F, Parashar M. Feedback-control & queueing theory-based resource management for streaming applications. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 4): 1061–1075

[10]	Wu Y, Tan K L. ChronoStream: elastic stateful stream computation in the cloud. In: Proceedings of the 31st IEEE International Conference on Data Engineering. 2015, 723−734

[11]	Li J, Wang Q, Lee P P C, Shi C. An in-depth comparative analysis of cloud block storage workloads: findings and implications. ACM Transactions on Storage, 2023, 19( 2): 1–32

[12]	Ding C, Dwarkadas S, Huang M C, Shen K, Carter J B. Program phase detection and exploitation. In: Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium. 2006, 279−286

[13]	Nagpurkar P, Hind P, Krintz C, Sweeney P F, Rajan V T. Online phase detection algorithms. In: Proceedings of the International Symposium on Code Generation and Optimization. 2006, 111−123

[14]	Esfandiarpoor S, Pahlavan A, Goudarzi M. Structure-aware online virtual machine consolidation for datacenter energy improvement in cloud computing. Computers & Electrical Engineering, 2015, 42: 74–89

[15]	Meyer V, Kirchoff D F, Da Silva M L, De Rose C A F. ML-driven classification scheme for dynamic interference-aware resource scheduling in cloud infrastructures. Journal of Systems Architecture, 2021, 116: 102064

[16]	Saxena D, Kumar J, Singh A K, Schmid S. Performance analysis of machine learning centered workload prediction models for cloud. IEEE Transactions on Parallel and Distributed Systems, 2023, 34( 4): 1313–1330

[17]	Ali A, Zhu Y, Zakarya M. Exploiting dynamic spatio-temporal graph convolutional neural networks for citywide traffic flows prediction. Neural Networks, 2022, 145: 233–247

[18]	Limouni T, Yaagoubi R, Bouziane K, Guissi K, Baali E H. Accurate one step and multistep forecasting of very short-term PV power using LSTM-TCN model. Renewable Energy, 2023, 205: 1010–1024

[19]	Ruan L, Bai Y, Li S, He S, Xiao L. Workload time series prediction in storage systems: a deep learning based approach. Cluster Computing, 2023, 26( 1): 25–35

[20]	Basak J, Wadhwani K, Voruganti K. Storage workload identification. ACM Transactions on Storage, 2016, 12( 3): 14

[21]	Zhang J, Ling Y, Fu X, Yang X, Xiong G, Zhang R. Model of the intrusion detection system based on the integration of spatial-temporal features. Computers & Security, 2020, 89: 101681

[22]	Pipada P, Kundu A, Gopinath K, Bhattacharyya C, Susarla S, Nagesh P C. LoadIQ: Learning to identify workload phases from a live storage trace. In: Proceedings of the 4th USENIX Workshop on Hot Topics in Storage and File Systems. 2012, 11−15

[23]	Avnur R, Hellerstein J M. Eddies: continuously adaptive query processing. In: Proceedings of 2000 ACM SIGMOD International Conference on Management of Data. 2000, 261−272

[24]	Kirpichov E, Denielou M. No shard left behind: dynamic work rebalancing in Google cloud dataflow. Google Cloud Blog, 2016

[25]	Kulkarni S, Bhagat N, Fu M, Kedigehalli V, Kellogg C, Mittal S, Patel J M, Ramasamy K, Taneja S. Twitter heron: stream processing at scale. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. 2015, 239−250

[26]	Gedik B, Schneider S, Hirzel M, Wu K L. Elastic scaling for data stream processing. IEEE Transactions on Parallel and Distributed Systems, 2014, 25( 6): 1447–1463

[27]	Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K. Apache flink: stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering, 2015, 36(4): 28−38

[28]	Otte M, Richardson S. An hmm applied to semi-online program phase analysis. Boulder: University of Colorado Boulder, 2007

[29]	Riska A, Riedel E. Disk drive level workload characterization. In: Proceedings of 2006 USENIX Annual Technical Conference. 2006, 97−102

[30]	Zhang X, Riska A, Riedel E. Characterization of the e-commerce storage subsystem workload. In: Proceedings of the 5th International Conference on Quantitative Evaluation of Systems. 2008, 297−306

[31]	Elnaffar S, Martin P, Horman R. Automatically classifying database workloads. In: Proceedings of the 11th ACM CIKM International Conference on Information and Knowledge Management. 2002, 622−624

[32]	Oh J S, Choi K S, Kwon J R, Lee S H. Finding the near workload type between TPC-C and TPC-W environments. In: Proceedings of the International Conference on Convergence and Hybrid Information Technology. 2008, 334−337

[33]	Yadwadkar N J, Bhattacharyya C, Gopinath K, Niranjan T, Susarla S. Discovery of application workloads from network file traces. In: Proceedings of the 8th USENIX Conference on File and Storage Technologies. 2010, 183−196

[34]	Liu Y, Gunasekaran R, Ma X, Vazhkudai S S. Automatic identification of application I/O signatures from noisy server-side traces. In: Proceedings of the 12th USENIX Conference on File and Storage Technologies. 2014, 213−228

[35]	Northcutt C G, Athalye A, Mueller J. Pervasive label errors in test sets destabilize machine learning benchmarks. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021. 2021, 1−13

[36]	Mills T C. Applied Time Series Analysis: A Practical Guide to Modeling and Forecasting. Amsterdam: Elsevier, 2019

[37]	Pena D, Tiao G C, Tsay R S. A course in time series analysis. volume 409. Wiley Online Library, 2001

[38]	Quinlan J R. Induction of decision trees. Machine Learning, 1986, 1( 1): 81–106

[39]	Prokhorenkova L, Gusev G, Vorobev A, Dorogush A V, Gulin A. CatBoost: unbiased boosting with categorical features. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 6639−6649

[40]	Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics, 2000, 28( 2): 337–407

[41]	Friedman J H. Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 2001, 29( 5): 1189–1232

[42]	Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, 785−794

[43]	Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T Y. LightGBM: a highly efficient gradient boosting decision tree. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 3149−3157

[44]	Jones D R, Schonlau M, Welch W J. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 1998, 13( 4): 455–492

[45]	Yao H, Tang X, Wei H, Zheng G, Li Z. Revisiting spatial-temporal similarity: a deep learning framework for traffic prediction. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2018, 5668−5675

[46]	Gao J, Wang H, Shen H. Task failure prediction in cloud data centers using deep learning. IEEE Transactions on Services Computing, 2022, 15( 3): 1411–1422

[47]	Bai S, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. 2018, arXiv preprint arXiv: 1803.01271

[48]	Mrhari A, Hadi Y. Workload prediction using VMD and TCN in cloud computing. Journal of Advances in Information Technology, 2022, 13( 3): 284–289