MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latency

Yixuan DU; Edwin Hsing-Mean SHA; Yuhong SONG; Yibo GUO; Longshan XU; Qingfeng ZHUGE

doi:10.1007/s11704-023-3566-y

PDF(5990 KB)

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (5) : 195107. DOI: 10.1007/s11704-023-3566-y

Architecture

RESEARCH ARTICLE

MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latency

Author information +

History +

Abstract

Scratch-pad memory (SPM) has been widely used in embedded systems because it allows software-controlled data placement. By designing data placement strategies, optimal solutions with minimal memory access latency for loops on SPM-DRAM architecture can be explored. Although existing works effectively reduce the latency by using fine-grained data placement methods, they fail in solving the case of inconsecutive array access. Meanwhile, fine-grained strategy can lead to excessive memory activation overhead, making it less efficient. Therefore, in this paper, we first propose a fine-grained dynamic programming algorithm, called FiDP, to tackle unsolved case and minimize latency. In order to mitigate the frequent activation before data access, we then add a medium-grained scheme to our strategy. It can achieve a better solution than FiDP by strictly formulating an integer linear programming (ILP) problem and considering multiple granularities, which is called MuILP. Furthermore, to compensate for the high time complexity of ILP, we develop a heuristic multi-granularity data placement algorithm, called HMuDP, which achieves a near-optimal solution with lower complexity. Experimental results show that our FiDP reduces the total latency by 75.90%, 47.70% and 12.34% compared with LRU-cache, a greedy-based comparison method (called Uday) and a dynamic programming-based comparison method (called DLAA). Besides, our MuILP and HMuDP yield less latency than FiDP with 45.10% and 43.14% average improvement, respectively.

Graphical abstract

Keywords

scratch-pad memory / data placement / loops / embedded system

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Yixuan DU, Edwin Hsing-Mean SHA, Yuhong SONG, Yibo GUO, Longshan XU, Qingfeng ZHUGE. MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latency. Front. Comput. Sci., 2025, 19(5): 195107 https://doi.org/10.1007/s11704-023-3566-y

This is a preview of subscription content, contact us for subscripton.

Yixuan Du received the BS degree from College of Computer Science and Engineering, Shandong University of Science and Technology, China in 2022. She is working towards the master degree in the Department of Computer Science and Technology, East China Normal University, China. Her current research field is data allocation algorithm in scratch-pad memory

Edwin Hsing-Mean Sha received PhD degree from Princeton University, USA in 1992. Then he was with University of Notre Dame, USA. Since 2000, he has been a tenured full professor at the University of Texas at Dallas, USA. After serving as the Dean of CS College at Chongqing University, China, he is currently a distinguished professor at East China Normal University, China. He has published 500 research articles and served as program committee members and chairs of numerous conferences such as conference chair of ESWEEK 2022. He received many awards including NSF CAREER Award, China National Distinguished Expert, etc. and the best paper award of premier journals such as ACM TODAES, IEEE TC, IEEE TCAD

Yuhong Song received the BS degree from the Department of Computer Science, Nanjing Agricultural University, China in 2019. She is currently working towards the PhD degree in the Department of Computer Science and Technology, East China Normal University, China. Her current research interests include embedded system, software-hardware co-design, automated machine learning, stochastic computing and quantum computing

Yibo Guo received the BS degree in information security from Hunan University, China in 2009, and the MS and PhD degrees in computer science from University of Texas at Dallas, USA in 2011 and 2014, respectively. He is currently an associate professor with the School of Computer and Artificial Intelligence, Zhengzhou University, China. His current research interest includes artificial intelligence

Longshan Xu received the BS degree from the Department of Computer Science and Technology, East China Normal University, China in 2023. She will continue to work towards the master degree in this department. Her current research interest is CPU dynamic voltage frequency scaling on mobile devices

Qingfeng Zhuge received the BS and MS degrees in electronics engineering from Fudan University, China and the PhD degree from the Department of Computer Science, University of Texas at Dallas, USA in 2003. She is currently a professor at East China Normal University, China. She received Best PhD Dissertation Award in 2003. She has published more than 90 research articles in premier journals and conferences. Her research interests include parallel architectures, embedded systems, supply-chain management, real-time systems, optimization algorithms, compilers, and scheduling. She is a member of the IEEE

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	VanHattum A, Nigam R, Lee V T, Bornholt J, Sampson A. A synthesis-aided compiler for DSP architectures (WiP Paper). In: Proceedings of the 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems. 2020, 131−135

[2]	Song Y, Jiang W, Li B, Qi P, Zhuge Q, Sha E H M, Dasgupta S, Shi Y, Ding C. Dancing along battery: enabling transformer with run-time reconfigurability on mobile devices. In: Proceedings of the 58th ACM/IEEE Design Automation Conference (DAC). 2021, 1003−1008

[3]	Sun X H, Ni L M. Another view on parallel speedup. In: Proceedings of 1990 ACM/IEEE conference on Supercomputing. 1990, 324−333

[4]	Wu M, Liu Y, Cui H, Wei Q, Li Q, Li L, Lv F, Xue J, Feng X. Bandwidth-aware loop tiling for DMA-supported scratchpad memory. In: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 2020, 97−109

[5]	Venkataramani V, Chan M C, Mitra T . Scratchpad-memory management for multi-threaded applications on many-core architectures. ACM Transactions on Embedded Computing Systems, 2019, 18( 1): 10

[6]	Guo Y, Zhuge Q, Hu J, Yi J, Qiu M, Sha E H M . Data placement and duplication for embedded multicore systems with scratch pad memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2013, 32( 6): 809–817

[7]	Guo Y, Zhuge Q, Zhang J, Hu J, Sha E H M. Optimal data allocation algorithm for loop-centric applications on scratch-PAD memories. In: Proceedings of 2013 SiPS. 2013, 383−388

[8]	Singh A, Dave S, Zardoshti P, Brotzman R, Zhang C, Guo X, Shrivastava A, Tan G, Spear M . SPX64: a scratchpad memory for general-purpose microprocessors. ACM Transactions on Architecture and Code Optimization, 2021, 18( 1): 14

[9]	Şuşu A E . A vector-length agnostic compiler for the Connex-S accelerator with scratchpad memory. ACM Transactions on Embedded Computing Systems, 2020, 19( 6): 51

[10]	Jeong H J, Yeo J, Bahk C, Park J. Pin or fuse? Exploiting scratchpad memory to reduce off-chip data transfer in DNN accelerators. In: Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization. 2023, 224−235

[11]	Xu R, Sha E H M, Zhuge Q, Song Y, Wang H . Loop interchange and tiling for multi-dimensional loops to minimize write operations on NVMs. Journal of Systems Architecture, 2023, 135: 102799

[12]	Udayakumaran S, Dominguez A, Barua R . Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Transactions on Embedded Computing Systems, 2006, 5( 2): 472–511

[13]	Li J, Zhang Z, Qiu M, Zhang P, Quan G, Zhu Y. Optimizing scheduling in embedded CMP systems with phase change memory. In: Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems. 2012, 532−539

[14]	Li Y, Zhan J, Jiang W, Li Y. Writing-aware data variable allocation on hybrid SRAM+NVM SPM: work-in-progress. In: Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES). 2018, 11

[15]	Li Y, Zhan J, Jiang W, Yu J. Energy optimization of branch-aware data variable allocation on hybrid SRAM+NVM SPM for CPS. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. 2019, 236−241

[16]	Zhan J, Li Y, Jiang W, Yu J, Yu J . Branch-aware data variable allocation for energy optimization of hybrid SRAM+NVM SPM. Journal of Systems Architecture, 2020, 109: 101797

[17]	Xu R, Sha E H M, Zhuge Q, Song Y, Wang H, Shi L . Optimizing data placement for hybrid SRAM+racetrack memory SPM in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42( 3): 847–859

[18]	Zhuge Q, Guo Y, Hu J, Tseng W C, Xue C J, Sha E H M . Minimizing access cost for multiple types of memory units in embedded systems through data allocation and scheduling. IEEE Transactions on Signal Processing, 2012, 60( 6): 3253–3263

[19]	Suhendra V, Raghavan C, Mitra T. Integrated scratchpad memory optimization and task scheduling for MPSoC architectures. In: Proceedings of 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. 2006, 401−410

[20]	Wang G, Ju L, Jia Z, Li X. Data allocation for embedded systems with hybrid on-chip scratchpad and caches. In: Proceedings of the 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. 2013, 366−373

[21]	Muralimanohar N, Balasubramonian R, Jouppi N. Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007). 2007, 3–14