MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latency

Yixuan DU , Edwin Hsing-Mean SHA , Yuhong SONG , Yibo GUO , Longshan XU , Qingfeng ZHUGE

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (5) : 195107

PDF (5990KB)
Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (5) : 195107 DOI: 10.1007/s11704-023-3566-y
Architecture
RESEARCH ARTICLE

MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latency

Author information +
History +
PDF (5990KB)

Abstract

Scratch-pad memory (SPM) has been widely used in embedded systems because it allows software-controlled data placement. By designing data placement strategies, optimal solutions with minimal memory access latency for loops on SPM-DRAM architecture can be explored. Although existing works effectively reduce the latency by using fine-grained data placement methods, they fail in solving the case of inconsecutive array access. Meanwhile, fine-grained strategy can lead to excessive memory activation overhead, making it less efficient. Therefore, in this paper, we first propose a fine-grained dynamic programming algorithm, called FiDP, to tackle unsolved case and minimize latency. In order to mitigate the frequent activation before data access, we then add a medium-grained scheme to our strategy. It can achieve a better solution than FiDP by strictly formulating an integer linear programming (ILP) problem and considering multiple granularities, which is called MuILP. Furthermore, to compensate for the high time complexity of ILP, we develop a heuristic multi-granularity data placement algorithm, called HMuDP, which achieves a near-optimal solution with lower complexity. Experimental results show that our FiDP reduces the total latency by 75.90%, 47.70% and 12.34% compared with LRU-cache, a greedy-based comparison method (called Uday) and a dynamic programming-based comparison method (called DLAA). Besides, our MuILP and HMuDP yield less latency than FiDP with 45.10% and 43.14% average improvement, respectively.

Graphical abstract

Keywords

scratch-pad memory / data placement / loops / embedded system

Cite this article

Download citation ▾
Yixuan DU, Edwin Hsing-Mean SHA, Yuhong SONG, Yibo GUO, Longshan XU, Qingfeng ZHUGE. MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latency. Front. Comput. Sci., 2025, 19(5): 195107 DOI:10.1007/s11704-023-3566-y

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

VanHattum A, Nigam R, Lee V T, Bornholt J, Sampson A. A synthesis-aided compiler for DSP architectures (WiP Paper). In: Proceedings of the 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems. 2020, 131−135

[2]

Song Y, Jiang W, Li B, Qi P, Zhuge Q, Sha E H M, Dasgupta S, Shi Y, Ding C. Dancing along battery: enabling transformer with run-time reconfigurability on mobile devices. In: Proceedings of the 58th ACM/IEEE Design Automation Conference (DAC). 2021, 1003−1008

[3]

Sun X H, Ni L M. Another view on parallel speedup. In: Proceedings of 1990 ACM/IEEE conference on Supercomputing. 1990, 324−333

[4]

Wu M, Liu Y, Cui H, Wei Q, Li Q, Li L, Lv F, Xue J, Feng X. Bandwidth-aware loop tiling for DMA-supported scratchpad memory. In: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 2020, 97−109

[5]

Venkataramani V, Chan M C, Mitra T . Scratchpad-memory management for multi-threaded applications on many-core architectures. ACM Transactions on Embedded Computing Systems, 2019, 18( 1): 10

[6]

Guo Y, Zhuge Q, Hu J, Yi J, Qiu M, Sha E H M . Data placement and duplication for embedded multicore systems with scratch pad memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2013, 32( 6): 809–817

[7]

Guo Y, Zhuge Q, Zhang J, Hu J, Sha E H M. Optimal data allocation algorithm for loop-centric applications on scratch-PAD memories. In: Proceedings of 2013 SiPS. 2013, 383−388

[8]

Singh A, Dave S, Zardoshti P, Brotzman R, Zhang C, Guo X, Shrivastava A, Tan G, Spear M . SPX64: a scratchpad memory for general-purpose microprocessors. ACM Transactions on Architecture and Code Optimization, 2021, 18( 1): 14

[9]

Şuşu A E . A vector-length agnostic compiler for the Connex-S accelerator with scratchpad memory. ACM Transactions on Embedded Computing Systems, 2020, 19( 6): 51

[10]

Jeong H J, Yeo J, Bahk C, Park J. Pin or fuse? Exploiting scratchpad memory to reduce off-chip data transfer in DNN accelerators. In: Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization. 2023, 224−235

[11]

Xu R, Sha E H M, Zhuge Q, Song Y, Wang H . Loop interchange and tiling for multi-dimensional loops to minimize write operations on NVMs. Journal of Systems Architecture, 2023, 135: 102799

[12]

Udayakumaran S, Dominguez A, Barua R . Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Transactions on Embedded Computing Systems, 2006, 5( 2): 472–511

[13]

Li J, Zhang Z, Qiu M, Zhang P, Quan G, Zhu Y. Optimizing scheduling in embedded CMP systems with phase change memory. In: Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems. 2012, 532−539

[14]

Li Y, Zhan J, Jiang W, Li Y. Writing-aware data variable allocation on hybrid SRAM+NVM SPM: work-in-progress. In: Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES). 2018, 11

[15]

Li Y, Zhan J, Jiang W, Yu J. Energy optimization of branch-aware data variable allocation on hybrid SRAM+NVM SPM for CPS. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. 2019, 236−241

[16]

Zhan J, Li Y, Jiang W, Yu J, Yu J . Branch-aware data variable allocation for energy optimization of hybrid SRAM+NVM SPM. Journal of Systems Architecture, 2020, 109: 101797

[17]

Xu R, Sha E H M, Zhuge Q, Song Y, Wang H, Shi L . Optimizing data placement for hybrid SRAM+racetrack memory SPM in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42( 3): 847–859

[18]

Zhuge Q, Guo Y, Hu J, Tseng W C, Xue C J, Sha E H M . Minimizing access cost for multiple types of memory units in embedded systems through data allocation and scheduling. IEEE Transactions on Signal Processing, 2012, 60( 6): 3253–3263

[19]

Suhendra V, Raghavan C, Mitra T. Integrated scratchpad memory optimization and task scheduling for MPSoC architectures. In: Proceedings of 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. 2006, 401−410

[20]

Wang G, Ju L, Jia Z, Li X. Data allocation for embedded systems with hybrid on-chip scratchpad and caches. In: Proceedings of the 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. 2013, 366−373

[21]

Muralimanohar N, Balasubramonian R, Jouppi N. Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007). 2007, 3–14

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap
PDF (5990KB)

Supplementary files

FCS-23566-OF-YD_suppl_1

948

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/