ARCHER: a ReRAM-based accelerator for compressed recommendation systems
Xinyang SHEN, Xiaofei LIAO, Long ZHENG, Yu HUANG, Dan CHEN, Hai JIN
ARCHER: a ReRAM-based accelerator for compressed recommendation systems
Modern recommendation systems are widely used in modern data centers. The random and sparse embedding lookup operations are the main performance bottleneck for processing recommendation systems on traditional platforms as they induce abundant data movements between computing units and memory. ReRAM-based processing-in-memory (PIM) can resolve this problem by processing embedding vectors where they are stored. However, the embedding table can easily exceed the capacity limit of a monolithic ReRAM-based PIM chip, which induces off-chip accesses that may offset the PIM profits. Therefore, we deploy the decomposed model on-chip and leverage the high computing efficiency of ReRAM to compensate for the decompression performance loss. In this paper, we propose ARCHER, a ReRAM-based PIM architecture that implements fully on-chip recommendations under resource constraints. First, we make a full analysis of the computation pattern and access pattern on the decomposed table. Based on the computation pattern, we unify the operations of each layer of the decomposed model in multiply-and-accumulate operations. Based on the access observation, we propose a hierarchical mapping schema and a specialized hardware design to maximize resource utilization. Under the unified computation and mapping strategy, we can coordinate the inter-processing elements pipeline. The evaluation shows that ARCHER outperforms the state-of-the-art GPU-based DLRM system, the state-of-the-art near-memory processing recommendation system RecNMP, and the ReRAM-based recommendation accelerator REREC by , , and in terms of performance and , , and in terms of energy savings, respectively.
recommendation system / ReRAM / processing-in-memory / embedding layer
Xinyang Shen is currently a PhD student at the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. His research interests include ReRAM-based processing in memory, graph processing, and recommendation systems
Xiaofei Liao received his PhD degree in computer science and engineering from Huazhong University of Science and Technology (HUST), China in 2005. He is currently a Professor in the School of Computer Science and Technology at HUST. He has served as a reviewer for many conferences and journal papers. He is a member of the IEEE. His research interests are in the areas of system software, P2P systems, cluster computing, graph processing, and streaming services
Long Zheng is now an Associate Professor in the School of Computer Science and Technology at Huazhong University of Science and Technology (HUST), China. He received his PhD degree at HUST in 2016. His current research interests include runtime systems, program analysis, and configurable computer architecture
Yu Huang received a BS degree from the Huazhong University of Science and Technology (HUST), China in 2016. He is now working toward a PhD degree at the School of Computer Science and Technology, HUST, China. His research interests focus on distributed stream processing and graph processing
Dan Chen received a BS degree from the North China Electric Power University, China in 2018. He is now working toward a PhD degree at the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. His research interests focus on processing-in-memory and graph neural networks
Hai Jin is a Professor of computer science and engineering at Huazhong University of Science and Technology (HUST) in China. He received his PhD in computer engineering from HUST in 1994. He is the chief scientist of ChinaGrid, the largest grid computing project in China. He is an IEEE Fellow, CCF Fellow, and a member of the ACM. He research interests include computer architecture, big data processing, data storage, and system security
[1] |
Ke L, Gupta U, Cho B Y, Brooks D, Chandra V, Diril U, Firoozshahian A, Hazelwood K, Jia B, Lee H H S, Li M, Maher B, Mudigere D, Naumov M, Schatz M, Smelyanskiy M, Wang X, Reagen B, Wu C J, Hempstead M, Zhang X. RecNMP: Accelerating personalized recommendation with near-memory processing. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 790−803
[2] |
Naumov M, Mudigere D, Shi H J M, Huang J, Sundaraman N, Park J, Wang X, Gupta U, Wu C J, Azzolini A G, Dzhulgakov D, Mallevich A, Cherniavskii I, Lu Y, Krishnamoorthi R, Yu A, Kondratenko V, Pereira S, Chen X, Chen W, Rao V, Jia B, Xiong L, Smelyanskiy M. Deep learning recommendation model for personalization and recommendation systems. 2019, arXiv preprint arXiv: 1906.00091
[3] |
Gupta U, Wu C J, Wang X, Naumov M, Reagen B, Brooks D, Cottel B, Hazelwood K, Hempstead M, Jia B, Lee H H S, Malevich A, Mudigere D, Smelyanskiy M, Xiong L, Zhang X. The architectural implications of Facebook’s DNN-based personalized recommendation. In: Proceedings of 2020 IEEE International Symposium on High Performance Computer Architecture. 2020, 488−501
[4] |
Wu J, He X, Wang X, Wang Q, Chen W, Lian J, Xie X. Graph convolution machine for context-aware recommender system. Frontiers of Computer Science, 2022, 16( 6): 166614
[5] |
Guo H, Tang R, Ye Y, Li Z, He X, Dong Z. DeepFM: an end-to-end wide & deep learning framework for CTR prediction. 2018, arXiv preprint arXiv: 1804.04950
[6] |
Zhou G, Mou N, Fan Y, Pi Q, Bian W, Zhou C, Zhu X, Gai K. Deep interest evolution network for click-through rate prediction. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 5941−5948
[7] |
Hwang R, Kim T, Kwon Y, Rhu M. Centaur: a chiplet-based, hybrid sparse-dense accelerator for personalized recommendations. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 968−981
[8] |
Kal H, Lee S, Ko G, Ro W W. SPACE: locality-aware processing in heterogeneous memory for personalized recommendations. In: Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture. 2021, 679−691
[9] |
Shafiee A, Nag A, Muralimanohar N, Balasubramonian R, Strachan J P, Hu M, Williams R S, Srikumar V. ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News, 2016, 44( 3): 14–26
[10] |
Chi P, Li S, Xu C, Zhang T, Zhao J, Liu Y, Wang Y, Xie Y. PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Computer Architecture News, 2016, 44( 3): 27–39
[11] |
Imani M, Gupta S, Kim Y, Rosing T. FloatPIM: in-memory acceleration of deep neural network training with high precision. In: Proceedings of the 46th ACM/IEEE Annual International Symposium on Computer Architecture. 2019, 802−815
[12] |
Song L, Zhuo Y, Qian X, Li H, Chen Y. GraphR: accelerating graph processing using ReRAM. In: Proceedings of 2018 IEEE International Symposium on High Performance Computer Architecture. 2018, 531−543
[13] |
Huang Y, Zheng L, Yao P, Zhao J, Liao X, Jin H, Xue J. A heterogeneous PIM hardware-software co-design for energy-efficient graph processing. In: Proceedings of 2020 IEEE International Parallel and Distributed Processing Symposium. 2020, 684−695
[14] |
Zheng L, Zhao J, Huang Y, Wang Q, Zeng Z, Xue J, Liao X, Jin H. Spara: an energy-efficient ReRAM-based accelerator for sparse graph analytics applications. In: Proceedings of 2020 IEEE International Parallel and Distributed Processing Symposium. 2020, 696−707
[15] |
Arka A I, Doppa J R, Pande P P, Joardar B K, Chakrabarty K. ReGraphX: NoC-enabled 3D heterogeneous ReRAM architecture for training graph neural networks. In: Proceedings of 2021 Design, Automation & Test in Europe Conference & Exhibition. 2021, 1667−1672
[16] |
Zha Y, Li J. Hyper-AP: enhancing associative processing through a full-stack optimization. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 846−859
[17] |
Imani M, Pampana S, Gupta S, Zhou M, Kim Y, Rosing T. DUAL: acceleration of clustering algorithms using digital-based processing in-memory. In: Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture. 2020, 356−371
[18] |
Niu D, Xu C, Muralimanohar N, Jouppi N P, Xie Y. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In: Proceedings of 2013 IEEE/ACM International Conference on Computer-Aided Design. 2013, 17−23
[19] |
Wong H S P, Lee H Y, Yu S, Chen Y S, Wu Y, Chen P S, Lee B, Chen F T, Tsai M J. Metal−oxide RRAM. Proceedings of the IEEE, 2012, 100( 6): 1951–1970
[20] |
Li H, Jin H, Zheng L, Huang Y, Liao X. ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory. Frontiers of Computer Science, 2023, 17( 2): 172103
[21] |
Yin C, Acun B, Wu C J, Liu X. TT-Rec: Tensor train compression for deep learning recommendation models. 2021, arXiv preprint arXiv: 2101.11714
[22] |
Hu M, Strachan J P, Li Z, Grafals E M, Davila N, Graves C, Lam S, Ge N, Yang J J, Williams R S. Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication. In: Proceedings of the 53rd ACM/EDAC/IEEE Design Automation Conference. 2016, 1−6
[23] |
Xu C, Niu D, Muralimanohar N, Balasubramonian R, Zhang T, Yu S, Xie Y. Overcoming the challenges of crossbar resistive memory architectures. In: Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture. 2015, 476−488
[24] |
Song L, Qian X, Li H, Chen Y. PipeLayer: a pipelined ReRAM-based accelerator for deep learning. In: Proceedings of 2017 IEEE International Symposium on High Performance Computer Architecture. 2017, 541−552
[25] |
Cai H, Liu B, Chen J, Naviner L, Zhou Y, Wang Z, Yang J. A survey of in-spin transfer torque MRAM computing. Science China Information Sciences, 2021, 64( 6): 160402
[26] |
Luo Y, Wang P, Peng X, Sun X, Yu S. Benchmark of ferroelectric transistor-based hybrid precision synapse for neural network accelerator. IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, 2019, 5( 2): 142–150
[27] |
Xia F, Jiang D J, Xiong J, Sun N H. A survey of phase change memory systems. Journal of Computer Science and Technology, 2015, 30( 1): 121–144
[28] |
Gong N. Multi level cell (MLC) in 3D crosspoint phase change memory array. Science China Information Sciences, 2021, 64( 6): 166401
[29] |
Weinberger K, Dasgupta A, Langford J, Smola A, Attenberg J. Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 1113−1120
[30] |
Guan H, Malevich A, Yang J, Park J, Yuen H. Post-training 4-bit quantization on embedding tables. 2019, arXiv preprint arXiv: 1911.02079
[31] |
Oseledets I V. Tensor-train decomposition. SIAM Journal on Scientific Computing, 2011, 33( 5): 2295–2317
[32] |
Han T, Wang P, Niu S, Li C. Modality matches modality: pretraining modality-disentangled item representations for recommendation. In: Proceedings of the ACM Web Conference 2022. 2022, 2058−2066
[33] |
Long Y, She X, Mukhopadhyay S. Design of reliable DNN accelerator with un-reliable ReRAM. In: Proceedings of 2019 Design, Automation & Test in Europe Conference & Exhibition. 2019, 1769−1774
[34] |
Dong X, Xu C, Xie Y, Jouppi N P. NVSim: a circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2012, 31( 7): 994–1007
[35] |
Wang Y, Zhu Z, Chen F, Ma M, Dai G, Wang Y, Li H, Chen Y. Rerec: in-ReRAM acceleration with access-aware mapping for personalized recommendation. In: Proceedings of 2021 IEEE/ACM International Conference on Computer Aided Design. 2021, 1−9
[36] |
Muralimanohar N, Balasubramonian R, Jouppi N. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. 2007, 3−14
[37] |
Jiang N, Becker D U, Michelogiannakis G, Balfour J, Towles B, Shaw D E, Kim J, Dally W J. A detailed and flexible cycle-accurate network-on-chip simulator. In: Proceedings of 2013 IEEE International Symposium on Performance Analysis of Systems and Software. 2013, 86−96
[38] |
Huang Y, Zheng L, Yao P, Wang Q, Liao X, Jin H, Xue J. Accelerating graph convolutional networks using crossbar-based processing-in-memory architectures. In: Proceedings of 2022 IEEE International Symposium on High-Performance Computer Architecture. 2022, 1029−1042
[39] |
Qu Y, Cai H, Ren K, Zhang W, Yu Y, Wen Y, Wang J. Product-based neural networks for user response prediction. In: Proceedings of the 16th IEEE International Conference on Data Mining. 2016, 1149−1154
[40] |
Qu Y, Fang B, Zhang W, Tang R, Niu M, Guo H, Yu Y, He X. Product-based neural networks for user response prediction over multi-field categorical data. ACM Transactions on Information Systems, 2019, 37( 1): 5
[41] |
Ko H, Lee S, Park Y, Choi A. A survey of recommendation systems: recommendation models, techniques, and application fields. Electronics, 2022, 11( 1): 141
[42] |
Chen D, Jin H, Zheng L, Huang Y, Yao P, Gui C, Wang Q, Liu H, He H, Liao X, Zheng R. A general offloading approach for near-dram processing-in-memory architectures. In: Proceedings of 2022 IEEE International Parallel and Distributed Processing Symposium. 2022, 246−257
[43] |
Chen D, He H, Jin H, Zheng L, Huang Y, Shen X, Liao X. MetaNMP: leveraging Cartesian-like product to accelerate HGNNs with near-memory processing. In: Proceedings of the 50th Annual International Symposium on Computer Architecture. 2023, 56
[44] |
Kwon Y, Lee Y, Rhu M. Tensor casting: co-designing algorithm-architecture for personalized recommendation training. In: Proceedings of 2021 IEEE International Symposium on High-Performance Computer Architecture. 2021, 235−248
[45] |
Wilkening M, Gupta U, Hsia S, Trippel C, Wu C J, Brooks D, Wei G Y. RecSSD: near data processing for solid state drive based recommendation inference. In: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2021, 717−729
〈 | 〉 |