ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory
Huize LI, Hai JIN, Long ZHENG, Yu HUANG, Xiaofei LIAO
ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory
With the increasing amount of data, there is an urgent need for efficient sorting algorithms to process large data sets. Hardware sorting algorithms have attracted much attention because they can take advantage of different hardware’s parallelism. But the traditional hardware sort accelerators suffer “memory wall” problems since their multiple rounds of data transmission between the memory and the processor. In this paper, we utilize the in-situ processing ability of the ReRAM crossbar to design a new ReCAM array that can process the matrix-vector multiplication operation and the vector-scalar comparison in the same array simultaneously. Using this designed ReCAM array, we present ReCSA, which is the first dedicated ReCAM-based sort accelerator. Besides hardware designs, we also develop algorithms to maximize memory utilization and minimize memory exchanges to improve sorting performance. The sorting algorithm in ReCSA can process various data types, such as integer, float, double, and strings.
We also present experiments to evaluate the performance and energy efficiency against the state-of-the-art sort accelerators. The experimental results show that ReCSA has , , , , and speedups against CPU-, GPU-, FPGA-, NDP-, and PIM-based platforms when processing numeric data sets. ReCSA also has , , and performance improvement when processing string data sets compared with CPU-, GPU-, and FPGA-based platforms.
ReCAM / parallel sorting / architecture design / processing-in-memory
Huize Li is currently a PhD student in the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. His current research interests include computer architecture and emerging non-volatile memory
Hai Jin is a Chair Professor of computer science and engineering at Huazhong University of Science and Technology (HUST), China. Jin received his PhD in computer engineering from HUST in 1994. In 1996, he was awarded a German Academic Exchange Service fellowship to visit the Technical University of Chemnitz, Germany. Jin worked at The University of Hong Kong, China between 1998 and 2000, and as a visiting scholar at the University of Southern California, USA between 1999 and 2000. He was awarded Excellent Youth Award from the National Science Foundation of China in 2001. Jin is a Fellow of IEEE, Fellow of CCF, and a life member of the ACM. He has co-authored more than 20 books and published over 900 research papers. His research interests include computer architecture, parallel and distributed computing, big data processing, data storage, and system security
Long Zheng is now an associate professor in the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. He received his PhD degree in computer engineering from HUST, China in 2016. His current research interests include program analysis, runtime systems, and configurable computer architecture with a particular focus on graph processing
Yu Huang received the BS degree from Huazhong University of Science and Technology (HUST), China in 2016. He is now working toward the PhD degree in the School of Computer Science and Technology, HUST, China. His research interests focus on processing-in-memory architecture and graph processing
Xiaofei Liao received his PhD degree in computer science and engineering from Huazhong University of Science and Technology (HUST), China in 2005. He is now a Professor in the School of Computer Science and Technology at HUST, China. He has served as a reviewer for many conferences and journal papers. His research interests are in the areas of system software, P2P system, cluster computing and streaming services. He is a member of IEEE and the IEEE Computer Society
[1] |
Carlson D, Carin L . Continuing progress of spike sorting in the era of big data. Current Opinion in Neurobiology, 2019, 55: 90– 96
|
[2] |
Kuritzin A, Kischka T, Schmitz J, Churakov G . Incomplete lineage sorting and hybridization statistics for large-scale retroposon insertion data. PLoS Computational Biology, 2016, 12( 3): e1004812
|
[3] |
Heath L S, Vergara J P C . Sorting by short swaps. Current Opinion in Neurobiology, 2003, 10( 5): 775– 789
|
[4] |
Tsuda N, Satoh T, Kawada T. A piepline sorting chip. In: Proceedings of IEEE International Solid-State Circuits Conference. 1987, 270– 271
|
[5] |
Casper J, Olukotun K. Hardware acceleration of database operations. In: Proceedings of 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2014, 151– 160
|
[6] |
Cole R . Parallel merge sort. SIAM Journal on Computing, 1988, 17( 4): 770– 785
|
[7] |
Mueller R, Teubner J, Alonso G . Sorting networks on FPGAs. The VLDB Journal, 2012, 21( 1): 1– 23
|
[8] |
Bentley J L, Sedgewick R. Fast algorithms for sorting and searching strings. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms. 1997, 360– 369
|
[9] |
Arisland K Ø, Aasbø A C, Nundal A . VLSI parallel shift sort algorithm and design. Integration, 1984, 2( 4): 331– 347
|
[10] |
Farmahini-Farahani A, Duwe III H J, Schulte M J, Compton K . Modular design of high-throughput, low-latency sorting units. IEEE Transactions on Computers, 2013, 62( 7): 1389– 1402
|
[11] |
Govindaraju N, Gray J, Kumar R, Manocha D. GPUTeraSort: high performance graphics co-processor sorting for large database management. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. 2006, 325– 336
|
[12] |
Singh D P, Joshi I, Choudhary J . Survey of GPU based sorting algorithms. International Journal of Parallel Programming, 2018, 46( 6): 1017– 1034
|
[13] |
Satish N, Harris M, Garland M. Designing efficient sorting algorithms for manycore GPUs. In: Proceedings of IEEE International Symposium on Parallel & Distributed Processing. 2009, 1– 10
|
[14] |
Satish N, Kim C, Chhugani J, Nguyen A D, Lee V W, Kim D, Dubey P. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 351– 362
|
[15] |
Koch D, Torresen J. FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting. In: Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 2011, 45– 54
|
[16] |
Saitoh M, Elsayed E A, Van Chu T, Mashimo S, Kise K. A high-performance and cost-effective hardware merge sorter without feedback datapath. In: Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. 2018, 197– 204
|
[17] |
Cho M, Brand D, Bordawekar R, Finkler U, Kulandaisamy V, Puri R . PARADIS: an efficient parallel algorithm for in-place radix sort. Proceedings of the VLDB Endowment, 2015, 8( 2): 1518– 1529
|
[18] |
Minutoli M, Kuntz S K, Tumeo A, Kogge P. Implementing radix sort on emu 1. In: Proceedings of the 3rd Workshop NearData Process. 2015
|
[19] |
Balasubramonian R, Chang J, Manning T, Moreno J H, Murphy R, Nair R, Swanson S . Near-data processing: insights from a MICRO-46 workshop. IEEE Micro, 2014, 34( 4): 36– 42
|
[20] |
Zhu Q, Akin B, Sumbul H E, Sadi F, Hoe J C, Pileggi L, Franchetti F. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In: Proceedings of IEEE International 3D Systems Integration Conference. 2013, 1– 7
|
[21] |
Akinaga H, Shima H . Resistive random access memory (ReRAM) based on metal oxides. Proceedings of the IEEE, 2010, 98( 12): 2237– 2251
|
[22] |
Yavits L, Kvatinsky S, Morad A, Ginosar R . Resistive associative processor. IEEE Computer Architecture Letters, 2015, 14( 2): 148– 151
|
[23] |
Imani M, Gupta S, Sharma S, Rosing T S . NVQuery: efficient query processing in nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 38( 4): 628– 639
|
[24] |
Li H, Jin H, Zheng L, Liao X . ReSQM: accelerating database operations using ReRAM-based content addressable memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39( 11): 4030– 4041
|
[25] |
Leischner N, Osipov V, Sanders P. GPU sample sort. In: Proceedings of the IEEE International Symposium on Parallel & Distributed Processing. 2010, 1– 10
|
[26] |
Sintorn E, Assarsson U . Fast parallel GPU-sorting using a hybrid algorithm. Journal of Parallel and Distributed Computing, 2008, 68( 10): 1381– 1388
|
[27] |
Cederman D, Tsigas P . GPU-Quicksort: a practical quicksort algorithm for graphics processors. ACM Journal of Experimental Algorithmics, 2009, 14: 4
|
[28] |
Song W, Koch D, Luján M, Garside J. Parallel hardware merge sorter. In: Proceedings of Annual International Symposium on Field-Programmable Custom Computing Machines. 2016, 95– 102
|
[29] |
Samardzic N, Qiao W, Aggarwal V, Chang M C F, Cong J. Bonsai: high-performance adaptive merge tree sorting. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 282– 294
|
[30] |
Siegl P, Buchty R, Berekovic M. Data-centric computing frontiers: a survey on processing-in-memory. In: Proceedings of the 2nd International Symposium on Memory Systems. 2016, 295– 308
|
[31] |
Li Z, Challapalle N, Ramanathan A K, Narayanan V. IMC-Sort: in-memory parallel sorting architecture using hybrid memory cube. In: Proceedings of the 30th Great Lakes Symposium on VLSI. 2020, 45– 50
|
[32] |
Prasad A K, Rezaalipour M, Dehyadegari M, Bojnordi M N. Memristive data ranking. In: Proceedings of International Symposium on High-Performance Computer Architecture. 2021, 440– 452
|
[33] |
Wong H S P, Raoux S, Kim S, Liang J, Reifenberg J P, Rajendran B, Asheghi M, Goodson K E . Phase change memory. Proceedings of the IEEE, 2010, 98( 12): 2201– 2227
|
[34] |
Wang K L, Alzate J G, Amiri P K . Low-power non-volatile spintronic memory: STT-RAM and beyond. Journal of Physics D: Applied Physics, 2013, 46( 7): 074003
|
[35] |
Li J, Montoye R K, Ishii M, Chang L . 1 Mb 0. 41 µm2 2T-2R cell nonvolatile TCAM with two-bit encoding and clocked self-referenced sensing. IEEE Journal of Solid-State Circuits , 2014, 49( 4): 896– 907
|
[36] |
Chang M F, Huang L Y, Lin W Z, Chiang Y N, Kuo C C, Chuang C H, Yang K H, Tsai H J, Chen T F, Sheu S S . A ReRAM-based 4T2R nonvolatile TCAM using RC-filtered stress-decoupled scheme for frequent-OFF instant-ON search engines used in IoT and big-data processing. IEEE Journal of Solid-State Circuits, 2016, 51( 11): 2786– 2798
|
[37] |
Matsunaga S, Katsumata A, Natsui M, Fukami S, Endoh T, Ohno H, Hanyu T. Fully parallel 6T-2MTJ nonvolatile TCAM with single-transistor-based self match-line discharge control. In: Proceedings of Symposium on VLSI Circuits - Digest of Technical Papers. 2011, 298– 299
|
[38] |
Zhao L, Deng Q, Zhang Y, Yang J. RFAcc: a 3D ReRAM Associative array based random forest accelerator. In: Proceedings of the ACM International Conference on Supercomputing. 2019, 473– 483
|
[39] |
Huangfu W, Li S, Hu X, Xie Y. RADAR: a 3D-ReRAM based DNA alignment accelerator architecture. In: Proceedings of the 55th ACM/ESDA/IEEE Design Automation Conference (DAC). 2018, 1– 6
|
[40] |
Neelima B, Shamsundar B, Narayan A, Prabhu R, Gomes C . Kepler GPU accelerated recursive sorting using dynamic parallelism. Concurrency and Computation: Practice and Experience, 2017, 29( 4): e3865
|
[41] |
Asiatici M, Maiorano D, Ienne P. FPGAs in the datacenters: the case of parallel hybrid super scalar string sample sort. In: Proceedings of the 31st IEEE International Conference on Application-specific Systems, Architectures and Processors. 2020, 133– 140
|
[42] |
Pawlowski J T. Hybrid memory cube (HMC). In: Proceedings of IEEE Hot Chips 23 Symposium. 2011, 1– 24
|
[43] |
Kim J, Kim Y. HBM: Memory solution for bandwidth-hungry processors. In: Proceedings of IEEE Hot Chips 26 Symposium. 2014, 1– 24
|
[44] |
Sinha R, Zobel J, Ring D. Cache-efficient string sorting using copying. ACM Journal of Experimental Algorithmics, 2006, 11( 1.2): 1– 32
|
[45] |
Yavits L, Morad A, Ginosar R . Computer architecture with associative processor replacing last-level cache and SIMD accelerator. IEEE Transactions on Computers, 2015, 64( 2): 368– 381
|
[46] |
Lien Y C. A 4.5-mW 8-b 750-MS/s 2-b/step asynchronous subranged SAR ADC in 28-nm CMOS technology. In: Proceedings of Symposium on VLSI Circuits. 2012, 88– 89
|
[47] |
Niu D, Xu C, Muralimanohar N, Jouppi N P, Xie Y. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In: Proceedings of IEEE/ACM International Conference on Computer-Aided Design. 2013, 17– 23
|
[48] |
Stehle E, Jacobsen H A. A memory bandwidth-efficient hybrid radix sort on GPUs. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, 417– 432
|
[49] |
David H, Gorbatov E, Hanebutte U R, Khanna R, Le C. RAPL: memory power estimation and capping. In: Proceedings of ACM/IEEE International Symposium on Low-Power Electronics and Design. 2010, 189– 194
|
[50] |
Deshpande A, Narayanan P J. Can GPUs sort strings efficiently? In: Proceedings of the 20th Annual International Conference on High Performance Computing. 2013, 305– 313
|
/
〈 | 〉 |