ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory

Huize LI; Hai JIN; Long ZHENG; Yu HUANG; Xiaofei LIAO

doi:10.1007/s11704-022-1322-3

Front. Comput. Sci. ›› 2023, Vol. 17 ›› Issue (2) :172103 DOI: 10.1007/s11704-022-1322-3

Architecture

RESEARCH ARTICLE

ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory

Author information +

History +

PDF (7128KB)

Abstract

With the increasing amount of data, there is an urgent need for efficient sorting algorithms to process large data sets. Hardware sorting algorithms have attracted much attention because they can take advantage of different hardware’s parallelism. But the traditional hardware sort accelerators suffer “memory wall” problems since their multiple rounds of data transmission between the memory and the processor. In this paper, we utilize the in-situ processing ability of the ReRAM crossbar to design a new ReCAM array that can process the matrix-vector multiplication operation and the vector-scalar comparison in the same array simultaneously. Using this designed ReCAM array, we present ReCSA, which is the first dedicated ReCAM-based sort accelerator. Besides hardware designs, we also develop algorithms to maximize memory utilization and minimize memory exchanges to improve sorting performance. The sorting algorithm in ReCSA can process various data types, such as integer, float, double, and strings.

We also present experiments to evaluate the performance and energy efficiency against the state-of-the-art sort accelerators. The experimental results show that ReCSA has $90.92 ×$ , $46.13 ×$ , $27.38 ×$ , $84.57 ×$ , and $3.36 ×$ speedups against CPU-, GPU-, FPGA-, NDP-, and PIM-based platforms when processing numeric data sets. ReCSA also has $24.82 ×$ , $32.94 ×$ , and $18.22 ×$ performance improvement when processing string data sets compared with CPU-, GPU-, and FPGA-based platforms.

Graphical abstract

Keywords

ReCAM / parallel sorting / architecture design / processing-in-memory

Cite this article

Download citation ▾

Huize LI, Hai JIN, Long ZHENG, Yu HUANG, Xiaofei LIAO. ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory. Front. Comput. Sci., 2023, 17 (2) : 172103 DOI:10.1007/s11704-022-1322-3

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Carlson D, Carin L . Continuing progress of spike sorting in the era of big data. Current Opinion in Neurobiology, 2019, 55: 90– 96

[2]	Kuritzin A, Kischka T, Schmitz J, Churakov G . Incomplete lineage sorting and hybridization statistics for large-scale retroposon insertion data. PLoS Computational Biology, 2016, 12( 3): e1004812

[3]	Heath L S, Vergara J P C . Sorting by short swaps. Current Opinion in Neurobiology, 2003, 10( 5): 775– 789

[4]	Tsuda N, Satoh T, Kawada T. A piepline sorting chip. In: Proceedings of IEEE International Solid-State Circuits Conference. 1987, 270– 271

[5]	Casper J, Olukotun K. Hardware acceleration of database operations. In: Proceedings of 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2014, 151– 160

[6]	Cole R . Parallel merge sort. SIAM Journal on Computing, 1988, 17( 4): 770– 785

[7]	Mueller R, Teubner J, Alonso G . Sorting networks on FPGAs. The VLDB Journal, 2012, 21( 1): 1– 23

[8]	Bentley J L, Sedgewick R. Fast algorithms for sorting and searching strings. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms. 1997, 360– 369

[9]	Arisland K Ø, Aasbø A C, Nundal A . VLSI parallel shift sort algorithm and design. Integration, 1984, 2( 4): 331– 347

[10]	Farmahini-Farahani A, Duwe III H J, Schulte M J, Compton K . Modular design of high-throughput, low-latency sorting units. IEEE Transactions on Computers, 2013, 62( 7): 1389– 1402

[11]	Govindaraju N, Gray J, Kumar R, Manocha D. GPUTeraSort: high performance graphics co-processor sorting for large database management. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. 2006, 325– 336

[12]	Singh D P, Joshi I, Choudhary J . Survey of GPU based sorting algorithms. International Journal of Parallel Programming, 2018, 46( 6): 1017– 1034

[13]	Satish N, Harris M, Garland M. Designing efficient sorting algorithms for manycore GPUs. In: Proceedings of IEEE International Symposium on Parallel & Distributed Processing. 2009, 1– 10

[14]	Satish N, Kim C, Chhugani J, Nguyen A D, Lee V W, Kim D, Dubey P. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 351– 362

[15]	Koch D, Torresen J. FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting. In: Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 2011, 45– 54

[16]	Saitoh M, Elsayed E A, Van Chu T, Mashimo S, Kise K. A high-performance and cost-effective hardware merge sorter without feedback datapath. In: Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. 2018, 197– 204

[17]	Cho M, Brand D, Bordawekar R, Finkler U, Kulandaisamy V, Puri R . PARADIS: an efficient parallel algorithm for in-place radix sort. Proceedings of the VLDB Endowment, 2015, 8( 2): 1518– 1529

[18]	Minutoli M, Kuntz S K, Tumeo A, Kogge P. Implementing radix sort on emu 1. In: Proceedings of the 3rd Workshop NearData Process. 2015

[19]	Balasubramonian R, Chang J, Manning T, Moreno J H, Murphy R, Nair R, Swanson S . Near-data processing: insights from a MICRO-46 workshop. IEEE Micro, 2014, 34( 4): 36– 42

[20]	Zhu Q, Akin B, Sumbul H E, Sadi F, Hoe J C, Pileggi L, Franchetti F. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In: Proceedings of IEEE International 3D Systems Integration Conference. 2013, 1– 7

[21]	Akinaga H, Shima H . Resistive random access memory (ReRAM) based on metal oxides. Proceedings of the IEEE, 2010, 98( 12): 2237– 2251

[22]	Yavits L, Kvatinsky S, Morad A, Ginosar R . Resistive associative processor. IEEE Computer Architecture Letters, 2015, 14( 2): 148– 151

[23]	Imani M, Gupta S, Sharma S, Rosing T S . NVQuery: efficient query processing in nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 38( 4): 628– 639

[24]	Li H, Jin H, Zheng L, Liao X . ReSQM: accelerating database operations using ReRAM-based content addressable memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39( 11): 4030– 4041

[25]	Leischner N, Osipov V, Sanders P. GPU sample sort. In: Proceedings of the IEEE International Symposium on Parallel & Distributed Processing. 2010, 1– 10

[26]	Sintorn E, Assarsson U . Fast parallel GPU-sorting using a hybrid algorithm. Journal of Parallel and Distributed Computing, 2008, 68( 10): 1381– 1388

[27]	Cederman D, Tsigas P . GPU-Quicksort: a practical quicksort algorithm for graphics processors. ACM Journal of Experimental Algorithmics, 2009, 14: 4

[28]	Song W, Koch D, Luján M, Garside J. Parallel hardware merge sorter. In: Proceedings of Annual International Symposium on Field-Programmable Custom Computing Machines. 2016, 95– 102

[29]	Samardzic N, Qiao W, Aggarwal V, Chang M C F, Cong J. Bonsai: high-performance adaptive merge tree sorting. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 282– 294

[30]	Siegl P, Buchty R, Berekovic M. Data-centric computing frontiers: a survey on processing-in-memory. In: Proceedings of the 2nd International Symposium on Memory Systems. 2016, 295– 308

[31]	Li Z, Challapalle N, Ramanathan A K, Narayanan V. IMC-Sort: in-memory parallel sorting architecture using hybrid memory cube. In: Proceedings of the 30th Great Lakes Symposium on VLSI. 2020, 45– 50

[32]	Prasad A K, Rezaalipour M, Dehyadegari M, Bojnordi M N. Memristive data ranking. In: Proceedings of International Symposium on High-Performance Computer Architecture. 2021, 440– 452

[33]	Wong H S P, Raoux S, Kim S, Liang J, Reifenberg J P, Rajendran B, Asheghi M, Goodson K E . Phase change memory. Proceedings of the IEEE, 2010, 98( 12): 2201– 2227

[34]	Wang K L, Alzate J G, Amiri P K . Low-power non-volatile spintronic memory: STT-RAM and beyond. Journal of Physics D: Applied Physics, 2013, 46( 7): 074003

[35]	Li J, Montoye R K, Ishii M, Chang L . 1 Mb 0. 41 µm² 2T-2R cell nonvolatile TCAM with two-bit encoding and clocked self-referenced sensing. IEEE Journal of Solid-State Circuits , 2014, 49( 4): 896– 907

[36]

Chang M F, Huang L Y, Lin W Z, Chiang Y N, Kuo C C, Chuang C H, Yang K H, Tsai H J, Chen T F, Sheu S S . A ReRAM-based 4T2R nonvolatile TCAM using RC-filtered stress-decoupled scheme for frequent-OFF instant-ON search engines used in IoT and big-data processing. IEEE Journal of Solid-State Circuits, 2016, 51( 11): 2786– 2798

[37]	Matsunaga S, Katsumata A, Natsui M, Fukami S, Endoh T, Ohno H, Hanyu T. Fully parallel 6T-2MTJ nonvolatile TCAM with single-transistor-based self match-line discharge control. In: Proceedings of Symposium on VLSI Circuits - Digest of Technical Papers. 2011, 298– 299

[38]	Zhao L, Deng Q, Zhang Y, Yang J. RFAcc: a 3D ReRAM Associative array based random forest accelerator. In: Proceedings of the ACM International Conference on Supercomputing. 2019, 473– 483

[39]	Huangfu W, Li S, Hu X, Xie Y. RADAR: a 3D-ReRAM based DNA alignment accelerator architecture. In: Proceedings of the 55th ACM/ESDA/IEEE Design Automation Conference (DAC). 2018, 1– 6

[40]	Neelima B, Shamsundar B, Narayan A, Prabhu R, Gomes C . Kepler GPU accelerated recursive sorting using dynamic parallelism. Concurrency and Computation: Practice and Experience, 2017, 29( 4): e3865

[41]	Asiatici M, Maiorano D, Ienne P. FPGAs in the datacenters: the case of parallel hybrid super scalar string sample sort. In: Proceedings of the 31st IEEE International Conference on Application-specific Systems, Architectures and Processors. 2020, 133– 140

[42]	Pawlowski J T. Hybrid memory cube (HMC). In: Proceedings of IEEE Hot Chips 23 Symposium. 2011, 1– 24

[43]	Kim J, Kim Y. HBM: Memory solution for bandwidth-hungry processors. In: Proceedings of IEEE Hot Chips 26 Symposium. 2014, 1– 24

[44]	Sinha R, Zobel J, Ring D. Cache-efficient string sorting using copying. ACM Journal of Experimental Algorithmics, 2006, 11( 1.2): 1– 32

[45]	Yavits L, Morad A, Ginosar R . Computer architecture with associative processor replacing last-level cache and SIMD accelerator. IEEE Transactions on Computers, 2015, 64( 2): 368– 381

[46]	Lien Y C. A 4.5-mW 8-b 750-MS/s 2-b/step asynchronous subranged SAR ADC in 28-nm CMOS technology. In: Proceedings of Symposium on VLSI Circuits. 2012, 88– 89

[47]	Niu D, Xu C, Muralimanohar N, Jouppi N P, Xie Y. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In: Proceedings of IEEE/ACM International Conference on Computer-Aided Design. 2013, 17– 23

[48]	Stehle E, Jacobsen H A. A memory bandwidth-efficient hybrid radix sort on GPUs. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, 417– 432

[49]	David H, Gorbatov E, Hanebutte U R, Khanna R, Le C. RAPL: memory power estimation and capping. In: Proceedings of ACM/IEEE International Symposium on Low-Power Electronics and Design. 2010, 189– 194

[50]	Deshpande A, Narayanan P J. Can GPUs sort strings efficiently? In: Proceedings of the 20th Annual International Conference on High Performance Computing. 2013, 305– 313