In recent years, mature advanced packaging technologies have increasingly enabled the integration of multiple small dies into larger chips, while retaining chip-scale density and high-bandwidth interconnects. To address the inefficiencies of manual design and the challenges of heterogeneous optimization in wafer-scale chip (WSC) development, we systematically explore key factors in WSC architecture design. We integrate chip layout, operator mapping, and hardware-software codesign, and formulate the WSC architecture exploration problem as a multi-objective optimization task. First, we establish a hierarchical architecture model for WSCs, unifying the quantification of core constraints and interconnect topology constraints; second, we propose a hierarchical multi-objective collaborative optimization framework to jointly optimize physical constraints and task mapping communication patterns; finally, we develop a WSC optimizer toolchain that supports mixed-granularity simulation and generates optimal configurations for representative workloads. Experimental results demonstrate that compared with traditional computer architectures, the optimized architectures generated by our WSC optimizer achieve up to a 22× throughput improvement and a 5× latency reduction in application domains, such as cryptographic decryption and signal processing.
| [1] |
Achiam J , Adler S , Agarwal S , et al., 2023. GPT-4 Technical Report.https://api.semanticscholar.org/CorpusID: 257532815[Accessed on Dec. 1, 2025].
|
| [2] |
Ahmad M , DeLaCruz J , Ramamurthy A , 2022. Heterogeneous integration of chiplets: cost and yield tradeoff analysis. Proc 23rd Int Conf on Thermal, Mechanical and Multi-Physics Simulation and Experiments in Microelectronics and Microsystems, p.1-9.
|
| [3] |
Ali H , Tariq UU , Hardy J , et al., 2021. A survey on system-level energy optimisation for MPSoCs in IoT and consumer electronics. Comput Sci Rev, 41: 100416.
|
| [4] |
Baktash JA , Dawodi M , 2023. GPT-4: a review on advancements and opportunities in natural language processing. J Elect Electron Eng, 2 (4): 548- 549.
|
| [5] |
Binkert N , Beckmann B , Black G , et al., 2011. The gem5 simulator. ACM SIGARCH Comput Archit News, 39 (2): 1- 7.
|
| [6] |
Bohr M M , 2009. The new era of scaling in an SoC world. Proc IEEE Int Solid-State Circuits Conf-Digest of Technical Papers, p.23-28.
|
| [7] |
Brown TB , Mann B , Ryder N , et al., 2020. Language models are few-shot learners. Proc 34th Int Conf on Neural Information Processing Systems, Article 159.
|
| [8] |
Burns JA , Aull BF , Chen CK , et al., 2006. A wafer-scale 3-D circuit integration technology. IEEE Trans Electron Dev, 53 (10): 2507- 2516.
|
| [9] |
Chakaravarthy RV , Kwon H , Jiang H , et al., 2021. Vision control unit in fully self-driving vehicles using Xilinx MPSoC and opensource stack. Proc 26th Asia and South Pacific Design Automation Conf, p.311-317.
|
| [10] |
Chen SX , Li SY , Zhuang Z , et al., 2024. Floorplet: performance-aware floorplan framework for chiplet integration. IEEE Trans ComputAid Des Integr Circ Syst, 43 (6): 1638- 1649.
|
| [11] |
Chen YW , Wang RH , Cheng YH , et al., 2024. SUN: dynamic hybridprecision SRAM-based CIM accelerator with high macro utilization using structured pruning mixed-precision networks. IEEE Trans Comput-Aid Des Integr Circ Syst, 43 (7): 2163- 2176.
|
| [12] |
Chowdhery A , Narang S , Devlin J , et al., 2023. PaLM: scaling language modeling with pathways. J Mach Learn Res, 24 (1): 240.
|
| [13] |
Deng CH , Li XY , Feng Z , et al., 2022. GARNet: reduced-rank topology learning for robust and scalable graph neural networks.
|
| [14] |
Feng YX , Ma KS , 2022. Chiplet actuary: a quantitative cost model and multi-chiplet architecture exploration. Proc 59th ACM/IEEE Design Automation Conf, p.121-126.
|
| [15] |
Hammarlund P , Martinez AJ , Bajwa AA , et al., 2014. Haswell: the fourth-generation Intel Core Processor. IEEE Micro, 34 (2): 6- 20.
|
| [16] |
Han YH , Xu HB , Lu MX , et al., 2024. The big chip: challenge, model and architecture. Fund Res, 4 (6): 1431- 1441.
|
| [17] |
Hu Y , Lin XH , Wang HZ , et al., 2024. Wafer-scale computing: advancements, challenges, and future perspectives. IEEE Circ Syst Mag, 24 (1): 52- 81.
|
| [18] |
IEEE , 2024. International Roadmap for Devices and SystemsTM.https://irds.ieee.org/images/files/pdf/2024/2024IRDS_MET.pdf[Accessed on Dec. 1, 2025].
|
| [19] |
Jung S , Lee H , Myung S , et al., 2022. A crossbar array of magnetoresistive memory devices for in-memory computing. Nature, 601(7892): 211- 216.
|
| [20] |
Leon V , Minaidis P , Lentaris G , et al., 2023. Accelerating AI and computer vision for satellite pose estimation on the Intel Myriad X embedded SoC. Microprocess Microsyst, 103: 104947. Microprocess Microsyst, 103: 104947.
|
| [21] |
Leon V , Minaidis P , Soudris D , et al., 2024. MPAI: a co-processing architecture with MPSoC & AI accelerators for vision applications in space. Proc 31st IEEE Int Conf on Electronics, Circuits and Systems, p.1-2.
|
| [22] |
Li FP , Wang Y , Cheng YQ , et al., 2022. GIA: a reusable general interposer architecture for agile chiplet integration. Proc IEEE/ACM Int Conf on Computer Aided Design, p.1-9.
|
| [23] |
Li ZS , Liu LB , Deng YD , et al., 2017. Aggressive pipelining of irregular applications on reconfigurable hardware. Proc 44th Annual Int Symp on Computer Architecture, p.575-586.
|
| [24] |
Loh Y , Xie Y , Black B , 2007. Processor design in 3D die-stacking technologies. IEEE Micro, 27 (3): 31- 48.
|
| [25] |
Markidis S , Der Chien SW , Laure E , et al., 2018. NVIDIA tensor core programmability, performance & precision. Proc IEEE Int Parallel and Distributed Processing Symp Workshops, p.522-531.
|
| [26] |
Pal S , Petrisko D , Tomei M , et al., 2019. Architecting waferscale processors-a GPU case study. Proc IEEE Int Symp on High Performance Computer Architecture, p.250-263.
|
| [27] |
Pal S , Liu JY , Alam I , et al., 2021. Designing a 2048-chiplet, 14336-core waferscale processor. Proc 58th ACM/IEEE Design Automation Conf, p.1183-1188.
|
| [28] |
Panousopoulos V , Papaloukas E , Leon V , et al., 2024. HW/SW codesign on embedded SoC FPGA for star tracking optimization in space applications. J Real-Time Image Proc, 21 (1): 16.
|
| [29] |
Patel D , Wong G , 2023. GPT-4 architecture, infrastructure, training dataset, costs, vision, MoE. Proc Demystifying GPT-4: the Engineering Tradeoffs that Led OpenAI to Their Architecture, p.1-17.
|
| [30] |
Raffel C , Shazeer N , Roberts A , et al., 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res, 21 (140): 1- 67.
|
| [31] |
Shao YS , Clemons J , Venkatesan R , et al., 2019. Simba: scaling deep-learning inference with multi-chip-module-based architecture. Proc 52nd Annual IEEE/ACM Int Symp on Microarchitecture, p.14-27.
|
| [32] |
Talpes E , Williams D , Sarma DD , 2022. DOJO: the microarchitecture of Tesla's exa-scale computer. Proc IEEE Hot Chips 34 Symp, p.1-28.
|
| [33] |
Tang XP , Tian RQ , Wong DF , 2001. Fast evaluation of sequence pair in block placement by longest common subsequence computation. IEEE Trans Comput-Aid Des Integr Circ Syst, 20 (12): 1406- 1413.
|
| [34] |
Tatar G , Bayar S , Çiçek İ , et al., 2024. Real-time multi-learning deep neural network on an MPSoC-FPGA for intelligent vehicles: harnessing hardware acceleration with pipeline. IEEE Trans Intell Veh, 9 (6): 5021- 5032.
|
| [35] |
Touvron H , Martin L , Stone K , et al., 2023. Llama 2: open foundation and fine-tuned chat models.
|
| [36] |
Turner WJ , Poulton JW , Wilson JM , et al., 2018. Ground-referenced signaling for intra-chip and short-reach chip-to-chip interconnects. Proc IEEE Custom Integrated Circuits Conf, p.1-8.
|
| [37] |
Venkatesan R , Shao YS , Wang MR , et al., 2019. MAGNet: a modular accelerator generator for neural networks. Proc IEEE/ACM Int Conf on Computer-Aided Design, p.1-8.
|
| [38] |
Weng J , Liu SH , Dadu V , et al., 2020. DSAGen: synthesizing programmable spatial accelerators. Proc 47th Annual Int Symp on Computer Architecture, p.268-281.
|
| [39] |
Wu JX , Liu QR , Shen JL , et al., 2024. From SoC to SDSoW: a new paradigm for microelectronics development. Sci Sin Inform, 54: 1350- 1368.
|
| [40] |
Xu QZ , Wang CH , Li ZQ , et al., 2025. A wafer-scale heterogeneous integration thermal simulator. Appl Therm Eng, 264: 125459.
|
| [41] |
Yenduri G , Ramalingam M , Selvi GC , et al., 2024. GPT (generative pre-trained transformer)-a comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access, 12: 54608- 54649.
|
| [42] |
Zhang JM , Wang XY , Ye YY , et al., 2024. M2M: a fine-grained mapping framework to accelerate multiple DNNs on a multi-chiplet architecture. IEEE Trans VLSI Syst, 32 (10): 1864- 1877.
|
| [43] |
Zhang SS , Roller S , Goyal N , et al., 2022. OPT: open pre-trained transformer language models.
|
| [44] |
Zhu JC , Xue CH , Chen YQ , et al., 2025. Theseus: exploring efficient wafer-scale chip design for large language models. IEEE Trans Comput-Aid Des Integr Circ Syst, 44 (12): 4793- 4806.
|
| [45] |
Zhuang Z , Yu B , Chao KY , et al., 2022. Multi-package co-design for chiplet integration. Proc 41st IEEE/ACM Int Conf on ComputerAided Design, Article 4.
|
| [46] |
Zou DX , Wang GG , Pan G , et al., 2016. A modified simulated annealing algorithm and an excessive area model for floorplanning using fixed-outline constraints. Front Inform Technol Electron Eng, 17 (11): 1228- 1244.
|
RIGHTS & PERMISSIONS
The Authors. Published by Zhejiang University Press Co., Ltd.