Improving performance portability for GPU-specific OpenCL kernels onmulti-core/many-coreCPUs by analysis-based transformations
Mei WEN , Da-fei HUANG , Chang-qing XUN , Dong CHEN
Front. Inform. Technol. Electron. Eng ›› 2015, Vol. 16 ›› Issue (11) : 899 -916.
OpenCL is an open heterogeneous programming framework. Although OpenCL programs are functionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typically, the use of OpenCL’s local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements are also achieved on Intel’s many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.
OpenCL / Performance portability / Multi-core/many-core CPU / Analysis-based transformation
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
Freeocl, 2012. FreeOCL: multi-platform implementation of OpenCL 1.2 targeting CPUs. |
| [11] |
|
| [12] |
|
| [13] |
Intel Corporation, 2012. A Guide to Vectorization with Intel C++ Compilers. |
| [14] |
Intel Corporation, 2013a. Intel C++ Intrinsic Reference.Available from https://software.intel.com/sites/default/files/a6/22/18072-347603.pdf [Accessed on Feb. 9, 2014] |
| [15] |
Intel Corporation, 2013b. Intel SDK for OpenCL Applications XE 2013 Optimization Guide. |
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
LLVM Team and others, 2012. Clang: a C language family frontend for LLVM. Available from http://clang.llvm.org/ [Accessed on Apr. 13, 2014]. |
| [20] |
|
| [21] |
Nvidia Corporation, 2011a. OpenCL Best Practices Guide. |
| [22] |
Nvidia Corporation, 2011b. OpenCL Programming Guide for the CUDA Architecture. |
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
TOP500.org, 2014. TOP500 lists: November 2014. Available from http://top500.org/lists/2014/11/ [Accessed on Nov. 29, 2014]. |
| [33] |
|
Higher Education Press
/
| 〈 |
|
〉 |