Improving performance portability for GPU-specific OpenCL kernels onmulti-core/many-coreCPUs by analysis-based transformations

Mei WEN; Da-fei HUANG; Chang-qing XUN; Dong CHEN

doi:10.1631/FITEE.1500032

Front. Inform. Technol. Electron. Eng ›› 2015, Vol. 16 ›› Issue (11) :899 -916. DOI: 10.1631/FITEE.1500032

Orginal Article

Improving performance portability for GPU-specific OpenCL kernels onmulti-core/many-coreCPUs by analysis-based transformations

Mei WEN ¹^,²^,^†
, Da-fei HUANG ¹^,²^,^†
, Chang-qing XUN ¹^,²^,^†
, Dong CHEN ¹^,²

Author information +

History +

PDF (2179KB)

Abstract

OpenCL is an open heterogeneous programming framework. Although OpenCL programs are functionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typically, the use of OpenCL’s local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements are also achieved on Intel’s many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.

Keywords

OpenCL / Performance portability / Multi-core/many-core CPU / Analysis-based transformation

Cite this article

Download citation ▾

Mei WEN, Da-fei HUANG, Chang-qing XUN, Dong CHEN. Improving performance portability for GPU-specific OpenCL kernels onmulti-core/many-coreCPUs by analysis-based transformations. Front. Inform. Technol. Electron. Eng, 2015, 16 (11) : 899-916 DOI:10.1631/FITEE.1500032

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Allen, R., Kennedy, K., 2002. Optimizing Compilers for Modern Architectures: a Dependence-Based Approach. Morgan Kaufmann, San Francisco.

[2]	Balasundaram, V., Kennedy, K., 1989. A technique for summarizing data access and its use in parallelism enhancing transformations. ACM SIGPLAN Not., 24(7):41–53. [doi:10.1145/74818.74822]

[3]	Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., , 2008. A compiler framework for optimization of affine loop nests for GPGPUs. Proc. 22nd Annual Int. Conf. on Supercomputing, p.225–234. [doi:10.1145/ 1375527.1375562]

[4]	Bastoul, C., 2004. Code generation in the polyhedral model is easier than you think. Proc. 13th Int. Conf. on Parallel Architectures and Compilation Techniques, p.7–16. [doi:10.1109/PACT.2004.1342537]

[5]	Danalis, A., Marin, G., McCurdy, C., , 2010. The scalable heterogeneous computing (SHOC) benchmark suite. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.63–74. [doi:10.1145/1735688.1735702]

[6]	Dong, H., Ghosh, D., Zafar, F., , 2012. Crossplatform OpenCL code and performance portability for CPU and GPU architectures investigated with a climate and weather physics model. Proc. 41st Int. Conf. on Parallel Processing Workshops, p.126–134. [doi:10.1109/ICPPW.2012.19]

[7]	Du, P., Weber, R., Luszczek, P., , 2012. From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parall. Comput., 38(8):391–407. [doi:10.1016/j.parco.2011.10.002]

[8]	Fang, J., Sips, H., Jaaskelainen, P., , 2014a. Grover: looking for performance improvement by disabling local memory usage in OpenCL kernels. Proc. 43rd Int. Conf. on Parallel Processing, p.162–171. [doi:10.1109/ ICPP.2014.25]

[9]	Fang, J., Sips, H., Varbanescu, A.L., 2014b. Aristotle: a performance impact indicator for the OpenCL kernels using local memory. Sci. Progr., 22(3):239–257.[doi:10.3233/SPR-140390]

[10]	Freeocl, 2012. FreeOCL: multi-platform implementation of OpenCL 1.2 targeting CPUs.

[11]	Gummaraju, J., Morichetti, L., Houston, M., , 2010. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. Proc. 19th Int. Conf. on Parallel Architectures and Compilation Techniques, p.205–216. [doi:10.1145/ 1854273.1854302]

[12]	Huang, D., Wen, M., Xun, C., , 2014. Automated transformation of GPU-specific OpenCL kernels targeting performance portability on multi-core/many-core CPUs. Proc. Euro-Par, p.210–221. [doi:10.1007/978-3- 319-09873-9_18]

[13]	Intel Corporation, 2012. A Guide to Vectorization with Intel C++ Compilers.

[14]	Intel Corporation, 2013a. Intel C++ Intrinsic Reference.Available from https://software.intel.com/sites/default/files/a6/22/18072-347603.pdf [Accessed on Feb. 9, 2014]

[15]	Intel Corporation, 2013b. Intel SDK for OpenCL Applications XE 2013 Optimization Guide.

[16]	Jang, B., Schaa, D., Mistry, P., , 2011. Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans. Parall. Distr. Syst., 22(1):105–118. [doi:10.1109/TPDS.2010.107]

[17]	Lattner, C., Adve, V., 2005. The LLVM compiler framework and infrastructure tutorial. In: Eigenmann, R., Li, Z.Y., Midkiff, S.P. (Eds.), Languages and Compilers for High Performance Computing. Springer, p.15–16.

[18]	Lee, J., Kim, J., Seo, S., , 2010. An OpenCL framework for heterogeneous multicores with local memory. Proc. 19th Int. Conf. on Parallel Architectures and Compilation Techniques, p.193–204. [doi:10.1145/ 1854273.1854301]

[19]	LLVM Team and others, 2012. Clang: a C language family frontend for LLVM. Available from http://clang.llvm.org/ [Accessed on Apr. 13, 2014].

[20]	Munshi, A., 2011. The OpenCL specification. Available from http://www.khronos.org/opencl [Accessed on Apr. 12, 2014]

[21]	Nvidia Corporation, 2011a. OpenCL Best Practices Guide.

[22]	Nvidia Corporation, 2011b. OpenCL Programming Guide for the CUDA Architecture.

[23]	Paek, Y., Hoeflinger, J., Padua, D., 2002. Efficient and precise array access analysis. ACM Trans. Progr. Lang. Syst., 24(1):65–109. [doi:10.1145/509705.509708]

[24]	Pennycook, S.J., Hammond, S.D., Wright, S.A., , 2013. An investigation of the performance portability of OpenCL. J. Parall. Distr. Comput., 73(11):1439–1450.[doi:10.1016/j.jpdc.2012.07.005]

[25]	Phothilimthana, P.M., Ansel, J., Ragan-Kelley, J., , 2013. Portable performance on heterogeneous architectures. Proc. 18th Int. Conf. on Architechtural Support for Programming Languages and Operating Systems, p.431–444. [doi:10.1145/2451116.2451162]

[26]	Rul, S., Vandierendonck, H., D’Haene, J., , 2010. An experimental study on performance portability of OpenCL kernels. Symp. on Application Accelerators in High Performance Computing. Available from https://biblio.ugent.be/publication/1016024

[27]	Shen, Z., Li, Z., Yew, P., 1990. An empirical study of Fortran programs for parallelizing compilers. IEEE Trans. Parall. Distr. Syst., 1(3):356–364. [doi:10.1109/ 71.80162]

[28]	Steven, S.M., 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann, San Francisco.

[29]	Stratton, J.A., Stone, S.S., Hwu, W.M.W., 2008. MCUDA: an effective implementation of CUDA kernels for multicore CPUs. Proc. 21st Int. Workshop on Languages and Compilers for Parallel Computing, p.16–30.[doi:10.1007/978-3-540-89740-8_2]

[30]	Stratton, J.A., Grover, V., Marathe, J., , 2010. Efficient compilation of fine-grained SPMD threaded programs for multicore CPUs. Proc. 8th Annual IEEE/ACM Int. Symp. on Code Generation and Optimization, p.111–119. [doi:10.1145/1772954.1772971]

[31]	Stratton, J.A., Kim, H., Jablin, T.B., , 2013. Performance portability in accelerated parallel kernels. Technical Report No. IMPACT-13-01, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, IL.

[32]	TOP500.org, 2014. TOP500 lists: November 2014. Available from http://top500.org/lists/2014/11/ [Accessed on Nov. 29, 2014].

[33]	Triolet, R., Irigoin, F., Feautrier, P., 1986. Direct parallelization of call statements. ACM SIGPLAN Not., 21(7):176–185. [doi:10.1145/13310.13329]