************************************ Data-parallelism with SIMD ************************************ The ``HIPP::SIMD`` module provides the OOP interface to single-instruction-multiple-data (SIMD) operations. Currently it only supports the Intel x86 platform where AVX vector instructions are implemented (see `Intel's Guide `_ and check the `Wiki `_ page to see the processors that supports SIMD). In this tutorial we will show some examples of how to use the ``HIPP::SIMD`` module to accelerate your program. We will begin with an ordinary program using the non-SIMD instructions. Then we show how to convert it to the SIMD version. Sometimes it is useful to go back to the SIMD intrinsics (i.e., those defined in ````), so we also show how to implement the program with only the intrinsics and how HIPP's OOP interface interacts with the intrinsics. A Simple Example ================================ Let us begin with a simple example: the addition of two arrays, "arr2 = arr1 + arr2"; This is a trival task, but it shows the basic ideas of how to vectorize your operations. It demonstrates three parts that are used in (almost) every SIMD program: (1) memory management, (2) load/store operations, and (3) arithmetic operations. The non-SIMD version ------------------------- The following code sample demonstrates the array addition without any SIMD operation:: void vecadd_nonSIMD(){ const int N = 1024*8; float *arr1 = new float[N], *arr2 = new float[N]; // allocate memory for 8192 floats for(int i=0; i`_, for example). C++17 also standardize the ``std::aligned_alloc`` (see the `cpp reference `_, for example). You may use any of them, but the most convinient way is to use HIPP's memory interface. To replace the two ``new`` operations in the non-SIMD version, you write:: #include // HIPP's SIMD library #include // MemObj and prt_a typedef HIPP::SIMD::Vec f8_t; typedef HIPP::MemObj mem_t; const int N = 1024; f8_t *arr1 = mem_t::alloc_a(N, 1), // allocate aligned memory *arr2 = mem_t::alloc_a(N, 1); The two header files are necessary here and in the following - they provide the interface to HIPP's SIMD classes and other useful functions, respectively. The SIMD classes are defined in the namespace ``HIPP::SIMD``. Other useful functions are defined in the namespace ``HIPP``. All HIPP SIMD vectors are defined through the class template ``Vec``, which hosts an array of ``n`` elements each typed ``scalar_type``. We alias ``f8_t`` as a vector that consists of 8 single-precision floats. To allocate aligned memory for the vector type, you define the memory allocator ``MemObj`` of it (we alias it as ``mem_t``). Then the call of ``addr = mem_t::alloc(N, A)`` allocates space for ``N`` vectors (i.e., ``N*8`` floats) which is aligned at ``A`` minimal alignments of ``f8_t`` (i.e., ``A * alignof(f8_t)`` ), and returns its address into ``addr``. The call of ``mem_t::dealloc(addr)`` release the memory allocated by the allocator. Note that you cannot define a SIMD vector of arbitrary size. An AVX-compatible processor provide 256-bit "ymm" registors, so you can use a pack of 8 floats (8*32=256 bits). The second part we need to change is the memory access operations (i.e., load/store operations). This can be done through the methods ``load`` and ``store`` of the ``Vec`` type. ``load( (Vec *)addr )`` load a vector of data elements from an aligned address ``addr`` into the ``Vec`` instance, while ``store( (Vec *)addr )`` store the data in the instance to the aligned memory address ``addr``. Therefore, the part of the initialization code can be written as:: f8_t low {7.0f,6.0f,5.0f,4.0f,3.0f,2.0f,1.0f,0.0f}, // 0, 1, 2, ..., 7 hi = f8_t(N*8.0f) - low, // 8192, 8191, 8190, ..., 8185 stride = f8_t(8.0); // 8, 8, 8, ..., 8 for(int i=0; i // HIPP's SIMD library #include // MemObj and prt_a typedef HIPP::SIMD::Vec f8_t; typedef HIPP::MemObj mem_t; void vecadd_SIMD(){ const int N = 1024; f8_t *arr1 = mem_t::alloc_a(N, 1), *arr2 = mem_t::alloc_a(N, 1); // allocate aligned memory f8_t low {7.0f,6.0f,5.0f,4.0f,3.0f,2.0f,1.0f,0.0f}, hi = f8_t(N*8.0f) - low, stride = f8_t(8.0); for(int i=0; i // SIMD instructions #include // MemObj and prt_a void vecadd_intrinsics(){ const int N = 1024, size_vec = sizeof(__m256); __m256 *arr1 = (__m256 *)aligned_alloc(size_vec, size_vec*N), *arr2 = (__m256 *)aligned_alloc(size_vec, size_vec*N); __m256 low = _mm256_set_ps(7.0f,6.0f,5.0f,4.0f,3.0f,2.0f,1.0f,0.0f), hi = _mm256_sub_ps( _mm256_set1_ps(float(N*8)), low), stride = _mm256_set1_ps(8.0f); for(int i=0; i`_ for the detail signatures of this functions. Example: Matrix-matrix Multiplication ======================================= Matrix-matrix multiplication is another example where data-parallelism can be used for acceleration. Given two matrix :math:`A \in \mathbb{R}^{m\times n}` and ::math:`B \in \mathbb{R}^{n\times p}`, the matrix-matrix multiplication maps them into another matrix :math:`C \in \mathbb{R}^{m\times p}`, where each element of :math:`C` is given by .. math:: C_{\rm i, k} = \sum_{j=1}^{n} A_{\rm i,j}B_{\rm j,k}. Many subroutines have been developed for different types of matrix-matrix multiplication. Here we implement a simplified version of "DGEMM", the general matrix-matrix multiplication for double-precision floating-points, which can be found in the BLAS libraries. A brute-force implementation simply consists of three nested loops:: void dgemm_nonSIMD(const double *A, const double *B, double *C, int m, int n, int p) { for(int i=0; i d4_t; void dgemm_SIMD(const double *A, const double *B, double *C, int m, int n, int p){ for(int i=0; i