************************************
Data-parallelism with SIMD
************************************
The ``HIPP::SIMD`` module provides the OOP interface to single-instruction-multiple-data (SIMD) operations.
Currently it only supports the Intel x86 platform where AVX vector instructions are implemented
(see `Intel's Guide `_
and check the `Wiki `_ page to see
the processors that supports SIMD).
In this tutorial we will show some examples of how to use the ``HIPP::SIMD`` module to accelerate
your program.
We will begin with an ordinary program using the non-SIMD instructions.
Then we show how to convert it to the SIMD version.
Sometimes it is useful to go back to the SIMD intrinsics
(i.e., those defined in ````),
so we also show how to implement the program
with only the intrinsics and how HIPP's OOP interface interacts with the intrinsics.
A Simple Example
================================
Let us begin with a simple example: the addition of two arrays, "arr2 = arr1 + arr2";
This is a trival task, but it shows the basic ideas of how to vectorize your operations.
It demonstrates three parts that are used in (almost) every SIMD program:
(1) memory management,
(2) load/store operations, and
(3) arithmetic operations.
The non-SIMD version
-------------------------
The following code sample demonstrates the array addition without any SIMD operation::
void vecadd_nonSIMD(){
const int N = 1024*8;
float *arr1 = new float[N],
*arr2 = new float[N]; // allocate memory for 8192 floats
for(int i=0; i`_, for example).
C++17 also standardize the ``std::aligned_alloc`` (see the `cpp reference `_, for example).
You may use any of them, but the most convinient way is to use HIPP's memory interface.
To replace the two ``new`` operations in the non-SIMD version, you write::
#include // HIPP's SIMD library
#include // MemObj and prt_a
typedef HIPP::SIMD::Vec f8_t;
typedef HIPP::MemObj mem_t;
const int N = 1024;
f8_t *arr1 = mem_t::alloc_a(N, 1), // allocate aligned memory
*arr2 = mem_t::alloc_a(N, 1);
The two header files are necessary here and in the following - they provide the
interface to HIPP's SIMD classes and other useful functions, respectively.
The SIMD classes are defined in the namespace ``HIPP::SIMD``. Other useful functions
are defined in the namespace ``HIPP``.
All HIPP SIMD vectors are defined through the class template ``Vec``, which
hosts an array of ``n`` elements each typed ``scalar_type``. We alias ``f8_t``
as a vector that consists of 8 single-precision floats.
To allocate aligned memory for the vector type, you define the memory allocator
``MemObj`` of it (we alias it as ``mem_t``). Then the call of ``addr = mem_t::alloc(N, A)``
allocates space for ``N`` vectors (i.e., ``N*8`` floats) which is aligned at ``A`` minimal
alignments of ``f8_t`` (i.e., ``A * alignof(f8_t)`` ), and returns its address into ``addr``.
The call of ``mem_t::dealloc(addr)`` release the memory allocated by the allocator.
Note that you cannot define a SIMD vector of arbitrary size. An AVX-compatible
processor provide 256-bit "ymm" registors, so you can use a pack of 8
floats (8*32=256 bits).
The second part we need to change is the memory access operations (i.e., load/store operations).
This can be done through the methods ``load`` and ``store`` of the ``Vec`` type.
``load( (Vec *)addr )`` load a vector of data elements from an aligned address ``addr``
into the ``Vec`` instance, while ``store( (Vec *)addr )`` store the data in the
instance to the aligned memory address ``addr``. Therefore, the part of the initialization
code can be written as::
f8_t low {7.0f,6.0f,5.0f,4.0f,3.0f,2.0f,1.0f,0.0f}, // 0, 1, 2, ..., 7
hi = f8_t(N*8.0f) - low, // 8192, 8191, 8190, ..., 8185
stride = f8_t(8.0); // 8, 8, 8, ..., 8
for(int i=0; i // HIPP's SIMD library
#include // MemObj and prt_a
typedef HIPP::SIMD::Vec f8_t;
typedef HIPP::MemObj mem_t;
void vecadd_SIMD(){
const int N = 1024;
f8_t *arr1 = mem_t::alloc_a(N, 1),
*arr2 = mem_t::alloc_a(N, 1); // allocate aligned memory
f8_t low {7.0f,6.0f,5.0f,4.0f,3.0f,2.0f,1.0f,0.0f},
hi = f8_t(N*8.0f) - low,
stride = f8_t(8.0);
for(int i=0; i // SIMD instructions
#include // MemObj and prt_a
void vecadd_intrinsics(){
const int N = 1024, size_vec = sizeof(__m256);
__m256 *arr1 = (__m256 *)aligned_alloc(size_vec, size_vec*N),
*arr2 = (__m256 *)aligned_alloc(size_vec, size_vec*N);
__m256 low = _mm256_set_ps(7.0f,6.0f,5.0f,4.0f,3.0f,2.0f,1.0f,0.0f),
hi = _mm256_sub_ps( _mm256_set1_ps(float(N*8)), low),
stride = _mm256_set1_ps(8.0f);
for(int i=0; i`_
for the detail signatures of this functions.
Example: Matrix-matrix Multiplication
=======================================
Matrix-matrix multiplication is another example where data-parallelism can be used
for acceleration. Given two matrix :math:`A \in \mathbb{R}^{m\times n}` and ::math:`B \in \mathbb{R}^{n\times p}`,
the matrix-matrix multiplication maps them into another matrix :math:`C \in \mathbb{R}^{m\times p}`, where each
element of :math:`C` is given by
.. math::
C_{\rm i, k} = \sum_{j=1}^{n} A_{\rm i,j}B_{\rm j,k}.
Many subroutines have been developed for different types of matrix-matrix multiplication. Here we implement
a simplified version of "DGEMM", the general matrix-matrix multiplication for double-precision floating-points, which
can be found in the BLAS libraries.
A brute-force implementation simply consists of three nested loops::
void dgemm_nonSIMD(const double *A, const double *B, double *C,
int m, int n, int p)
{
for(int i=0; i d4_t;
void dgemm_SIMD(const double *A, const double *B, double *C,
int m, int n, int p){
for(int i=0; i