AVX (256-bit) Vectors ============================== The following classes are all defined within namespace ``HIPP::SIMD``. .. namespace:: HIPP::SIMD Vec ----------------------------- .. class:: template<> Vec A vector of four double-precision (256 bits in total) values. ``Vec`` can be **copied**, **copy-constructed**, **moved**, and **move-constructed**. The copy and move operations and destructor are all ``noexcept``. ``Vec`` is binary-compatible with the intrinsic type ``__m256d``, i.e., they have the same length and alignment. .. type:: double scal_t float scal_hp_t __m256d vec_t __m128d vec_hc_t __m128 vec_hp_t int64_t iscal_t int32_t iscal_hp_t __m256i ivec_t __m128d ivec_hp_t __mmask8 mask8_t Type aliases. ``scal_t`` is the scalar type (i.e., element type) of the SIMD vector. ``scal_hp_t`` is half-precision scalar type. ``vec_t`` is the intrinsic SIMD vector, ``vec_hc_t`` and ``vec_hp_t`` represent the half-precision and half-count types, respectively. Scalar and vector types for integers are also defined. .. enum:: @simd_vecd4enum : size_t .. enumerator:: NPACK=4 NBIT=256 VECSIZE=sizeof(vec_t) SCALSIZE=sizeof(scal_t) ``NPACK`` is the number of scalars in the vector, ``NBIT`` is the number of bits of the vector register. ``VECSIZE`` and ``SCALSIZE`` are size in bytes of the vector and scalar. .. function:: Vec() noexcept Vec( scal_t e3, scal_t e2, scal_t e1, scal_t e0 ) noexcept explicit Vec( scal_t a ) noexcept explicit Vec( caddr_t mem_addr ) noexcept Vec( const vec_t &a ) noexcept Vec( const scal_t *base_addr, \ ivec_t vindex, const int scale=SCALSIZE ) noexcept Vec( vec_t src, const scal_t *base_addr, \ ivec_t vindex, vec_t mask, const int scale=SCALSIZE ) noexcept Initializers. (1) Default Initializer: ``Vec()`` gives an un-initialized vector. (2) ``Vec(e3, e2, e1, e0)`` constructs a vector of four given elements, from higher address value ``e3``, to lower address value ``e0``. (3) ``Vec(scal_t a)`` constructs a vector of four repeated scalar value ``a``. (4) ``Vec(mem_addr)``: the four elements are loaded from the memory address ``mem_addr`` (must be aligned at 32-byte boundary). (5) ``Vec(const vec_t &a)``: copy the intrinsic vector ``a``. (6) ``Vec(base_addr, vindex, scale)``: load using :func:`gather`. (7) ``Vec(src, base_addr, vindex, mask, scale)``: load using :func:`gatherm`. The address type ``caddr_t`` can be either ``const double *``, ``const vec_t *`` or ``const Vec *``. .. function:: ostream & info( ostream &os = cout, int fmt_cntl = 1 ) const friend ostream & operator<<( ostream &os, const Vec &v ) ``info()`` displays the content of the vector to ``os``. :arg fmt_cntl: Control the display format. 0 for an inline printing and 1 for a verbose, multiple-line version. :return: The argument ``os`` is returned. The overloaded `<<` operator is equivalent to ``info()`` with default ``fmt_cntl``. The returned reference to ``os`` allows you to chain the outputs, such as ``vec.info(cout) << " continue printing " << std::endl``. .. function:: const vec_t & val() const noexcept vec_t & val() noexcept const scal_t & operator[]( size_t n )const noexcept scal_t & operator[]( size_t n ) noexcept ``val()`` return the intrinsic vector value. ``operator[](n)`` takes the n-th scalar element from the vector. .. function:: Vec & load( caddr_t mem_addr ) noexcept Vec & loadu( caddr_t mem_addr ) noexcept Vec & loadm( caddr_t mem_addr, ivec_t mask ) noexcept Vec & load1( const scal_t *mem_addr ) noexcept Vec & bcast( const scal_t *mem_addr ) noexcept Vec & bcast( const vec_hc_t *mem_addr ) noexcept Vec & gather( const scal_t *base_addr, ivec_t vindex, \ const int scale=SCALSIZE ) noexcept Vec & gatherm( vec_t src, const scal_t *base_addr, ivec_t vindex, \ vec_t mask, const int scale=SCALSIZE ) noexcept Vec & gather_idxhp( const scal_t *base_addr, ivec_hp_t vindex, \ const int scale=SCALSIZE ) noexcept Vec & gatherm_idxhp( vec_t src, const scal_t *base_addr, ivec_hp_t vindex, \ vec_t mask, const int scale=SCALSIZE ) noexcept Load operations: load data from memory. The address type ``caddr_t`` can be either ``const double *``, ``const vec_t *`` or ``const Vec *``. (1) ``load()`` loads a pack of 4 double precision floating-point scalar values into the calling instance from the aligned address ``mem_addr``. (2) ``loadu()`` allows that ``mem_addr`` is not aligned. (3) ``loadm()`` uses ``mask`` (elements are zeroed out when the highest bit of the corresponding element is not set). (4) ``load1()`` load a single scalar value and repeats it four times to make a vector. (5) ``bcast(const scal_t *)`` is the same as ``load1()``. (6) ``bcast(const vec_hc_t *)`` loads two scalar values and repeats them twice to make a vector. (7) ``gather()`` loads 4 scalar values from address starting at ``base_addr``, each offset by the corresponding 64-bit element in ``vindex`` (in bytes, and scaled by ``scale``; ``scale`` can be 1, 2, 4, or 8). (8) ``gatherm()`` is the same as ``gather()`` but using ``mask`` (elements are copied from src when the highest bit is not set in the corresponding element). (9) ``gather_idxhp()`` is like ``gather()`` but uses 32-bit offset. (10) ``gatherm_idxhp()`` us like ``gatherm()`` but uses 32-bit offset. .. function:: const Vec & store( addr_t mem_addr ) const noexcept const Vec & storeu( addr_t mem_addr ) const noexcept const Vec & storem( addr_t mem_addr, ivec_t mask ) const noexcept const Vec & stream( addr_t mem_addr ) const noexcept const Vec & scatter( void *base_addr, \ ivec_t vindex, int scale=SCALSIZE ) const noexcept const Vec & scatterm( void *base_addr, mask8_t k,\ ivec_t vindex, int scale=SCALSIZE ) const noexcept const Vec & scatter_idxhp( void *base_addr, \ ivec_hp_t vindex, int scale=SCALSIZE ) const noexcept const Vec & scatterm_idxhp( void *base_addr, mask8_t k,\ ivec_hp_t vindex, int scale=SCALSIZE ) const noexcept Vec & store( addr_t mem_addr ) noexcept Vec & storeu( addr_t mem_addr ) noexcept Vec & storem( addr_t mem_addr, ivec_t mask ) noexcept Vec & stream( addr_t mem_addr ) noexcept Vec & scatter( void *base_addr, \ ivec_t vindex, int scale=SCALSIZE ) noexcept Vec & scatterm( void *base_addr, mask8_t k,\ ivec_t vindex, int scale=SCALSIZE ) noexcept Vec & scatter_idxhp( void *base_addr, \ ivec_hp_t vindex, int scale=SCALSIZE ) noexcept Vec & scatterm_idxhp( void *base_addr, mask8_t k,\ ivec_hp_t vindex, int scale=SCALSIZE ) noexcept Store operations: store element from the current instance to a memory location. The address type ``addr_t`` can be either ``double *``, ``vec_t *`` or ``Vec *``. Each store operation has a non-``const`` version used for a non-constant instance. All the store operations return the reference to the instance itself. (1) ``store()`` stores 4 double precision floating-point scalar values into the aligned address ``mem_addr``. (2) ``storeu()`` does not need the address to be aligned. (3) ``storem()`` uses the ``mask`` (elements are not stored when the highest bit is not set in the corresponding element). (4) ``stream()`` uses a non-temporal memory hint. ``mem_addr`` must be aligned. (5) ``scatter()`` stores elements into the address starting at ``base_addr`` and offset by each 64-bit element in ``vindex`` (in byte, and scaled by ``scale``; ``scale`` can be 1, 2, 4, or 8). (6) ``scatterm()`` is the same as ``scatter()`` but uses a ``mask`` (elements are not stored when the corresponding mask bit is not set). (7) ``scatter_idxhp()`` is the same as ``scatter()`` but uses 32-bit offset. (8) ``scatterm_idxhp()`` is the same as ``scatterm()`` but uses 32-bit offset. .. function:: scal_t to_scal( )const noexcept int movemask( ) const noexcept Vec movedup( ) const noexcept ``to_scal()`` returns the lower double-precision floating-point scalar value. ``movemask()`` sets each bit of the returned value based on the corresponding most significate bit in each double precision floating-point scalar value. ``movedup()`` duplicates even-indexed scalar values. .. function:: Vec & set( scal_t e3, scal_t e2, scal_t e1, scal_t e0 ) noexcept Vec & set1( scal_t a ) noexcept Vec & set1( vec_hc_t a ) noexcept Vec & set() noexcept Vec & setzero() noexcept Vec & undefined() noexcept Set the scalar values of the calling instance. (1) ``set(e3,e2,e1,e0)`` sets each elements from the higher address value ``e3`` to lower address value ``e0``. (2) ``set1(scal_t a)`` repeats a scalar value 4 times. (3) ``set1(vec_hc_t a)`` repeats the lower scalar value of ``a`` 4 times. (4) ``set()`` is the same as ``setzero()``. (5) ``setzero()`` set all bits to zero. (6) ``undefined()`` set scalars to undefined values. .. function:: Vec operator+( const Vec &a ) const noexcept Vec operator-( const Vec &a ) const noexcept Vec operator*( const Vec &a ) const noexcept Vec operator/( const Vec &a ) const noexcept Vec operator++(int) noexcept Vec & operator++() noexcept Vec operator--(int) noexcept Vec & operator--() noexcept Vec & operator+=( const Vec &a ) noexcept Vec & operator-=( const Vec &a ) noexcept Vec & operator*=( const Vec &a ) noexcept Vec & operator/=( const Vec &a ) noexcept Vec hadd( const Vec &a ) const noexcept Vec hsub( const Vec &a ) const noexcept Arithmetic operations. All of the above operations are element-wise. ``hadd()`` performs horizontal addtion, i.e., the result of a.hadd(b) is { a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3] }. ``hsub()`` performs horizontal subtration, i.e., the result of a.hsub(b) is { a[0]-a[1], b[0]-b[1], a[2]-a[3], b[2]-b[3] }. .. function:: Vec operator&( const Vec &a ) const noexcept Vec andnot( const Vec &a ) const noexcept Vec operator|( const Vec &a ) const noexcept Vec operator~()const noexcept Vec operator^( const Vec &a ) const noexcept Vec & operator&=( const Vec &a ) noexcept Vec & operator|=( const Vec &a ) noexcept Vec & operator^=( const Vec &a ) noexcept Bitwise Logic operations. .. function:: Vec operator==( const Vec &a ) const noexcept Vec operator!=( const Vec &a ) const noexcept Vec operator<( const Vec &a ) const noexcept Vec operator<=( const Vec &a ) const noexcept Vec operator>( const Vec &a ) const noexcept Vec operator>=( const Vec &a ) const noexcept Relation (comparison) operations. The comparision is element-wise for each scalar. If true, all the bits are set in the corresponding result element. .. function:: Vec blend( const Vec &a, const int imm8 ) const noexcept Vec blend( const Vec &a, const Vec &mask ) const noexcept Blend two vectors using control mask ``imm8``. For each bit in ``imm8``, if set, taken the corresponding result element from ``b``, otherwise from ``a``. The second version uses a vector ``mask``, i.e., each mask bit is taken from the highest bit of the corresponding 64-bit elements. .. function:: Vec sqrt() const noexcept Vec ceil() const noexcept Vec floor() const noexcept Vec round( const int rounding ) const noexcept Vec max( const Vec &a ) const noexcept Vec min( const Vec &a ) const noexcept Vec sin() const noexcept Vec cos() const noexcept Vec log() const noexcept Vec exp() const noexcept Vec pow( const Vec &a ) const noexcept Elementary math functions. ``sin()``, ``cos()``, ``log()``, ``exp()``, ``pow()`` may not be serialized, depending on the compiler. Vector ---------------------- .. class:: template<> Vec A vector of eight single-precision (256 bits in total) values. ``Vec`` can be **copied**, **copy-constructed**, **moved**, and **move-constructed**. The copy and move operations and destructor are all ``noexcept``. ``Vec`` is binary-compatible with the intrinsic type ``__m256``, i.e., they have the same length and alignment. .. type:: float scal_t __m256 vec_t __m128 vec_hc_t int32_t iscal_t __m256i ivec_t __mmask8 mask8_t Type aliases. ``scal_t`` is the scalar type (i.e., element type) of the SIMD vector. ``vec_t`` is the intrinsic SIMD vector, ``vec_hc_t`` represents the half-count type. .. enum:: @simd_vecd4enum : size_t .. enumerator:: NPACK=8 NBIT=256 VECSIZE=sizeof(vec_t) SCALSIZE=sizeof(scal_t) ``NPACK`` is the number of scalars in the vector, ``NBIT`` is the number of bits of the vector register. ``VECSIZE`` and ``SCALSIZE`` are size in bytes of the vector and scalar. .. function:: Vec() noexcept Vec( scal_t e7, scal_t e6, scal_t e5, scal_t e4, \ scal_t e3, scal_t e2, scal_t e1, scal_t e0 ) noexcept explicit Vec( scal_t a ) noexcept explicit Vec( caddr_t mem_addr ) noexcept Vec( const vec_t &a ) noexcept Vec( const scal_t *base_addr, ivec_t vindex, \ const int scale=SCALSIZE ) noexcept Vec( vec_t src, const scal_t *base_addr, \ ivec_t vindex, vec_t mask, const int scale ) noexcept Initializers. (1) Default Initializer: ``Vec()`` gives an un-initialized vector. (2) ``Vec(e7, e6, ..., e0)`` constructs a vector of eight given elements from higher address value ``e7``, to lower address value ``e0``. (3) ``Vec(scal_t a)`` constructs a vector of eight repeated scalar value ``a``. (4) ``Vec(mem_addr)``: the eight elements are loaded from the memory address ``mem_addr`` (must be aligned at 32-byte boundary). (5) ``Vec(const vec_t &a)``: copy the intrinsic vector ``a``. (6) ``Vec(base_addr, vindex, scale)``: load using :func:`gather`. (7) ``Vec(src, base_addr, vindex, mask, scale)``: load using :func:`gatherm`. The address type ``caddr_t`` can be either ``const float *``, ``const vec_t *`` or ``const Vec *``. .. function:: ostream & info( ostream &os = cout, int fmt_cntl = 1 ) const friend ostream & operator<<( ostream &os, const Vec &v ) ``info()`` displays the content of the vector to ``os``. :arg fmt_cntl: Control the display format. 0 for an inline printing and 1 for a verbose, multiple-line version. :return: The argument ``os`` is returned. The overloaded `<<` operator is equivalent to ``info()`` with default ``fmt_cntl``. The returned reference to ``os`` allows you to chain the outputs, such as ``vec.info(cout) << " continue printing " << std::endl``. .. function:: const vec_t & val() const noexcept vec_t & val() noexcept const scal_t & operator[]( size_t n ) const noexcept scal_t & operator[]( size_t n ) noexcept ``val()`` return the intrinsic vector value. ``operator[](n)`` takes the n-th scalar element from the vector. .. function:: Vec & load( caddr_t mem_addr ) noexcept Vec & loadu( caddr_t mem_addr ) noexcept Vec & loadm( caddr_t mem_addr, ivec_t mask ) noexcept Vec & load1( const scal_t *mem_addr ) noexcept Vec & bcast( const scal_t *mem_addr ) noexcept Vec & bcast( const vec_hc_t *mem_addr ) noexcept Vec & gather( const scal_t *base_addr, \ ivec_t vindex, const int scale=SCALSIZE ) noexcept Vec & gatherm( vec_t src, const scal_t *base_addr, \ ivec_t vindex, vec_t mask, const int scale=SCALSIZE ) noexcept Load operations: load data from memory. The address type ``caddr_t`` can be either ``const double *``, ``const vec_t *`` or ``const Vec *``. (1) ``load()`` loads a pack of 8 single precision floating-point scalar values into the calling instance from the aligned address ``mem_addr``. (2) ``loadu()`` allows that ``mem_addr`` is not aligned. (3) ``loadm()`` uses ``mask`` (elements are zeroed out when the highest bit of the corresponding element is not set). (4) ``load1()`` load a single scalar value and repeats it eight times to make a vector. (5) ``bcast(const scal_t *)`` is the same as ``load1()``. (6) ``bcast(const vec_hc_t *)`` loads four scalar values and repeats them twice to make a vector. (7) ``gather()`` loads 8 scalar values from address starting at ``base_addr``, each offset by the corresponding 32-bit element in ``vindex`` (in bytes, and scaled by ``scale``; ``scale`` can be 1, 2, 4, or 8). (8) ``gatherm()`` is the same as ``gather()`` but using ``mask`` (elements are copied from src when the highest bit is not set in the corresponding element). .. function:: const Vec & store( addr_t mem_addr ) const noexcept const Vec & storeu( addr_t mem_addr ) const noexcept const Vec & storem( addr_t mem_addr, ivec_t mask ) const noexcept const Vec & stream( addr_t mem_addr ) const noexcept const Vec & scatter(void *base_addr, ivec_t vindex, \ int scale=SCALSIZE) const noexcept const Vec & scatterm(void *base_addr, mask8_t k, ivec_t vindex, \ int scale=SCALSIZE) const noexcept Vec & store( addr_t mem_addr ) noexcept Vec & storeu( addr_t mem_addr ) noexcept Vec & storem( addr_t mem_addr, ivec_t mask ) noexcept Vec & stream( addr_t mem_addr ) noexcept Vec & scatter(void *base_addr, ivec_t vindex, \ int scale=SCALSIZE) noexcept Vec & scatterm(void *base_addr, mask8_t k, ivec_t vindex, \ int scale=SCALSIZE) noexcept Store operations: store element from the current instance to a memory location. The address type ``addr_t`` can be either ``double *``, ``vec_t *`` or ``Vec *``. Each store operation has a non-``const`` version used for a non-constant instance. All the store operations return the reference to the instance itself. (1) ``store()`` stores 8 single precision floating-point scalar values into the aligned address ``mem_addr``. (2) ``storeu()`` does not need the address to be aligned. (3) ``storem()`` uses the ``mask`` (elements are not stored when the highest bit is not set in the corresponding element). (4) ``stream()`` uses a non-temporal memory hint. ``mem_addr`` must be aligned. (5) ``scatter()`` stores elements into the address starting at ``base_addr`` and offset by each 32-bit element in ``vindex`` (in byte, and scaled by ``scale``; ``scale`` can be 1, 2, 4, or 8). (6) ``scatterm()`` is the same as ``scatter()`` but uses a ``mask`` (elements are not stored when the corresponding mask bit is not set). .. function:: scal_t to_scal( )const noexcept int movemask( ) const noexcept Vec movehdup( ) const noexcept Vec moveldup( ) const noexcept ``to_scal()`` returns the lower single precision floating-point scalar value. ``movemask()`` sets each bit of the returned value based on the corresponding most significate bit in each single precision floating-point scalar value. ``movehdup()`` duplicates odd-indexed scalar values. ``moveldup()`` duplicates even-indexed scalar values. .. function:: Vec & set( scal_t e7, scal_t e6, scal_t e5, scal_t e4, \ scal_t e3, scal_t e2, scal_t e1, scal_t e0 ) noexcept Vec & set1( scal_t a ) noexcept Vec & set1( vec_hc_t a ) noexcept Vec & set() noexcept Vec & setzero() noexcept Vec & undefined() noexcept Set the scalar values of the calling instance. (1) ``set(e7,e6,...,e0)`` sets each elements from the higher address value ``e7`` to lower address value ``e0``. (2) ``set1(scal_t a)`` repeats a scalar value 8 times. (3) ``set1(vec_hc_t a)`` repeats the lower scalar value of ``a`` 8 times. (4) ``set()`` is the same as ``setzero()``. (5) ``setzero()`` set all bits to zero. (6) ``undefined()`` set scalars to undefined values. .. function:: Vec operator+( const Vec &a ) const noexcept Vec operator-( const Vec &a ) const noexcept Vec operator*( const Vec &a ) const noexcept Vec operator/( const Vec &a ) const noexcept Vec operator++(int) noexcept Vec & operator++() noexcept Vec operator--(int) noexcept Vec & operator--() noexcept Vec & operator+=(const Vec &a) noexcept Vec & operator-=(const Vec &a) noexcept Vec & operator*=(const Vec &a) noexcept Vec & operator/=(const Vec &a) noexcept Vec hadd( const Vec &a ) const noexcept Vec hsub( const Vec &a ) const noexcept Arithmetic operations. All of the above operations are element-wise. ``hadd()`` performs horizontal addtion, i.e., the result of a.hadd(b) is { a[0]+a[1], a[2]+a[3], b[0]+b[1], b[2]+b[3], ..., b[4]+b[5], b[6]+b[7] }. ``hsub()`` performs horizontal subtration, i.e., the result of a.hsub(b) is { a[0]-a[1], a[2]-a[3], b[0]-b[1], b[2]-b[3], ..., b[4]-b[5], b[6]-b[7] }. .. function:: Vec operator&( const Vec &a ) const noexcept Vec andnot( const Vec &a ) const noexcept Vec operator|( const Vec &a ) const noexcept Vec operator~()const noexcept Vec operator^( const Vec &a ) const noexcept Vec & operator&=( const Vec &a ) noexcept Vec & operator|=( const Vec &a ) noexcept Vec & operator^=( const Vec &a ) noexcept Bitwise Logic operations. .. function:: Vec operator==( const Vec &a ) const noexcept Vec operator!=( const Vec &a ) const noexcept Vec operator<( const Vec &a ) const noexcept Vec operator<=( const Vec &a ) const noexcept Vec operator>( const Vec &a ) const noexcept Vec operator>=( const Vec &a ) const noexcept Relation (comparison) operations. The comparision is element-wise for each scalar. If true, all the bits are set in the corresponding result element. .. function:: Vec blend( const Vec &a, const int imm8 ) const noexcept Vec blend( const Vec &a, const Vec &mask ) const noexcept Blend two vectors using control mask ``imm8``. For each bit in ``imm8``, if set, taken the corresponding result element from ``b``, otherwise from ``a``. The second version uses a vector ``mask``, i.e., each mask bit is taken from the highest bit of the corresponding 64-bit elements. .. function:: Vec rcp() const noexcept Vec sqrt() const noexcept Vec rsqrt() const noexcept Vec ceil() const noexcept Vec floor() const noexcept Vec round( const int rounding ) const noexcept Vec max( const Vec &a ) const noexcept Vec min( const Vec &a ) const noexcept Elementary math functions. .. function:: Vec log2_fast( ) const noexcept Vec log_fast( ) const noexcept Vec log10_fast( ) const noexcept Vec log2_faster( ) const noexcept Vec log_faster( ) const noexcept Vec log10_faster( ) const noexcept Vec pow2_fast() const noexcept Vec exp_fast() const noexcept Vec pow10_fast() const noexcept Vec pow2_faster() const noexcept Vec exp_faster() const noexcept Vec pow10_faster() const noexcept Vectorized math functions. These functions are not supported by hardware. They are implemented by approximation algorithms. ``xxx_faster()`` is faster than ``xxx_fast()``, but has lower precision.