AVX (256-bit) Vectors

The following classes are all defined within namespace HIPP::SIMD.

Vec<double, 4>

template<>
class Vec<double, 4>

A vector of four double-precision (256 bits in total) values. Vec<double, 4> can be copied, copy-constructed, moved, and move-constructed. The copy and move operations and destructor are all noexcept.

Vec<double, 4> is binary-compatible with the intrinsic type __m256d, i.e., they have the same length and alignment.

typedef double scal_t
typedef float scal_hp_t
typedef __m256d vec_t
typedef __m128d vec_hc_t
typedef __m128 vec_hp_t
typedef int64_t iscal_t
typedef int32_t iscal_hp_t
typedef __m256i ivec_t
typedef __m128d ivec_hp_t
typedef __mmask8 mask8_t

Type aliases. scal_t is the scalar type (i.e., element type) of the SIMD vector. scal_hp_t is half-precision scalar type. vec_t is the intrinsic SIMD vector, vec_hc_t and vec_hp_t represent the half-precision and half-count types, respectively.

Scalar and vector types for integers are also defined.

enum [anonymous] : size_t
enumerator NPACK = 4
enumerator NBIT = 256
enumerator VECSIZE = sizeof(vec_t)
enumerator SCALSIZE = sizeof(scal_t)

NPACK is the number of scalars in the vector, NBIT is the number of bits of the vector register. VECSIZE and SCALSIZE are size in bytes of the vector and scalar.

Vec() noexcept
Vec(scal_t e3, scal_t e2, scal_t e1, scal_t e0) noexcept
explicit Vec(scal_t a) noexcept
explicit Vec(caddr_t mem_addr) noexcept
Vec(const vec_t &a) noexcept
Vec(const scal_t *base_addr, ivec_t vindex, const int scale = SCALSIZE) noexcept
Vec(vec_t src, const scal_t *base_addr, ivec_t vindex, vec_t mask, const int scale = SCALSIZE) noexcept

Initializers.

  1. Default Initializer: Vec() gives an un-initialized vector.

  2. Vec(e3, e2, e1, e0) constructs a vector of four given elements, from higher address value e3, to lower address value e0.

  3. Vec(scal_t a) constructs a vector of four repeated scalar value a.

  4. Vec(mem_addr): the four elements are loaded from the memory address mem_addr (must be aligned at 32-byte boundary).

  5. Vec(const vec_t &a): copy the intrinsic vector a.

  6. Vec(base_addr, vindex, scale): load using gather().

  7. Vec(src, base_addr, vindex, mask, scale): load using gatherm().

The address type caddr_t can be either const double *, const vec_t * or const Vec<double, 4> *.

ostream &info(ostream &os = cout, int fmt_cntl = 1) const
friend ostream &operator<<(ostream &os, const Vec &v)

info() displays the content of the vector to os.

Parameters

fmt_cntl – Control the display format. 0 for an inline printing and 1 for a verbose, multiple-line version.

Returns

The argument os is returned.

The overloaded << operator is equivalent to info() with default fmt_cntl.

The returned reference to os allows you to chain the outputs, such as vec.info(cout) << " continue printing " << std::endl.

const vec_t &val() const noexcept
vec_t &val() noexcept
const scal_t &operator[](size_t n) const noexcept
scal_t &operator[](size_t n) noexcept

val() return the intrinsic vector value. operator[](n) takes the n-th scalar element from the vector.

Vec &load(caddr_t mem_addr) noexcept
Vec &loadu(caddr_t mem_addr) noexcept
Vec &loadm(caddr_t mem_addr, ivec_t mask) noexcept
Vec &load1(const scal_t *mem_addr) noexcept
Vec &bcast(const scal_t *mem_addr) noexcept
Vec &bcast(const vec_hc_t *mem_addr) noexcept
Vec &gather(const scal_t *base_addr, ivec_t vindex, const int scale = SCALSIZE) noexcept
Vec &gatherm(vec_t src, const scal_t *base_addr, ivec_t vindex, vec_t mask, const int scale = SCALSIZE) noexcept
Vec &gather_idxhp(const scal_t *base_addr, ivec_hp_t vindex, const int scale = SCALSIZE) noexcept
Vec &gatherm_idxhp(vec_t src, const scal_t *base_addr, ivec_hp_t vindex, vec_t mask, const int scale = SCALSIZE) noexcept

Load operations: load data from memory. The address type caddr_t can be either const double *, const vec_t * or const Vec<double, 4> *.

  1. load() loads a pack of 4 double precision floating-point scalar values into the calling instance from the aligned address mem_addr.

  2. loadu() allows that mem_addr is not aligned.

  3. loadm() uses mask (elements are zeroed out when the highest bit of the corresponding element is not set).

  4. load1() load a single scalar value and repeats it four times to make a vector.

  5. bcast(const scal_t *) is the same as load1().

  6. bcast(const vec_hc_t *) loads two scalar values and repeats them twice to make a vector.

  7. gather() loads 4 scalar values from address starting at base_addr, each offset by the corresponding 64-bit element in vindex (in bytes, and scaled by scale; scale can be 1, 2, 4, or 8).

  8. gatherm() is the same as gather() but using mask (elements are copied from src when the highest bit is not set in the corresponding element).

  9. gather_idxhp() is like gather() but uses 32-bit offset.

  10. gatherm_idxhp() us like gatherm() but uses 32-bit offset.

const Vec &store(addr_t mem_addr) const noexcept
const Vec &storeu(addr_t mem_addr) const noexcept
const Vec &storem(addr_t mem_addr, ivec_t mask) const noexcept
const Vec &stream(addr_t mem_addr) const noexcept
const Vec &scatter(void *base_addr, ivec_t vindex, int scale = SCALSIZE) const noexcept
const Vec &scatterm(void *base_addr, mask8_t k, ivec_t vindex, int scale = SCALSIZE) const noexcept
const Vec &scatter_idxhp(void *base_addr, ivec_hp_t vindex, int scale = SCALSIZE) const noexcept
const Vec &scatterm_idxhp(void *base_addr, mask8_t k, ivec_hp_t vindex, int scale = SCALSIZE) const noexcept
Vec &store(addr_t mem_addr) noexcept
Vec &storeu(addr_t mem_addr) noexcept
Vec &storem(addr_t mem_addr, ivec_t mask) noexcept
Vec &stream(addr_t mem_addr) noexcept
Vec &scatter(void *base_addr, ivec_t vindex, int scale = SCALSIZE) noexcept
Vec &scatterm(void *base_addr, mask8_t k, ivec_t vindex, int scale = SCALSIZE) noexcept
Vec &scatter_idxhp(void *base_addr, ivec_hp_t vindex, int scale = SCALSIZE) noexcept
Vec &scatterm_idxhp(void *base_addr, mask8_t k, ivec_hp_t vindex, int scale = SCALSIZE) noexcept

Store operations: store element from the current instance to a memory location. The address type addr_t can be either double *, vec_t * or Vec<double, 4> *.

Each store operation has a non-const version used for a non-constant instance.

All the store operations return the reference to the instance itself.

  1. store() stores 4 double precision floating-point scalar values into the aligned address mem_addr.

  2. storeu() does not need the address to be aligned.

  3. storem() uses the mask (elements are not stored when the highest bit is not set in the corresponding element).

  4. stream() uses a non-temporal memory hint. mem_addr must be aligned.

  5. scatter() stores elements into the address starting at base_addr and offset by each 64-bit element in vindex (in byte, and scaled by scale; scale can be 1, 2, 4, or 8).

  6. scatterm() is the same as scatter() but uses a mask (elements are not stored when the corresponding mask bit is not set).

  7. scatter_idxhp() is the same as scatter() but uses 32-bit offset.

  8. scatterm_idxhp() is the same as scatterm() but uses 32-bit offset.

scal_t to_scal() const noexcept
int movemask() const noexcept
Vec movedup() const noexcept

to_scal() returns the lower double-precision floating-point scalar value. movemask() sets each bit of the returned value based on the corresponding most significate bit in each double precision floating-point scalar value. movedup() duplicates even-indexed scalar values.

Vec &set(scal_t e3, scal_t e2, scal_t e1, scal_t e0) noexcept
Vec &set1(scal_t a) noexcept
Vec &set1(vec_hc_t a) noexcept
Vec &set() noexcept
Vec &setzero() noexcept
Vec &undefined() noexcept

Set the scalar values of the calling instance.

  1. set(e3,e2,e1,e0) sets each elements from the higher address value e3 to lower address value e0.

  2. set1(scal_t a) repeats a scalar value 4 times.

  3. set1(vec_hc_t a) repeats the lower scalar value of a 4 times.

  4. set() is the same as setzero().

  5. setzero() set all bits to zero.

  6. undefined() set scalars to undefined values.

Vec operator+(const Vec &a) const noexcept
Vec operator-(const Vec &a) const noexcept
Vec operator*(const Vec &a) const noexcept
Vec operator/(const Vec &a) const noexcept
Vec operator++(int) noexcept
Vec &operator++() noexcept
Vec operator--(int) noexcept
Vec &operator--() noexcept
Vec &operator+=(const Vec &a) noexcept
Vec &operator-=(const Vec &a) noexcept
Vec &operator*=(const Vec &a) noexcept
Vec &operator/=(const Vec &a) noexcept
Vec hadd(const Vec &a) const noexcept
Vec hsub(const Vec &a) const noexcept

Arithmetic operations. All of the above operations are element-wise.

hadd() performs horizontal addtion, i.e., the result of a.hadd(b) is { a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3] }. hsub() performs horizontal subtration, i.e., the result of a.hsub(b) is { a[0]-a[1], b[0]-b[1], a[2]-a[3], b[2]-b[3] }.

Vec operator&(const Vec &a) const noexcept
Vec andnot(const Vec &a) const noexcept
Vec operator|(const Vec &a) const noexcept
Vec operator~() const noexcept
Vec operator^(const Vec &a) const noexcept
Vec &operator&=(const Vec &a) noexcept
Vec &operator|=(const Vec &a) noexcept
Vec &operator^=(const Vec &a) noexcept

Bitwise Logic operations.

Vec operator==(const Vec &a) const noexcept
Vec operator!=(const Vec &a) const noexcept
Vec operator<(const Vec &a) const noexcept
Vec operator<=(const Vec &a) const noexcept
Vec operator>(const Vec &a) const noexcept
Vec operator>=(const Vec &a) const noexcept

Relation (comparison) operations. The comparision is element-wise for each scalar. If true, all the bits are set in the corresponding result element.

Vec blend(const Vec &a, const int imm8) const noexcept
Vec blend(const Vec &a, const Vec &mask) const noexcept

Blend two vectors using control mask imm8. For each bit in imm8, if set, taken the corresponding result element from b, otherwise from a.

The second version uses a vector mask, i.e., each mask bit is taken from the highest bit of the corresponding 64-bit elements.

Vec sqrt() const noexcept
Vec ceil() const noexcept
Vec floor() const noexcept
Vec round(const int rounding) const noexcept
Vec max(const Vec &a) const noexcept
Vec min(const Vec &a) const noexcept
Vec sin() const noexcept
Vec cos() const noexcept
Vec log() const noexcept
Vec exp() const noexcept
Vec pow(const Vec &a) const noexcept

Elementary math functions. sin(), cos(), log(), exp(), pow() may not be serialized, depending on the compiler.

Vector<float, 8>

template<>
class Vec<float, 8>

A vector of eight single-precision (256 bits in total) values. Vec<float, 4> can be copied, copy-constructed, moved, and move-constructed. The copy and move operations and destructor are all noexcept.

Vec<float, 4> is binary-compatible with the intrinsic type __m256, i.e., they have the same length and alignment.

typedef float scal_t
typedef __m256 vec_t
typedef __m128 vec_hc_t
typedef int32_t iscal_t
typedef __m256i ivec_t
typedef __mmask8 mask8_t

Type aliases. scal_t is the scalar type (i.e., element type) of the SIMD vector. vec_t is the intrinsic SIMD vector, vec_hc_t represents the half-count type.

enum [anonymous] : size_t
enumerator NPACK = 8
enumerator NBIT = 256
enumerator VECSIZE = sizeof(vec_t)
enumerator SCALSIZE = sizeof(scal_t)

NPACK is the number of scalars in the vector, NBIT is the number of bits of the vector register. VECSIZE and SCALSIZE are size in bytes of the vector and scalar.

Vec() noexcept
Vec(scal_t e7, scal_t e6, scal_t e5, scal_t e4, scal_t e3, scal_t e2, scal_t e1, scal_t e0) noexcept
explicit Vec(scal_t a) noexcept
explicit Vec(caddr_t mem_addr) noexcept
Vec(const vec_t &a) noexcept
Vec(const scal_t *base_addr, ivec_t vindex, const int scale = SCALSIZE) noexcept
Vec(vec_t src, const scal_t *base_addr, ivec_t vindex, vec_t mask, const int scale) noexcept

Initializers.

  1. Default Initializer: Vec() gives an un-initialized vector.

  2. Vec(e7, e6, ..., e0) constructs a vector of eight given elements from higher address value e7, to lower address value e0.

  3. Vec(scal_t a) constructs a vector of eight repeated scalar value a.

  4. Vec(mem_addr): the eight elements are loaded from the memory address mem_addr (must be aligned at 32-byte boundary).

  5. Vec(const vec_t &a): copy the intrinsic vector a.

  6. Vec(base_addr, vindex, scale): load using gather().

  7. Vec(src, base_addr, vindex, mask, scale): load using gatherm().

The address type caddr_t can be either const float *, const vec_t * or const Vec<float, 8> *.

ostream &info(ostream &os = cout, int fmt_cntl = 1) const
friend ostream &operator<<(ostream &os, const Vec &v)

info() displays the content of the vector to os.

Parameters

fmt_cntl – Control the display format. 0 for an inline printing and 1 for a verbose, multiple-line version.

Returns

The argument os is returned.

The overloaded << operator is equivalent to info() with default fmt_cntl.

The returned reference to os allows you to chain the outputs, such as vec.info(cout) << " continue printing " << std::endl.

const vec_t &val() const noexcept
vec_t &val() noexcept
const scal_t &operator[](size_t n) const noexcept
scal_t &operator[](size_t n) noexcept

val() return the intrinsic vector value. operator[](n) takes the n-th scalar element from the vector.

Vec &load(caddr_t mem_addr) noexcept
Vec &loadu(caddr_t mem_addr) noexcept
Vec &loadm(caddr_t mem_addr, ivec_t mask) noexcept
Vec &load1(const scal_t *mem_addr) noexcept
Vec &bcast(const scal_t *mem_addr) noexcept
Vec &bcast(const vec_hc_t *mem_addr) noexcept
Vec &gather(const scal_t *base_addr, ivec_t vindex, const int scale = SCALSIZE) noexcept
Vec &gatherm(vec_t src, const scal_t *base_addr, ivec_t vindex, vec_t mask, const int scale = SCALSIZE) noexcept

Load operations: load data from memory. The address type caddr_t can be either const double *, const vec_t * or const Vec<double, 4> *.

  1. load() loads a pack of 8 single precision floating-point scalar values into the calling instance from the aligned address mem_addr.

  2. loadu() allows that mem_addr is not aligned.

  3. loadm() uses mask (elements are zeroed out when the highest bit of the corresponding element is not set).

  4. load1() load a single scalar value and repeats it eight times to make a vector.

  5. bcast(const scal_t *) is the same as load1().

  6. bcast(const vec_hc_t *) loads four scalar values and repeats them twice to make a vector.

  7. gather() loads 8 scalar values from address starting at base_addr, each offset by the corresponding 32-bit element in vindex (in bytes, and scaled by scale; scale can be 1, 2, 4, or 8).

  8. gatherm() is the same as gather() but using mask (elements are copied from src when the highest bit is not set in the corresponding element).

const Vec &store(addr_t mem_addr) const noexcept
const Vec &storeu(addr_t mem_addr) const noexcept
const Vec &storem(addr_t mem_addr, ivec_t mask) const noexcept
const Vec &stream(addr_t mem_addr) const noexcept
const Vec &scatter(void *base_addr, ivec_t vindex, int scale = SCALSIZE) const noexcept
const Vec &scatterm(void *base_addr, mask8_t k, ivec_t vindex, int scale = SCALSIZE) const noexcept
Vec &store(addr_t mem_addr) noexcept
Vec &storeu(addr_t mem_addr) noexcept
Vec &storem(addr_t mem_addr, ivec_t mask) noexcept
Vec &stream(addr_t mem_addr) noexcept
Vec &scatter(void *base_addr, ivec_t vindex, int scale = SCALSIZE) noexcept
Vec &scatterm(void *base_addr, mask8_t k, ivec_t vindex, int scale = SCALSIZE) noexcept

Store operations: store element from the current instance to a memory location. The address type addr_t can be either double *, vec_t * or Vec<double, 4> *.

Each store operation has a non-const version used for a non-constant instance.

All the store operations return the reference to the instance itself.

  1. store() stores 8 single precision floating-point scalar values into the aligned address mem_addr.

  2. storeu() does not need the address to be aligned.

  3. storem() uses the mask (elements are not stored when the highest bit is not set in the corresponding element).

  4. stream() uses a non-temporal memory hint. mem_addr must be aligned.

  5. scatter() stores elements into the address starting at base_addr and offset by each 32-bit element in vindex (in byte, and scaled by scale; scale can be 1, 2, 4, or 8).

  6. scatterm() is the same as scatter() but uses a mask (elements are not stored when the corresponding mask bit is not set).

scal_t to_scal() const noexcept
int movemask() const noexcept
Vec movehdup() const noexcept
Vec moveldup() const noexcept

to_scal() returns the lower single precision floating-point scalar value. movemask() sets each bit of the returned value based on the corresponding most significate bit in each single precision floating-point scalar value. movehdup() duplicates odd-indexed scalar values. moveldup() duplicates even-indexed scalar values.

Vec &set(scal_t e7, scal_t e6, scal_t e5, scal_t e4, scal_t e3, scal_t e2, scal_t e1, scal_t e0) noexcept
Vec &set1(scal_t a) noexcept
Vec &set1(vec_hc_t a) noexcept
Vec &set() noexcept
Vec &setzero() noexcept
Vec &undefined() noexcept

Set the scalar values of the calling instance.

  1. set(e7,e6,...,e0) sets each elements from the higher address value e7 to lower address value e0.

  2. set1(scal_t a) repeats a scalar value 8 times.

  3. set1(vec_hc_t a) repeats the lower scalar value of a 8 times.

  4. set() is the same as setzero().

  5. setzero() set all bits to zero.

  6. undefined() set scalars to undefined values.

Vec operator+(const Vec &a) const noexcept
Vec operator-(const Vec &a) const noexcept
Vec operator*(const Vec &a) const noexcept
Vec operator/(const Vec &a) const noexcept
Vec operator++(int) noexcept
Vec &operator++() noexcept
Vec operator--(int) noexcept
Vec &operator--() noexcept
Vec &operator+=(const Vec &a) noexcept
Vec &operator-=(const Vec &a) noexcept
Vec &operator*=(const Vec &a) noexcept
Vec &operator/=(const Vec &a) noexcept
Vec hadd(const Vec &a) const noexcept
Vec hsub(const Vec &a) const noexcept

Arithmetic operations. All of the above operations are element-wise.

hadd() performs horizontal addtion, i.e., the result of a.hadd(b) is { a[0]+a[1], a[2]+a[3], b[0]+b[1], b[2]+b[3], …, b[4]+b[5], b[6]+b[7] }. hsub() performs horizontal subtration, i.e., the result of a.hsub(b) is { a[0]-a[1], a[2]-a[3], b[0]-b[1], b[2]-b[3], …, b[4]-b[5], b[6]-b[7] }.

Vec operator&(const Vec &a) const noexcept
Vec andnot(const Vec &a) const noexcept
Vec operator|(const Vec &a) const noexcept
Vec operator~() const noexcept
Vec operator^(const Vec &a) const noexcept
Vec &operator&=(const Vec &a) noexcept
Vec &operator|=(const Vec &a) noexcept
Vec &operator^=(const Vec &a) noexcept

Bitwise Logic operations.

Vec operator==(const Vec &a) const noexcept
Vec operator!=(const Vec &a) const noexcept
Vec operator<(const Vec &a) const noexcept
Vec operator<=(const Vec &a) const noexcept
Vec operator>(const Vec &a) const noexcept
Vec operator>=(const Vec &a) const noexcept

Relation (comparison) operations. The comparision is element-wise for each scalar. If true, all the bits are set in the corresponding result element.

Vec blend(const Vec &a, const int imm8) const noexcept
Vec blend(const Vec &a, const Vec &mask) const noexcept

Blend two vectors using control mask imm8. For each bit in imm8, if set, taken the corresponding result element from b, otherwise from a.

The second version uses a vector mask, i.e., each mask bit is taken from the highest bit of the corresponding 64-bit elements.

Vec rcp() const noexcept
Vec sqrt() const noexcept
Vec rsqrt() const noexcept
Vec ceil() const noexcept
Vec floor() const noexcept
Vec round(const int rounding) const noexcept
Vec max(const Vec &a) const noexcept
Vec min(const Vec &a) const noexcept

Elementary math functions.

Vec log2_fast() const noexcept
Vec log_fast() const noexcept
Vec log10_fast() const noexcept
Vec log2_faster() const noexcept
Vec log_faster() const noexcept
Vec log10_faster() const noexcept
Vec pow2_fast() const noexcept
Vec exp_fast() const noexcept
Vec pow10_fast() const noexcept
Vec pow2_faster() const noexcept
Vec exp_faster() const noexcept
Vec pow10_faster() const noexcept

Vectorized math functions. These functions are not supported by hardware. They are implemented by approximation algorithms. xxx_faster() is faster than xxx_fast(), but has lower precision.