AVX (256-bit) Vectors¶

The following classes are all defined within namespace HIPP::SIMD.

Vec<double, 4>¶

template<> class Vec<double, 4>¶

A vector of four double-precision (256 bits in total) values. Vec<double, 4> can be copied, copy-constructed, moved, and move-constructed. The copy and move operations and destructor are all noexcept.

Vec<double, 4> is binary-compatible with the intrinsic type __m256d, i.e., they have the same length and alignment.

typedef double scal_t¶

typedef float scal_hp_t¶

typedef __m256d vec_t¶

typedef __m128d vec_hc_t¶

typedef __m128 vec_hp_t¶

typedef int64_t iscal_t¶

typedef int32_t iscal_hp_t¶

typedef __m256i ivec_t¶

typedef __m128d ivec_hp_t¶

typedef __mmask8 mask8_t¶

Type aliases. scal_t is the scalar type (i.e., element type) of the SIMD vector. scal_hp_t is half-precision scalar type. vec_t is the intrinsic SIMD vector, vec_hc_t and vec_hp_t represent the half-precision and half-count types, respectively.

Scalar and vector types for integers are also defined.

enum [anonymous] : size_t¶

enumerator NPACK = 4¶
enumerator NBIT = 256¶
enumerator VECSIZE = sizeof(vec_t)¶
enumerator SCALSIZE = sizeof(scal_t)¶

NPACK is the number of scalars in the vector, NBIT is the number of bits of the vector register. VECSIZE and SCALSIZE are size in bytes of the vector and scalar.

Vec() noexcept¶

Vec(scal_t e3, scal_t e2, scal_t e1, scal_t e0) noexcept¶

explicit Vec(scal_t a) noexcept¶

explicit Vec(caddr_t mem_addr) noexcept¶

Vec(const vec_t &a) noexcept¶

Vec(const scal_t *base_addr, ivec_t vindex, const int scale = SCALSIZE) noexcept¶

Vec(vec_t src, const scal_t *base_addr, ivec_t vindex, vec_t mask, const int scale = SCALSIZE) noexcept¶

Initializers.

Default Initializer: Vec() gives an un-initialized vector.
Vec(e3, e2, e1, e0) constructs a vector of four given elements, from higher address value e3, to lower address value e0.
Vec(scal_t a) constructs a vector of four repeated scalar value a.
Vec(mem_addr): the four elements are loaded from the memory address mem_addr (must be aligned at 32-byte boundary).
Vec(const vec_t &a): copy the intrinsic vector a.
Vec(base_addr, vindex, scale): load using gather().
Vec(src, base_addr, vindex, mask, scale): load using gatherm().

The address type caddr_t can be either const double *, const vec_t * or const Vec<double, 4> *.

ostream &info(ostream &os = cout, int fmt_cntl = 1) const¶

friend ostream &operator<<(ostream &os, const Vec &v)¶

info() displays the content of the vector to os.

Parameters: fmt_cntl – Control the display format. 0 for an inline printing and 1 for a verbose, multiple-line version.
Returns: The argument os is returned.

The overloaded << operator is equivalent to info() with default fmt_cntl.

The returned reference to os allows you to chain the outputs, such as vec.info(cout) << " continue printing " << std::endl.

const vec_t &val() const noexcept¶
vec_t &val() noexcept¶
const scal_t &operator[](size_t n) const noexcept¶
scal_t &operator[](size_t n) noexcept¶: val() return the intrinsic vector value. operator[](n) takes the n-th scalar element from the vector.

Vec &load(caddr_t mem_addr) noexcept¶

Vec &loadu(caddr_t mem_addr) noexcept¶

Vec &loadm(caddr_t mem_addr, ivec_t mask) noexcept¶

Vec &load1(const scal_t *mem_addr) noexcept¶

Vec &bcast(const scal_t *mem_addr) noexcept¶

Vec &bcast(const vec_hc_t *mem_addr) noexcept¶

Vec &gather(const scal_t *base_addr, ivec_t vindex, const int scale = SCALSIZE) noexcept¶

Vec &gatherm(vec_t src, const scal_t *base_addr, ivec_t vindex, vec_t mask, const int scale = SCALSIZE) noexcept¶

Vec &gather_idxhp(const scal_t *base_addr, ivec_hp_t vindex, const int scale = SCALSIZE) noexcept¶

Vec &gatherm_idxhp(vec_t src, const scal_t *base_addr, ivec_hp_t vindex, vec_t mask, const int scale = SCALSIZE) noexcept¶

Load operations: load data from memory. The address type caddr_t can be either const double *, const vec_t * or const Vec<double, 4> *.

load() loads a pack of 4 double precision floating-point scalar values into the calling instance from the aligned address mem_addr.
loadu() allows that mem_addr is not aligned.
loadm() uses mask (elements are zeroed out when the highest bit of the corresponding element is not set).
load1() load a single scalar value and repeats it four times to make a vector.
bcast(const scal_t *) is the same as load1().
bcast(const vec_hc_t *) loads two scalar values and repeats them twice to make a vector.
gather() loads 4 scalar values from address starting at base_addr, each offset by the corresponding 64-bit element in vindex (in bytes, and scaled by scale; scale can be 1, 2, 4, or 8).
gatherm() is the same as gather() but using mask (elements are copied from src when the highest bit is not set in the corresponding element).
gather_idxhp() is like gather() but uses 32-bit offset.
gatherm_idxhp() us like gatherm() but uses 32-bit offset.

const Vec &store(addr_t mem_addr) const noexcept¶

const Vec &storeu(addr_t mem_addr) const noexcept¶

const Vec &storem(addr_t mem_addr, ivec_t mask) const noexcept¶

const Vec &stream(addr_t mem_addr) const noexcept¶

const Vec &scatter(void *base_addr, ivec_t vindex, int scale = SCALSIZE) const noexcept¶

const Vec &scatterm(void *base_addr, mask8_t k, ivec_t vindex, int scale = SCALSIZE) const noexcept¶

const Vec &scatter_idxhp(void *base_addr, ivec_hp_t vindex, int scale = SCALSIZE) const noexcept¶

const Vec &scatterm_idxhp(void *base_addr, mask8_t k, ivec_hp_t vindex, int scale = SCALSIZE) const noexcept¶

Vec &store(addr_t mem_addr) noexcept¶

Vec &storeu(addr_t mem_addr) noexcept¶

Vec &storem(addr_t mem_addr, ivec_t mask) noexcept¶

Vec &stream(addr_t mem_addr) noexcept¶

Vec &scatter(void *base_addr, ivec_t vindex, int scale = SCALSIZE) noexcept¶

Vec &scatterm(void *base_addr, mask8_t k, ivec_t vindex, int scale = SCALSIZE) noexcept¶

Vec &scatter_idxhp(void *base_addr, ivec_hp_t vindex, int scale = SCALSIZE) noexcept¶

Vec &scatterm_idxhp(void *base_addr, mask8_t k, ivec_hp_t vindex, int scale = SCALSIZE) noexcept¶

Store operations: store element from the current instance to a memory location. The address type addr_t can be either double *, vec_t * or Vec<double, 4> *.

Each store operation has a non-const version used for a non-constant instance.

All the store operations return the reference to the instance itself.

store() stores 4 double precision floating-point scalar values into the aligned address mem_addr.
storeu() does not need the address to be aligned.
storem() uses the mask (elements are not stored when the highest bit is not set in the corresponding element).
stream() uses a non-temporal memory hint. mem_addr must be aligned.
scatter() stores elements into the address starting at base_addr and offset by each 64-bit element in vindex (in byte, and scaled by scale; scale can be 1, 2, 4, or 8).
scatterm() is the same as scatter() but uses a mask (elements are not stored when the corresponding mask bit is not set).
scatter_idxhp() is the same as scatter() but uses 32-bit offset.
scatterm_idxhp() is the same as scatterm() but uses 32-bit offset.

scal_t to_scal() const noexcept¶
int movemask() const noexcept¶
Vec movedup() const noexcept¶: to_scal() returns the lower double-precision floating-point scalar value. movemask() sets each bit of the returned value based on the corresponding most significate bit in each double precision floating-point scalar value. movedup() duplicates even-indexed scalar values.

Vec &set(scal_t e3, scal_t e2, scal_t e1, scal_t e0) noexcept¶

Vec &set1(scal_t a) noexcept¶

Vec &set1(vec_hc_t a) noexcept¶

Vec &set() noexcept¶

Vec &setzero() noexcept¶

Vec &undefined() noexcept¶

Set the scalar values of the calling instance.

set(e3,e2,e1,e0) sets each elements from the higher address value e3 to lower address value e0.
set1(scal_t a) repeats a scalar value 4 times.
set1(vec_hc_t a) repeats the lower scalar value of a 4 times.
set() is the same as setzero().
setzero() set all bits to zero.
undefined() set scalars to undefined values.

Vec operator+(const Vec &a) const noexcept¶

Vec operator-(const Vec &a) const noexcept¶

Vec operator*(const Vec &a) const noexcept¶

Vec operator/(const Vec &a) const noexcept¶

Vec operator++(int) noexcept¶

Vec &operator++() noexcept¶

Vec operator--(int) noexcept¶

Vec &operator--() noexcept¶

Vec &operator+=(const Vec &a) noexcept¶

Vec &operator-=(const Vec &a) noexcept¶

Vec &operator*=(const Vec &a) noexcept¶

Vec &operator/=(const Vec &a) noexcept¶

Vec hadd(const Vec &a) const noexcept¶

Vec hsub(const Vec &a) const noexcept¶

Arithmetic operations. All of the above operations are element-wise.

hadd() performs horizontal addtion, i.e., the result of a.hadd(b) is { a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3] }. hsub() performs horizontal subtration, i.e., the result of a.hsub(b) is { a[0]-a[1], b[0]-b[1], a[2]-a[3], b[2]-b[3] }.

Vec operator&(const Vec &a) const noexcept¶
Vec andnot(const Vec &a) const noexcept¶
Vec operator|(const Vec &a) const noexcept¶
Vec operator~() const noexcept¶
Vec operator^(const Vec &a) const noexcept¶
Vec &operator&=(const Vec &a) noexcept¶
Vec &operator|=(const Vec &a) noexcept¶
Vec &operator^=(const Vec &a) noexcept¶: Bitwise Logic operations.

Vec operator==(const Vec &a) const noexcept¶
Vec operator!=(const Vec &a) const noexcept¶
Vec operator<(const Vec &a) const noexcept¶
Vec operator<=(const Vec &a) const noexcept¶
Vec operator>(const Vec &a) const noexcept¶
Vec operator>=(const Vec &a) const noexcept¶: Relation (comparison) operations. The comparision is element-wise for each scalar. If true, all the bits are set in the corresponding result element.

Vec blend(const Vec &a, const int imm8) const noexcept¶

Vec blend(const Vec &a, const Vec &mask) const noexcept¶

Blend two vectors using control mask imm8. For each bit in imm8, if set, taken the corresponding result element from b, otherwise from a.

The second version uses a vector mask, i.e., each mask bit is taken from the highest bit of the corresponding 64-bit elements.

Vec sqrt() const noexcept¶
Vec ceil() const noexcept¶
Vec floor() const noexcept¶
Vec round(const int rounding) const noexcept¶
Vec max(const Vec &a) const noexcept¶
Vec min(const Vec &a) const noexcept¶
Vec sin() const noexcept¶
Vec cos() const noexcept¶
Vec log() const noexcept¶
Vec exp() const noexcept¶
Vec pow(const Vec &a) const noexcept¶: Elementary math functions. sin(), cos(), log(), exp(), pow() may not be serialized, depending on the compiler.

Vector<float, 8>¶

template<> class Vec<float, 8>¶

A vector of eight single-precision (256 bits in total) values. Vec<float, 4> can be copied, copy-constructed, moved, and move-constructed. The copy and move operations and destructor are all noexcept.

Vec<float, 4> is binary-compatible with the intrinsic type __m256, i.e., they have the same length and alignment.

typedef float scal_t¶
typedef __m256 vec_t¶
typedef __m128 vec_hc_t¶
typedef int32_t iscal_t¶
typedef __m256i ivec_t¶
typedef __mmask8 mask8_t¶: Type aliases. scal_t is the scalar type (i.e., element type) of the SIMD vector. vec_t is the intrinsic SIMD vector, vec_hc_t represents the half-count type.

enum [anonymous] : size_t¶

enumerator NPACK = 8¶
enumerator NBIT = 256¶
enumerator VECSIZE = sizeof(vec_t)¶
enumerator SCALSIZE = sizeof(scal_t)¶

NPACK is the number of scalars in the vector, NBIT is the number of bits of the vector register. VECSIZE and SCALSIZE are size in bytes of the vector and scalar.

Vec() noexcept¶

Vec(scal_t e7, scal_t e6, scal_t e5, scal_t e4, scal_t e3, scal_t e2, scal_t e1, scal_t e0) noexcept¶

explicit Vec(scal_t a) noexcept¶

explicit Vec(caddr_t mem_addr) noexcept¶

Vec(const vec_t &a) noexcept¶

Vec(const scal_t *base_addr, ivec_t vindex, const int scale = SCALSIZE) noexcept¶

Vec(vec_t src, const scal_t *base_addr, ivec_t vindex, vec_t mask, const int scale) noexcept¶

Initializers.

Default Initializer: Vec() gives an un-initialized vector.
Vec(e7, e6, ..., e0) constructs a vector of eight given elements from higher address value e7, to lower address value e0.
Vec(scal_t a) constructs a vector of eight repeated scalar value a.
Vec(mem_addr): the eight elements are loaded from the memory address mem_addr (must be aligned at 32-byte boundary).
Vec(const vec_t &a): copy the intrinsic vector a.
Vec(base_addr, vindex, scale): load using gather().
Vec(src, base_addr, vindex, mask, scale): load using gatherm().

The address type caddr_t can be either const float *, const vec_t * or const Vec<float, 8> *.

ostream &info(ostream &os = cout, int fmt_cntl = 1) const¶

friend ostream &operator<<(ostream &os, const Vec &v)¶

info() displays the content of the vector to os.

Parameters: fmt_cntl – Control the display format. 0 for an inline printing and 1 for a verbose, multiple-line version.
Returns: The argument os is returned.

The overloaded << operator is equivalent to info() with default fmt_cntl.

The returned reference to os allows you to chain the outputs, such as vec.info(cout) << " continue printing " << std::endl.

const vec_t &val() const noexcept¶
vec_t &val() noexcept¶
const scal_t &operator[](size_t n) const noexcept¶
scal_t &operator[](size_t n) noexcept¶: val() return the intrinsic vector value. operator[](n) takes the n-th scalar element from the vector.

Vec &load(caddr_t mem_addr) noexcept¶

Vec &loadu(caddr_t mem_addr) noexcept¶

Vec &loadm(caddr_t mem_addr, ivec_t mask) noexcept¶

Vec &load1(const scal_t *mem_addr) noexcept¶

Vec &bcast(const scal_t *mem_addr) noexcept¶

Vec &bcast(const vec_hc_t *mem_addr) noexcept¶

Vec &gather(const scal_t *base_addr, ivec_t vindex, const int scale = SCALSIZE) noexcept¶

Vec &gatherm(vec_t src, const scal_t *base_addr, ivec_t vindex, vec_t mask, const int scale = SCALSIZE) noexcept¶

Load operations: load data from memory. The address type caddr_t can be either const double *, const vec_t * or const Vec<double, 4> *.

load() loads a pack of 8 single precision floating-point scalar values into the calling instance from the aligned address mem_addr.
loadu() allows that mem_addr is not aligned.
loadm() uses mask (elements are zeroed out when the highest bit of the corresponding element is not set).
load1() load a single scalar value and repeats it eight times to make a vector.
bcast(const scal_t *) is the same as load1().
bcast(const vec_hc_t *) loads four scalar values and repeats them twice to make a vector.
gather() loads 8 scalar values from address starting at base_addr, each offset by the corresponding 32-bit element in vindex (in bytes, and scaled by scale; scale can be 1, 2, 4, or 8).
gatherm() is the same as gather() but using mask (elements are copied from src when the highest bit is not set in the corresponding element).

const Vec &store(addr_t mem_addr) const noexcept¶

const Vec &storeu(addr_t mem_addr) const noexcept¶

const Vec &storem(addr_t mem_addr, ivec_t mask) const noexcept¶

const Vec &stream(addr_t mem_addr) const noexcept¶

const Vec &scatter(void *base_addr, ivec_t vindex, int scale = SCALSIZE) const noexcept¶

const Vec &scatterm(void *base_addr, mask8_t k, ivec_t vindex, int scale = SCALSIZE) const noexcept¶

Vec &store(addr_t mem_addr) noexcept¶

Vec &storeu(addr_t mem_addr) noexcept¶

Vec &storem(addr_t mem_addr, ivec_t mask) noexcept¶

Vec &stream(addr_t mem_addr) noexcept¶

Vec &scatter(void *base_addr, ivec_t vindex, int scale = SCALSIZE) noexcept¶

Vec &scatterm(void *base_addr, mask8_t k, ivec_t vindex, int scale = SCALSIZE) noexcept¶

Store operations: store element from the current instance to a memory location. The address type addr_t can be either double *, vec_t * or Vec<double, 4> *.

Each store operation has a non-const version used for a non-constant instance.

All the store operations return the reference to the instance itself.

store() stores 8 single precision floating-point scalar values into the aligned address mem_addr.
storeu() does not need the address to be aligned.
storem() uses the mask (elements are not stored when the highest bit is not set in the corresponding element).
stream() uses a non-temporal memory hint. mem_addr must be aligned.
scatter() stores elements into the address starting at base_addr and offset by each 32-bit element in vindex (in byte, and scaled by scale; scale can be 1, 2, 4, or 8).
scatterm() is the same as scatter() but uses a mask (elements are not stored when the corresponding mask bit is not set).

scal_t to_scal() const noexcept¶
int movemask() const noexcept¶
Vec movehdup() const noexcept¶
Vec moveldup() const noexcept¶: to_scal() returns the lower single precision floating-point scalar value. movemask() sets each bit of the returned value based on the corresponding most significate bit in each single precision floating-point scalar value. movehdup() duplicates odd-indexed scalar values. moveldup() duplicates even-indexed scalar values.

Vec &set(scal_t e7, scal_t e6, scal_t e5, scal_t e4, scal_t e3, scal_t e2, scal_t e1, scal_t e0) noexcept¶

Vec &set1(scal_t a) noexcept¶

Vec &set1(vec_hc_t a) noexcept¶

Vec &set() noexcept¶

Vec &setzero() noexcept¶

Vec &undefined() noexcept¶

Set the scalar values of the calling instance.

set(e7,e6,...,e0) sets each elements from the higher address value e7 to lower address value e0.
set1(scal_t a) repeats a scalar value 8 times.
set1(vec_hc_t a) repeats the lower scalar value of a 8 times.
set() is the same as setzero().
setzero() set all bits to zero.
undefined() set scalars to undefined values.

Vec operator+(const Vec &a) const noexcept¶

Vec operator-(const Vec &a) const noexcept¶

Vec operator*(const Vec &a) const noexcept¶

Vec operator/(const Vec &a) const noexcept¶

Vec operator++(int) noexcept¶

Vec &operator++() noexcept¶

Vec operator--(int) noexcept¶

Vec &operator--() noexcept¶

Vec &operator+=(const Vec &a) noexcept¶

Vec &operator-=(const Vec &a) noexcept¶

Vec &operator*=(const Vec &a) noexcept¶

Vec &operator/=(const Vec &a) noexcept¶

Vec hadd(const Vec &a) const noexcept¶

Vec hsub(const Vec &a) const noexcept¶

Arithmetic operations. All of the above operations are element-wise.

hadd() performs horizontal addtion, i.e., the result of a.hadd(b) is { a[0]+a[1], a[2]+a[3], b[0]+b[1], b[2]+b[3], …, b[4]+b[5], b[6]+b[7] }. hsub() performs horizontal subtration, i.e., the result of a.hsub(b) is { a[0]-a[1], a[2]-a[3], b[0]-b[1], b[2]-b[3], …, b[4]-b[5], b[6]-b[7] }.

Vec operator&(const Vec &a) const noexcept¶
Vec andnot(const Vec &a) const noexcept¶
Vec operator|(const Vec &a) const noexcept¶
Vec operator~() const noexcept¶
Vec operator^(const Vec &a) const noexcept¶
Vec &operator&=(const Vec &a) noexcept¶
Vec &operator|=(const Vec &a) noexcept¶
Vec &operator^=(const Vec &a) noexcept¶: Bitwise Logic operations.

Vec operator==(const Vec &a) const noexcept¶
Vec operator!=(const Vec &a) const noexcept¶
Vec operator<(const Vec &a) const noexcept¶
Vec operator<=(const Vec &a) const noexcept¶
Vec operator>(const Vec &a) const noexcept¶
Vec operator>=(const Vec &a) const noexcept¶: Relation (comparison) operations. The comparision is element-wise for each scalar. If true, all the bits are set in the corresponding result element.

Vec blend(const Vec &a, const int imm8) const noexcept¶

Vec blend(const Vec &a, const Vec &mask) const noexcept¶

Blend two vectors using control mask imm8. For each bit in imm8, if set, taken the corresponding result element from b, otherwise from a.

The second version uses a vector mask, i.e., each mask bit is taken from the highest bit of the corresponding 64-bit elements.

Vec rcp() const noexcept¶
Vec sqrt() const noexcept¶
Vec rsqrt() const noexcept¶
Vec ceil() const noexcept¶
Vec floor() const noexcept¶
Vec round(const int rounding) const noexcept¶
Vec max(const Vec &a) const noexcept¶
Vec min(const Vec &a) const noexcept¶: Elementary math functions.

Vec log2_fast() const noexcept¶
Vec log_fast() const noexcept¶
Vec log10_fast() const noexcept¶
Vec log2_faster() const noexcept¶
Vec log_faster() const noexcept¶
Vec log10_faster() const noexcept¶
Vec pow2_fast() const noexcept¶
Vec exp_fast() const noexcept¶
Vec pow10_fast() const noexcept¶
Vec pow2_faster() const noexcept¶
Vec exp_faster() const noexcept¶
Vec pow10_faster() const noexcept¶: Vectorized math functions. These functions are not supported by hardware. They are implemented by approximation algorithms. xxx_faster() is faster than xxx_fast(), but has lower precision.