AVX (256-bit) Vectors
Contents
AVX (256-bit) Vectors
The following classes are all defined within namespace HIPP::SIMD.
Vec<double, 4>
-
template<>
class Vec<double, 4> A vector of four double-precision (256 bits in total) values.
Vec<double, 4>can be copied, copy-constructed, moved, and move-constructed. The copy and move operations and destructor are allnoexcept.Vec<double, 4>is binary-compatible with the intrinsic type__m256d, i.e., they have the same length and alignment.-
typedef double scal_t
-
typedef float scal_hp_t
-
typedef __m256d vec_t
-
typedef __m128d vec_hc_t
-
typedef __m128 vec_hp_t
-
typedef int64_t iscal_t
-
typedef int32_t iscal_hp_t
-
typedef __m256i ivec_t
-
typedef __m128d ivec_hp_t
-
typedef __mmask8 mask8_t
Type aliases.
scal_tis the scalar type (i.e., element type) of the SIMD vector.scal_hp_tis half-precision scalar type.vec_tis the intrinsic SIMD vector,vec_hc_tandvec_hp_trepresent the half-precision and half-count types, respectively.Scalar and vector types for integers are also defined.
-
enum [anonymous] : size_t
-
enumerator NPACK = 4
-
enumerator NBIT = 256
-
enumerator VECSIZE = sizeof(vec_t)
-
enumerator SCALSIZE = sizeof(scal_t)
NPACKis the number of scalars in the vector,NBITis the number of bits of the vector register.VECSIZEandSCALSIZEare size in bytes of the vector and scalar.-
enumerator NPACK = 4
-
Vec() noexcept
-
Vec(scal_t e3, scal_t e2, scal_t e1, scal_t e0) noexcept
-
explicit Vec(scal_t a) noexcept
-
explicit Vec(caddr_t mem_addr) noexcept
-
Vec(const vec_t &a) noexcept
-
Vec(const scal_t *base_addr, ivec_t vindex, const int scale = SCALSIZE) noexcept
-
Vec(vec_t src, const scal_t *base_addr, ivec_t vindex, vec_t mask, const int scale = SCALSIZE) noexcept
Initializers.
Default Initializer:
Vec()gives an un-initialized vector.Vec(e3, e2, e1, e0)constructs a vector of four given elements, from higher address valuee3, to lower address valuee0.Vec(scal_t a)constructs a vector of four repeated scalar valuea.Vec(mem_addr): the four elements are loaded from the memory addressmem_addr(must be aligned at 32-byte boundary).Vec(const vec_t &a): copy the intrinsic vectora.Vec(base_addr, vindex, scale): load usinggather().Vec(src, base_addr, vindex, mask, scale): load usinggatherm().
The address type
caddr_tcan be eitherconst double *,const vec_t *orconst Vec<double, 4> *.
-
ostream &info(ostream &os = cout, int fmt_cntl = 1) const
-
friend ostream &operator<<(ostream &os, const Vec &v)
info()displays the content of the vector toos.- Parameters
fmt_cntl – Control the display format. 0 for an inline printing and 1 for a verbose, multiple-line version.
- Returns
The argument
osis returned.
The overloaded << operator is equivalent to
info()with defaultfmt_cntl.The returned reference to
osallows you to chain the outputs, such asvec.info(cout) << " continue printing " << std::endl.
-
const vec_t &val() const noexcept
-
vec_t &val() noexcept
-
const scal_t &operator[](size_t n) const noexcept
-
scal_t &operator[](size_t n) noexcept
val()return the intrinsic vector value.operator[](n)takes the n-th scalar element from the vector.
-
Vec &load(caddr_t mem_addr) noexcept
-
Vec &loadu(caddr_t mem_addr) noexcept
-
Vec &loadm(caddr_t mem_addr, ivec_t mask) noexcept
-
Vec &load1(const scal_t *mem_addr) noexcept
-
Vec &bcast(const scal_t *mem_addr) noexcept
-
Vec &bcast(const vec_hc_t *mem_addr) noexcept
-
Vec &gather(const scal_t *base_addr, ivec_t vindex, const int scale = SCALSIZE) noexcept
-
Vec &gatherm(vec_t src, const scal_t *base_addr, ivec_t vindex, vec_t mask, const int scale = SCALSIZE) noexcept
-
Vec &gather_idxhp(const scal_t *base_addr, ivec_hp_t vindex, const int scale = SCALSIZE) noexcept
-
Vec &gatherm_idxhp(vec_t src, const scal_t *base_addr, ivec_hp_t vindex, vec_t mask, const int scale = SCALSIZE) noexcept
Load operations: load data from memory. The address type
caddr_tcan be eitherconst double *,const vec_t *orconst Vec<double, 4> *.load()loads a pack of 4 double precision floating-point scalar values into the calling instance from the aligned addressmem_addr.loadu()allows thatmem_addris not aligned.loadm()usesmask(elements are zeroed out when the highest bit of the corresponding element is not set).load1()load a single scalar value and repeats it four times to make a vector.bcast(const scal_t *)is the same asload1().bcast(const vec_hc_t *)loads two scalar values and repeats them twice to make a vector.gather()loads 4 scalar values from address starting atbase_addr, each offset by the corresponding 64-bit element invindex(in bytes, and scaled byscale;scalecan be 1, 2, 4, or 8).gatherm()is the same asgather()but usingmask(elements are copied from src when the highest bit is not set in the corresponding element).gather_idxhp()is likegather()but uses 32-bit offset.gatherm_idxhp()us likegatherm()but uses 32-bit offset.
-
const Vec &store(addr_t mem_addr) const noexcept
-
const Vec &storeu(addr_t mem_addr) const noexcept
-
const Vec &storem(addr_t mem_addr, ivec_t mask) const noexcept
-
const Vec &stream(addr_t mem_addr) const noexcept
-
const Vec &scatter(void *base_addr, ivec_t vindex, int scale = SCALSIZE) const noexcept
-
const Vec &scatterm(void *base_addr, mask8_t k, ivec_t vindex, int scale = SCALSIZE) const noexcept
-
const Vec &scatter_idxhp(void *base_addr, ivec_hp_t vindex, int scale = SCALSIZE) const noexcept
-
const Vec &scatterm_idxhp(void *base_addr, mask8_t k, ivec_hp_t vindex, int scale = SCALSIZE) const noexcept
-
Vec &store(addr_t mem_addr) noexcept
-
Vec &storeu(addr_t mem_addr) noexcept
-
Vec &storem(addr_t mem_addr, ivec_t mask) noexcept
-
Vec &stream(addr_t mem_addr) noexcept
-
Vec &scatter(void *base_addr, ivec_t vindex, int scale = SCALSIZE) noexcept
-
Vec &scatterm(void *base_addr, mask8_t k, ivec_t vindex, int scale = SCALSIZE) noexcept
-
Vec &scatter_idxhp(void *base_addr, ivec_hp_t vindex, int scale = SCALSIZE) noexcept
-
Vec &scatterm_idxhp(void *base_addr, mask8_t k, ivec_hp_t vindex, int scale = SCALSIZE) noexcept
Store operations: store element from the current instance to a memory location. The address type
addr_tcan be eitherdouble *,vec_t *orVec<double, 4> *.Each store operation has a non-
constversion used for a non-constant instance.All the store operations return the reference to the instance itself.
store()stores 4 double precision floating-point scalar values into the aligned addressmem_addr.storeu()does not need the address to be aligned.storem()uses themask(elements are not stored when the highest bit is not set in the corresponding element).stream()uses a non-temporal memory hint.mem_addrmust be aligned.scatter()stores elements into the address starting atbase_addrand offset by each 64-bit element invindex(in byte, and scaled byscale;scalecan be 1, 2, 4, or 8).scatterm()is the same asscatter()but uses amask(elements are not stored when the corresponding mask bit is not set).scatter_idxhp()is the same asscatter()but uses 32-bit offset.scatterm_idxhp()is the same asscatterm()but uses 32-bit offset.
-
scal_t to_scal() const noexcept
-
int movemask() const noexcept
-
Vec movedup() const noexcept
to_scal()returns the lower double-precision floating-point scalar value.movemask()sets each bit of the returned value based on the corresponding most significate bit in each double precision floating-point scalar value.movedup()duplicates even-indexed scalar values.
-
Vec &set(scal_t e3, scal_t e2, scal_t e1, scal_t e0) noexcept
-
Vec &set1(scal_t a) noexcept
-
Vec &set1(vec_hc_t a) noexcept
-
Vec &set() noexcept
-
Vec &setzero() noexcept
-
Vec &undefined() noexcept
Set the scalar values of the calling instance.
set(e3,e2,e1,e0)sets each elements from the higher address valuee3to lower address valuee0.set1(scal_t a)repeats a scalar value 4 times.set1(vec_hc_t a)repeats the lower scalar value ofa4 times.set()is the same assetzero().setzero()set all bits to zero.undefined()set scalars to undefined values.
-
Vec operator+(const Vec &a) const noexcept
-
Vec operator-(const Vec &a) const noexcept
-
Vec operator*(const Vec &a) const noexcept
-
Vec operator/(const Vec &a) const noexcept
-
Vec operator++(int) noexcept
-
Vec &operator++() noexcept
-
Vec operator--(int) noexcept
-
Vec &operator--() noexcept
-
Vec &operator+=(const Vec &a) noexcept
-
Vec &operator-=(const Vec &a) noexcept
-
Vec &operator*=(const Vec &a) noexcept
-
Vec &operator/=(const Vec &a) noexcept
-
Vec hadd(const Vec &a) const noexcept
-
Vec hsub(const Vec &a) const noexcept
Arithmetic operations. All of the above operations are element-wise.
hadd()performs horizontal addtion, i.e., the result of a.hadd(b) is { a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3] }.hsub()performs horizontal subtration, i.e., the result of a.hsub(b) is { a[0]-a[1], b[0]-b[1], a[2]-a[3], b[2]-b[3] }.
-
Vec operator&(const Vec &a) const noexcept
-
Vec andnot(const Vec &a) const noexcept
-
Vec operator|(const Vec &a) const noexcept
-
Vec operator~() const noexcept
-
Vec operator^(const Vec &a) const noexcept
-
Vec &operator&=(const Vec &a) noexcept
-
Vec &operator|=(const Vec &a) noexcept
-
Vec &operator^=(const Vec &a) noexcept
Bitwise Logic operations.
-
Vec operator==(const Vec &a) const noexcept
-
Vec operator!=(const Vec &a) const noexcept
-
Vec operator<(const Vec &a) const noexcept
-
Vec operator<=(const Vec &a) const noexcept
-
Vec operator>(const Vec &a) const noexcept
-
Vec operator>=(const Vec &a) const noexcept
Relation (comparison) operations. The comparision is element-wise for each scalar. If true, all the bits are set in the corresponding result element.
-
Vec blend(const Vec &a, const int imm8) const noexcept
-
Vec blend(const Vec &a, const Vec &mask) const noexcept
Blend two vectors using control mask
imm8. For each bit inimm8, if set, taken the corresponding result element fromb, otherwise froma.The second version uses a vector
mask, i.e., each mask bit is taken from the highest bit of the corresponding 64-bit elements.
-
Vec sqrt() const noexcept
-
Vec ceil() const noexcept
-
Vec floor() const noexcept
-
Vec round(const int rounding) const noexcept
-
Vec max(const Vec &a) const noexcept
-
Vec min(const Vec &a) const noexcept
-
Vec sin() const noexcept
-
Vec cos() const noexcept
-
Vec log() const noexcept
-
Vec exp() const noexcept
-
Vec pow(const Vec &a) const noexcept
Elementary math functions.
sin(),cos(),log(),exp(),pow()may not be serialized, depending on the compiler.
-
typedef double scal_t
Vector<float, 8>
-
template<>
class Vec<float, 8> A vector of eight single-precision (256 bits in total) values.
Vec<float, 4>can be copied, copy-constructed, moved, and move-constructed. The copy and move operations and destructor are allnoexcept.Vec<float, 4>is binary-compatible with the intrinsic type__m256, i.e., they have the same length and alignment.-
typedef float scal_t
-
typedef __m256 vec_t
-
typedef __m128 vec_hc_t
-
typedef int32_t iscal_t
-
typedef __m256i ivec_t
-
typedef __mmask8 mask8_t
Type aliases.
scal_tis the scalar type (i.e., element type) of the SIMD vector.vec_tis the intrinsic SIMD vector,vec_hc_trepresents the half-count type.
-
enum [anonymous] : size_t
-
enumerator NPACK = 8
-
enumerator NBIT = 256
-
enumerator VECSIZE = sizeof(vec_t)
-
enumerator SCALSIZE = sizeof(scal_t)
NPACKis the number of scalars in the vector,NBITis the number of bits of the vector register.VECSIZEandSCALSIZEare size in bytes of the vector and scalar.-
enumerator NPACK = 8
-
Vec() noexcept
-
Vec(scal_t e7, scal_t e6, scal_t e5, scal_t e4, scal_t e3, scal_t e2, scal_t e1, scal_t e0) noexcept
-
explicit Vec(scal_t a) noexcept
-
explicit Vec(caddr_t mem_addr) noexcept
-
Vec(const vec_t &a) noexcept
-
Vec(const scal_t *base_addr, ivec_t vindex, const int scale = SCALSIZE) noexcept
-
Vec(vec_t src, const scal_t *base_addr, ivec_t vindex, vec_t mask, const int scale) noexcept
Initializers.
Default Initializer:
Vec()gives an un-initialized vector.Vec(e7, e6, ..., e0)constructs a vector of eight given elements from higher address valuee7, to lower address valuee0.Vec(scal_t a)constructs a vector of eight repeated scalar valuea.Vec(mem_addr): the eight elements are loaded from the memory addressmem_addr(must be aligned at 32-byte boundary).Vec(const vec_t &a): copy the intrinsic vectora.Vec(base_addr, vindex, scale): load usinggather().Vec(src, base_addr, vindex, mask, scale): load usinggatherm().
The address type
caddr_tcan be eitherconst float *,const vec_t *orconst Vec<float, 8> *.
-
ostream &info(ostream &os = cout, int fmt_cntl = 1) const
-
friend ostream &operator<<(ostream &os, const Vec &v)
info()displays the content of the vector toos.- Parameters
fmt_cntl – Control the display format. 0 for an inline printing and 1 for a verbose, multiple-line version.
- Returns
The argument
osis returned.
The overloaded << operator is equivalent to
info()with defaultfmt_cntl.The returned reference to
osallows you to chain the outputs, such asvec.info(cout) << " continue printing " << std::endl.
-
const vec_t &val() const noexcept
-
vec_t &val() noexcept
-
const scal_t &operator[](size_t n) const noexcept
-
scal_t &operator[](size_t n) noexcept
val()return the intrinsic vector value.operator[](n)takes the n-th scalar element from the vector.
-
Vec &load(caddr_t mem_addr) noexcept
-
Vec &loadu(caddr_t mem_addr) noexcept
-
Vec &loadm(caddr_t mem_addr, ivec_t mask) noexcept
-
Vec &load1(const scal_t *mem_addr) noexcept
-
Vec &bcast(const scal_t *mem_addr) noexcept
-
Vec &bcast(const vec_hc_t *mem_addr) noexcept
-
Vec &gather(const scal_t *base_addr, ivec_t vindex, const int scale = SCALSIZE) noexcept
-
Vec &gatherm(vec_t src, const scal_t *base_addr, ivec_t vindex, vec_t mask, const int scale = SCALSIZE) noexcept
Load operations: load data from memory. The address type
caddr_tcan be eitherconst double *,const vec_t *orconst Vec<double, 4> *.load()loads a pack of 8 single precision floating-point scalar values into the calling instance from the aligned addressmem_addr.loadu()allows thatmem_addris not aligned.loadm()usesmask(elements are zeroed out when the highest bit of the corresponding element is not set).load1()load a single scalar value and repeats it eight times to make a vector.bcast(const scal_t *)is the same asload1().bcast(const vec_hc_t *)loads four scalar values and repeats them twice to make a vector.gather()loads 8 scalar values from address starting atbase_addr, each offset by the corresponding 32-bit element invindex(in bytes, and scaled byscale;scalecan be 1, 2, 4, or 8).gatherm()is the same asgather()but usingmask(elements are copied from src when the highest bit is not set in the corresponding element).
-
const Vec &store(addr_t mem_addr) const noexcept
-
const Vec &storeu(addr_t mem_addr) const noexcept
-
const Vec &storem(addr_t mem_addr, ivec_t mask) const noexcept
-
const Vec &stream(addr_t mem_addr) const noexcept
-
const Vec &scatter(void *base_addr, ivec_t vindex, int scale = SCALSIZE) const noexcept
-
const Vec &scatterm(void *base_addr, mask8_t k, ivec_t vindex, int scale = SCALSIZE) const noexcept
-
Vec &store(addr_t mem_addr) noexcept
-
Vec &storeu(addr_t mem_addr) noexcept
-
Vec &storem(addr_t mem_addr, ivec_t mask) noexcept
-
Vec &stream(addr_t mem_addr) noexcept
-
Vec &scatter(void *base_addr, ivec_t vindex, int scale = SCALSIZE) noexcept
-
Vec &scatterm(void *base_addr, mask8_t k, ivec_t vindex, int scale = SCALSIZE) noexcept
Store operations: store element from the current instance to a memory location. The address type
addr_tcan be eitherdouble *,vec_t *orVec<double, 4> *.Each store operation has a non-
constversion used for a non-constant instance.All the store operations return the reference to the instance itself.
store()stores 8 single precision floating-point scalar values into the aligned addressmem_addr.storeu()does not need the address to be aligned.storem()uses themask(elements are not stored when the highest bit is not set in the corresponding element).stream()uses a non-temporal memory hint.mem_addrmust be aligned.scatter()stores elements into the address starting atbase_addrand offset by each 32-bit element invindex(in byte, and scaled byscale;scalecan be 1, 2, 4, or 8).scatterm()is the same asscatter()but uses amask(elements are not stored when the corresponding mask bit is not set).
-
scal_t to_scal() const noexcept
-
int movemask() const noexcept
-
Vec movehdup() const noexcept
-
Vec moveldup() const noexcept
to_scal()returns the lower single precision floating-point scalar value.movemask()sets each bit of the returned value based on the corresponding most significate bit in each single precision floating-point scalar value.movehdup()duplicates odd-indexed scalar values.moveldup()duplicates even-indexed scalar values.
-
Vec &set(scal_t e7, scal_t e6, scal_t e5, scal_t e4, scal_t e3, scal_t e2, scal_t e1, scal_t e0) noexcept
-
Vec &set1(scal_t a) noexcept
-
Vec &set1(vec_hc_t a) noexcept
-
Vec &set() noexcept
-
Vec &setzero() noexcept
-
Vec &undefined() noexcept
Set the scalar values of the calling instance.
set(e7,e6,...,e0)sets each elements from the higher address valuee7to lower address valuee0.set1(scal_t a)repeats a scalar value 8 times.set1(vec_hc_t a)repeats the lower scalar value ofa8 times.set()is the same assetzero().setzero()set all bits to zero.undefined()set scalars to undefined values.
-
Vec operator+(const Vec &a) const noexcept
-
Vec operator-(const Vec &a) const noexcept
-
Vec operator*(const Vec &a) const noexcept
-
Vec operator/(const Vec &a) const noexcept
-
Vec operator++(int) noexcept
-
Vec &operator++() noexcept
-
Vec operator--(int) noexcept
-
Vec &operator--() noexcept
-
Vec &operator+=(const Vec &a) noexcept
-
Vec &operator-=(const Vec &a) noexcept
-
Vec &operator*=(const Vec &a) noexcept
-
Vec &operator/=(const Vec &a) noexcept
-
Vec hadd(const Vec &a) const noexcept
-
Vec hsub(const Vec &a) const noexcept
Arithmetic operations. All of the above operations are element-wise.
hadd()performs horizontal addtion, i.e., the result of a.hadd(b) is { a[0]+a[1], a[2]+a[3], b[0]+b[1], b[2]+b[3], …, b[4]+b[5], b[6]+b[7] }.hsub()performs horizontal subtration, i.e., the result of a.hsub(b) is { a[0]-a[1], a[2]-a[3], b[0]-b[1], b[2]-b[3], …, b[4]-b[5], b[6]-b[7] }.
-
Vec operator&(const Vec &a) const noexcept
-
Vec andnot(const Vec &a) const noexcept
-
Vec operator|(const Vec &a) const noexcept
-
Vec operator~() const noexcept
-
Vec operator^(const Vec &a) const noexcept
-
Vec &operator&=(const Vec &a) noexcept
-
Vec &operator|=(const Vec &a) noexcept
-
Vec &operator^=(const Vec &a) noexcept
Bitwise Logic operations.
-
Vec operator==(const Vec &a) const noexcept
-
Vec operator!=(const Vec &a) const noexcept
-
Vec operator<(const Vec &a) const noexcept
-
Vec operator<=(const Vec &a) const noexcept
-
Vec operator>(const Vec &a) const noexcept
-
Vec operator>=(const Vec &a) const noexcept
Relation (comparison) operations. The comparision is element-wise for each scalar. If true, all the bits are set in the corresponding result element.
-
Vec blend(const Vec &a, const int imm8) const noexcept
-
Vec blend(const Vec &a, const Vec &mask) const noexcept
Blend two vectors using control mask
imm8. For each bit inimm8, if set, taken the corresponding result element fromb, otherwise froma.The second version uses a vector
mask, i.e., each mask bit is taken from the highest bit of the corresponding 64-bit elements.
-
Vec rcp() const noexcept
-
Vec sqrt() const noexcept
-
Vec rsqrt() const noexcept
-
Vec ceil() const noexcept
-
Vec floor() const noexcept
-
Vec round(const int rounding) const noexcept
-
Vec max(const Vec &a) const noexcept
-
Vec min(const Vec &a) const noexcept
Elementary math functions.
-
Vec log2_fast() const noexcept
-
Vec log_fast() const noexcept
-
Vec log10_fast() const noexcept
-
Vec log2_faster() const noexcept
-
Vec log_faster() const noexcept
-
Vec log10_faster() const noexcept
-
Vec pow2_fast() const noexcept
-
Vec exp_fast() const noexcept
-
Vec pow10_fast() const noexcept
-
Vec pow2_faster() const noexcept
-
Vec exp_faster() const noexcept
-
Vec pow10_faster() const noexcept
Vectorized math functions. These functions are not supported by hardware. They are implemented by approximation algorithms.
xxx_faster()is faster thanxxx_fast(), but has lower precision.
-
typedef float scal_t