The Intel® Advanced Vector Extensions (Intel® AVX) intrinsics map directly to the Intel® AVX instructions and other enhanced 128-bit single-instruction multiple data processing (SIMD) instructions. The Intel® AVX instructions are architecturally similar to extensions of the existing Intel® 64 architecture-based vector streaming SIMD portions of Intel® Streaming SIMD Extensions (Intel® SSE) instructions, and double-precision floating-point portions of Intel® SSE2 instructions. However, Intel® AVX introduces the following architectural enhancements:
Intel® AVX adds 16 registers (YMM0-YMM15), each 256 bits wide, aliased onto the 16 SIMD (XMM0-XMM15) registers. The Intel® AVX new instructions operate on the YMM registers. Intel® AVX extends certain existing instructions to operate on the YMM registers, defining a new way of encoding up to three sources and one destination in a single instruction.
Because each of these registers can hold more than one data element, the processor can process more than one data element simultaneously. This processing capability is also known as single-instruction multiple data processing (SIMD).
For each computational and data manipulation instruction in the new extension sets, there is a corresponding C intrinsic that implements that instruction directly. This frees you from managing registers and assembly programming. Further, the compiler optimizes the instruction scheduling so that your executable runs faster.
The Fused Multiply Add (FMA) new instructions comprise 256-bit and 128-bit SIMD instructions operating on YMM states.
The Intel® AVX intrinsic functions use three new C data types as operands, representing the new registers used as operands to the intrinsic functions. These are the __m256 , __m256d, and the __m256i data types.
The __m256 data type is used to represent the contents of the extended SSE register - the YMM register, used by the Intel® AVX intrinsics. The __m256 data type can hold eight 32-bit floating-point values.
The __m256d data type can hold four 64-bit double precision floating-point values.
The __m256i data type can hold thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit integer values.
The compiler aligns the __m256, __m256d, and __m256i local and global data to 32-byte boundaries on the stack. To align integer, float, or double arrays, use the declspec align statement as follows:
typedef struct __declspec(align(32)) { float f[8]; } __m256;
typedef struct __declspec(align(32)) { double d[4]; } __m256d;
typedef struct __declspec(align(32)) { int i[8]; } __m256i;
The Intel® AVX intrinsics also use SSE2 data types like __m128, __m128d, and __m128i for some operations. See Details of Intrinsics topic for more information.
Intel® AVX introduces a new prefix, referred to as VEX, in the Intel® 64 and IA-32 instruction encoding format. Instruction encoding using the VEX prefix provides several capabilities:
The VEX prefix encoding applies to SIMD instructions operating on YMM registers, XMM registers, and in some cases with a general-purpose register as one of the operands. The VEX prefix is not supported for instructions operating on MMX™ or x87 registers.
It is recommended to use Intel® AVX and FMA intrinsics with /QxAVX [on Windows* operating systems] or -xAVX [on Linux* operating systems] options because their corresponding instructions are encoded with the VEX-prefix. The /QxAVX or -xAVX option forces other packed instructions to be encoded with VEX too. As a result there are less number of performance stalls due to AVX/FMA to legacy SSE code transitions.
Most Intel® AVX and FMA intrinsic names use the following notational convention:
_mm256_<intrin_op>_<suffix>(<data type> <parameter1>, <data type> <parameter2>, <data type> <parameter3>)
The following table explains each item in the syntax.
extern __m256d _mm256_add_pd(__m256d m1, __m256d m2);
where,
The packed values are represented in right-to-left order, with the lowest value used for scalar operations. Consider the following example operation:
double a[4] = {1.0, 2.0, 3.0, 4.0};
__m256d t = _mm256_load_pd(a);
The result is the following:
__m256d t = _mm256_set_pd(4.0, 3.0, 2.0, 1.0);
In other words, the YMM register that holds the value t appears as follows:
The"scalar" element is 1.0. Due to the nature of the instruction, some intrinsics require their arguments to be immediates (constant integer literals).
Copyright © 1996-2010, Intel Corporation. All rights reserved.