Intel Corp. this week published details regarding new advanced vector extensions 512 (Intel AVX-512) instructions first to be found in the Xeon Phi “Knights Landing” co-processors as well as future microprocessors. The new instructions enables processing of twice the number of data elements that AVX/AVX2 can process and represent a significant leap to 512-bit SIMD support.
512-Bit AVX Instructions Incoming
Intel AVX-512 instructions are important because they offer higher performance for the most demanding computational tasks. Intel AVX-512 instructions offer the highest degree of compiler support by including an unprecedented level of richness in the design of the instructions. Programs can pack eight double precision or sixteen single precision floating-point numbers, or eight 64-bit integers, or sixteen 32-bit integers within the 512-bit vectors. This enables processing of twice the number of data elements that AVX/AVX2 can process with a single instruction and four times that of SSE.
Intel AVX-512 features include 32 vector registers each 512 bits wide, eight dedicated mask registers, 512-bit operations on packed floating point data or packed integer data, embedded rounding controls (override global settings), embedded broadcast, embedded floating-point fault suppression, embedded memory fault suppression, new operations, additional gather/scatter support, high speed math instructions, compact representation of large displacement value, and the ability to have optional capabilities beyond the foundational capabilities. It is interesting to note that the 32 ZMM registers represent 2K of register space.
Intel AVX-512 will be first implemented in the future Intel Xeon Phi processor and coprocessor known by the code name Knights Landing, and will also be supported by some future Xeon processors scheduled to be introduced after Knights Landing. Intel AVX-512 brings the capabilities of 512-bit vector operations, first seen in the first Xeon Phi co-processors (previously code named Knights Corner), into the official Intel instruction set in a way that can be utilized in processors as well. Intel AVX-512 offers some improvements and refinement over the 512-bit SIMD found on Knights Corner that I've seen bring smiles to compiler writers and application developers alike. This is done in a way that offers source code compatibility for almost all applications with a simple recompile or relinking to libraries with Knights Landing support.
512-Bit Instructions Detailed
Intel AVX-512 offers a level of compatibility with AVX that is stronger than prior transitions to new widths for SIMD operations. Unlike SSE and AVX that cannot be mixed without performance penalties, the mixing of AVX and Intel AVX-512 instructions is supported without penalty. AVX registers YMM0–YMM15 map into the Intel AVX-512 registers ZMM0–ZMM15, very much like SSE registers map into AVX registers. Therefore, in processors with Intel AVX-512 support, AVX and AVX2 instructions operate on the lower 128 or 256 bits of the first 16 ZMM registers.
Intel AVX instructions use the VEX prefix while Intel AVX-512 instructions use the EVEX prefix which is one byte longer. The EVEX prefix enables the additional functionality of Intel AVX-512. In general, if the extra capabilities of the EVEX prefix are not needed then the AVX2 instructions can be used, coded using the VEX prefix saving a byte in certain cases. Such optimizations can be done in compiler code generators or assemblers automatically
Intel AVX-512 foundation instructions will be included in all implementations of Intel AVX-512. Products may also include capabilities that extend Intel AVX-512 and have distinct CPUID bits for detection. Knights Landing will support three sets of capabilities to augment the foundation instructions. This is documented in the programmer’s guide; they are known as Intel AVX-512 Conflict Detection Instructions (CDI), Intel AVX-512 Exponential and Reciprocal Instructions (ERI) and Intel AVX-512 Prefetch Instructions (PFI). These capabilities provide efficient conflict detection to allow more loops to be vectorized, exponential and reciprocal operations and new prefetch capabilities, respectively.
The evolution to Intel AVX-512 contributes to Intel’s goal to grow peak FLOP/sec by 8x over 4 generations: 2x with AVX1.0 with the Sandy Bridge architecture over the prior SSE4.2, extended by Ivy Bridge architecture with 16-bit float and random number support, 2x with AVX2.0 and its fused multiply-add (FMA) in the Haswell architecture and then 2x more with Intel AVX-512.