AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200 (Knights Landing),[1] and then later in a number of AMD and other Intel CPUs (see list below). AVX-512 consists of multiple extensions that may be implemented independently.[2] This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F (AVX-512 Foundation) is required by all AVX-512 implementations.
Besides widening most 256-bit instructions, the extensions introduce various new operations, such as new data conversions, scatter operations, and permutations.[2] The number of AVX registers is increased from 16 to 32, and eight new "mask registers" are added, which allow for variable selection and blending of the results of instructions. In CPUs with the vector length (VL) extension—included in most AVX-512-capable processors (see § CPUs with AVX-512)—these instructions may also be used on the 128-bit and 256-bit vector sizes. AVX-512 is not the first 512-bit SIMD instruction set that Intel has introduced in processors: the earlier 512-bit SIMD instructions used in the first generation Xeon Phi coprocessors, derived from Intel's Larrabee project, are similar but not binary compatible and only partially source compatible.[1]
The AVX-512 instruction set consists of several separate sets each having their own unique CPUID feature bit; however, they are typically grouped by the processor generation that implements them.
The VEX prefix used by AVX and AVX2, while flexible, did not leave enough room for the features Intel wanted to add to AVX-512. This has led them to define a new prefix called EVEX.
Compared to VEX, EVEX adds the following benefits:[6]
The extended registers, SIMD width bit, and opmask registers of AVX-512 are mandatory and all require support from the OS.
The AVX-512 instructions are designed to mix with 128/256-bit AVX/AVX2 instructions without a performance penalty. However, AVX-512VL extensions allows the use of AVX-512 instructions on 128/256-bit registers XMM/YMM, so most SSE and AVX/AVX2 instructions have new AVX-512 versions encoded with the EVEX prefix which allow access to new features such as opmask and additional registers. Unlike AVX-256, the new instructions do not have new mnemonics but share namespace with AVX, making the distinction between VEX and EVEX encoded versions of an instruction ambiguous in the source code. Since AVX-512F only works on 32- and 64-bit values, SSE and AVX/AVX2 instructions that operate on bytes or words are available only with the AVX-512BW extension (byte & word support).[6]
The width of the SIMD register file is increased from 256 bits to 512 bits, and expanded from 16 to a total of 32 registers ZMM0–ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.
AVX-512 vector instructions may indicate an opmask register to control which values are written to the destination, the instruction encoding supports 0–7 for this field, however, only opmask registers k1–k7 (of k0–k7) can be used as the mask corresponding to the value 1–7, whereas the value 0 is reserved for indicating no opmask register is used, i.e. a hardcoded constant (instead of 'k0') is used to indicate unmasked operations. The special opmask register 'k0' is still a functioning, valid register, it can be used in opmask register manipulation instructions or used as the destination opmask register.[8] A flag controls the opmask behavior, which can either be "zero", which zeros everything not selected by the mask, or "merge", which leaves everything not selected untouched. The merge behavior is identical to the blend instructions.
The opmask registers are normally 16 bits wide, but can be up to 64 bits with the AVX-512BW extension.[6] How many of the bits are actually used, though, depends on the vector type of the instructions masked. For the 32-bit single float or double words, 16 bits are used to mask the 16 elements in a 512-bit register. For double float and quad words, at most 8 mask bits are used.
The opmask register is the reason why several bitwise instructions which naturally have no element widths had them added in AVX-512. For instance, bitwise AND, OR or 128-bit shuffle now exist in both double-word and quad-word variants with the only difference being in the final masking.
The opmask registers have a new mini extension of instructions operating directly on them. Unlike the rest of the AVX-512 instructions, these instructions are all VEX encoded. The initial opmask instructions are all 16-bit (Word) versions. With AVX-512DQ 8-bit (Byte) versions were added to better match the needs of masking 8 64-bit values, and with AVX-512BW 32-bit (Double) and 64-bit (Quad) versions were added so they can mask up to 64 8-bit values. The instructions KORTEST and KTEST can be used to set the x86 flags based on mask registers, so that they may be used together with non-SIMD x86 branch and conditional instructions.
KAND
KANDN
KMOV
KUNPCK
KNOT
KOR
KORTEST
KSHIFTL
KSHIFTR
KXNOR
KXOR
KADD
KTEST
Many AVX-512 instructions are simply EVEX versions of old SSE or AVX instructions. There are, however, several new instructions, and old instructions that have been replaced with new AVX-512 versions. The new or heavily reworked instructions are listed below. These foundation instructions also include the extensions from AVX-512VL and AVX-512BW since those extensions merely add new versions of these instructions instead of new instructions.
There are no EVEX-prefixed versions of the blend instructions from SSE4; instead, AVX-512 has a new set of blending instructions using mask registers as selectors. Together with the general compare into mask instructions below, these may be used to implement generic ternary operations or cmov, similar to XOP's VPCMOV.
Since blending is an integral part of the EVEX encoding, these instructions may also be considered basic move instructions. Using the zeroing blend mode, they can also be used as masking instructions.
VBLENDMPD
VBLENDMPS
VPBLENDMD
VPBLENDMQ
VPBLENDMB
VPBLENDMW
AVX-512F has four new compare instructions. Like their XOP counterparts they use the immediate field to select between 8 different comparisons. Unlike their XOP inspiration, however, they save the result to a mask register and initially only support doubleword and quadword comparisons. The AVX-512BW extension provides the byte and word versions. Note that two mask registers may be specified for the instructions, one to write to and one to declare regular masking.[6]
VPCMPD
VPCMPUD
VPCMPQ
VPCMPUQ
VPCMPB
VPCMPUB
VPCMPW
VPCMPUW
The final way to set masks is using Logical Set Mask. These instructions perform either AND or NAND, and then set the destination opmask based on the result values being zero or non-zero. Note that like the comparison instructions, these take two opmask registers, one as destination and one a regular opmask.
VPTESTMD
VPTESTMQ
VPTESTNMD
VPTESTNMQ
VPTESTMB
VPTESTMW
VPTESTNMB
VPTESTNMW
The compress and expand instructions match the APL operations of the same name. They use the opmask in a slightly different way from other AVX-512 instructions. Compress only saves the values marked in the mask, but saves them compacted by skipping and not reserving space for unmarked values. Expand operates in the opposite way, by loading as many values as indicated in the mask and then spreading them to the selected positions.
VCOMPRESSPD
VCOMPRESSPS
VPCOMPRESSD
VPCOMPRESSQ
VEXPANDPD
VEXPANDPS
VPEXPANDD
VPEXPANDQ
A new set of permute instructions have been added for full two input permutations. They all take three arguments, two source registers and one index; the result is output by either overwriting the first source register or the index register. AVX-512BW extends the instructions to also include 16-bit (word) versions, and the AVX-512_VBMI extension defines the byte versions of the instructions.
VPERMB
VPERMW
VPERMT2B
VPERMT2W
VPERMI2PD
VPERMI2PS
VPERMI2D
VPERMI2Q
VPERMI2B
VPERMI2W
VPERMT2PS
VPERMT2PD
VPERMT2D
VPERMT2Q
VSHUFF32x4
VSHUFF64x2
VSHUFI32x4
VSHUFI64x2
VPMULTISHIFTQB
Two new instructions added can logically implement all possible bitwise operations between three inputs. They take three registers as input and an 8-bit immediate field. Each bit in the output is generated using a lookup of the three corresponding bits in the inputs to select one of the 8 positions in the 8-bit immediate. Since only 8 combinations are possible using three bits, this allow all possible 3 input bitwise operations to be performed.[6] These are the only bitwise vector instructions in AVX-512F; EVEX versions of the two source SSE and AVX bitwise vector instructions AND, ANDN, OR and XOR were added in AVX-512DQ.
The difference in the doubleword and quadword versions is only the application of the opmask.
VPTERNLOGD
VPTERNLOGQ
A number of conversion or move instructions were added; these complete the set of conversion instructions available from SSE2.
VPMOVQD, VPMOVSQD, VPMOVUSQD, VPMOVQW, VPMOVSQW, VPMOVUSQW, VPMOVQB, VPMOVSQB, VPMOVUSQB, VPMOVDW, VPMOVSDW, VPMOVUSDW, VPMOVDB, VPMOVSDB, VPMOVUSDB
VPMOVQD
VPMOVSQD
VPMOVUSQD
VPMOVQW
VPMOVSQW
VPMOVUSQW
VPMOVQB
VPMOVSQB
VPMOVUSQB
VPMOVDW
VPMOVSDW
VPMOVUSDW
VPMOVDB
VPMOVSDB
VPMOVUSDB
VPMOVWB
VPMOVSWB
VPMOVUSWB
VCVTPS2UDQ
VCVTPD2UDQ
VCVTTPS2UDQ
VCVTTPD2UDQ
VCVTSS2USI
VCVTSD2USI
VCVTTSS2USI
VCVTTSD2USI
VCVTPS2QQ
VCVTPD2QQ
VCVTPS2UQQ
VCVTPD2UQQ
VCVTTPS2QQ
VCVTTPD2QQ
VCVTTPS2UQQ
VCVTTPD2UQQ
VCVTUDQ2PS
VCVTUDQ2PD
VCVTUSI2PS
VCVTUSI2PD
VCVTUSI2SD
VCVTUSI2SS
VCVTUQQ2PS
VCVTUQQ2PD
VCVTQQ2PD
VCVTQQ2PS
Among the unique new features in AVX-512F are instructions to decompose floating-point values and handle special floating-point values. Since these methods are completely new, they also exist in scalar versions.
VGETEXPPD
VGETEXPPS
VGETEXPSD
VGETEXPSS
VGETMANTPD
VGETMANTPS
VGETMANTSD
VGETMANTSS
VFIXUPIMMPD
VFIXUPIMMPS
VFIXUPIMMSD
VFIXUPIMMSS
This is the second set of new floating-point methods, which includes new scaling and approximate calculation of reciprocal, and reciprocal of square root. The approximate reciprocal instructions guarantee to have at most a relative error of 2−14.[6]
VRCP14PD
VRCP14PS
VRCP14SD
VRCP14SS
VRNDSCALEPS
VRNDSCALEPD
VRNDSCALESS
VRNDSCALESD
VRSQRT14PD
VRSQRT14PS
VRSQRT14SD
VRSQRT14SS
VSCALEFPS
VSCALEFPD
VSCALEFSS
VSCALEFSD
VBROADCASTSS
VBROADCASTSD
VPBROADCASTB
VPBROADCASTW
VPBROADCASTD
VPBROADCASTQ
VBROADCASTI32X2
VBROADCASTI64X2
VBROADCASTI32X4
VBROADCASTI32X8
VBROADCASTI64X4
VALIGND
VALIGNQ
VDBPSADBW
VPABSQ
VPMAXSQ
VPMAXUQ
VPMINSQ
VPMINUQ
VPROLD
VPROLVD
VPROLQ
VPROLVQ
VPRORD
VPRORVD
VPRORQ
VPRORVQ
VPSCATTERDD
VPSCATTERDQ
VPSCATTERQD
VPSCATTERQQ
VSCATTERDPS
VSCATTERDPD
VSCATTERQPS
VSCATTERQPD
The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized.[9]
VPCONFLICTD
VPCONFLICTQ
VPLZCNTD
VPLZCNTQ
LZCNT
VPBROADCASTMB2Q
VPBROADCASTMW2D
AVX-512 exponential and reciprocal (AVX-512ER) instructions contain more accurate approximate reciprocal instructions than those in the AVX-512 foundation; relative error is at most 2−28. They also contain two new exponential functions that have a relative error of at most 2−23.[6]
VEXP2PD
VEXP2PS
VRCP28PD
VRCP28PS
VRCP28SD
VRCP28SS
VRSQRT28PD
VRSQRT28PS
VRSQRT28SD
VRSQRT28SS
AVX-512 prefetch (AVX-512PF) instructions contain new prefetch operations for the new scatter and gather functionality introduced in AVX2 and AVX-512. T0 prefetch means prefetching into level 1 cache and T1 means prefetching into level 2 cache.
T0
T1
VGATHERPF0DPS
VGATHERPF0QPS
VGATHERPF0DPD
VGATHERPF0QPD
VGATHERPF1DPS
VGATHERPF1QPS
VGATHERPF1DPD
VGATHERPF1QPD
VSCATTERPF0DPS
VSCATTERPF0QPS
VSCATTERPF0DPD
VSCATTERPF0QPD
VSCATTERPF1DPS
VSCATTERPF1QPS
VSCATTERPF1DPD
VSCATTERPF1QPD
The two sets of instructions perform multiple iterations of processing. They are generally only found in Xeon Phi products.
V4FMADDPS
V4FMADDSS
V4FNMADDPS
V4FNMADDSS
VP4DPWSSD
VP4DPWSSDS
AVX-512DQ adds new doubleword and quadword instructions. AVX-512BW adds byte and words versions of the same instructions, and adds byte and word version of doubleword/quadword instructions in AVX-512F. A few instructions which get only word forms with AVX-512BW acquire byte forms with the AVX-512_VBMI extension (VPERMB, VPERMI2B, VPERMT2B, VPMULTISHIFTQB).
Two new instructions were added to the mask instructions set: KADD and KTEST (B and W forms with AVX-512DQ, D and Q with AVX-512BW). The rest of mask instructions, which had only word forms, got byte forms with AVX-512DQ and doubleword/quadword forms with AVX-512BW. KUNPCKBW was extended to KUNPCKWD and KUNPCKDQ by AVX-512BW.
KUNPCKBW
KUNPCKWD
KUNPCKDQ
Among the instructions added by AVX-512DQ are several SSE and AVX instructions that didn't get AVX-512 versions with AVX-512F, among those are all the two input bitwise instructions and extract/insert integer instructions.
Instructions that are completely new are covered below.
Three new floating-point operations are introduced. Since they are not only new to AVX-512 they have both packed/SIMD and scalar versions.
The VFPCLASS instructions tests if the floating-point value is one of eight special floating-point values, which of the eight values will trigger a bit in the output mask register is controlled by the immediate field. The VRANGE instructions perform minimum or maximum operations depending on the value of the immediate field, which can also control if the operation is done absolute or not and separately how the sign is handled. The VREDUCE instructions operate on a single source, and subtract from that the integer part of the source value plus a number of bits specified in the immediate field of its fraction.
VFPCLASS
VRANGE
VREDUCE
VFPCLASSPS
VFPCLASSPD
VFPCLASSSS
VFPCLASSSD
VRANGEPS
VRANGEPD
VRANGESS
VRANGESD
VREDUCEPS
VREDUCEPD
VREDUCESS
VREDUCESD
VPMOVM2D
VPMOVM2Q
VPMOVM2B
VPMOVM2W
VPMOVD2M
VPMOVQ2M
VPMOVB2M
VPMOVW2M
VPMULLQ
Extend VPCOMPRESS and VPEXPAND with byte and word variants. Shift instructions are new.
VPCOMPRESSB
VPCOMPRESSW
VPEXPANDB
VPEXPANDW
VPSHLD
VPSHLDV
VPSHRD
VPSHRDV
Vector Neural Network Instructions:[10] AVX512-VNNI adds EVEX-coded instructions described below. With AVX-512F, these instructions can operate on 512-bit vectors, and AVX-512VL further adds support for 128- and 256-bit vectors.
A later AVX-VNNI extension adds VEX encodings of these instructions which can only operate on 128- or 256-bit vectors. AVX-VNNI is not part of the AVX-512 suite, it does not require AVX-512F and can be implemented independently.
VPDPBUSD
VPDPBUSDS
VPDPWSSD
VPDPWSSDS
Integer fused multiply-add instructions. AVX512-IFMA adds EVEX-coded instructions described below.
A separate AVX-IFMA instruction set extension defines VEX encoding of these instructions. This extension is not part of the AVX-512 suite and can be implemented independently.
VPMADD52LUQ
VPMADD52HUQ
VPOPCNTD
VPOPCNTQ
VPOPCNTB
VPOPCNTW
VPSHUFBITQMB
VP2INTERSECTD
VP2INTERSECTQ
Galois field new instructions are useful for cryptography,[11] as they can be used to implement Rijndael-style S-boxes such as those used in AES, Camellia, and SM4.[12] These instructions may also be used for bit manipulation in networking and signal processing.[11]
GFNI is a standalone instruction set extension and can be enabled separately from AVX or AVX-512. Depending on whether AVX and AVX-512F support is indicated by the CPU, GFNI support enables legacy (SSE), VEX or EVEX-coded instructions operating on 128, 256 or 512-bit vectors.
VGF2P8AFFINEINVQB
VGF2P8AFFINEQB
VGF2P8MULB
VPCLMULQDQ with AVX-512F adds an EVEX-encoded 512-bit version of the PCLMULQDQ instruction. With AVX-512VL, it adds EVEX-encoded 256- and 128-bit versions. VPCLMULQDQ alone (that is, on non-AVX512 CPUs) adds only VEX-encoded 256-bit version. (Availability of the VEX-encoded 128-bit version is indicated by different CPUID bits: PCLMULQDQ and AVX.) The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers, but they do not extend it to select quadwords from different 128-bit fields (the meaning of imm8 operand is the same: either low or high quadword of the 128-bit field is selected).
VPCLMULQDQ
VEX- and EVEX-encoded AES instructions. The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers. The VEX versions can be used without AVX-512 support.
VAESDEC
VAESDECLAST
VAESENC
VAESENCLAST
AI acceleration instructions operating on the Bfloat16 numbers.
VCVTNE2PS2BF16
VCVTNEPS2BF16
VDPBF16PS
An extension of the earlier F16C instruction set, adding comprehensive support for the binary16 floating-point numbers (also known as FP16, float16 or half-precision floating-point numbers). The new instructions implement most operations that were previously available for single and double-precision floating-point numbers and also introduce new complex number instructions and conversion instructions. Scalar and packed operations are supported.
Unlike the single and double-precision format instructions, the half-precision operands are neither conditionally flushed to zero (FTZ) nor conditionally treated as zero (DAZ) based on MXCSR settings. Subnormal values are processed at full speed by hardware to facilitate using the full dynamic range of the FP16 numbers. Instructions that create FP32 and FP64 numbers still respect the MXCSR.FTZ bit.[13]
MXCSR
MXCSR.FTZ
VADDPH
VADDSH
VSUBPH
VSUBSH
VMULPH
VMULSH
VDIVPH
VDIVSH
VSQRTPH
VSQRTSH
VFMADD{132, 213, 231}PH
VFMADD{132, 213, 231}SH
VFNMADD{132, 213, 231}PH
VFNMADD{132, 213, 231}SH
VFMSUB{132, 213, 231}PH
VFMSUB{132, 213, 231}SH
VFNMSUB{132, 213, 231}PH
VFNMSUB{132, 213, 231}SH
VFMADDSUB{132, 213, 231}PH
VFMSUBADD{132, 213, 231}PH
VREDUCEPH
VREDUCESH
VRNDSCALEPH
VRNDSCALESH
VSCALEFPH
VSCALEFSH
VFMULCPH
VFMULCSH
VFCMULCPH
VFCMULCSH
VFMADDCPH
VFMADDCSH
VFCMADDCPH
VFCMADDCSH
VRCPPH
VRCPSH
VRSQRTPH
VRSQRTSH
VCMPPH
VCMPSH
VCOMISH
VUCOMISH
VMAXPH
VMAXSH
VMINPH
VMINSH
VFPCLASSPH
VFPCLASSSH
VCVTW2PH
VCVTUW2PH
VCVTDQ2PH
VCVTUDQ2PH
VCVTQQ2PH
VCVTUQQ2PH
VCVTPS2PHX
VCVTPS2PH
VCVTPD2PH
VCVTSI2SH
VCVTUSI2SH
VCVTSS2SH
VCVTSD2SH
VCVTPH2W
VCVTTPH2W
VCVTPH2UW
VCVTTPH2UW
VCVTPH2DQ
VCVTTPH2DQ
VCVTPH2UDQ
VCVTTPH2UDQ
VCVTPH2QQ
VCVTTPH2QQ
VCVTPH2UQQ
VCVTTPH2UQQ
VCVTPH2PSX
VCVTPH2PS
VCVTPH2PD
VCVTSH2SI
VCVTTSH2SI
VCVTSH2USI
VCVTTSH2USI
VCVTSH2SS
VCVTSH2SD
VGETEXPPH
VGETEXPSH
VGETMANTPH
VGETMANTSH
VMOVSH
VMOVW
VADDPD
VADDPS
VADDSD
VADDSS
VANDPD
VANDPS
VANDNPD
VANDNPS
VCMPPD
VCMPPS
VCMPSD
VCMPSS
VCOMISD
VCOMISS
VDIVPD
VDIVPS
VDIVSD
VDIVSS
VCVTDQ2PD
VCVTDQ2PS
VCVTPD2DQ
VCVTPD2PS
VCVTPS2DQ
VCVTPS2PD
VCVTSD2SI
VCVTSD2SS
VCVTSI2SD
VCVTSI2SS
VCVTSS2SD
VCVTSS2SI
VCVTTPD2DQ
VCVTTPS2DQ
VCVTTSD2SI
VCVTTSS2SI
VMAXPD
VMAXPS
VMAXSD
VMAXSS
VMINPD
VMINPS
VMINSD
VMINSS
VMOVAPD
VMOVAPS
VMOVD
VMOVQ
VMOVDDUP
VMOVHLPS
VMOVHPD
VMOVHPS
VMOVLHPS
VMOVLPD
VMOVLPS
VMOVNTDQA
VMOVNTDQ
VMOVNTPD
VMOVNTPS
VMOVSD
VMOVSHDUP
VMOVSLDUP
VMOVSS
VMOVUPD
VMOVUPS
VMOVDQA32
VMOVDQA64
VMOVDQU8
VMOVDQU16
VMOVDQU32
VMOVDQU64
VMULPD
VMULPS
VMULSD
VMULSS
VORPD
VORPS
VSQRTPD
VSQRTPS
VSQRTSD
VSQRTSS
VSUBPD
VSUBPS
VSUBSD
VSUBSS
VUCOMISD
VUCOMISS
VUNPCKHPD
VUNPCKHPS
VUNPCKLPD
VUNPCKLPS
VXORPD
VXORPS
VEXTRACTPS
VINSERTPS
VPEXTRB
VPEXTRW
VPEXTRD
VPEXTRQ
VPINSRB
VPINSRW
VPINSRD
VPINSRQ
VPACKSSWB
VPACKSSDW
VPACKUSDW
VPACKUSWB
VPADDB
VPADDW
VPADDD
VPADDQ
VPADDSB
VPADDSW
VPADDUSB
VPADDUSW
VPANDD
VPANDQ
VPANDND
VPANDNQ
VPAVGB
VPAVGW
VPCMPEQB
VPCMPEQW
VPCMPEQD
VPCMPEQQ
VPCMPGTB
VPCMPGTW
VPCMPGTD
VPCMPGTQ
VPMAXSB
VPMAXSW
VPMAXSD
VPMAXUB
VPMAXUW
VPMAXUD
VPMINSB
VPMINSW
VPMINSD
VPMINUB
VPMINUW
VPMINUD
VPMOVSXBW
VPMOVSXBD
VPMOVSXBQ
VPMOVSXWD
VPMOVSXWQ
VPMOVSXDQ
VPMOVZXBW
VPMOVZXBD
VPMOVZXBQ
VPMOVZXWD
VPMOVZXWQ
VPMOVZXDQ
VPMULDQ
VPMULUDQ
VPMULHRSW
VPMULHUW
VPMULHW
VPMULLD
VPMULLW
VPORD
VPORQ
VPSUBB
VPSUBW
VPSUBD
VPSUBQ
VPSUBSB
VPSUBSW
VPSUBUSB
VPSUBUSW
VPUNPCKHBW
VPUNPCKHWD
VPUNPCKHDQ
VPUNPCKHQDQ
VPUNPCKLBW
VPUNPCKLWD
VPUNPCKLDQ
VPUNPCKLQDQ
VPXORD
VPXORQ
VPSADBW
VPSHUFB
VPSHUFHW
VPSHUFLW
VPSHUFD
VPSLLDQ
VPSLLW
VPSLLD
VPSLLQ
VPSRAW
VPSRAD
VPSRAQ
VPSRLDQ
VPSRLW
VPSRLD
VPSRLQ
VPSLLVW
VPSLLVD
VPSLLVQ
VPSRLVW
VPSRLVD
VPSRLVQ
VPSHUFPD
VPSHUFPS
VEXTRACTF32X4
VEXTRACTF64X2
VEXTRACTF32X8
VEXTRACTF64X4
VEXTRACTI32X4
VEXTRACTI64X2
VEXTRACTI32X8
VEXTRACTI64X4
VINSERTF32x4
VINSERTF64X2
VINSERTF32X8
VINSERTF64x4
VINSERTI32X4
VINSERTI64X2
VINSERTI32X8
VINSERTI64X4
VPABSB
VPABSW
VPABSD
VPALIGNR
VPERMD
VPERMILPD
VPERMILPS
VPERMPD
VPERMPS
VPERMQ
VPMADDUBSW
VPMADDWD
VFMADD132PD
VFMADD213PD
VFMADD231PD
VFMADD132PS
VFMADD213PS
VFMADD231PS
VFMADD132SD
VFMADD213SD
VFMADD231SD
VFMADD132SS
VFMADD213SS
VFMADD231SS
VFMADDSUB132PD
VFMADDSUB213PD
VFMADDSUB231PD
VFMADDSUB132PS
VFMADDSUB213PS
VFMADDSUB231PS
VFMSUBADD132PD
VFMSUBADD213PD
VFMSUBADD231PD
VFMSUBADD132PS
VFMSUBADD213PS
VFMSUBADD231PS
VFMSUB132PD
VFMSUB213PD
VFMSUB231PD
VFMSUB132PS
VFMSUB213PS
VFMSUB231PS
VFMSUB132SD
VFMSUB213SD
VFMSUB231SD
VFMSUB132SS
VFMSUB213SS
VFMSUB231SS
VFNMADD132PD
VFNMADD213PD
VFNMADD231PD
VFNMADD132PS
VFNMADD213PS
VFNMADD231PS
VFNMADD132SD
VFNMADD213SD
VFNMADD231SD
VFNMADD132SS
VFNMADD213SS
VFNMADD231SS
VFNMSUB132PD
VFNMSUB213PD
VFNMSUB231PD
VFNMSUB132PS
VFNMSUB213PS
VFNMSUB231PS
VFNMSUB132SD
VFNMSUB213SD
VFNMSUB231SD
VFNMSUB132SS
VFNMSUB213SS
VFNMSUB231SS
VGATHERDPS
VGATHERDPD
VGATHERQPS
VGATHERQPD
VPGATHERDD
VPGATHERDQ
VPGATHERQD
VPGATHERQQ
VPSRAVW
VPSRAVD
VPSRAVQ
^Note 1 : Intel does not officially support AVX-512 family of instructions on the Alder Lake microprocessors. In early 2022, Intel began disabling in silicon (fusing off) AVX-512 in Alder Lake microprocessors to prevent customers from enabling AVX-512.[34] In older Alder Lake family CPUs with some legacy combinations of BIOS and microcode revisions, it was possible to execute AVX-512 family instructions when disabling all the efficiency cores which do not contain the silicon for AVX-512.[35][36][23]
Intel Vectorization Advisor (starting from version 2017) supports native AVX-512 performance and vector code quality analysis (for "Core", Xeon and Intel Xeon Phi processors). Along with traditional hotspots profile, Advisor Recommendations and "seamless" integration of Intel Compiler vectorization diagnostics, Advisor Survey analysis also provides AVX-512 ISA metrics and new AVX-512-specific "traits", e.g. Scatter, Compress/Expand, mask utilization.[37][38]
On some processors (mostly pre-Ice Lake Intel), AVX-512 instructions can cause a frequency throttling even greater than its predecessors, causing a penalty for mixed workloads. The additional downclocking is triggered by the 512-bit width of vectors and depend on the nature of instructions being executed, and using the 128 or 256-bit part of AVX-512 (AVX-512VL) does not trigger it. As a result, gcc and clang default to prefer using the 256-bit vectors for Intel targets.[39][40][41]
C/C++ compilers also automatically handle loop unrolling and preventing stalls in the pipeline in order to use AVX-512 most effectively, which means a programmer using language intrinsics to try to force use of AVX-512 can sometimes result in worse performance relative to the code generated by the compiler when it encounters loops plainly written in the source code.[42] In other cases, using AVX-512 intrinsics in C/C++ code can result in a performance improvement relative to plainly written C/C++.[43]
There are many examples of AVX-512 applications, including media processing, cryptography, video games,[44] neural networks,[45] and even OpenJDK, which employs AVX-512 for sorting.[46]
In a much-cited quote from 2020, Linus Torvalds said "I hope AVX-512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on,"[47] stating that he would prefer the transistor budget be spent on additional cores and integer performance instead, and that he "detests" floating point benchmarks.[48]
Numenta touts their "highly sparse"[49] neural network technology, which they say obviates the need for GPUs as their algorithms run on CPUs with AVX-512.[50] They claim a ten times speedup relative to A100 largely because their algorithms reduce the size of the neural network, while maintaining accuracy, by techniques such as the Sparse Evolutionary Training (SET) algorithm[51] and Foresight Pruning.[52]
Newer x86-64 processors also support Galois Field New Instructions (GFNI) which allow implementing Camellia s-box more straightforward manner and yield even better performance.
{{cite web}}
Lokasi Pengunjung: 3.16.130.155