学习汇编最好的参考手册,帮助理解。

32 篇文章 3 订阅

English

x86/x64 SIMD Instruction List (SSE to AVX512)

MMX register (64-bit) instructions are omitted.

S1=SSE  S2=SSE2 S3=SSE3 SS3=SSSE3 S4.1=SSE4.1 S4.2=SSE4.2 V1=AVX V2=AVX2 V5=AVX512

Instructions marked * become scalar instructions (only the lowest element is calculated) when PS/PD/DQ is changed to SS/SD/SI.

C/C++ intrinsic name is written below each instruction in blue.

AVX/AVX2

  • Add prefix 'V' to change SSE instruction name to AVX instruction name.
  • FP AVX instructions can do 256-bit operations on YMM registers.
  • Integer AVX instructions can use YMM registers from AVX2.
  • To use 256-bit intrinsics, change prefix _mm to _mm256, and suffix si128 to si256.
  • Using YMM registers requires the support from OS (For Windows, 7 update 1 or later is required).
  • YMM register is basically separated into 2 lanes (upper 128-bit and lower 128-bit) and operates within each lane. Horizontal operations (unpacks, shuffles, horizontal calculations, byte shifts, conversions) can be anomalous. Check the manuals out carefully.

AVX512

  • Instructions noted only "(V5" can be used if CPUID AVX512F flag is on.
  • Instructions noted "(V5" and "+xx" can be used only if CPUID AVX512F flag is set and AVX512xx flag is also set.
  • Using AVX512 instructions requires the support from OS.
  • The features common to most AVX512 instructions ({k1}{z}, {er}/{sae}, bcst) are not mentioned in each instruction. See this -> AVX512 Memo
  • Opmask register instructions are here.

This document is intended that you can find the correct instruction name that you are not sure of, and make it possible to search in the manuals. Refer to the manuals before coding.

Intel's manuals -> https://software.intel.com/en-us/articles/intel-sdm

When you find any error or something please post this feedback form or email me to the address at the bottom of this page.

 

Highlighter 

to   Color  

 

MOVE     ?MM = XMM / YMM / ZMM

 IntegerFloating-PointYMM lane (128-bit)
QWORDDWORDWORDBYTEDoubleSingleHalf
?MM whole
from / to
?MM/mem
MOVDQA (S2
_mm_load_si128
_mm_store_si128
MOVDQU (S2
_mm_loadu_si128
_mm_storeu_si128
MOVAPD (S2
_mm_load_pd
_mm_loadr_pd
_mm_store_pd
_mm_storer_pd
MOVUPD (S2
_mm_loadu_pd
_mm_storeu_pd
MOVAPS (S1
_mm_load_ps
_mm_loadr_ps
_mm_store_ps
_mm_storer_ps
MOVUPS (S1
_mm_loadu_ps
_mm_storeu_ps
  
VMOVDQA64 (V5...
_mm_mask_load_epi64
_mm_mask_store_epi64
etc
VMOVDQU64 (V5...
_mm_mask_loadu_epi64
_mm_mask_store_epi64
etc
VMOVDQA32 (V5...
_mm_mask_load_epi32
_mm_mask_store_epi32
etc
VMOVDQU32 (V5...
_mm_mask_loadu_epi32
_mm_mask_storeu_epi32
etc
VMOVDQU16 (V5+BW...
_mm_mask_loadu_epi16
_mm_mask_storeu_epi16
etc
VMOVDQU8 (V5+BW...
_mm_mask_loadu_epi8
_mm_mask_storeu_epi8
etc
XMM upper half
from / to
mem
    MOVHPD (S2
_mm_loadh_pd
_mm_storeh_pd
MOVHPS (S1
_mm_loadh_pi
_mm_storeh_pi
  
XMM upper half
from / to
XMM lower half
     MOVHLPS (S1
_mm_movehl_ps
MOVLHPS (S1
_mm_movelh_ps
  
XMM lower half
from / to
mem
MOVQ (S2
_mm_loadl_epi64
_mm_storel_epi64
   MOVLPD (S2
_mm_loadl_pd
_mm_storel_pd
MOVLPS (S1
_mm_loadl_pi
_mm_storel_pi
  
XMM lowest 1 elem
from / to
r/m
MOVQ (S2
_mm_cvtsi64_si128
_mm_cvtsi128_si64
MOVD (S2
_mm_cvtsi32_si128
_mm_cvtsi128_si32
      
XMM lowest 1 elem
from / to
XMM/mem
MOVQ (S2
_mm_move_epi64
   MOVSD (S2
_mm_load_sd
_mm_store_sd
_mm_move_sd
MOVSS (S1
_mm_load_ss
_mm_store_ss
_mm_move_ss
  
XMM whole
from
1 elem
TIP 2
_mm_set1_epi64x
VPBROADCASTQ (V2
_mm_broadcastq_epi64
TIP 2
_mm_set1_epi32
VPBROADCASTD (V2
_mm_broadcastd_epi32
TIP 2
_mm_set1_epi16
VPBROADCASTW (V2
_mm_broadcastw_epi16
_mm_set1_epi8
VPBROADCASTB (V2
_mm_broadcastb_epi8
TIP 2
_mm_set1_pd
_mm_load1_pd
MOVDDUP (S3
_mm_movedup_pd
_mm_loaddup_pd
TIP 2
_mm_set1_ps
_mm_load1_ps
VBROADCASTSS
from mem (V1
from XMM (V2
_mm_broadcast_ss
  
YMM / ZMM whole
from
1 elem
VPBROADCASTQ (V2
_mm256_broadcastq_epi64
VPBROADCASTD (V2
_mm256_broadcastd_epi32
VPBROADCASTW (V2
_mm256_broadcastw_epi16
VPBROADCASTB (V2
_mm256_broadcastb_epi8
VBROADCASTSD
 from mem (V1
 from XMM (V2
_mm256_broadcast_sd
VBROADCASTSS
 from mem (V1
 from XMM (V2
_mm256_broadcast_ss
 VBROADCASTF128 (V1
_mm256_broadcast_ps
_mm256_broadcast_pd
VBROADCASTI128 (V2
_mm256_broadcastsi128_si256
YMM / ZMM whole
from
2/4/8 elems
VBROADCASTI64X2 (V5+DQ...
_mm512_broadcast_i64x2
VBROADCASTI64X4 (V5
_mm512_broadcast_i64x4
VBROADCASTI32X2 (V5+DQ...
_mm512_broadcast_i32x2
VBROADCASTI32X4 (V5...
_mm512_broadcast_i32x4
VBROADCASTI32X8 (V5+DQ
_mm512_broadcast_i32x8
  VBROADCASTF64X2 (V5+DQ...
_mm512_broadcast_f64x2
VBROADCASTF64X4 (V5
_mm512_broadcast_f64x4
VBROADCASTF32X2 (V5+DQ...
_mm512_broadcast_f32x2
VBROADCASTF32X4 (V5...
_mm512_broadcast_f32x4
VBROADCASTF32X8 (V5+DQ
_mm512_broadcast_f32x8
  
?MM
from
multiple elems
_mm_set_epi64x
_mm_setr_epi64x
_mm_set_epi32
_mm_setr_epi32
_mm_set_epi16
_mm_setr_epi16
_mm_set_epi8
_mm_setr_epi8
_mm_set_pd
_mm_setr_pd
_mm_set_ps
_mm_setr_ps
  
?MM whole
from
zero
TIP 1
_mm_setzero_si128
TIP 1
_mm_setzero_pd
TIP 1
_mm_setzero_ps
  
extractPEXTRQ (S4.1
_mm_extract_epi64
PEXTRD (S4.1
_mm_extract_epi32
PEXTRW to r (S2
PEXTRW to r/m (S4.1
_mm_extract_epi16
PEXTRB (S4.1
_mm_extract_epi8
->MOVHPD (S2
_mm_loadh_pd
_mm_storeh_pd
->MOVLPD (S2
_mm_loadl_pd
_mm_storel_pd
EXTRACTPS (S4.1
_mm_extract_ps
 VEXTRACTF128 (V1
_mm256_extractf128_ps
_mm256_extractf128_pd
_mm256_extractf128_si256
VEXTRACTI128 (V2
_mm256_extracti128_si256
VEXTRACTI64X2 (V5+DQ...
_mm512_extracti64x2_epi64
VEXTRACTI64X4 (V5
_mm512_extracti64x4_epi64
VEXTRACTI32X4 (V5...
_mm512_extracti32x4_epi32
VEXTRACTI32X8 (V5+DQ
_mm512_extracti32x8_epi32
  VEXTRACTF64X2 (V5+DQ...
_mm512_extractf64x2_pd
VEXTRACTF64X4 (V5
_mm512_extractf64x4_pd
VEXTRACTF32X4 (V5...
_mm512_extractf32x4_ps
VEXTRACTF32X8 (V5+DQ
_mm512_extractf32x8_ps
  
insertPINSRQ (S4.1
_mm_insert_epi64
PINSRD (S4.1
_mm_insert_epi32
PINSRW (S2
_mm_insert_epi16
PINSRB (S4.1
_mm_insert_epi8
->MOVHPD (S2
_mm_loadh_pd
_mm_storeh_pd
->MOVLPD (S2
_mm_loadl_pd
_mm_storel_pd
INSERTPS (S4.1
_mm_insert_ps
 VINSERTF128 (V1
_mm256_insertf128_ps
_mm256_insertf128_pd
_mm256_insertf128_si256
VINSERTI128 (V2
_mm256_inserti128_si256
VINSERTI64X2 (V5+DQ...
_mm512_inserrti64x2
VINSERTI64X4 (V5...
_mm512_inserti64x4
VINSERTI32X4 (V5...
_mm512_inserti32x4
VINSERTI32X8 (V5+DQ
_mm512_inserti32x8
  VINSERTF64X2 (V5+DQ...
_mm512_insertf64x2
VINSERTF64X4 (V5
_mm512_insertf64x4
VINSERTF32X4 (V5...
_mm512_insertf32x4
VINSERTF32X8 (V5+DQ
_mm512_insertf32x8
  
unpack
PUNPCKHQDQ (S2
_mm_unpackhi_epi64
PUNPCKLQDQ (S2
_mm_unpacklo_epi64
PUNPCKHDQ (S2
_mm_unpackhi_epi32
PUNPCKLDQ (S2
_mm_unpacklo_epi32
PUNPCKHWD (S2
_mm_unpackhi_epi16
PUNPCKLWD (S2
_mm_unpacklo_epi16
PUNPCKHBW (S2
_mm_unpackhi_epi8
PUNPCKLBW (S2
_mm_unpacklo_epi8
UNPCKHPD (S2
_mm_unpackhi_pd
UNPCKLPD (S2
_mm_unpacklo_pd
UNPCKHPS (S1
_mm_unpackhi_ps
UNPCKLPS (S1
_mm_unpacklo_ps
  
shuffle/permute
VPERMQ (V2
_mm256_permute4x64_epi64
VPERMI2Q (V5...
_mm_permutex2var_epi64
PSHUFD (S2
_mm_shuffle_epi32
VPERMD (V2
_mm256_permutevar8x32_epi32
_mm256_permutexvar_epi32
VPERMI2D (V5...
_mm_permutex2var_epi32
PSHUFHW (S2
_mm_shufflehi_epi16
PSHUFLW (S2
_mm_shufflelo_epi16
VPERMW (V5+BW...
_mm_permutexvar_epi16
VPERMI2W (V5+BW...
_mm_permutex2var_epi16
PSHUFB (SS3
_mm_shuffle_epi8
SHUFPD (S2
_mm_shuffle_pd
VPERMILPD (V1
_mm_permute_pd
_mm_permutevar_pd
VPERMPD (V2
_mm256_permute4x64_pd
VPERMI2PD (V5...
_mm_permutex2var_pd
SHUFPS (S1
_mm_shuffle_ps
VPERMILPS (V1
_mm_permute_ps
_mm_permutevar_ps
VPERMPS (V2
_mm256_permutevar8x32_ps
VPERMI2PS (V5...
_mm_permutex2var_ps
 VPERM2F128 (V1
_mm256_permute2f128_ps
_mm256_permute2f128_pd
_mm256_permute2f128_si256
VPERM2I128 (V2
_mm256_permute2x128_si256
VSHUFI64X2 (V5...
_mm512_shuffle_i64x2
VSHUFI32X4 (V5...
_mm512_shuffle_i32x4
  VSHUFF64X2 (V5...
_mm512_shuffle_f64x2
VSHUFF32X4 (V5...
_mm512_shuffle_f32x4
  
blend
VPBLENDMQ (V5...
_mm_mask_blend_epi32
VPBLENDD (V2
_mm_blend_epi32
VPBLENDMD (V5...
_mm_mask_blend_epi32
PBLENDW (S4.1
_mm_blend_epi16
VPBLENDMW (V5+BW...
_mm_mask_blend_epi16
PBLENDVB (S4.1
_mm_blendv_epi8
VPBLENDMB (V5+BW...
_mm_mask_blend_epi8
BLENDPD (S4.1
_mm_blend_pd
BLENDVPD (S4.1
_mm_blendv_pd
VBLENDMPD (V5...
_mm_mask_blend_pd
BLENDPS (S4.1
_mm_blend_ps
BLENDVPS (S4.1
_mm_blendv_ps
VBLENDMPS (V5...
_mm_mask_blend_ps
  
move and duplicate    MOVDDUP (S3
_mm_movedup_pd
_mm_loaddup_pd
MOVSHDUP (S3
_mm_movehdup_ps
MOVSLDUP (S3
_mm_moveldup_ps
  
mask moveVPMASKMOVQ (V2
_mm_maskload_epi64
_mm_maskstore_epi64
VPMASKMOVD (V2
_mm_maskload_epi32
_mm_maskstore_epi32
  VMASKMOVPD (V1
_mm_maskload_pd
_mm_maskstore_pd
VMASKMOVPS (V1
_mm_maskload_ps
_mm_maskstore_ps
  
extract highest bit   PMOVMSKB (S2
_mm_movemask_epi8
MOVMSKPD (S2
_mm_movemask_pd
MOVMSKPS (S1
_mm_movemask_ps
  
VPMOVQ2M (V5+DQ...
_mm_movepi64_mask
VPMOVD2M (V5+DQ...
_mm_movepi32_mask
VPMOVW2M (V5+BW...
_mm_movepi16_mask
VPMOVB2M (V5+BW...
_mm_movepi8_mask
    
gather
VPGATHERDQ (V2
_mm_i32gather_epi64
_mm_mask_i32gather_epi64
VPGATHERQQ (V2
_mm_i64gather_epi64
_mm_mask_i64gather_epi64
VPGATHERDD (V2
_mm_i32gather_epi32
_mm_mask_i32gather_epi32
VPGATHERQD (V2
_mm_i64gather_epi32
_mm_mask_i64gather_epi32
  VGATHERDPD (V2
_mm_i32gather_pd
_mm_mask_i32gather_pd
VGATHERQPD (V2
_mm_i64gather_pd
_mm_mask_i64gather_pd
VGATHERDPS (V2
_mm_i32gather_ps
_mm_mask_i32gather_ps
VGATHERQPS (V2
_mm_i64gather_ps
_mm_mask_i64gather_ps
  
scatter
VPSCATTERDQ (V5...
_mm_i32scatter_epi64
_mm_mask_i32scatter_epi64
VPSCATTERQQ (V5...
_mm_i64scatter_epi64
_mm_mask_i64scatter_epi64
VPSCATTERDD (V5...
_mm_i32scatter_epi32
_mm_mask_i32scatter_epi32
VPSCATTERQD (V5...
_mm_i64scatter_epi32
_mm_mask_i64scatter_epi32
  VSCATTERDPD (V5...
_mm_i32scatter_pd
_mm_mask_i32scatter_pd
VSCATTERQPD (V5...
_mm_i64scatter_pd
_mm_mask_i64scatter_pd
VSCATTERDPS (V5...
_mm_i32scatter_ps
_mm_mask_i32scatter_ps
VSCATTERQPS (V5...
_mm_i64scatter_ps
_mm_mask_i64scatter_ps
  
compress
VPCOMPRESSQ (V5...
_mm_mask_compress_epi64
_mm_mask_compressstoreu_epi64
VPCOMPRESSD (V5...
_mm_mask_compress_epi32
_mm_mask_compressstoreu_epi32
  VCOMPRESSPD (V5...
_mm_mask_compress_pd
_mm_mask_compressstoreu_pd
VCOMPRESSPS (V5...
_mm_mask_compress_ps
_mm_mask_compressstoreu_ps
  
expand
VEXPANDQ (V5...
_mm_mask_expand_epi64
_mm_mask_expandloadu_epi64
VEXPANDD (V5...
_mm_mask_expand_epi32
_mm_mask_expandloadu_epi32
  VEXPANDPD (V5...
_mm_mask_expand_pd
_mm_mask_expandloadu_pd
VEXPANDPS (V5...
_mm_mask_expand_ps
_mm_mask_expandloadu_ps
  
align rightVALIGNQ (V5...
_mm_alignr_epi64
VALIGND (V5...
_mm_alignr_epi32
 PALIGNR (SS3
_mm_alignr_epi8
    
expand Opmask bitsVPMOVM2Q (V5+DQ...
_mm_movm_epi64
VPMOVM2D (V5+DQ...
_mm_movm_epi32
VPMOVM2W (V5+BW...
_mm_movm_epi16
VPMOVM2B (V5+BW...
_mm_movm_epi8
    

 

Conversions

from \ toIntegerFloating-Point
QWORDDWORDWORDBYTEDoubleSingleHalf
IntegerQWORD VPMOVQD (V5...
_mm_cvtepi64_epi32
VPMOVSQD (V5...
_mm_cvtsepi64_epi32
VPMOVUSQD (V5...
_mm_cvtusepi64_epi32
VPMOVQW (V5...
_mm_cvtepi64_epi16
VPMOVSQW (V5...
_mm_cvtsepi64_epi16
VPMOVUSQW (V5...
_mm_cvtusepi64_epi16
VPMOVQB (V5...
_mm_cvtepi64_epi8
VPMOVSQB (V5...
_mm_cvtsepi64_epi8
VPMOVUSQB (V5...
_mm_cvtusepi64_epi8
CVTSI2SD (S2 scalar only
_mm_cvtsi64_sd
VCVTQQ2PD* (V5+DQ...
_mm_cvtepi64_pd
VCVTUQQ2PD* (V5+DQ...
_mm_cvtepu64_pd
CVTSI2SS (S1 scalar only
_mm_cvtsi64_ss
VCVTQQ2PS* (V5+DQ...
_mm_cvtepi64_ps
VCVTUQQ2PS* (V5+DQ...
_mm_cvtepu64_ps
 
DWORDTIP 3
PMOVSXDQ (S4.1
_mm_ cvtepi32_epi64
PMOVZXDQ (S4.1
_mm_ cvtepu32_epi64
 PACKSSDW (S2
_mm_packs_epi32
PACKUSDW (S4.1
_mm_packus_epi32
VPMOVDW (V5...
_mm_cvtepi32_epi16
VPMOVSDW (V5...
_mm_cvtsepi32_epi16
VPMOVUSDW (V5...
_mm_cvtusepi32_epi16
VPMOVDB (V5...
_mm_cvtepi32_epi8
VPMOVSDB (V5...
_mm_cvtsepi32_epi8
VPMOVUSDB (V5...
_mm_cvtusepi32_epi8
CVTDQ2PD* (S2
_mm_cvtepi32_pd
VCVTUDQ2PD* (V5...
_mm_cvtepu32_pd
CVTDQ2PS* (S2
_mm_cvtepi32_ps
VCVTUDQ2PS* (V5...
_mm_cvtepu32_ps
 
WORDPMOVSXWQ (S4.1
_mm_ cvtepi16_epi64
PMOVZXWQ (S4.1
_mm_ cvtepu16_epi64
TIP 3
PMOVSXWD (S4.1
_mm_ cvtepi16_epi32
PMOVZXWD (S4.1
_mm_ cvtepu16_epi32
 PACKSSWB (S2
_mm_packs_epi16
PACKUSWB (S2
_mm_packus_epi16
VPMOVWB (V5+BW...
_mm_cvtepi16_epi8
VPMOVSWB (V5+BW...
_mm_cvtsepi16_epi8
VPMOVUSWB (V5+BW...
_mm_cvtusepi16_epi8
   
BYTEPMOVSXBQ (S4.1
_mm_ cvtepi8_epi64
PMOVZXBQ (S4.1
_mm_ cvtepu8_epi64
PMOVSXBD (S4.1
_mm_ cvtepi8_epi32
PMOVZXBD (S4.1
_mm_ cvtepu8_epi32
TIP 3
PMOVSXBW (S4.1
_mm_ cvtepi8_epi16
PMOVZXBW (S4.1
_mm_ cvtepu8_epi16
    
Floating-PointDoubleCVTSD2SI / CVTTSD2SI (S2 scalar only
_mm_cvtsd_si64 / _mm_cvttsd_si64
VCVTPD2QQ* / VCVTTPD2QQ* (V5+DQ...
_mm_cvtpd_epi64 / _mm_cvttpd_epi64
VCVTPD2UQQ* / VCVTTPD2UQQ* (V5+DQ...
_mm_cvtpd_epu64 / _mm_cvttpd_epu64
right ones are with truncation
CVTPD2DQ* / CVTTPD2DQ* (S2
_mm_cvtpd_epi32 / _mm_cvttpd_epi32
VCVTPD2UDQ* / VCVTTPD2UDQ* (V5...
_mm_cvtpd_epu32 / _mm_cvttpd_epu32
right ones are with truncation
   CVTPD2PS* (S2
_mm_cvtpd_ps
 
SingleCVTSS2SI / CVTTSS2SI (S1 scalar only
_mm_cvtss_si64 / _mm_cvttss_si64
VCVTPS2QQ* / VCVTTPS2QQ* (V5+DQ...
_mm_cvtps_epi64 / _mm_cvttps_epi64
VCVTPS2UQQ* / VCVTTPS2UQQ* (V5+DQ...
_mm_cvtps_epu64 / _mm_cvttps_epu64
right ones are with truncation
CVTPS2DQ* / CVTTPS2DQ* (S2
_mm_cvtps_epi32 / _mm_cvttps_epi32
VCVTPS2UDQ* / VCVTTPS2UDQ* (V5...
_mm_cvtps_epu32 / _mm_cvttps_epu32
right ones are with truncation
  CVTPS2PD* (S2
_mm_cvtps_pd
 VCVTPS2PH (F16C
_mm_cvtps_ph
Half     VCVTPH2PS (F16C
_mm_cvtph_ps
 

 

Arithmetic Operations

 IntegerFloating-Point
QWORDDWORDWORDBYTEDoubleSingleHalf
addPADDQ (S2
_mm_add_epi64
PADDD (S2
_mm_add_epi32
PADDW (S2
_mm_add_epi16
PADDSW (S2
_mm_adds_epi16
PADDUSW (S2
_mm_adds_epu16
PADDB (S2
_mm_add_epi8
PADDSB (S2
_mm_adds_epi8
PADDUSB (S2
_mm_adds_epu8
ADDPD* (S2
_mm_add_pd
ADDPS* (S1
_mm_add_ps
 
subPSUBQ (S2
_mm_sub_epi64
PSUBD (S2
_mm_sub_epi32
PSUBW (S2
_mm_sub_epi16
PSUBSW (S2
_mm_subs_epi16
PSUBUSW (S2
_mm_subs_epu16
PSUBB (S2
_mm_sub_epi8
PSUBSB (S2
_mm_subs_epi8
PSUBUSB (S2
_mm_subs_epu8
SUBPD* (S2
_mm_sub_pd
SUBPS* (S1
_mm_sub_ps
 
mulVPMULLQ (V5+DQ...
_mm_mullo_epi64
PMULDQ (S4.1
_mm_mul_epi32
PMULUDQ (S2
_mm_mul_epu32
PMULLD (S4.1
_mm_mullo_epi32
PMULHW (S2
_mm_mulhi_epi16
PMULHUW (S2
_mm_mulhi_epu16
PMULLW (S2
_mm_mullo_epi16
 MULPD* (S2
_mm_mul_pd
MULPS* (S1
_mm_mul_ps
 
div    DIVPD* (S2
_mm_div_pd
DIVPS* (S1
_mm_div_ps
 
reciprocal    VRCP14PD* (V5...
_mm_rcp14_pd
VRCP28PD* (V5+ER
_mm512_rcp28_pd
RCPPS* (S1
_mm_rcp_ps
VRCP14PS* (V5...
_mm_rcp14_ps
VRCP28PS* (V5+ER
_mm512_rcp28_ps
 
square root    SQRTPD* (S2
_mm_sqrt_pd
SQRTPS* (S1
_mm_sqrt_ps
 
reciprocal of square root    VRSQRT14PD* (V5...
_mm_rsqrt14_pd
VRSQRT28PD* (V5+ER
_mm512_rsqrt28_pd
RSQRTPS* (S1
_mm_rsqrt_ps
VRSQRT14PS* (V5...
_mm_rsqrt14_ps
VRSQRT28PS* (V5+ER
_mm_rsqrt28_ps
 
power of two    VEXP2PD* (V5+ER
_mm512_exp2a23_roundpd
VEXP2PS* (V5+ER
_mm512_exp2a23_round_ps
 
multiply nth power of 2    VSCALEFPD* (V5...
_mm_scalef_pd
VSCALEFPS* (V5...
_mm_scalef_ps
 
maxTIP 8
VPMAXSQ (V5...
_mm_max_epi64
VPMAXUQ (V5...
_mm_max_epu64
TIP 8
PMAXSD (S4.1
_mm_max_epi32
PMAXUD (S4.1
_mm_max_epu32
PMAXSW (S2
_mm_max_epi16
PMAXUW (S4.1
_mm_max_epu16
TIP 8
PMAXSB (S4.1
_mm_max_epi8
PMAXUB (S2
_mm_max_epu8
TIP 8
MAXPD* (S2
_mm_max_pd
TIP 8
MAXPS* (S1
_mm_max_ps
 
minTIP 8
VPMINSQ (V5...
_mm_min_epi64
VPMINUQ (V5...
_mm_min_epu64
TIP 8
PMINSD (S4.1
_mm_min_epi32
PMINUD (S4.1
_mm_min_epu32
PMINSW (S2
_mm_min_epi16
PMINUW (S4.1
_mm_min_epu16
TIP 8
PMINSB (S4.1
_mm_min_epi8
PMINUB (S2
_mm_min_epu8
TIP 8
MINPD* (S2
_mm_min_pd
TIP 8
MINPS* (S1
_mm_min_ps
 
average  PAVGW (S2
_mm_avg_epu16
PAVGB (S2
_mm_avg_epu8
   
absoluteTIP 4
VPABSQ (V5...
_mm_abs_epi64
TIP 4
PABSD (SS3
_mm_abs_epi32
TIP 4
PABSW (SS3
_mm_abs_epi16
TIP 4
PABSB (SS3
_mm_abs_epi8
TIP 5TIP 5 
sign operation PSIGND (SS3
_mm_sign_epi32
PSIGNW (SS3
_mm_sign_epi16
PSIGNB (SS3
_mm_sign_epi8
   
round    ROUNDPD* (S4.1
_mm_round_pd
_mm_floor_pd
_mm_ceil_pd
VRNDSCALEPD* (V5...
_mm_roundscale_pd
ROUNDPS* (S4.1
_mm_round_ps
_mm_floor_ps
_mm_ceil_ps
VRNDSCALEPS* (V5...
_mm_roundscale_ps
 
difference from rounded value    VREDUCEPD* (V5+DQ...
_mm_reduce_pd
VREDUCEPS* (V5+DQ...
_mm_reduce_ps
 
add / sub    ADDSUBPD (S3
_mm_addsub_pd
ADDSUBPS (S3
_mm_addsub_ps
 
horizontal add PHADDD (SS3
_mm_hadd_epi32
PHADDW (SS3
_mm_hadd_epi16
PHADDSW (SS3
_mm_hadds_epi16
 HADDPD (S3
_mm_hadd_pd
HADDPS (S3
_mm_hadd_ps
 
horizontal sub PHSUBD (SS3
_mm_hsub_epi32
PHSUBW (SS3
_mm_hsub_epi16
PHSUBSW (SS3
_mm_hsubs_epi16
 HSUBPD (S3
_mm_hsub_pd
HSUBPS (S3
_mm_hsub_ps
 
dot product    DPPD (S4.1
_mm_dp_pd
DPPS (S4.1
_mm_dp_ps
 
multiply and add  PMADDWD (S2
_mm_madd_epi16
PMADDUBSW (SS3
_mm_maddubs_epi16
   
fused multiply and add / sub    VFMADDxxxPD* (FMA
_mm_fmadd_pd
VFMSUBxxxPD* (FMA
_mm_fmsub_pd
VFMADDSUBxxxPD (FMA
_mm_fmaddsub_pd
VFMSUBADDxxxPD (FMA
_mm_fmsubadd_pd
VFNMADDxxxPD* (FMA
_mm_fnmadd_pd
VFNMSUBxxxPD* (FMA
_mm_fnmsub_pd
xxx=132/213/231
VFMADDxxxPS* (FMA
_mm_fmadd_ps
VFMSUBxxxPS* (FMA
_mm_fmsub_ps
VFMADDSUBxxxPS (FMA
_mm_fmaddsub_ps
VFMSUBADDxxxPS (FMA
_mm_fmsubadd_ps
VFNMADDxxxPS* (FMA
_mm_fnmadd_ps
VFNMSUBxxxPS* (FMA
_mm_fnmsub_ps
xxx=132/213/231
 

 

Compare

 Integer
QWORDDWORDWORDBYTE
compare for ==PCMPEQQ (S4.1
_mm_cmpeq_epi64
_mm_cmpeq_epi64_mask (V5...
VPCMPUQ (0) (V5...
_mm_cmpeq_epu64_mask
PCMPEQD (S2
_mm_cmpeq_epi32
_mm_cmpeq_epi32_mask (V5...
VPCMPUD (0) (V5...
_mm_cmpeq_epu32_mask
PCMPEQW (S2
_mm_cmpeq_epi16
_mm_cmpeq_epi16_mask (V5+BW...
VPCMPUW (0) (V5+BW...
_mm_cmpeq_epu16_mask
PCMPEQB (S2
_mm_cmpeq_epi8
_mm_cmpeq_epi8_mask (V5+BW...
VPCMPUB (0) (V5+BW...
_mm_cmpeq_epu8_mask
compare for <VPCMPQ (1) (V5...
_mm_cmplt_epi64_mask
VPCMPUQ (1) (V5...
_mm_cmplt_epu64_mask
VPCMPD (1) (V5...
_mm_cmplt_epi32_mask
VPCMPUD (1) (V5...
_mm_cmplt_epu32_mask
VPCMPW (1) (V5+BW...
_mm_cmplt_epi16_mask
VPCMPUW (1) (V5+BW...
_mm_cmplt_epu16_mask
VPCMPB (1) (V5+BW...
_mm_cmplt_epi8_mask
VPCMPUB (1) (V5+BW...
_mm_cmplt_epu8_mask
compare for <=VPCMPQ (2) (V5...
_mm_cmple_epi64_mask
VPCMPUQ (2) (V5...
_mm_cmple_epu64_mask
VPCMPD (2) (V5...
_mm_cmple_epi32_mask
VPCMPUD (2) (V5...
_mm_cmple_epu32_mask
VPCMPW (2) (V5+BW...
_mm_cmple_epi16_mask
VPCMPUW (2) (V5+BW...
_mm_cmple_epu16_mask
VPCMPB (2) (V5+BW...
_mm_cmple_epi8_mask
VPCMPUB (2) (V5+BW...
_mm_cmple_epu8_mask
compare for >PCMPGTQ (S4.2
_mm_cmpgt_epi64
VPCMPQ (6) (V5...
_mm_cmpgt_epi64_mask
VPCMPUQ (6) (V5...
_mm_cmpgt_epu64_mask
PCMPGTD (S2
_mm_cmpgt_epi32
VPCMPD (6) (V5...
_mm_cmpgt_epi32_mask
VPCMPUD (6) (V5...
_mm_cmpgt_epu32_mask
PCMPGTW (S2
_mm_cmpgt_epi16
VPCMPW (6) (V5+BW...
_mm_cmpgt_epi16_mask
VPCMPUW (6) (V5+BW...
_mm_cmpgt_epu16_mask
PCMPGTB (S2
_mm_cmpgt_epi8
VPCMPB (6) (V5+BW...
_mm_cmpgt_epi8_mask
VPCMPUB (6) (V5+BW...
_mm_cmpgt_epu8_mask
compare for >=VPCMPQ (5) (V5...
_mm_cmpge_epi64_mask
VPCMPUQ (5) (V5...
_mm_cmpge_epu64_mask
VPCMPD (5) (V5...
_mm_cmpge_epi32_mask
VPCMPUD (5) (V5...
_mm_cmpge_epu32_mask
VPCMPW (5) (V5+BW...
_mm_cmpge_epi16_mask
VPCMPUW (5) (V5+BW...
_mm_cmpge_epu16_mask
VPCMPB (5) (V5+BW...
_mm_cmpge_epi8_mask
VPCMPUB (5) (V5+BW...
_mm_cmpge_epu8_mask
compare for !=VPCMPQ (4) (V5...
_mm_cmpneq_epi64_mask
VPCMPUQ (4) (V5...
_mm_cmpneq_epu64_mask
VPCMPD (4) (V5...
_mm_cmpneq_epi32_mask
VPCMPUD (4) (V5...
_mm_cmpneq_epu32_mask
VPCMPW (4) (V5+BW...
_mm_cmpneq_epi16_mask
VPCMPUW (4) (V5+BW...
_mm_cmpneq_epu16_mask
VPCMPB (4) (V5+BW...
_mm_cmpneq_epi8_mask
VPCMPUB (4) (V5+BW...
_mm_cmpneq_epu8_mask

 

 Floating-Point
 DoubleSingleHalf
when either (or both) is Nancondition unmetcondition metcondition unmetcondition met 
Exception on QNaNYESNOYESNOYESNOYESNO 
compare for ==VCMPEQ_OSPD* (V1
_mm_cmp_pd
CMPEQPD* (S2
_mm_cmpeq_pd
VCMPEQ_USPD* (V1
_mm_cmp_pd
VCMPEQ_UQPD* (V1
_mm_cmp_pd
VCMPEQ_OSPS* (V1
_mm_cmp_ps
CMPEQPS* (S1
_mm_cmpeq_ps
VCMPEQ_USPS* (V1
_mm_cmp_ps
VCMPEQ_UQPS* (V1
_mm_cmp_ps
 
compare for <CMPLTPD* (S2
_mm_cmplt_pd
VCMPLT_OQPD* (V1
_mm_cmp_pd
  CMPLTPS* (S1
_mm_cmplt_ps
VCMPLT_OQPS* (V1
_mm_cmp_ps
   
compare for <=CMPLEPD* (S2
_mm_cmple_pd
VCMPLE_OQPD* (V1
_mm_cmp_pd
  CMPLEPS* (S1
_mm_cmple_ps
VCMPLE_OQPS* (V1
_mm_cmp_ps
   
compare for >VCMPGTPD* (V1
_mm_cmpgt_pd (S2
VCMPGT_OQPD* (V1
_mm_cmp_pd
  VCMPGTPS* (V1
_mm_cmpgt_ps (S1
VCMPGT_OQPS* (V1
_mm_cmp_ps
   
compare for >=VCMPGEPD* (V1
_mm_cmpge_pd (S2
VCMPGE_OQPD* (V1
_mm_cmp_pd
  VCMPGEPS* (V1
_mm_cmpge_ps (S1
VCMPGE_OQPS* (V1
_mm_cmp_ps
   
compare for !=VCMPNEQ_OSPD* (V1
_mm_cmp_pd
VCMPNEQ_OQPD* (V1
_mm_cmp_pd
VCMPNEQ_USPD* (V1
_mm_cmp_pd
CMPNEQPD* (S2
_mm_cmpneq_pd
VCMPNEQ_OSPS* (V1
_mm_cmp_ps
VCMPNEQ_OQPS* (V1
_mm_cmp_ps
VCMPNEQ_USPS* (V1
_mm_cmp_ps
CMPNEQPS* (S1
_mm_cmpneq_ps
 
compare for ! <  CMPNLTPD* (S2
_mm_cmpnlt_pd
VCMPNLT_UQPD* (V1
_mm_cmp_pd
  CMPNLTPS* (S1
_mm_cmpnlt_ps
VCMPNLT_UQPS* (V1
_mm_cmp_ps
 
compare for ! <=  CMPNLEPD* (S2
_mm_cmpnle_pd
VCMPNLE_UQPD* (V1
_mm_cmp_pd
  CMPNLEPS* (S1
_mm_cmpnle_ps
VCMPNLE_UQPS* (V1
_mm_cmp_ps
 
compare for ! >  VCMPNGTPD* (V1
_mm_cmpngt_pd (S2
VCMPNGT_UQPD* (V1
_mm_cmp_pd
  VCMPNGTPS* (V1
_mm_cmpngt_ps (S1
VCMPNGT_UQPS* (V1
_mm_cmp_ps
 
compare for ! >=  VCMPNGEPD* (V1
_mm_cmpnge_pd (S2
VCMPNGE_UQPD* (V1
_mm_cmp_pd
  VCMPNGEPS* (V1
_mm_cmpnge_ps (S1
VCMPNGE_UQPS* (V1
_mm_cmp_ps
 
compare for orderedVCMPORD_SPD* (V1
_mm_cmp_pd
CMPORDPD* (S2
_mm_cmpord_pd
  VCMPORD_SPS* (V1
_mm_cmp_ps
CMPORDPS* (S1
_mm_cmpord_ps
   
compare for unordered  VCMPUNORD_SPD* (V1
_mm_cmp_pd
CMPUNORDPD* (S2
_mm_cmpunord_pd
  VCMPUNORD_SPS* (V1
_mm_cmp_ps
CMPUNORDPS* (S1
_mm_cmpunord_ps
 
TRUE  VCMPTRUE_USPD* (V1
_mm_cmp_pd
VCMPTRUEPD* (V1
_mm_cmp_pd
  VCMPTRUE_USPS* (V1
_mm_cmp_ps
VCMPTRUEPS* (V1
_mm_cmp_ps
 
FALSEVCMPFALSE_OSPD* (V1
_mm_cmp_pd
VCMPFALSEPD* (V1
_mm_cmp_pd
  VCMPFALSE_OSPS* (V1
_mm_cmp_ps
VCMPFALSEPS* (V1
_mm_cmp_ps
   

 

 Floating-Point
DoubleSingleHalf
compare scalar values
to set flag register
COMISD (S2
_mm_comieq_sd
_mm_comilt_sd
_mm_comile_sd
_mm_comigt_sd
_mm_comige_sd
_mm_comineq_sd
UCOMISD (S2
_mm_ucomieq_sd
_mm_ucomilt_sd
_mm_ucomile_sd
_mm_ucomigt_sd
_mm_ucomige_sd
_mm_ucomineq_sd
COMISS (S1
_mm_comieq_ss
_mm_comilt_ss
_mm_comile_ss
_mm_comigt_ss
_mm_comige_ss
_mm_comineq_ss
UCOMISS (S1
_mm_ucomieq_ss
_mm_ucomilt_ss
_mm_ucomile_ss
_mm_ucomigt_ss
_mm_ucomige_ss
_mm_ucomineq_ss
 

 

Bitwise Logical Operations

 IntegerFloating-Point
QWORDDWORDWORDBYTEDoubleSingleHalf
andPAND (S2
_mm_and_si128
ANDPD (S2
_mm_and_pd
ANDPS (S1
_mm_and_ps
 
VPANDQ (V5...
_mm512_and_epi64
etc
VPANDD (V5...
_mm512_and_epi32
etc
  
and notPANDN (S2
_mm_andnot_si128
ANDNPD (S2
_mm_andnot_pd
ANDNPS (S1
_mm_andnot_ps
 
VPANDNQ (V5...
_mm512_andnot_epi64
etc
VPANDND (V5...
_mm512_andnot_epi32
etc
  
orPOR (S2
_mm_or_si128
ORPD (S2
_mm_or_pd
ORPS (S1
_mm_or_ps
 
VPORQ (V5...
_mm512_or_epi64
etc
VPORD (V5...
_mm512_or_epi32
etc
  
xorPXOR (S2
_mm_xor_si128
XORPD (S2
_mm_xor_pd
XORPS (S1
_mm_xor_ps
 
VPXORQ (V5...
_mm512_xor_epi64
etc
VPXORD (V5...
_mm512_xor_epi32
etc
  
testPTEST (S4.1
_mm_testz_si128
_mm_testc_si128
_mm_testnzc_si128
VTESTPD (V1
_mm_testz_pd
_mm_testc_pd
_mm_testnzc_pd
VTESTPS (V1
_mm_testz_ps
_mm_testc_ps
_mm_testnzc_ps
 
VPTESTMQ (V5...
_mm_test_epi64_mask
VPTESTNMQ (V5...
_mm_testn_epi64_mask
VPTESTMD (V5...
_mm_test_epi32_mask
VPTESTNMD (V5...
_mm_testn_epi32_mask
VPTESTMW (V5+BW...
_mm_test_epi16_mask
VPTESTNMW (V5+BW...
_mm_testn_epi16_mask
VPTESTMB (V5+BW...
_mm_test_epi8_mask
VPTESTNMB (V5+BW...
_mm_testn_epi8_mask
   
ternary operationVPTERNLOGQ (V5...
_mm_ternarylogic_epi64
VPTERNLOGD (V5...
_mm_ternarylogic_epi32
     

 

Bit Shift / Rotate

 Integer
QWORDDWORDWORDBYTE
shift left logicalPSLLQ (S2
_mm_slli_epi64
_mm_sll_epi64
PSLLD (S2
_mm_slli_epi32
_mm_sll_epi32
PSLLW (S2
_mm_slli_epi16
_mm_sll_epi16
 
VPSLLVQ (V2
_mm_sllv_epi64
VPSLLVD (V2
_mm_sllv_epi32
VPSLLVW (V5+BW...
_mm_sllv_epi16
 
shift right logicalPSRLQ (S2
_mm_srli_epi64
_mm_srl_epi64
PSRLD (S2
_mm_srli_epi32
_mm_srl_epi32
PSRLW (S2
_mm_srli_epi16
_mm_srl_epi16
 
VPSRLVQ (V2
_mm_srlv_epi64
VPSRLVD (V2
_mm_srlv_epi32
VPSRLVW (V5+BW...
_mm_srlv_epi16
 
shift right arithmeticVPSRAQ (V5...
_mm_srai_epi64
_mm_sra_epi64
PSRAD (S2
_mm_srai_epi32
_mm_sra_epi32
PSRAW (S2
_mm_srai_epi16
_mm_sra_epi16
 
VPSRAVQ (V5...
_mm_srav_epi64
VPSRAVD (V2
_mm_srav_epi32
VPSRAVW (V5+BW...
_mm_srav_epi16
 
rotate leftVPROLQ (V5...
_mm_rol_epi64
VPROLD (V5...
_mm_rol_epi32
  
VPROLVQ (V5...
_mm_rolv_epi64
VPROLVD (V5...
_mm_rolv_epi32
  
rotate rightVPRORQ (V5...
_mm_ror_epi64
VPRORD (V5...
_mm_ror_epi32
  
VPRORVQ (V5...
_mm_rorv_epi64
VPRORVD (V5...
_mm_rorv_epi32
  

 

Byte Shift

 128-bit
shift left logicalPSLLDQ (S2
_mm_slli_si128
shift right logicalPSRLDQ (S2
_mm_srli_si128
packed align rightPALIGNR (SS3
_mm_alignr_epi8

 

Compare Strings

 explicit lengthimplicit length
return indexPCMPESTRI (S4.2
_mm_cmpestri
_mm_cmpestra
_mm_cmpestrc
_mm_cmpestro
_mm_cmpestrs
_mm_cmpestrz
PCMPISTRI (S4.2
_mm_cmpistri
_mm_cmpistra
_mm_cmpistrc
_mm_cmpistro
_mm_cmpistrs
_mm_cmpistrz
return maskPCMPESTRM (S4.2
_mm_cmpestrm
_mm_cmpestra
_mm_cmpestrc
_mm_cmpestro
_mm_cmpestrs
_mm_cmpestrz
PCMPISTRM (S4.2
_mm_cmpistrm
_mm_cmpistra
_mm_cmpistrc
_mm_cmpistro
_mm_cmpistrs
_mm_cmpistrz

 

Others

LDMXCSR (S1
_mm_setcsr
Load MXCSR register
STMXCSR (S1
_mm_getcsr
Save MXCSR register state

 

PSADBW (S2
_mm_sad_epu8
Compute sum of absolute differences
MPSADBW (S4.1
_mm_mpsadbw_epu8
Performs eight 4-byte wide Sum of Absolute Differences operations to produce eight word integers.
VDBPSADBW (V5+BW...
_mm_dbsad_epu8
Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes

 

PMULHRSW (SS3
_mm_mulhrs_epi16
Packed Multiply High with Round and Scale

 

PHMINPOSUW (S4.1
_mm_minpos_epu16
Finds the value and location of the minimum unsigned word from one of 8 horizontally packed unsigned words. The resulting value and location (offset within the source) are packed into the low dword of the destination XMM register.

 

VPCONFLICTQ (V5+CD...
_mm512_conflict_epi64
VPCONFLICTD (V5+CD...
_mm512_conflict_epi32
Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register

 

VPLZCNTQ (V5+CD...
_mm_lzcnt_epi64
VPLZCNTD (V5+CD...
_mm_lzcnt_epi32
Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values

 

VFIXUPIMMPD* (V5...
_mm512_fixupimm_pd
VFIXUPIMMPS* (V5...
_mm512_fixupimm_ps
Fix Up Special Packed Float64/32 Values
VFPCLASSPD* (V5...
_mm512_fpclass_pd_mask
VFPCLASSSD* (V5...
_mm512_fpclass_sd_mask
Tests Types Of a Packed Float64/32 Values
VRANGEPD* (V5+DQ...
_mm_range_pd
VRANGEPS* (V5+DQ...
_mm_range_pd
Range Restriction Calculation For Packed Pairs of Float64/32 Values
VGETEXPPD* (V5...
_mm512_getexp_pd
VGETEXPPS* (V5...
_mm512_getexp_ps
Convert Exponents of Packed DP/SP FP Values to FP Values
VGETMANTPD* (V5...
_mm512_getmant_pd
VGETMANTPS* (V5...
_mm512_getmant_ps
Extract Float64/32 Vector of Normalized Mantissas from Float64/32 Vector

 

AESDEC (AESNI
_mm_aesdec_si128
Perform an AES decryption round using an 128-bit state and a round key
AESDECLAST (AESNI
_mm_aesdeclast_si128
Perform the last AES decryption round using an 128-bit state and a round key
AESENC (AESNI
_mm_aesenc_si128
Perform an AES encryption round using an 128-bit state and a round key
AESENCLAST (AESNI
_mm_aesenclast_si128
Perform the last AES encryption round using an 128-bit state and a round key
AESIMC (AESNI
_mm_aesimc_si128
Perform an inverse mix column transformation primitive
AESKEYGENASSIST (AESNI
_mm_aeskeygenassist_si128
Assist the creation of round keys with a key expansion schedule
PCLMULQDQ (PCLMULQDQ
_mm_clmulepi64_si128
Perform carryless multiplication of two 64-bit numbers

 

SHA1RNDS4 (SHA
_mm_sha1rnds4_epu32
Perform Four Rounds of SHA1 Operation
SHA1NEXTE (SHA
_mm_sha1nexte_epu32
Calculate SHA1 State Variable E after Four Rounds
SHA1MSG1 (SHA
_mm_sha1msg1_epu32
Perform an Intermediate Calculation for the Next Four SHA1 Message Dwords
SHA1MSG2 (SHA
_mm_sha1msg2_epu32
Perform a Final Calculation for the Next Four SHA1 Message Dwords
SHA256RNDS2 (SHA
_mm_sha256rnds2_epu32
Perform Two Rounds of SHA256 Operation
SHA256MSG1 (SHA
_mm_sha256msg1_epu32
Perform an Intermediate Calculation for the Next Four SHA256 Message
SHA256MSG2 (SHA
_mm_sha256msg2_epu32
Perform a Final Calculation for the Next Four SHA256 Message Dwords

 

VPBROADCASTMB2Q (V5+CD...
_mm_broadcastmb_epi64
VPBROADCASTMW2D (V5+CD...
_mm_broadcastmw_epi32
Broadcast Mask to Vector Register

 

VZEROALL (V1
_mm256_zeroall
Zero all YMM registers
VZEROUPPER (V1
_mm256_zeroupper
Zero upper 128 bits of all YMM registers

 

MOVNTPS (S1
_mm_stream_ps
Non-temporal store of four packed single-precision floating-point values from an XMM register into memory
MASKMOVDQU (S2
_mm_maskmoveu_si128
Non-temporal store of selected bytes from an XMM register into memory
MOVNTPD (S2
_mm_stream_pd
Non-temporal store of two packed double-precision floating-point values from an XMM register into memory
MOVNTDQ (S2
_mm_stream_si128
Non-temporal store of double quadword from an XMM register into memory
LDDQU (S3
_mm_lddqu_si128
Special 128-bit unaligned load designed to avoid cache line splits
MOVNTDQA (S4.1
_mm_stream_load_si128
Provides a non-temporal hint that can cause adjacent 16-byte items within an aligned 64-byte region (a streaming line) to be fetched and held in a small set of temporary buffers ("streaming load buffers"). Subsequent streaming loads to other aligned 16-byte items in the same streaming line may be supplied from the streaming load buffer and can improve throughput.

 

VGATHERPFxDPS (V5+PF
_mm512_mask_prefetch_i32gather_ps
VGATHERPFxQPS (V5+PF
_mm512_mask_prefetch_i64gather_ps
VGATHERPFxDPD (V5+PF
_mm512_mask_prefetch_i32gather_pd
VGATHERPFxQPD (V5+PF
_mm512_mask_prefetch_i64gather_pd
x=0/1
Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T0/T1 Hint
VSCATTERPFxDPS (V5+PF
_mm512_prefetch_i32scatter_ps
VSCATTERPFxQPS (V5+PF
_mm512_prefetch_i64scatter_ps
VSCATTERPFxDPD (V5+PF
_mm512_prefetch_i32scatter_pd
VSCATTERPFxQPD (V5+PF
_mm512_prefetch_i64scatter_pd
x=0/1
Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T0/T1 Hint with Intent to Write

 

 

TIPS

TIP 1: Zero Clear

XOR instructions do for both Integer and Floating-point.

Example: Zero all of 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1

        pxor         xmm1, xmm1

Example: Set 0.0f to 4 floats in XMM1

        xorps        xmm1, xmm1

Example: Set 0.0 to 2 doubles in XMM1

        xorpd        xmm1, xmm1

 

TIP 2: Copy the lowest 1 element to other elements in XMM register

Shuffle instructions do.

Example: Copy the lowest float element to other 3 elements in XMM1.

        shufps       xmm1, xmm1, 0

Example: Copy the lowest WORD element to other 7 elements in XMM1

        pshuflw       xmm1, xmm1, 0
        pshufd        xmm1, xmm1, 0

Example: Copy the lower QWORD element to the upper element in XMM1

        pshufd        xmm1, xmm1, 44h     ; 01 00 01 00 B = 44h

Is this better?

        punpcklqdq    xmm1, xmm1

 

TIP 3: Integer Sign Extension / Zero Extension

Unpack instructions do.

Example: Zero extend 8 WORDS in XMM1 to DWORDS in XMM1 (lower 4) and XMM2 (upper 4).

        movdqa     xmm2, xmm1     ; src data WORD[7] [6] [5] [4] [3] [2] [1] [0]
        pxor       xmm3, xmm3     ; upper 16-bit to attach to each WORD = all 0
        punpcklwd  xmm1, xmm3     ; lower 4 DWORDS:  0 [3] 0 [2] 0 [1] 0 [0] 
        punpckhwd  xmm2, xmm3     ; upper 4 DWORDS:  0 [7] 0 [6] 0 [5] 0 [4]

Example: Sign extend 16 BYTES in XMM1 to WORDS in XMM1 (lower 8) and XMM2 (upper 8).

        pxor       xmm3, xmm3
        movdqa     xmm2, xmm1
        pcmpgtb    xmm3, xmm1     ; upper 8-bit to attach to each BYTE = src >= 0 ? 0 : -1
        punpcklbw  xmm1, xmm3     ; lower 8 WORDS
        punpckhbw  xmm2, xmm3     ; upper 8 WORDS

Example (intrinsics): Sign extend 8 WORDS in __m128i variable words8 to DWORDS in dwords4lo (lower 4) and dwords4hi (upper 4)

    const __m128i izero = _mm_setzero_si128();
    __m128i words8hi = _mm_cmpgt_epi16(izero, words8);
    __m128i dwords4lo = _mm_unpacklo_epi16(words8, words8hi);
    __m128i dwords4hi = _mm_unpackhi_epi16(words8, words8hi);

 

TIP 4: Absolute Values of Integers

If an integer value is positive or zero, it is already the abosoute value. Else, adding 1 after complementing all bits makes the absolute value.

Example: Set absolute values of 8 signed WORDS in XMM1 to XMM1

                                  ; if src is positive or 0; if src is negative
        pxor      xmm2, xmm2      
        pcmpgtw   xmm2, xmm1      ; xmm2 <- 0              ; xmm2 <- 1
        pxor      xmm1, xmm2      ; xor with 0(do nothing) ; xor with -1(complement all bits)
        psubw     xmm1, xmm2      ; subtract 0(do nothing) ; subtract -1(add 1)

Example (intrinsics): Set abosolute values of 4 DWORDS in __m128i variable dwords4 to dwords4

    const __m128i izero = _mm_setzero_si128();
    __m128i tmp = _mm_cmpgt_epi32(izero, dwords4);
    dwords4 = _mm_xor_si128(dwords4, tmp);
    dwords4 = _mm_sub_epi32(dwords4, tmp);

 

TIPS 5: Absolute Values of Floating-Points

Floating-Points are not complemented so just clearing sign (the highest) bit makes the absolute value.

Example: Set absolute values of 4 floats in XMM1 to XMM1

; data
              align   16
signoffmask   dd      4 dup (7fffffffH)       ; mask for clearing the highest bit
        
; code
        andps   xmm1, xmmword ptr signoffmask        

Example (intrinsics): Set absolute values of 4 floats in __m128 variable floats4 to floats4

        const __m128 signmask = _mm_set1_ps(-0.0f); // 0x80000000

        floats4 = _mm_andnot_ps(signmask, floats4);

 

TIP 6: Lacking some integer MUL instructions?

Signed/unsigned makes difference only for the calculation of the upper part. Fot the lower part, the same instruction can be used both for signed and unsigned.

unsigned WORD * unsigned WORD -> Upper WORD: PMULHUW, Lower WORD: PMULLW

singed WORD * signed WORD -> Upper WORD: PMULHW, Lower WORD: PMULLW

 

TIP 8: max / min

Bitwise operation after getting mask by compararison does.

Example: Compare each signed DWORD in XMM1 and XMM2 and set smaller one to XMM1

; A=xmm1  B=xmm2                    ; if A>B        ; if A<=B
        movdqa      xmm0, xmm1
        pcmpgtd     xmm1, xmm2      ; xmm1=-1       ; xmm1=0
        pand        xmm2, xmm1      ; xmm2=B        ; xmm2=0
        pandn       xmm1, xmm0      ; xmm1=0        ; xmm1=A
        por         xmm1, xmm2      ; xmm1=B        ; xmm1=A

Example (intrinsics): Compare each signed byte in __m128i variables a, b and set larger one to maxAB

    __m128i mask = _mm_cmpgt_epi8(a, b);
    __m128i selectedA = _mm_and_si128(mask, a);
    __m128i selectedB = _mm_andnot_si128(mask, b);
    __m128i maxAB = _mm_or_si128(selectedA, selectedB);

 

TIP 10: Set all bits

PCMPEQx instruction does.

Example: set -1 to all of the 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1.

        pcmpeqb         xmm1, xmm1

 


  • 1
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值