Intel Architecture Manual Updates: bfloat16 for Cooper Lake Xeon Scalable Only?

Intel recently released a new version of its document for software developers revealing some additional details about its upcoming Xeon Scalable ‘Cooper Lake-SP’ processors. As it appears, the new CPUs will support AVX512_BF16 instructions and therefore the bfloat16 format. Meanwhile, the main intrigue here is the fact that at this point AVX512_BF16 seems to be only supported by the Cooper Lake-SP microarchitecture, but not its direct successor, the Ice Lake-SP microarchitecture.

The bfloat16 is a truncated 16-bit version of the 32-bit IEEE 754 single-precision floating-point format that preserves 8 exponent bits, but reduces precision of the significand from 24-bits to 8 bits to save up memory, bandwidth, and processing resources, while still retaining the same range. The bfloat16 format was designed primarily for machine learning and near-sensor computing applications, where precision is needed near to 0 but not so much at the maximum range. The number representation is supported by Intel’s upcoming FPGAs as well as Nervana neural network processors, and Google’s TPUs. Given the fact that Intel supports the bfloat16 format across two of its product lines, it makes sense to support it elsewhere as well, which is what the company is going to do by adding its AVX512_BF16 instructions support to its upcoming Xeon Scalable ‘Cooper Lake-SP’ platform.

AVX-512 Support Propogation by Various Intel CPUs Newer uArch supports older uArch
Xeon	General	Xeon Phi
Skylake-SP	AVX512BW AVX512DQ AVX512VL	AVX512F AVX512CD	AVX512ER AVX512PF	Knights Landing
Cannon Lake	AVX512VBMI AVX512IFMA	AVX512_4FMAPS AVX512_4VNNIW	Knights Mill
Cascade Lake-SP	AVX512_VNNI
Cooper Lake	AVX512_BF16
Ice Lake	AVX512_VNNI AVX512_VBMI2 AVX512_BITALG AVX512+VAES AVX512+GFNI AVX512+VPCLMULQDQ (not BF16)	AVX512_VPOPCNTDQ
Source: Intel Architecture Instruction Set Extensions and Future Features Programming Reference (pages 16)

The list of Intel’s AVX512_BF16 Vector Neural Network Instructions includes VCVTNE2PS2BF16, VCVTNEPS2BF16, and VDPBF16PS. All of them can be executed in 128-bit, 256-bit, or 512-bit mode, so software developers can pick up one of a total of nine versions based on their requirements.

Intel AVX512_BF16 Instructions Intel C/C++ Compiler Intrinsic Equivalent
Instruction	Description
VCVTNE2PS2BF16	Convert Two Packed Single Data to One Packed BF16 Data Intel C/C++ Compiler Intrinsic Equivalent: VCVTNE2PS2BF16 __m128bh _mm_cvtne2ps_pbh (__m128, __m128); VCVTNE2PS2BF16 __m128bh _mm_mask_cvtne2ps_pbh (__m128bh, __mmask8, __m128, __m128); VCVTNE2PS2BF16 __m128bh _mm_maskz_cvtne2ps_pbh (__mmask8, __m128, __m128); VCVTNE2PS2BF16 __m256bh _mm256_cvtne2ps_pbh (__m256, __m256); VCVTNE2PS2BF16 __m256bh _mm256_mask_cvtne2ps_pbh (__m256bh, __mmask16, __m256, __m256); VCVTNE2PS2BF16 __m256bh _mm256_maskz_cvtne2ps_ pbh (__mmask16, __m256, __m256); VCVTNE2PS2BF16 __m512bh _mm512_cvtne2ps_pbh (__m512, __m512); VCVTNE2PS2BF16 __m512bh _mm512_mask_cvtne2ps_pbh (__m512bh, __mmask32, __m512, __m512); VCVTNE2PS2BF16 __m512bh _mm512_maskz_cvtne2ps_pbh (__mmask32, __m512, __m512);
VCVTNEPS2BF16	Convert Packed Single Data to Packed BF16 Data Intel C/C++ Compiler Intrinsic Equivalent: VCVTNEPS2BF16 __m128bh _mm_cvtneps_pbh (__m128); VCVTNEPS2BF16 __m128bh _mm_mask_cvtneps_pbh (__m128bh, __mmask8, __m128); VCVTNEPS2BF16 __m128bh _mm_maskz_cvtneps_pbh (__mmask8, __m128); VCVTNEPS2BF16 __m128bh _mm256_cvtneps_pbh (__m256); VCVTNEPS2BF16 __m128bh _mm256_mask_cvtneps_pbh (__m128bh, __mmask8, __m256); VCVTNEPS2BF16 __m128bh _mm256_maskz_cvtneps_pbh (__mmask8, __m256); VCVTNEPS2BF16 __m256bh _mm512_cvtneps_pbh (__m512); VCVTNEPS2BF16 __m256bh _mm512_mask_cvtneps_pbh (__m256bh, __mmask16, __m512); VCVTNEPS2BF16 __m256bh _mm512_maskz_cvtneps_pbh (__mmask16, __m512);
VDPBF16PS	Dot Product of BF16 Pairs Accumulated into Packed Single Precision Intel C/C++ Compiler Intrinsic Equivalent: VDPBF16PS __m128 _mm_dpbf16_ps(__m128, __m128bh, __m128bh); VDPBF16PS __m128 _mm_mask_dpbf16_ps( __m128, __mmask8, __m128bh, __m128bh); VDPBF16PS __m128 _mm_maskz_dpbf16_ps(__mmask8, __m128, __m128bh, __m128bh); VDPBF16PS __m256 _mm256_dpbf16_ps(__m256, __m256bh, __m256bh); VDPBF16PS __m256 _mm256_mask_dpbf16_ps(__m256, __mmask8, __m256bh, __m256bh); VDPBF16PS __m256 _mm256_maskz_dpbf16_ps(__mmask8, __m256, __m256bh, __m256bh); VDPBF16PS __m512 _mm512_dpbf16_ps(__m512, __m512bh, __m512bh); VDPBF16PS __m512 _mm512_mask_dpbf16_ps(__m512, __mmask16, __m512bh, __m512bh); VDPBF16PS __m512 _mm512_maskz_dpbf16_ps(__mmask16, __m512, __m512bh, __m512bh);

Only for Cooper Lake?

When Intel mentions an instruction in its Intel Architecture Instruction Set Extensions and Future Features Programming Reference, the company usually states the first microarchitecture to support it and indicates that its successors also support it (or are set to support it) by calling them ‘later’ omitting the word microarchitecture. For example, Intel’s original AVX is supported by Intel’s ‘Sandy Bridge and later’.

This is not the case with AVX512_BF16. This one is said to be supported by ‘Future Cooper Lake’. Meanwhile, after the Cooper Lake-SP platform comes the long-awaited 10nm Ice Lake-SP server platform and it will be a bit odd for it not to support something its predecessor does. However, this is not an entirely impossible scenario. Intel is keen on offering differentiated solutions these days, so tailoring Cooper Lake-SP for certain workloads while focusing Ice Lake-SP on others may be the case here.

We have reached out to Intel for additional information and will update the story if we get some extra details on the matter.

Update: Intel has sent us the following:

At this time, Cooper Lake will add support for Bfloat16 to DL Boost. We’re not giving any more guidance beyond that in our roadmap.

Source: Intel Architecture Instruction Set Extensions and Future Features Programming Reference (via InstLatX64/Twitter)