AMD Calls New FPU "Flex FP", Defends Dual FMAC Approach

AMD: Dual FMAC - Most Efficient FPU of Our Days

by Anton Shilov
10/26/2010 | 11:45 PM

Advanced Micro Devices continies to share secrets about its forthcoming products that belong to code-named Bulldozer. Recently the company expained in details its new floating point unit (FPU) called "Flex FP" that promises to deliver high performance computing and be very efficient in terms of die size and power consumption.


As is known, Bulldozer processors consists of several so-called modules. Each module has two integer engines as well as one "Flex FP" FPU consisting of two 128-bit FMAC units that share with its own schedule.  The approach is different from a hypothetical 256-bit FPU with appropriate data paths that would be underutilized oftern. Moreover, unified scheduler for both FP and integer execution units would also be less efficient, according to AMD.

"Each Flex FP has its own scheduler; it does not rely on the integer scheduler to schedule FP commands, nor does it take integer resources to schedule 256-bit executions. This helps to ensure that the FP unit stays full as floating point commands occur. Our competitors’ architectures have had single scheduler for both integer and floating point, which means that both integer and floating point commands are issued by a single shared scheduler vs. having dedicated schedulers for both integer and floating point executions," said John Fruehe, the director of product marketing for server/workstation products at AMD.

Modern 128-bit FPUs can execute four single precision commands or two double precision commands in parallel per cycle. The yet-to-come AVX technology allows to execute eight 32-bit commands or four 64-bit commands per cycle. However, once a program does not support AVX then: "that flashy new 256-bit FPU only executes in 128-bit mode". This is naturally a blow for Intel's 256-bit FPU of Sandy Bridge processor.

The beauty of the Flex FP is that it is a single 256-bit FPU that is shared by two integer cores. With each cycle, either core can operate on 256 bits of parallel data via two 128-bit instructions or one 256-bit instruction, or each of the integer cores can execute 128-bit commands simultaneously.

"In today’s typical data center workloads, the bulk of the processing is integer and a smaller portion is floating point. So, in most cases you don’t want one massive 256-bit floating point unit per core consuming all of that die space and all of that power just to sit around watching the integer cores do all of the heavy lifting. By sharing one 256-bit floating point unit per every 2 cores, we can keep die size and power consumption down, helping hold down both the acquisition cost and long-term management costs," explained Mr. Fruehe.

By having a shared Flex FP the power budget for the processor is held down. This allows AMD to add more integer cores into the same power budget. In fact, AMD claims that the Flex FP is designed to reduce its active idle power consumption to a mere 2% of its peak power consumption.

"Obviously, there are benefits of recompiled code that will support the new AVX instructions. But, if you think that you will have some older 128-bit FP code hanging around (and let’s face it, you will), then don’t you think having a flexible floating point solution is a more flexible choice for your applications? For applications to support the new 256-bit AVX capabilities they will need to be recompiled; this takes time and testing, so I wouldn’t expect to see rapid movement to AVX until well after platforms are available on the streets," concluded Mr. Fruehe.