Floating Point Instructions
In the K8 processor the scheduler of floating-point instructions is separate from the integer instruction scheduler and is designed in a different way. The scheduler buffer can take in up to 12 groups of three macro-ops (theoretically, 36 floating-point operations). Moreover, a queue consolidation mechanism is employed to make one complete and one empty triplet out of two incomplete triplets. Unlike the integer instruction execution unit with symmetrical computational channels, the FPU contains three different units FADD, FMUL and FMISC (it is FSTORE, too) for floating-point addition, multiplication and auxiliary operations, so the scheduler buffer doesn’t link the position of a macro-op in an instruction group to a particular execution unit (Figure 4).
Each clock cycle one operation can be dispatched to one of the K8’s 80-bit FPUs. 128-bit SSE instructions are divided into two 64-bit macro-ops on the decoding step. These macro-ops are then dispatched sequentially in two cycles. Theoretically, up to three macro-ops can be dispatched each cycle, but this rate is unachievable in practice due to the decoding limitations because besides floating-point instructions there are also auxiliary commands of loads, loops, etc. in the code. Moreover, the simple scheduling algorithm doesn’t always distribute the operations in the free devices in the optimal order, and this can reduce the dispatch rate due to inefficient utilization of the execution devices.
Thanks to its two 64-bit read buses, the K8 processor can receive up to two 64-bit operands per cycle from the L1 cache, which helps the processor keep up a high execution rate when floating-point instructions are frequently accessing data in memory. This is an important feature of the architecture since four operands are necessary to execute two instructions in parallel (two operands per one instruction), and two out of four operands are usually read from memory in a number of algorithms for processing streaming data.
In the K8L processor the FADD and FMUL devices will be expanded to 128 bits (Figure 5), which will help double the theoretical floating-point performance with code that uses vector SSE instructions (not only due to a doubled dispatch rate, but also due to an increased decoding and retiring rate caused by the reduced number of generated macro-ops).
The buses for reading data from the cache will also become two times wider, which will enable the processor to perform two 128-bit data loads from the L1 cache per cycle. The ability to perform two 128-bit data reads per cycle can give the K8L an advantage in some algorithms over a Conroe-core processor that is only capable of performing one 128-bit load.
According to the revealed information, the FMISC (FSTORE) device will remain a 64-bit one. This is illogical since writes from the 128-bit SSE registers into memory are executed on the FMISC (FSTORE) unit, and if this unit is left as it is now, it will automatically become a bottleneck in streaming calculations due to the inability of the CPU to perform a 128-bit data write each clock cycle. So, there is an opinion that AMD’s presentation contains an error and the FMISC (FSTORE) unit will still be able to perform writes at the full rate, i.e. 128 bits per cycle. Or perhaps the saving of 128-bit data will be implemented in the other two units while the auxiliary operations unit will work at half the full rate, which won’t be crucial for overall performance. The scheduling algorithm, which doesn’t always work optimally in the K8, calls for improvement, too.
As we said above, the Conroe, unlike the K8, uses a common queue for integer and floating-point instructions with all the ensuing advantages and shortcomings. Besides that, the ports for dispatching integer and floating-point instructions are combined, which prohibits certain combinations of integer and floating-point instructions. Another limitation in the Conroe, which can show up in floating-point algorithms that make use of x87 commands (i.e. without SSE optimizations), is the two times lower rate of dispatching the multiplication instruction FMUL.
Besides having wider execution units, the K8L will also have wider integer units inside the FADD and FMUL blocks that deal with SSE2 commands processing. As a result, the integer applications using these instruction sets will work faster. Also K8L will learn to perform a few extra SSE instructions that we won’t discuss here.
The FPU in the K8L is going to be very effective, more effective in some parameters (like the ability to read two 128-bit values per cycle) than the Conroe’s.