The opportunity we are talking about is the opportunity to immediately transfer only the necessary part of the result as a partial operand for the next addition operation. Without spending any time on the actual data transfer. Say, for example, that we have a chain of dependent addition instructions. The results of each next instruction depend on the results of the previous instruction. Let’s start with the first instruction processing. During the first tick we finish processing the low-order half of it and it is immediately transferred for the next instruction processing. During the second tick the high-order part of the result is calculated, and at the same time the low-order part of the next result is being calculated, too. And so on and so forth. In other words, a chain of 100 dependent ADD operations, each of which has the previous result as its operand, will be completed in about 50 clock cycles. As a result, the effective latency of each operation is half a clock! Excellent result! This is exactly how Pentium 4 processors based on Willamette and Northwood cores work.
Of course, there are no wonders in this world that is why it will still take the same time to obtain the flags. However, there appears a much more important limitation now: fast ALU doesn’t work with all the operations but only with the simplest ones, such as ADD. In order to make the above described mechanism work flawlessly, fast ALU should execute only those micro-operations that process their operands starting with the low-order positions and going towards the high-order positions, and in no way on the contrary. All the operations that do not comply with this condition will be processed by slow ALU. And if we will have to transfer the operands to the slow unit in this case, then the resulting delay will equal 4 ticks (two clock cycles).
The best example of an operation that can ruin the convenient processing order is the Shift operation. This command is supposedly executed in the FPU unit of the Northwood processor. To be more exact, this unit is called MMX-shifter.
Let’s read it once again: all shift are performed in MMX-shifter with 4 ticks delay. But they are being sent to slow ALU, aren’t they? What is the connection between slow ALU and this MMX-shifter then?
We get the impression that slow ALU unit is a kind of a virtual unit at all! It looks like this unit receives the tasks, transforms them into acceptable format and then sends to the actual execution units. In other words, it serves mostly as a “helper and manager” rather than the direct commands executor.
Since different operations are executed in different units the results turn out pretty interesting. Let’s consider, for instance, a “4 positions shift to the left” operation, i.e. multiplication by 16. This operation can be performed as MULTIPLY command, but it will take a lot of time this way, because Northwood doesn’t have an integer multiplication unit (!) (it used to make many CPU-design gurus really furious). Also, this operation can be performed directly as it is and we already know that in this case it will take 4 clocks (8 ticks). But since far not all operations require the same amount of time for their complete execution, it would make sense to replace one type of operations with another. In particular, you can see from the example above that it would be much more economical to replace the SHIFT operation requiring 4 clocks with four ADD operations, because the total time they will need to be completed is only 2 clocks (4 ticks). Moreover, it is better to replace SHIFT operations with ADD operations up to 7-position shifts. In fact, the Intel’s compiler used to do it exactly this way until Prescott core arrived.