The above described ALU structure makes it very important for the CPU from now on how the program is actually written and compiled. Here the Pentium 4 micro-architecture veers slightly away from the direction in which the traditional microprocessor development goes: the engineers do their best to make sure that there are no significant differences in the processing times for various instructions.
The example we have just discussed is a perfect illustration of why Pentium 4 processor turned out so sensitive to the software optimizations. However, Intel didn’t forget to offer the corresponding compiler, because without the software optimizations, Pentium 4 will simply fail to perform fast.
Now it is high time we said a few words about Prescott. We have already mentioned that it is very much different from the Northwood micro-architecture. For example, the ALU units of the Prescott core are organized completely different from all other processor units, because they use transistor logics based on differential pairs in order to increase the working frequencies. Note that “differential logics” can work at much higher frequencies that is why they applied it here. However, this logics generates much more heat per transistor, contains much more transistors and generates heat even when idling.
The operations processing algorithms are also different for these two architectures.
Firstly, the effective delay of the instructions sent to fast ALU (including ADD instructions) has changed. It used to be half a clock, and now it is twice as high, i.e. 1 clock cycle. It means that the result of the low order positions calculation can now be transferred on for further processing only in one clock. So, the above mentioned chain of 100 ADD operations will require 100 clock cycles by Prescott, instead of the 50 clock cycles by Northwood processor.
Secondly, SHIFT operations are now processed in fast ALU. Prescott acquired a special unit that executes those SHIFT operations that couldn’t be executed in the previous fast ALU (i.e. these are the shifts where the direction of the shift doesn’t coincide with the direction from low order bits to high order bits). It takes one clock to perform a shift like that, and there can be up to two shifts performed per single clock. As you remember, there used to be 4 clocks for a single shift execution with no more than one shift per clock, because they were done in MMX-shifter unit.
Thirdly, the Prescott CPU acquired integer multiplication unit. We have already mentioned that the absence of this unit could be very painful for some CPU design gurus.
This actually leads to pretty funny consequences. If it used to be much better to have the SHIFT replaced with ADD operations in the code, then now the situation has become just the opposite. In the example above this replacement allowed completing the operation in 2 clock cycles instead of 4. But now the code optimized for Northwood processor will require 4 clock cycles to be completed by Prescott, i.e. it will take much longer than the NON optimized code, which requires only 1 clock. It is also remarkable that Prescott turned out much more similar to all other x86 processors than Northwood in terms of instructions processing times: there is no huge deviation in times for different instructions processing any more, unlike the situation with Northwood instructions.
Therefore, all the software compiled for CPUs with Northwood core should be recompiled anew for Prescott based processors.
What were those really interesting things we learned about in this chapter? We learned that the CPU has some units working at twice the core frequency (these are not only fast ALU, but also a few other units); that slow ALU is a kind of “virtual” unit, and that there is really big difference between the Northwood and Prescott micro-architecture. Of course, you understand now that ALUs of Pentium 4 processors are very much different from the ALU of any other CPU.