Chapter IV: Arithmetic Logic Units in NetBurst Micro-Architecture
This chapter is going to be devoted to ALU units of the NetBurst micro-architecture, including double-frequency ALUs.
As you remember, the major task of the Pentium 4 micro-architecture is to increase the performance by raising the working frequency. To increase the frequency, we need to perform major integer operations as fast as possible. However, the integer units differ a lot in complexity because so do the operations they execute: some of them are very complex and some of them are simple.
However, Intel engineers found a very unusual but efficient way of increasing the integer operations execution. All uop-s that come to ALU can be performed by two types of units: one [integer] slow ALU and two [integer] fast ALUs. The first unit can process quite a big number of integer operations, in particular it deals with most complex integer operations. At least this was what we thought about it at first, judging by its name. The reality, however, is going to be much more sophisticated. But let’s not rush ahead of time now.
Two fast ALU units are much more interesting and much narrowed specified. They are intended for simple integer operations, such as calculating the sum of two integers. But they perform this addition much faster than the slow ALU would have done it, because they work at double the processor clock frequency. It means that a CPU working at the actual core clock frequency of 3GHz has some units working at 6GHz (note that fast ALU are not the only units working at the double processor frequency, they are just a part of a much bigger system). We would like to point out that these two units are not identical: fast ALU 0 is more universal than fast ALU 1 and knows to execute much more commands.
But the most important is the following: in order to make it easier to speed up this part of the CPU, fast ALU consists of two 16bit wide “sub-units”, which are shifted by one stage from one another. The numbers addition is split into a few stages, so that to reduce the amount of work to be done at each stage, so that the ALU working frequency could get higher. Note that right now we are talking about the Northwood core. Prescott core is completely different at this point.
As a result, fast ALU processes numbers “by halves”. But it does it not every clock cycle but twice as fast, every half a cycle (they are called ticks). To be more exact, it works at its own frequency which is twice as high as the rest of the CPU frequency (let me call it nominal frequency). Let me remind you that when we speak of the frequency “rest of the CPU” works at we exclude Trace cache (which works at twice as low frequency as the data cache), and a few other units working at twice the CPU frequency, just like fast ALU. Each process in fast ALU is synchronized with a half of the nominal processor clock. As a result, each clock of the fast ALU is as long as only half of the “nominal” clock. All units working at twice the processor frequency are called Rapid Execution Engine, although now there is a tendency towards using this term in a broader meaning. So, let’s sum it all up now. In the Pentium 4 processor there are units that work at different frequencies at the same time: Trace cache works at half the nominal core frequency, Rapid Execution Engine (including fast ALU and the corresponding managing units) works at twice the nominal core frequency, and all the other processor units – at the nominal core frequency.