Articles: CPU
 

Bookmark and Share

(18) 
Pages: [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 ]

Prescott’s Enhanced Architecture

With the launching of the new Prescott processor Intel made a significant step forward towards successful improvement of their NetBurst architecture. The picture below shows something like a NetBurst genealogical tree with highlighted improvements introduced in the new Prescott processor core.

Let’s discuss these improvements in a bit more detail now.

Improved Branch Predictor. Most processor delays are caused by the necessity to clear and refill anew Prescott’s long pipeline after incorrect branch predictions. Therefore, the best way to eliminate these delays is to avoid incorrect predictions at all. Although the branch prediction algorithm of the NetBurst architecture was very efficient from the very beginning, Intel managed to improve this efficiency even more now.

The work of the Branch Prediction unit in Intel processors with NetBurst architecture is based on the work with Branch target Buffer (BTB). It is a 4KB buffer storing the statistics about the already complete branching. In other words, Intel’s branch prediction is based on a probabilistic model: the CPU evaluates a given branch as preferable or not in each particular case according to the collected statistical data. This algorithm proved very efficient, however, it turns out absolutely useless if there is no statistics about a certain branch. The Northwood based CPUs selected a “backward” branch in this case, considering that quitting cycles is the most widely spread branch.

This statistical algorithm of branch predictions has been significantly improved in the new Prescott core. Now, if there is no statistics about a certain branch, the branch prediction unit doesn’t draw any definite conclusions about the branch direction. Since the backward branches are usually not any longer than a certain empirically calculated branch distance, the branch prediction unit bases its decision on the branch distance for this particular case.

Moreover, the dynamic branch prediction algorithm has also been slightly improved. Prescott processor acquired an indirect branch predictor, which was first used in Pentium M processors and proved highly efficient there.

So, if Northwood based processors boasted the average of 0.86 incorrect predictions for every 100 instructions, then the new Prescott boasts a lower value of 0.75 for every 100 instructions. In other words, we got 12% less incorrect branch predictions, which leads to fewer delays caused by the necessity to empty and refill the execution pipeline.

Faster Instructions Execution.  The new processor core has the same number of integer ALUs: there are two integer ALUs working at the double core frequency for simple instructions and one more ALU one for complex instructions. However, the some instructions are processed much faster now. Prescott owes this performance increase to a few changes introduced in the ALU units.

First of all, I would like to mention that Intel added a shifter/rotator unit into one of the fast ALUs performing all instructions like shifts and rotations. As a result these instructions are now performed much faster, because in the previous Pentium 4 processors they were regarded as complex instructions and hence processed by the slow ALU.

The integer multiplication will also be performed faster by Prescott processors. In the previous versions of Intel’s NetBurst architecture integer multiplication was performed by the FPU, which required operands to be translated into floating-point format and then back to the integer format. In Prescott processor the integer multiplication is performed by the integer ALU, which definitely works considerably faster.

According to the measurements, the shifts and rotations are now performed at least 4 times faster, while integer multiplication got 25% faster. However, we should still keep in mind that longer pipeline and different L1 cache working algorithms have affected the time required for other simple instructions processing. Many instructions, which used to require about half a clock cycle, now take the entire clock that is why it wouldn’t be correct to state the overall ALU performance improvement.

 
Pages: [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 ]

Discussion

Comments currently: 18
Discussion started: 02/02/04 11:05:48 AM
Latest comment: 10/25/06 05:26:14 PM

View comments

Add your Comment