Part 4: PowerPC970 as a Fortune-Teller
First, let’s discuss the unit every modern processor features, irrespective of its architecture. It is the Branch Prediction Unit (BPU). This unit was necessary, because every long program contains conditional branches of various kinds. There is another fact about modern processors: all of them use pipelines as a means of increasing the operational frequency and also of increasing the percent of transistors in the processor that work simultaneously. In other words, each program instruction moves along a pipeline, on the way acquiring data read from the memory, results of other instructions, various additional properties and pointers. As a result, in each given moment of time the entire pipeline is usually busy processing different instructions on its various stages. This would be a perfect setup, if it were not for those misfortunate branches. If there is a jump in the flow of the program to another stretch of code (or there is a jump expected), the pipeline interrupts its smooth operation, and the resulting performance of the microprocessor gets greatly reduced. The engineers came up with a solution: they needed to create a unit, which main purpose would be to “guess” the most probable direction of the jump.
If the “guess” was right, we would get our reward: the continuous operation of the pipeline with a maximum possible workload. If the “guess” was wrong, we would get our punishment: the pipeline stalls, the buffers are all cleared up, and the right program branch is loaded. Of course, the penalty at a wrong guess (in processor clock cycles) may be greater than the economy if the guess could be correct. What saves the performance is that the number of correct guesses is overwhelming. As a rule, developers do their best for their processor to be as close to 100% correct predictions as possible in a majority of real-world algorithms. Most modern processors have a prediction precision of about 90% and higher! So overall, this method is profitable, although sometimes it is still necessary to clear the pipeline and fill it up again. On the other hand, by improving the predictions precision, you can increase the length of the pipeline, which favorably affects the CPU frequency (all other factors being equal).
Modern compilers use the branch prediction feature of the processor as a reserve for improving the performance of the program. They shape the code in such a way as to increase the branching prediction precision (the compiler developers know well how to use this feature of the processors they write the complier for). By the way, note that the PowerPC970 has a longer pipeline compared to its progenitor Power4 (16 stages against 12 stages; SIMD/FPU instructions may take as many as 25 stages!) As I have mentioned above, this was another trick to increase the frequency of the PowerPC970 (and close the gap in this parameter as Mac processors used to be lagging behind modern x86 processors in frequency).
But let’s return to the branch prediction unit. Let’s first see what flagship models from AMD and Intel have in this area. The Pentium 4 processor uses a branch history table (BHT) unit with a size of 4096 (4K) entries; its algorithm is based on the history of branching. In other words, a certain branching statistics is accumulated and the most probable (according to the accumulated statistical data) branch direction will have an advantage.