Let me now explain the operation of the second table. Let’s see how a random code sequence is executed. Each time we meet a branch instruction, we just mark what happened: whether the jump was really needed or not. We don’t need the instruction address for now, only the result. As an outcome, we get a sequence of “yes” and “no” answers. Now we can try guessing the next branch. Let’s take 8 last results. There are 2^8=256 combinations possible, so we need a special bit array for 256 elements. When we receive a new result, we write it into the array element, which corresponds to our history. For example, if we have a “history” like “yes-no-yes-no-no-no-no-no” and we find out during the code execution that there is no need in the next jump, we put 0 (“no”) into the array element number 5 (5 = 00000101 in the binary system). Next time we meet the same historical combination, we use the record. It is easily seen that this table would adjust itself nicely to short, but exotic branch sequences.
So, we have seen the main difference of this prediction method from the first one. In the first variant, we follow each branch instruction, independent of its connection to the others. It is all done the opposite way in the second variant: we make predictions basing on a sequence of results, without tying it up to any definite instruction. That’s what the names of the tables – local and global – stand for.
We can introduce an improvement to the method: we can try to enhance the latter idea to avoid dealing the same way with all instructions. To do this, we need a bigger table first. Second, we shouldn’t use the global counter alone to access an element of the table, but rather the counter with, say, the address of the current instruction. Thus, the Athlon XP extracts 4 address bits, adds 8 bits of the counter (the data on the last 8 branches) and gets the element index in the global history bimodal counter table (GHBC). The PowerPC970 uses an 11-bit counter, combines it with the address bits (by a logical operation rather than simple addition) and gets a 14-bit address. By the way, there is one more important difference: the Athlon XP (Athlon 64) has a table with 2 bits per entry, not 1 bit as with PowerPC970. 2 bits give us more flexibility as the entry can be not only “yes” or “no”, but also “perhaps yes”, “perhaps no”. But the PowerPC970 has three tables!
The gist of the branch prediction unit of the PowerPC970 processor is that there is a third 16K buffer (!) that analyzes which of the two prediction systems has proven to be more efficient over a given period of time, that is, which of them has a lower percent of wrong predictions. As a result, the processor can adapt itself to the environment in a short time and switch to the prediction algorithm, which is more efficient under the current conditions! It’s really sad that this most exciting solution is rarely mentioned in press. The information above was collected from the datasheets on the Power4 – it seems like PowerPC970 inherited this system from its predecessor.
My resume: it is probable that PowerPC970 has the best branch prediction unit among all modern processors. And we are looking forward to the Power5 where this unit should be enhanced further!