The latest microprocessor from AMD, Opteron (and Athlon 64), uses a buffer for 16K entries (!). That is, the branch history table is four times bigger than that of Pentium 4 (and the same four times bigger than that of AMD’s previous processor, Athlon XP). This helped to increase the predictions precision considerably (as far as I know, it is over 95% now). The improvement from 90% to 95% doesn’t seem to be a significant one, but look at the situation from the opposite point of view. 90% correct predictions means 10% wrong predictions, and 95% correct predictions means 5% errors. A double reduction of errors is a significant thing, don’t you think so?
How good are the current and previous Mac microprocessors at this kind of fortune-telling? Let’s start with the older one, the G4+. It was humble enough: a 2KB branching history buffer, and a BTB for 128 entries. On the other hand, the G4+ has a pipeline of 7 stages only, so it doesn’t actually need a powerful branch prediction unit as desperately as modern processors do – it suffers a much milder penalty per an incorrect branch guess. PowerPC970 has it another, much more interesting way.
First of all, let’s recall that it is a relative to the high-end Power4 processor. PowerPC970 has a sophisticated branch prediction mechanism. It is the most efficient, too, I only regret IBM never revealed their estimates of the prediction precision. At least, I couldn’t find them anywhere. First of all, the PowerPC970 is interesting for its system of two branch prediction units that work constantly. Moreover, they work simultaneously.
The first unit uses a traditional branch history buffer for 16KB of branch entries. The entries in the table denote the absence/presence of a branch and the correctness of the prediction. The next prediction is made by analyzing this information.
Let me explain it in more detail to you, as we will need this information later. That’s how the first table works. Ideally, a 1-bit element in the table should correspond to each branch instruction. The number of the element can be inferred from the instruction address, but since the size of the table is limited, the element is accessed by means of some bits of the 64-bit (32-bit) address, rather than by the entire address. That is, 14 bits should be extracted for a 2^14 table. Of course, we may run into a situation when one and the same element corresponds to several branch instructions, but that’s not important: these instructions are more likely to be performed in different periods of time (which will be pretty far from one another), so overall this “reassignment” will be invisible. The element 1 bit long can only store information about whether or not the jump in the program flow happened last time. This kind of a counter would be wrong only twice – for example, when the jump happens only once in 10 passes. If the jump should take place every second pass, this counter will always be wrong.
The second working scheme uses a table of the same size, 16KB. However, the second branch table is global, while the first one is local. Besides that, each entry in the second table is associated with an 11-bit counter (there is only one such counter for the whole processor). This counter marks the branch direction chosen in the previous 11 times when the instruction group was selected from the L1 cache (the load unit loads 8 instructions at once from the L1 instruction cache) and also remembers if the prediction was correct. This information helps to predict the outcome of the next branching.