Each clock cycle the K8 processor is fetching instructions in aligned 16-byte blocks from the L1I instruction cache into a buffer where the instructions are extracted from the block and are then sent to the decoder’s channels. The 16 bytes per cycle fetch rate allows sending 3 instructions with an average length of up to 5 bytes for decoding on each cycle. In certain algorithms, however, the average instruction length in a chain may be bigger than 5 bytes.
Particularly, the length of a simple SSE2 instruction with register-register operands (for example, MOVAPD XMM0, XMM1) is 4 bytes. If an instruction uses indirect addressing (with a base register and an offset like in MOVAPD XMM0, [EAX+16]), its length increases to 6-8 bytes. In 64-bit mode a one-byte REX prefix is added to the instruction code when the additional registers are employed. Thus, SSE2 instructions may be as long as 7-9 bytes in 64-bit mode. An SSE1 instruction may be 1 byte shorter if it is a vector one (that is, operates on four 32-bit values), but is 7-9 bytes, too, under the same conditions if it is a scalar one (with one operand).
In this situation, the fetch rate of 16 bytes per cycle doesn’t seem fast enough to keep up the decoding speed at a rate of 3 instructions per cycle. This limitation is not important for the K8 processor because vector SSE and SSE2 instructions are decoded at a rate of 3 instructions per 2 clock cycles (or 1.5 instructions per cycle), which is enough to load the two 64-bit FPUs. In the future processor, a rate of at least 3 instructions per cycle must be maintained. Considering this, the fetching in 32-byte blocks, announced in the presentation of architectural innovations of the K8L, doesn’t seem excessive. If the succession of these long commands takes a few neighboring 16-byte blocks, then the average fetching tempo with 16-byte data blocks of 3 commands per clock cannot be achieved.
Figure 2 illustrates the positioning of five long instructions in a 32-byte block which can be fetched in one clock cycle. If the instructions are fetched in 16-byte blocks, it is impossible to achieve a fetch rate of 3 instructions per cycle.
By the way, Conroe processors fetch instructions in 16-byte blocks, just like K8 processors do, so they can decode the instruction stream at a rate of 4 instructions per clock only when the average instruction length is no longer than 4 bytes. Otherwise the decoder cannot process not only 4 but even 3 instructions per clock. To fight this in short loops, the Conroe has a special 64-byte internal buffer that caches loops up to 64 bytes long (four 16-byte blocks) and allows fetching data in such loops at a rate of 32 bytes per cycle. If a loop is longer than 4 blocks, it cannot be cached in this buffer.
The fetching of the next block of instructions is done using the branch prediction mechanism if there are any branch instructions present. Branches are predicted in the K8 processor by means of simpler algorithms than those employed in the Conroe. For example, the K8 cannot predict alternating indirect branches (this may have a negative effect on the execution of object-oriented polymorph code) and is also doesn’t always predict correctly regular patterns. The branch prediction mechanism will be improved in the K8L, but there’s no detailed info about that yet. The branch tables and counters will probably be made larger, and the algorithm of predicting branches alternating in regular patterns may be improved.