The translation itself is performed in a way similar to the algorithm of the decoder in Athlon processor. All instructions fall into two categories. The first group, called cracked instructions, includes instructions that can be translated into two simplest ones (IBM calls them IOPs). The second group, millicoded instructions, is decoded into more than two IOPs. Every clock cycle PowerPC970 microprocessor can send a group of five IOPs into its execution units. Most micro-operations occupy slots 0 to 3 in the group. Slot 4 is reserved for branch prediction operations. If there are no operations to occupy the preceding slots, or they are too few, the decoder inserts the so-called NOPs (No Operation). The NOP is an instruction to literally “do nothing”.
Besides that, there are certain limitations connected with the positions of micro-operations inside the group. After a cracked instruction is translated into two IOPs, these two IOPs must be both included into the same group. If this cannot be done, the decoder inserts a NOP and starts a new group. Millicoded instructions always start a new group. If some instruction calls a millicoded instruction, it also starts with a new group.
These nuances and limitations resemble much the decoder of the Athlon XP processor. Athlon 64 processor, as we know, has a significantly improved and enhanced decoder. So, this unit of PowerPC970 should be considered adequate, although it is not the best; at least, we know better implementations of the instruction decoder.
Besides this very strange (for a RISC processor) decoder unit, PowerPC970 boasts bigger buffers along the pipelines (by the way, at the decoding stages the processor not only translates cracked instructions, but also resolves dependencies and forms a group of micro-operations). Right now, IBM (and Apple) says they are proud of PowerPC970 processor being able to have as many as 215 instructions on different pipeline stages at the same time (“on the fly”). Apple says it is much more than Pentium 4 can handle, as it has a “window” of only 126 instructions. G4+ processor has only 16 “on the fly” instructions. I will explain to you later that PowerPC970 doesn’t have any big advantage; the total number of “on the fly” instructions is very similar to that of PowerPC970, Athlon 64 and Pentium 4.
About half of these instructions (IOPs) are stored in a buffer called Group Completion Table. This is a functional analog of the Reorder Buffer – it can store up to 20 formed groups of micro-instructions (that is, about 100 IOPs) that are waiting to be sent for execution. Note that this all happens in the order set by the program code. The micro-instructions are sent to execution units as soon as they are properly prepared, and without keeping their sequential order in the program. “Out-of-order” execution happens only here! As soon as the functional unit “confirms” that the operation is being performed successfully, the place in the queue gets free. Note that
- this can happen before the micro-instruction is executed by the functional unit;
- Group Completion Table follows its track thereafter.