When the entire group of micro-instructions is executed, and all preceding groups are also executed, the processor writes down the final results and the Group Completion Table gets cleared. Besides that, if the Group Completion Table buffer is full, the decoder will not decode instructions and form other groups until there is free space. Of course, we need to know two things to evaluate the capabilities of a processor: the size of the Group Completion Table and the length of its queue. Let me clarify it once again: the size of the Group Completion Table roughly indicates the maximum size of a continuous instruction block (as if cut out of the program) that can be processed by the processor at a given moment. To be exact, this is the maximum number of processed micro-instructions, which the instructions from the continuous part of the program are translated into. The queue depths are the maximum number of micro-instructions from which out-of-order instructions are selected. Of course, I suggest that the program instruction is the same as micro-operation for the sake of simplicity. We should also take into account other features of the micro-architecture. For example, the size of the Group Completion Table should be increased if the processor has a long pipeline, since each instruction takes more time to be processed.
Now follows a slight deviation from the main topic. The problem is that we deal with marketing tricks here, too. Those “126 on-the-fly instructions” Apple applies to Pentium 4 processor don’t refer to the entire pipeline, but to the Reorder Buffer. A quote: “The Allocator allocates a Reorder Buffer (ROB) entry, which tracks the completion status of one of the 126 uops that could be in flight simultaneously in the machine”. So, it would be more correct to compare the “width” of this window with the Group Competition Table of the PowerPC970 processor.
The same correction is true for Athlon 64 processor, and we should make comparisons with the reorder buffer, which is 72 macro-operations big (The reorder buffer allows the instruction control unit to track and monitor up to 72 in-flight macro-ops (whether integer or floating-point) for maximum instruction throughput). Note also that this number refers to macro-operations, so it ideally corresponds to 144 micro-operations, although closer to 72 in our real world. Compare this to about a hundred instructions in the Group Completion Table. It seems like PowerPC970 processor cannot boast any exceptional features as far as the number of simultaneously processed instructions is concerned. Quite opposite to what Apple says.A lot of space is necessary for the “register renaming” procedure. As you probably know, this procedure is necessary to perform several instructions that work with one and the same register. As a result, every instruction receives a register with the name it needs, and it is quite another matter what real physical register this one corresponds to. The procedure of renaming the registers is performed by nearly all modern processors. It should also be mentioned that the total number of internal “rename registers” must be proportional to the total number of processed instructions.