The decoded triplets of macro-ops arrive at the instruction control unit (ICU), which puts information about them into the reorder buffer (ROB), and are then transferred to the schedulers. The ROB keeps track of the state of the macro-ops and controls the order of their retirement. The macro-ops come into the queues and are retired in groups of three (in lines) in the same manner as they have arrived at the ICU, but are received by the schedulers and are dispatched to the execution units independently.
The macro-ops from each group are distributed among the three independent queues of the scheduler, 8 elements each (24 macro-ops in total), assigned to three symmetrical integer channels. The queue number corresponds to the position of the macro-op in the group as it was shaped on the decoder’s output. As soon as the data is ready, the scheduler can dispatch one integer operation to the ALU and one address operation to the AGU from each queue. There can be two simultaneous memory accesses at most. Thus, each clock cycle 3 integer operations and 2 memory operations (64-bit reads and writes in any combination) can be dispatched. Integer operations are dispatched from the queues out of order as soon as the data is ready for them. But the load from memory operation is performed in program order, for example:
add ebx, ecx ;
mov eax, [ebx+10h] ; quick address calculation
mov ecx, [eax+ebx] ; the address depends on the result of the previous instruction
mov edx, [ebx+24h] ; this instruction won’t be executed until the addresses of all the
; previous instructions are calculated
This is one of the limiting factors in the K8 processor. It is because of it that the K8, although can dispatch two read instructions per clock, may be less efficient with memory than the Conroe, which can only dispatch one read instruction per clock but has a mechanism for speculative out-of-order execution of read instructions bypassing previous reads and writes (called Memory Disambiguation). Fortunately, a mechanism for out-of-order loads will appear in the K8L, and this bottleneck will be eliminated. Details of this mechanism are not yet disclosed, but the reordering of read instructions will not probably affect write instructions, and this may become a reason for less efficient execution of some types of code.
A group of three macro-ops is removed from the ROB after all the instructions from this group are executed. The queuing and removal of macro-ops in groups simplifies control over the resources and helps load the schedulers in a more efficient way. If one of the three queues is fully loaded, new triplets of macro-ops cannot arrive at the scheduler and empty slots may appear in the other queues. However, there is a small percent of free slots in practice, and they do not worsen the CPU efficiency much.
Besides that, there can theoretically occur a certain reduction in the scheduler efficiency due to the static linking of the position of a macro-op in the group to the scheduler’s queue because one queue can have two or more micro-ops ready for execution, and another can have none (Figure 3). But this is not very probable in practice and is not frequently observed since there are usually quite enough of execution-ready instructions in the pipeline.
Unlike in the K8, there is a common queue for all instructions, including floating-point ones, in the Conroe. The queue length is 32 macro-ops. The common queue theoretically helps avoid empty slots and the possible limitations due to the static linking to the execution units. Besides that, the stack engine mechanism helps reduce the number of data dependencies between the instructions PUSH, POP, CALL and RET. In practice, however, it is very difficult to organize a fully associative queue from which all 5 micro-ops could be dispatched simultaneously. That’s why the common queue is still divided into sections. This produces a reverse effect: the insufficiently ordered selection of execution-ready instructions leads to the so-called chaotic scheduling problem of P6+ family processors and reduces the execution speed. Besides that, the Conroe has a limitation on the number of registers that can be simultaneously read from the ROB (not more than three), which puts a limitation on the scheduling of the instruction stream.
The out-of-order execution mechanisms in the K8L and the Conroe differ but slightly, because both processors can send up to 5 commands per clock cycle to be executed (3 ALU + 2 Mem). The peculiarities of scheduling and execution algorithms may show differently depending on the code generated by the compiler.