So, at stage 6 (Allocator) three micro-operations are selected from the only available queue called Fetch Queue. Then we reserve processor resources for these micro-operations. After that these operations are located in two uopQ queues. One of these queues is for address operations, and another – for all other operations. As you can see from our explanation of the pipeline architecture these operations are distributed between the two queues at Stage 9.
The main task of the uopQ queues is to distribute micro-operations of different types between different schedulers correctly. That is exactly why uopQ for address operations accepts only two types of uops: “load [address]” and “store [address]”. All other operations, including “store data” are placed in another, major, queue. The address queue can be 16 micro-operations deep, the major queue is longer, and can receive twice as many operations, i.e. 32. The micro-operations are lined up in these queues sequentially: when one of the queues gets overflown, another one closes and doesn’t accept micro-operations any more. There are two practical advantages in this queue organization: “early” loading and faster execution of short parts of the code depending on the results of longer operations.
When the schedulers process micro-operations (at stages 10, 11 and 12) the uop-s get distributed among the next type of queues: scheduler queues.
The micro-operations are sent sequentially from the uopQ, according to the FIFO principle (first in – first out). The queues here work independently from one another: for instance the micro-operation from the address queue can be selected before all other preceding operations leave the “main” queue. In many cases this allows to start loading the data in advance, which can be helpful. However, there is also another side to the coin: this independence increases pretty significantly the probability of incorrect situations, such as the attempt to read the data before this data has actually been calculated by the corresponding micro-operation of the main queue.
The data transfer rate however is pretty high here: up to two micro-operations per clock cycle for each scheduler. And this speed is valid not only for fast schedulers (see below), but can also be achieved by slow schedulers. uopQ queues are very sensitive to semi-clock events (and thus they can also be considered units working at twice the clock frequency). If the micro-operation cannot be sent to the scheduler because of the schQ queue overflow, the transfer of all other uop-s from the uopQ is halted. If the micro-operation can be sent to one of the two available schedulers, then the system can actually chose the scheduler depending on the schQ queues status.
There are five queues as well as five schedulers, we have already mentioned it in the pipeline description. Let me list all these schedulers now:
FAST_0 – works with ALU micro-operations: logical operations (and, or, xor, test, not); ALU store data; branch; transfer operations (mov reg-reg, mov reg-imm, movzx/movsx, lea simple forms); simple arithmetic operations (add/sub, cmp, inc/dec, neg).
FAST_1 – works with ALU micro-operations. Among them are transfer operations subsets (except movsx) and arithmetic operations subsets (except neg). It looks like all operations sent to FAST_1 can also be sent to FAST_0.
SLOW_0 – works with FPU micro-operations dealing with data transfer and conversion. (for x87, MMX, SSE, SSE2-instructions); FPU store data, too.
SLOW_1 – works with ALU- and FPU micro-operations: a number of simple ALU-operations (shift/rotate; some uop-s created by adc/sbb) and all complex uop-s starting with multiplication, as well as the majority of “computational” FPU-operations.
MEM – AGU-operations: load and store address.
It is evident that all operations from the uopQ address queue are sent to the MEM scheduler queue. All other micro-operations fall into one of the remaining four scheduler queues.