We have already mentioned that fast ALU are not the only units working at twice the nominal clock frequency. There are other processor units that also work twice as fast as the nominal. Among them are two schedulers: FAST_0 and FAST_1, each servicing its own fast ALU. This speed refers to the selection and transfer of micro-operations. In some cases, the unit receiving such micro-operations as “ALU store data”, for instance, cannot provide the necessary transfer rate. As a result, these micro-operations can be sent to the corresponding units only at every even tick.
SchQ queues, which correspond to reservation stations (RS) in P6 micro-architecture, provide data selection with modified order. This is exactly the place where out-of-order commands selection occurs.
The operations are sent for execution so that by the time they arrive, the execution unit is already free and the operands are available. If there is more than one operation meeting the requirements for a given schQ queue, then the preference will stay with the “older” one.
As you can see from everything we have just said, the schedulers are responsible for all that preparatory work, which is essential for proper CPU functioning. In fact, the schedulers are the heart of the logics servicing the functional units in the CPU. In particular, the schedulers have to “feed” the commands and data into functional units, to educe their idling time to minimum. However they need to know the schedule for the micro-operations execution and the operands availability in order to plan the work of the execution units efficiently. Most commands need a set amount of time for their complete execution, and these times are known. The actual problems occur when the time of data delivery is hard to predict (especially, when the data needs to be transferred from the memory).
Since each micro-operation has to pass a few pipeline stages on the way from the scheduler to the functional unit, the scheduler needs to calculate/predict the situation for a few clock cycles ahead. Certain allowances are made for micro-operations depending on those one, which execution time is variable (such as data loading, for instance). Let’s bookmark this place: later on we will need to recall this scheduler peculiarity in order to better understand how the replay works.
The schQ scheduler queues are of the following size: SLOW_1 – 12 positions, SLOW_0 – 10 positions, all other ones 8 positions each. These numbers define the maximum “window” for instructions of the same type, which execution order can be changed. Let me once again stress: the quality of out-o-order commands execution depends on the length of the scheduler queues.
Once the schedulers completed their part of micro-operations processing, the micro-operations are sent to the execution units via four issue ports.
Port 0 is used by FAST_0 and SLOW_0; port 1 is used by FAST_1 and SLOW_1. Load micro-operations sand store-address micro-operations are sent by the MEM scheduler to ports 2 and 3 respectively. Ports 0 and 1 receive uop-s from fast schedulers every tick of the clock, and from slow schedulers – every second (even) tick. One port can process only one micro-operation within a tick (half a clock), and if there is a conflict between two schedulers, then the “older” micro-operation is preferred. The system reads registers from the Register File at twice the clock frequency. Schedulers, Register File, issue ports and fast ALU are all parts of the so-called Rapid Execution Engine.
We have just discussed the pipeline structure of the Northwood processor core. The pipeline of the Prescott core is known to be much longer: it consists of 31 stages. However, the new stages this pipeline acquired are mostly Drive stages, when the micro-operations or data are simply transferred along the pipeline without any specific processing.