- Dispatch 1
- Dispatch 2: here micro-operations are getting ready for further execution and are provided with the corresponding operands. Then they will go to the execution units through the corresponding issue ports. The order of micro-operations is determined by their readiness for further processing and not by the initial program code.
- Register File 1
- Register File 2: at these stages the operands are read from the register file. As we have already said, the reading from this file can go at twice the core frequency because this is required for fast ALU functioning.
- Execute: for this very stage the entire above described structure has been actually created. It is at this particular stage that the received micro-operation with operands falls into its functional unit and gets executed. It is important to note that the result of this operation can be immediately “grabbed” by the operation waiting for it and at the same time written into the Register File (this is called bypass mechanism). Instead of the logical but very time-consuming chain of actions (first save the data to the register and then read the data from there), we immediately send the result to the unit waiting for it. This “immediate” data transfer can even happen between two different fast ALUs. In all other cases data synchronization or transfer can cause short delays (see Appendix 2).
- Flags: at this stage the flags required by the program are calculated and set. For example, such as zero result, positive result, readiness indicator, etc. In particular, these flags serve as the entry data for the next stage, where the branch predictions will be verified.
- Branch Check: at this stage the branch prediction unit compares the branch address predicted for the just executed command with what has been predicted earlier. And if the prediction was wrong, the prediction algorithm will be corrected accordingly. This way the branch prediction unit collects prediction precision statistics, which is then used for prediction model correction.
- Drive: the result of the check done at the previous stage is sent to the decoder.
The micro-operation is waiting for the retirement in order to free the resources it used and to have the final non-speculative result written into memory. Retirement is strictly sequential (according to the program code) and is performed over the same triplets of micro-operations, which have been formed at the earlier pipeline stages. Only one triplet per clock can be processed here. It is important that if one of the three micro-operations in the triplet is not ready yet, the system will be waiting for the execution to complete. The retirement is based on the data about the initial micro-operations order, which was written into the ROB at stage 6. This whole process is within the responsibility of the Retire Unit, which is located at the very end of the pipeline.
The picture below represents the principal pipeline schematics, which will help to illustrate the statements above:
An attentive reader has already noticed that there appeared some “new characters” at many pipeline stages. These “characters” are various queues, at least three types of queues. Since we haven’t yet paid special attention to this subject, let’s discuss them in a little bit more details now.