Chapter VI: Pentium 4 Pipeline Architecture
Well, it’s time we took a closer look at the Pentium 4 pipeline, especially, since there is hardly any detailed info about it anywhere available for public access. What we know is that the pipeline of the Northwood processor (the part after Trace cache) consists of 20 stages, while by Prescott this number is even bigger and equals 31 stage total.
Let’s take a closer look at the Northwood pipeline and find out what exactly is happening at each stage:
- TC Nxt IP 1
- TC Nxt IP 2: at these two stages Trace cache finds those micro-operations the last executed instruction points at.
- TC Fetch 1
- TC Fetch 2: at these two stages up to 6 micro-operations are selected and sent to a special Fetch Queue, where the order of micro-operations corresponds strictly to the initial code. If there is a MROM-vector among these micro-instructions, further reading from the Trace cache is temporary halted, and the MROM-vector in the Fetch Queue gets replaced with the sequence of micro-operations it represents. As you remember, the Trace cache actually works at half the nominal frequency.
- Drive: micro-operations are moved towards a special unit called allocator. At this stage micro-operations do not change, they simply move along the pipeline. This stage seems to be necessary because the working frequency of the pipeline is sometimes not enough for the micro-operations to make it to the required unit within a single clock.
- Allocator: here a special unit selects from the Fetch Queue three micro-operations, for which there are special processor resources reserved. Among these special resources are positions in queues, register file elements and instructions reorder buffers. The operations prepared this way are then sent to other queues, which we will call uopQ. Moreover, as I have just said at this stage there are spots reserved in the ROB Reorder Buffer for these micro-operations, which will be helpful for instructions retirement.
- Rename registers 1
- Rename registers 2: at these stages the logical registers are being displayed over the actual physical registers. In IA32 there are 8 general purpose logical registers, while the physical registers are much more numerous: 128. This operation is necessary for the separate commands to be processed independently, without waiting for the necessary register to become free. The micro-operations order here corresponds to the initial code of the program.
- Queue: at this stage the micro-operations prepared by the Allocator are sorted out and arranged into special queues, uopQ. There are two uopQs. One of them is intended for address calculations, and another for all other uops. The micro-operations order corresponds to the program code. From the uopQs micro-operations are sent to the scheduler queues, schQ. We are going to pay special attention to uopQ and schQ later in this chapter.
- Schedule 1
- Schedule 2
- Schedule 3: at these three stages a lot of interesting things happen. First, schedulers receive micro-operations from the uopQs. Note that they strictly select the oldest micro-operations and retain their order, although from the two uopQs the micro-operations are selected independently. Besides, each scheduler receives a specific type of micro-operations and then they are placed into schQ. There are 5 schedulers in total. So, there are five schQs. The type of the given micro-operation determines the unique scheduler queue it can go to. From the schQ micro-operations are sent for execution through the issue or execution ports. There are four ports like that.