When we studied the way micro-operations are moving along the replay pipeline, we discovered that sometimes certain commands may stay there longer than they actually should. For example, this situation occurred in case or aliasing errors or in case of a D-TLB miss. We got the impression that the scheduler has more than one replay loop.
And this appeared to be true. Some types of errors in Northwood core (such as aliasing, for instance, for more details see our article called Replay: Unknown Peculiarities of the NetBurst Core) cause the operations transfer to a different replay loop, which is 12 clock cycles long, unlike 7 in the first case.
It happens because in some cases we need to perform a special non-standard error check. The result of this check becomes available a bit later than usual. However, we have a limited amount of time for turning the operation into replay loop if there is any problem. If the result of this check doesn’t arrive on time, the micro-operation will be executed incorrectly and will continue its way down the pipeline and will retire thus causing a catastrophe: the program code has been executed incorrectly.
Since the check we are talking about is situated at a farther distance from the scheduler, the “rejected” micro-operation will have to travel along a bigger loop. This loop is 12 clock cycles long, compared with 7 clock cycles for the first smaller loop.
We called the above described replay loops according to their length in clocks: RL-7 and RL-12 respectively.
Here is the result: Northwood core has two replay loops. Prescott core has only one replay loop, RL-18 (for details see our article called Replay: Unknown Peculiarities of the NetBurst Core). This is connected with the fact that since the L2 access latency grew bigger, we now have more time to perform the check, so the CPU manages to complete it within a single pass.
Here I would like to draw your attention to the fact that we have been considering only one scheduler all this time. From Chapter VI we remember that there are FIVE schedulers like that in the Pentium 4 core. They are all independent of one another, each of them has its own queue, and it means…
It means that each scheduler has its own replay system. In other words, there are 10 fictitious pipelines hidden from the user in Northwood core!
Wow, the size of that part of Pentium 4 processor we haven’t heard anything about before is impressive. We can’t help asking ourselves: can THIS be called a beautiful architectural solution?
In Prescott core things got a bit different. There is only one replay loop for each scheduler, but these loops are longer: 18 clocks each. So, we have the total of five fictitious pipelines. Keeping in mind that at least a part of them is working at twice the core frequency (together with fast ALU pipelines), and differential LVS logics is used, we are no longer surprised at the amount of heat dissipated by the Prescott core: this is all quite natural. Replay makes the pipeline idle at least twice per single operation that gets there. So, since there is more work to be done for the same piece of the program code, more heat is generated.