We can also make a brave supposition about the way the additional transistors of the core were spent. As you remember, there is no definite information about the number of transistors in Prescott core: the officially claimed number of transistors is too high for the banal doubling of the L2 cache size. So, we dare assume that there is “something else” there. This “something” can be a combination of five replay loops and ALU with EM64T support.
But, let’s return back to the replay. It turns out that replay system can theoretically lead to complete blocking of the CPU. In particular, in our example you can see that if there is a hole, the commands in the replay loop change their initial order. It may turn out that the commands dependency chain circling around the replay loop can only get out of there when the command arrives that is currently still in the scheduler. If the replay loop is completely full, there is no “hole” for this long-awaited command to fit in, the cycle will never break, because the replay commands have higher execution priority. This blocking is called “livelock”, and it cannot be resolved with the nominal means.
Nevertheless, the CPU doesn’t get to the livelock in reality, as the practical experience suggests. It implies that there is some emergency system which resolves the problem somehow when necessary. Getting a little bit ahead of our story, I would like to say that this system is most likely to be breaking the “endless” replay after a few dozens of loops (for details about this system see our article called Replay: Unknown Peculiarities of the NetBurst Core).
So, we understand that besides replay, Pentium 4 processor also has at least one unknown (!) emergency system. It serves to resolve livelock situations.
In fact, there are two systems: one of them discovers the problem and another one resolves it. Here we should give due credit to brave architectural engineers that put all this into life.
However, we keep studying these systems, and now we suggest turning to such interesting matter as replay and FPU.
We haven’t even mentioned the floating-point operations when we were talking about the replay all this time. And there is a good explanation to that. The thing is that the replay system communicates in a different way with FPU commands.
The loading of FPU, MMX and SSE2 registers from L1 cache takes much longer than the loading of integers (9/12 clock cycles against 2/4 clock cycles by Northwood/Prescott respectively). These additional 7/8 clock cycles are just enough to arrange the feedback between the scheduler and execution units. While we are waiting for fp_load command to be executed, we have just enough time to let the scheduler know if there is an L1 cache miss. The scheduler will take into account this sad news and will not release the dependent FPU/MMX/SSE2 instructions for execution. In other words, before these operations are sent for execution, they manage to check if the operands are already available. This automatically eliminates the main reason for replay to occur. And in fact, this is very handy, because NetBurst processor architecture doesn’t contain any additional FPU units. There is only one FPU unit processing one instruction per clock cycle (while ALU processes six instructions per clock cycle). So, if the operations in the replay will waster the resources of this unit, the overall processor performance will inevitably drop.