So, Northwood core will have the minimum pause time for the data delivery equal to 7 clock cycles, the data will simply not make it earlier than that. Therefore, the micro-operation round-trip time in the replay system should be calculated so that the micro-operation could pass the scheduler (slowing down its operation at this point) and then could return to the execution unit 7 clocks later. And remember that the algorithm should be fairly simple, so that this replay system could work at high frequencies.
The replay system of the Pentium 4 processor is exactly a compromise like that: an attempt to re-execute micro-operations without increasing the complexity of their processing.
So, what does this replay look like? In fact, the replay system is none other but a part of a fictitious pipeline located parallel to the main one.
When the micro-operation leaves the scheduler, it falls into the unit called Replay multiplexer aka Replay mux (for details see our article called Replay: Unknown Peculiarities of the NetBurst Core). Then this micro-operation is cloned, i.e. its exact copy is created. The original micro-operation continues its way down the pipeline towards execution units, while the cloned micro-operation has a much more exciting destiny. There is a replay pipeline parallel to the main pipeline. The length of this fictitious pipeline equals the distance between the scheduler and the execution unit (later you will understand why).
The original micro-operation leaves the Replay multiplexer and starts off towards the execution unit, and at the same time the clone of this micro-operation starts the same movement along the fictitious pipeline without any actual processing. Both micro-operations, the original and the clone, are moving parallel to one another through all the stages on the way to the execution unit.
I know this may sound confusing, but I would still like to say that in reality, the micro-operation doesn’t move anywhere. The thing is that the replay system we are describing to you may no look exactly like this in silicon. It is only important that the reaction of the described and the real replay system on the silicon level is the same. Nevertheless, it is very convenient to explain the way this replay system works with the help of a fictitious pipeline model. To tell the truth, the exact configuration of this system is not that important for our story. It is the behavior of this system that matters most.
During micro-operation execution, a special Checker of the Pentium 4 processor checks if the data obtained as a result of the given micro-operation is “legitimate” or not.
If the answer is positive, then the micro-operation goes farther down the main pipeline towards retirement unit, and its clone on the fictitious pipeline is simply deleted. If everything was done correctly, replay system doesn’t have to interfere.
If the check indicated a “cache-miss” or some other events, such as data loading from the memory for instance (for more details about these events see our article called Replay: Unknown Peculiarities of the NetBurst Core), then replay mechanism is activated. Here I would like to stop for a second and specify that there can be two reasons for the overall execution failure: failed execution of the current micro-operation or failed execution of those uop-s, our current micro-operation depends on.
In this case the original micro-operation with incorrect operands is deleted, and its clone is sent to replay making a circle over the pipeline. It re-enters the actual pipeline right after the scheduler and gets into the multiplexer (which main function is to slow down automatically the release of the next micro-operation from the scheduler, one a micro-operation from replay comes in). So, this micro-operation starts moving towards the execution units.