As a result, FPU commands never get into the RL-7 replay loop. Nevertheless, they will get into RL-12 replay loop still. For example, if the FPU micro-operation depends on the results of an integer micro-operation, which has already got into RL-7 loop.
In conclusion I would like to point out two more interesting facts connected with the FPU operations:
- In the overwhelming majority of computational algorithms, the data locality space is much bigger than the L1 cache.
- . Any type of data prefetch commands cannot load the cache line directly into L1. So, without preliminary checking of the operands availability, FPU commands would have been frequent guests in the replay system, especially in the popular streaming algorithms.
Now we have to discuss one more application for the replay system.
In the examples above we discussed the situation when replay is used to resolve the issues caused by cache miss. In fact, this is far not the only function replay system performs for the Pentium 4 processor. It used to solve very diverse problems. I would even say: everywhere where possible.
In particular, replay is used to help with a very frequent matter (sincerely disliked by software developers): data downloading right after uploading.
This is what the problem is actually about. When we received some data and executed the Store command for it, so that the data is stored in the memory, we need to make sure some time passes before this particular data can be read from the memory again.
We will discuss this whole situation in greater detail in our article called Replay: Unknown Peculiarities of the NetBurst Core. Here I would only like to stress that if not enough time has passed after the saving operation is complete, replay may step in as a way-out.
The worst thing is that there is nothing software developers can do to prevent replay: the CPU very aggressively reorders all instructions inside, that is why any data loading command may change its position in relation to other instructions in the program code, even if it used to be placed very far way initially (and as we know new independent command threads usually start with the data loading).
As a result, the processor tries to load data too early, and this operation gets sent to replay. And the entire dependency chain follows after it.
What does this actually mean? Operations like that always follow the function call: the calling command stores the parameters in the stack and the called command reads them from the stack. Function calls are present in all programs, with no exceptions. So, here is the conclusion: all programs with no exceptions have situations favorable for replay.
And in conclusion I would like to say a few words about the interaction between replay and Hyper Threading.
As you remember, Hyper Threading technology is intended to increase the efficiency of highly loaded processor units. Since replay eats up some of the execution unit resources, we were wondering if there is any influence there. And if there is any mutual influence, then how big it is? The answer is traditionally given in our article called Replay: Unknown Peculiarities of the NetBurst Core. So, you might want to check it out :)
From, the general suppositions it is clear that the more workload falls upon the processor execution units, the less efficient Hyper Threading technology becomes. At the same time, replay causes a number of operations to be executed multiple times, which will eat up processor resources. So, the two subsystems tending to use the same resources will face a mutual conflict.
The result of our investigation was up to expectations.
Replay system can reduce the efficiency of Hyper Threading technology significantly. In particular, in certain situations replay can cause the overall performance loss of up to 45% in Northwood core and up to 20% in Prescott core. Moreover, the increase in the efficiency of Hyper Threading technology that we observe on Prescott processors is most likely connected with the replay improvement rather than with the enhancement of the Hyper Threading technology.