Replay at FPU Pipeline
The replay mechanism in the FPU pipeline works according to a different algorithm than the ALU replay. It looks like there is a sort of feedback between the data loading unit and the scheduler. Once the L1 data cache has been checked for data availability and the data has been found there, the scheduler sends the dependent instruction further. So, if the data is reported missing in L1 data cache (such as RL-7 loop for ALU loading), FP-load where x87, MMX, SSE and SSE2 belong, is replayed, but the dependent instructions do not get issued. For RL-12 there is no difference in this case: FP operations are circling in the RL just the same way. If the data is found in L1 cache, the latency of FP-load operations is 9 clock cycles. If the data is not there, we add n*7 or n*12 clock cycles depending on the situation. In fact, we failed to send any chain of FP-operations to RL-7 at all. For example, if there is an Int-chain circling around RL-7, then the dependent FP-chain will get onto RL-12. For instance, two instructions “MOVD MM0,EAX – MOVD EAX,MM0” transfer the Int-chain from RL-7 to RL-12 (EAX dependency).
Why so and not the other way around? We assume that most instructions going via FP Move actually go through something like the “Convert & Classify” K8 unit, where the result is translated into a certain internal representation form (formatting). This hypothesis is proven by the following facts:
- the inter-register transfers latency is very high;
- chains of very diverse commands processing the contents of the SSE register, such as “ADDSD XMM0,XMM0 – ADDSS XMM0,XMM0”, result into significant fines.
Maybe most FP Move operations are none other but more or less fixed pairs of primitive commands like “load + convert” or “convert + store”, where the ‘convert” part takes about 6-7 clock cycles. Speaking about replay again: in this (hypothetical) case the time required for “convert” execution exceeds the “distance” in clock cycles between the scheduler and the execution unit. So, the scheduler can safely send the dependent operation further according to the first check result. In case of failure, only the “load + convert” pair will need to be replayed.