Additional Replay Loops
L1 cache miss is the major but not the only event causing replay. There is the whole bunch of other events (note that this list is incomplete):
- L1 cache miss, however there is a line in L1 cache with partially coinciding address bits (the so-called aliasing, the aliasing bits are different by Willamette/Northwood and Prescott).
- DTLB miss.
- Impossible STLF (store-to-load-forwarding).
- New lines in Write Buffer.
These events happen at different pipeline stages. The time required for the correct data to arrive may also differ. To solve this problem additional replay loops get involved.
When we were investigating the way latencies change in Pentium 4 Northwood processor, we suddenly discovered that some events (such as L1 miss L2 hit; L1 64KB aliasing) cause latency equal to the product of fill replay loop (7 clocks) length multiplied by number of passes + 2 clocks (L1D cache access latency). The latency of other events (such as L1 hit DTLB miss; L1 1MB aliasing) is divisible by 12+2 clocks. This indicates that there exists one more replay loop, a wider one. To simplify our further explanations let’s call the replay loop with the 7-clock rotation cycle RL-7, and the replay loop with a 12-clock rotation cycle – RL-12. Now let’s find out what’s the difference between them.
Let’s take a look at a pretty frequently occurring situation: the line with requested data is in L1, but there is no record in DTLB about the memory page for this line (L1 hit, DTLB miss).
The chain of commands with LD at the head is moving along the pipeline. When they reach CacheRead LD operation initiates a request to the L1 cache controller and at Hit/Miss stage received L1 Hit signal, which means that the tag for the requested line was found in L1. The parallel EarlyCheck stage received the signal about successful LD execution, so it didn’t turn LD for re-execution and allowed it to keep going down the pipeline together with the dependent commands following it. L1 cache of the Northwood processor is designed in such a way that its tags are viewed faster than the translation buffers where the virtual addresses are transformed into physical ones (TLB). And the TLB size in records is considerably smaller than the L1D cache size in lines. That is why if the corresponding record for the requested page cannot be found in the DTLB, LD command will get a “nice” surprise. A few stages later, when LD reaches the LateCheck stage, DTLB miss event takes place. This is when LD command will have to turn to replay, no matter how disappointing this may look. All commands depending on the result of LD will follow it into replay at the same LateCheck stage. While the commands are being turned to replay system at the LateCheck stage, force-early-replay-safe signal is sent to EarlyCheck stage. Its major function is to prevent any incorrectly executed command to be turned to replay at the EarlyCheck stage, so that two commands couldn’t arrive at the replay mux simultaneously and cause a conflict. However, if the command received a force-early-replay-safe signal, it doesn’t at all mean that the command has been executed correctly. All it means is that it will be sent to replay only at the LateCheck stage. LateChecker unit can process the whole lot of replay events (including those processed by EarlyChecker). Therefore, if the command had to be sent for re-execution at an EarlyCheck stage (for instance, it received L1 miss signal together with the force-early-replay-safe signal), it would be definitely turned to replay at the LateCheck stage then.
Here the following question arises: why do we need two replay loops, RL-7 and RL-12, working in parallel in each computational channel, if all commands can be turned to replay at RL-12? The answer is actually quite simple: the processor tries to perform speculative re-execution. If the LD command is turned to replay after L1 miss at an early stage, it will be re-executed sooner and the latency will be lower if the data is successfully found in the L2 cache. This way, RL-7 serves as an auxiliary loop reducing the latency in some cases.
In Prescott based processors the loop structure has been slightly changed. Since the L1 cache latency grew up to 4 clocks, the stages when L1 miss and DTLB miss events may occur coincide. Since the L2 cache latency grew up to 18 clocks, there is no further need to re-execute the LD command sooner. That is why Prescott processor core features only one type of replay loops: RL-18 (18 clocks per loop). And our tests prove this.