Why so? Since all official manuals wouldn’t reveal any details about the reasons behind the observed phenomena, we had to search through Intel patents ourselves. We managed to find a very interesting patent called “Multi-threading for a processor utilizing a replay queue”, which seems to be shedding some light onto the results we keep getting. Let’s take a brief look at replay queue and its main working principles.
Pic.12: Flow-chart illustrating the work of the replay system with Replay Queues.
In the system presented on Pic.12 you can see the commands turned to replay from the Check stage fall into Replay queue loading controller, which is responsible for making decisions about the further fortune of commands coming to replay. If the load command could not be executed correctly because of the L1 cache miss, it has the chance to try getting the data from the L2 cache, and the controller returns this command into the replay mux through the replay loop. If L2 cache miss occurred during the command re-execution attempt, the replay queue loading controller received L2 Miss signal and sends this command into the replay queue to prevent it from wasting the execution units resources while waiting for hundreds of clocks before the requested data arrives from the RAM. The controller also sends there all incorrectly executed commands following the first one in the program code, as there can be some hidden data dependencies, which could have caused incorrectly executed commands to turn to replay. Since a CPU with Hyper-Threading technology can have two independent threads processed at a time, there should be two independent queues for the commands of these two threads, which will be processed in parallel.
The commands are resend from the Replay queue for re-execution only when the Replay queue unloading controller receives a Data return signal, meaning that the data has arrives from the system RAM. The Replay queue unloading controller releases the commands for both queues hoping that they will be executed successfully. The choice of the replay mux input to receive the next command (from one of the replay queues, replay loop or scheduler) is left for the priority system. There can be different priority systems involved: fixed (the priority stays with the thread 0, then thread 1, then the replay loop and at last the scheduler) or aged (the priorities are given to instructions according to the order they arrived to the scheduler). Unfortunately, we don’t know for sure which system is used in the Prescott processors.
Besides the above described attempts to prevent the chain of commands waiting for the data to arrive from RAM from wasting the computational resources, replay queues can also serve to combat livelocks, which we have already discussed above. We assume that this buffer may be used for commands, which got stuck in the replay system of the Northwood processor.
It is quite possible that Prescott processor has exactly the replay queues system described in the “Multi-threading for a processor utilizing a replay queue” patent. At least we see it from the significant difference in test results compared to Northwood. However, we can’t help pointing out that the performance of the second thread in Prescott still slows down by about 20%, if the first thread working with the commands of the same type suffers a lot of L1 cache misses.