Replay Influence on Hyper-Threading
The major goal of the Hyper-Threading technology is to increase the efficiency of the computational resources usage due to the fact that two threads do not share any dependent data and one of the threads can use those resources that are not occupied by the second thread, especially when the latter is idle. It usually takes processors quite a lot of time to wait for the data to arrive from RAM, so this is when the resources can be idling for quite a while. So, why not let the other threads use them up, if they are not waiting for any data from the system RAM?
As we have already described above, if a command is looping in the replay system for a long time, it may cause unjustified waste of the computational resources. For instance, if the data is not in the L1 and L2 caches, the chain of commands will have to make tens and maybe hundreds of loops in the replay system, occupying the computational resources for nothing while waiting for the data to arrive from the main memory. If NetBurst processor is single-threaded, then waiting for the data from the RAM will hardly cause any serious performance issues in the replay system, because the CPU will anyway have to wait for this data thus losing hundreds of clock cycles (the processing stops until the data arrives).
In this case the work of additional processor unit will have more effect on the processor heat dissipation rather than performance :) But when there are two simultaneously processed threads, the inefficient use of the computational resources by one of the threads in the replay system simply cannot remain unnoticed for the performance of the other. I dare suppose that the more often the thread requests data absent in the L1 and L2 caches, the more resources the replay system will eat up while waiting for the data to be delivered.
We decided to check if this theory is true. We wrote a program where one thread had a long chain of data dependent commands and requests data from the system memory at random addresses all the time, while the other thread simply carries out the calculations in the registers hardly addressing the memory at all. Both threads execute the same type of commands (AND) on the same FastALU0. The main goal of this experiment was to check how the performance of the second thread not working with the memory is going to change depending on the location of the data requested by the second thread: L1 cache, L2 cache or system RAM. The results of our tests for Pentium 4 Northwood processor are given on Pic.10.
Pic.10: Replay influence on Hyper-Threading (Northwood CPU).
On the picture above (Pic.10) you can see the dependence of the second thread performance (Thread 2) on the size of the data buffer of the first thread (Thread 1), which requests data at pseudo-random addresses.
The results are more than illustrative. While one thread is waiting for the data to arrive from the memory, it slows down the processing speed of the second thread (>35% compared with the situation when the data is expected from L1 cache). The thread expecting the data from RAM occupies the resources even more during the wait period than it would during the regular execution when the data is available in L1 cache. The situation with HT is aggravated by the fact that the two threads share the L1 and L2 caches capacity, which makes the efficient size of the cache memory for each thread twice as small. This in its turn means that the amount of cache misses increases, and so does the number of replay cases. And this means in its turn that the performance of both threads lowers. Replay could be one of the reasons why enabled HT may turn out harmful for the performance in certain tasks.
Now that we figured out what’s happening with the results Pentium 4 on Northwood core demonstrates, we decided to test a Pentium 4 processor on the new Prescott core, especially since Intel claimed that they had enhanced HT technology there. The results of the tests didn’t disappoint us (see the influence of the number of cache-misses and replay cases of one thread on the performance of the other):
Pic.11: Replay influence on Hyper-Threading (Prescott CPU).
The influence of the replay system on the performance not just got smaller: it turned completely different. Firstly, the thread performance is now always higher if HT technology is enabled. Secondly, if the data is not in the L1 and L2 caches, the performance of the second thread turns out somewhat higher than in case the data is available in the L2 cache.