Of course, it is pretty logical to assume that two computational programs using the FPU unit will not be running any faster: we still have only one FPU unit, which will not be able to execute two tasks simultaneously. So, pure logics suggests that we should have different program threads use different processor resources, so that we could benefit from the multi-threaded processing.
But things turn out not that simple at all. In reality it may turn out that two threads with intensive floating-point calculations will be processed faster in HT mode than one by one, while the thread waiting for the data from the memory will slow down the next thread, which is not even working with the memory at all.
Why would it happen like that?
Let’s take a closer look at the example with two FP threads (the same is true for MMX and SSE, of course), for instance, a simple iterations loop. As you know each operation has a fixed execution time (latency). Say we have a multiplication here, FP_MUL with 6 clocks latency. Having sent this command for execution, the scheduler will halt all other commands depending on the result of this multiplication for at least 6 clock cycles, although FPU will already be ready for the next FP_SADD command the next clock cycle. If there are no independent commands like that in the queue (and there are only commands of FP_MUL type, the next clock cycle will be skipped. If there are no independent commands in the queue at all, the FPU unit will be idling for 5 clocks.
The second Hyper Threading thread will use these “empty” clocks for its calculations, because the commands of the two threads are totally impendent of one another. Of course, the average number of independent operations in the FP queue can be increased for the single thread with the help of special optimization techniques (such as “de-looping”, for instance). However, if you want to use up the entire potential of the FPU, you will need 5-6 independent threads with the equal share of FP_ADD and FP_MUL commands, which will need to have the cached data at their immediate disposal. And this is far not that trivial optimization challenge for most algorithms.
This simple example allows us drawing two somewhat paradoxical observations:
- The NetBurst execution units’ resources seem excessive at first glance, and their shortage should not affect the Hyper Threading efficiency. This is extremely illustrative for integer fast ALU units, which can execute more uop-s (up to four per clock) than the Trace cache can send (up to three per clock).
- The maximum performance gain from Hyper Threading compared with the sequential threads processing can be obtained in non-optimized applications. The optimization increasing the IPC of one thread reduces the Hyper Threading efficiency. Moreover, if the threads sharing processor resources compete not only for the execution units (cache, queues, buffers), then Hyper Threading can start harming the overall processor performance at a certain optimization point, and two threads will be processed much faster in a succession than in parallel, in the Hyper Threading mode.
But let’s return to our discussion of Hyper Threading technology implementation in the Pentium 4 CPU.
Hyper Threading technology is pretty easily implemented in the NetBurst architecture due to such specific features of this architecture as Trace cache. It’s true that in traditional architecture such as P6, the decoder is tightly connected with the execution units. In order to process two instruction threads simultaneously, they should be simultaneously transformed into micro-operations, which is really hard to achieve. But the worst thing is to select them for both threads at the same time, which would be a pretty sophisticated task since the x86 instructions length is variable.