It is a completely different story if we have Trace cache: we have some already decoded commands for different program threads (or different programs). So, it would be much easier to select the micro-operations for the appropriate threads: all you need is just to add a service index to the micro-operation, which will identify the thread it belongs to. Especially, sine the execution core is not very closely connected with the decoder and its functioning doesn’t depend on the decoder operation directly (all we need is the sufficient amount of pre-decoded instructions in the Trace cache).
So, both logical processors work on the same physical core. We have already mentioned above that all core resources will be split into large categories: “shared” and “distributed”. The so-called distributed resources include Fetch Queue, Uop Queue and Scheduler Queue. For each logical processor the queue depth is smaller because a part of its capacity is assigned to another logical processor. In other words, these resources are split in two halves: one for each logical processor. Or, for example, there is a position in the queue, which can be occupied only by the first logical processor, and another position can be used only by the second one. In case of shared resources their actual distribution between the logical processors will be arranges individually for each particular case. However, note that there is a special system preventing one logical processor from using up the entire resource capacity. You understand why this system is necessary: if we have one fast and one slow (or stalled) commands thread, then the latter can theoretically occupy all queues thus blocking the execution of the first thread.
Therefore, there should be a certain algorithm which would allow distributing the queue capacity between the two processors in the most efficient way. How can two threads share the positions in the UopQ (or any other queue) with the fixed number of positions? There are two different ways: competitive way (when each of the threads tries to take as much of the resources from the opponent as possible) and fixed way (50:50, for instance). In the first case it is quite possible that one of the threads will slow down or oust the second thread from the queue completely. In the second case, the resources will not be used efficiently, if one of the micro-operations threads requires more than 50% of them: one of the threads will lack resources, while the other one will simply waste them.
In the Pentium 4 processor the queue capacity for each thread is fixed: each of the logical processors has twice as short Fetch Queue, Uop Queue and Schedulers Queue at its disposal than in the example above with the disabled (absent) Hyper Threading technology. This certainly has some negative influence on the performance of each logical processor, but makes it impossible for any of the threads to block the processor. So, here the resources are distributed absolutely equitably. Important notice: the micro-operations are moving along the queue independently for each logical processor.
All other resources of the processor are shared in this interpretation: register file, schedulers, execution units, all caches, loading blocks. Here the resources are used according to the competitive principle: first come first served. Of course, not without arbitration: if both logical processors addressed the same unit, then the arbiter indicates strict processing order.
Let me offer you a schematic representation of a pipeline with the labeled distributed resources:
The blue color indicates the resources of one logical processor,
and the gray color – of another one.