The instructions decoder should process x86 commands for both threads even though only one x86 instructions can be decoded at a given moment of time (and only for one logical processor, of course). Moreover, the decoder can process a few x86 instructions before it shifts to the buffer of the second logical processor. In other words, the decoding window for each logical processor can actually be more than one command, i.e. the decoder doesn’t have to switch between the two logical processors every clock cycle. This way, the decoder is shares in a completely different way from the Trace cache. We don’t know what the exact size of the decoding window is, but it seems quite logical to suppose that it is determined by the number of x86 instructions in the streaming buffer of the logical processor. This internal structure allows the decoder to avoid switching between the logical processors every clock cycle.
It is also very important to pay special attention to the way the resources are shared inside the CPU core. Let’s now discuss the Back End block and take a look at the corresponding part of the pipeline scheme:
Some terms on this picture differ from what we have been using before.
Nevertheless, this is a very illustrative picture that is why we will specifically stress
where the markings or terms differ from what the picture suggests.
As we remember from the previous chapter, out-of-order commands execution begins at schQ stage (marked as Shed on the picture), and finishes at the Register stage (marked as Register Write on the picture). So, what are the peculiarities here?
The Allocator, which we have already discussed in detail in the previous chapter, one of the key logics elements, has the following resources at its disposal: 126 records in the ROB (Reorder Buffer), 48 records in load buffers and 24 records in store buffers. Also it has 128 integer registers and 128 floating-point registers.
The maximum resources available for each of the two logical processors are limited. Each logical processor can occupy up to 63 ROB records, up to 24 load buffer records and up to 12 store buffer re cords. This limitation is imposed to prevent each of the logical processors from taking over all the resources.
When there are micro-operations for both threads in the Fetch Queue, the allocator will be switching between the logical processors every clock allocating resources for them in turns.
If one of the two logical processors lacks any of the resources (for example, free store buffer positions), the allocator will generate the “stall” signal for this processor and will continue allocating resources for the second processor. Also if the Fetch Queue contains micro-operations only for one logical processor, then the allocator will provide this processor with resources every clock in order to maximize its performance. However, the limitations of the maximum resource capacity available for a single logical processor will still be limited in order to prevent the CPU from being blocked by one of the threads.
Another duplicated unit shown on the picture in the beginning of this chapter is the Register Alias Table (RAT). It should display 8 architectural registers over 128 physical ones. Each logical processor has its own RAT, and the data in it are an inalienable part of the architectural status of this logical processor.
Schedulers, the heart of the out-of-order execution system, work with the logical processors in a bit different way. It doesn’t matter for them what logical processor uses which micro-operations. The schedulers can send up to 6 micro-operations for execution within a single clock cycle. These can be a pair of micro operations from each logical processor, or three uop-s from one logical processor and one uop from another. There is still one limitation here: all positions in the queue of the given scheduler cannot be taken by a single logical processor so that we could prevent one logical processor from capturing all the resources and thus blocking the other logical processor.