And now let’s find out what the arbitration looks like for the Trace cache. Trace cache is an 8-channel partially associative cache working according to LRU algorithm. The preliminarily decoded micro-operations in Trace cache are already assigned two independent arrays of “next instruction pointers” – one for each logical processor. This shows each logical processor where exactly the next micro-operation for its thread is located.
Both logical processors compete for the Trace cache access each clock. If their requests come in simultaneously, the Trace cache access is granted in turns to each of them every other clock. In other words, the first one accesses the Trace cache on the first clock, the second – on the second, then the first one access Trace cache again on the third clock, etc. If one of the threads is stalled (of there are no decoded micro-operations in the Trace cache for this thread), then the second thread has the Trace cache at its full disposal.
The similar access scheme is used if there is a link to Microcode ROM in the Trace cache. The Microcode ROM access is arbitrated the same way as Trace cache access.
If the micro-operation the next-instruction-pointer flag is pointing at is not in the Trace cache, the request should go to L2 cache, and when the x86-command arrives from the L2 cache it should be decoded into the micro-operations. While we are working on obtaining the next x86 instruction, we should transform the virtual address of the micro-operation indicated by the next-instruction-pointer into the physical instruction address (because l2 cache works with physical addresses only). There is a special Instruction Translation Lookaside Buffer (I-TLB) used for this purpose. It receives the request from the Trace cache, translates the virtual address of the instruction into the physical one and sends the request into the L2 cache. The searched command is available in the L2 cache in most cases. It is then moved to a special streaming buffer of the corresponding logical processor (2 lines, 64 bytes each) and stays there until it is completely decoded and transferred to the trace cache.
Each logical processor has its own I-TLB table, i.e. this unit was duplicated when they implemented Hyper Threading support (the processor core scheme shown in the beginning of this chapter demonstrated exactly this additional I-TLB table). Moreover, each logical processor features its own set of next-instruction-pointers, which help to monitor the decoding of x86 commands.
The cache requests of the two logical processors are arbitrated according to the FIFO principle (the earlier request is processed first, all requests are processed in order of appearance).
The branch prediction unit is partially shared and partially duplicated. The return stack buffer is duplicated, because it is a small unit and moreover, the request/return addresses should be collected for each logical processor individually. The Branch History Table is also duplicated for each logical processor, because the prediction history should be monitored for each logical processor individually, too. Nevertheless, the global Branch Table is a general unit where each branch occurrence is marked with the individual index of the logical processor.