One of the most interesting and frequently asked questions is about the size of the trace cache. Intel states that it is designed to accommodate 12,000 micro-operations. However, if we would like to compare it with the instructions cache, it would be a good idea to calculate the efficient size of the Trace cache (in KB of the code). Of course, its structure implies that its efficient size will be different depending on the code type. The approximate calculation showed that its size varies from 8Kb to 16KB of the “standard cache” space depending on the applied calculation algorithm.
And in the meanwhile we would like to return to the Front End block. In fact there is one more problem to be solved besides the translation of x86 commands into a simple processor format.
This is the jumping problem. Let me explain. Say, we’ve got a part of the program code which is currently performed by the processor. Everything goes on smoothly: the decoder turns x86 commands into uop-s, execution units work with them. Beautiful! But if there is a jump instruction, then we should learn about it not when the previous instruction has been executed but long before that. Otherwise, when we suddenly see the jump instruction, we will have to wait for as long as necessary for the entire pipeline to shift to a new branch.
In order to save out time there is a special unit called Branch Prediction Unit. Its responsibility is to try foresee the branching direction thus saving us some time in case of a successful prediction. And if the prediction turns out false the CPU will be fined (pipeline will be completely stopped and all buffers cleared).
The second case when we might need a unit like that is when there is a conditional branch in the program. These are the branches that depend on the result of some operation. That is why we need to “guess” if this branch will happen or not.
If we will be guessing at random, this is hardly going to be any good. To make it simpler for the Branch Prediction Unit the results of approximately 4,000 last branch transfers are stored in a special Branch History Table.
Moreover, they also monitor the precision of the last branch prediction, so that the prediction algorithm could be corrected if necessary. As a result, the decoder performs a conditional branch transition “de-facto” following the hint from the BPU, and then the BPU checks if the condition was predicted correctly.
This way, we have come to Branch Prediction Unit and the Prefetch mechanism. The latter should “guess” what data the CPU might need later on. Of course, this is not a wild guess at all: special algorithms analyze the address sequences used for data loading and try to derive the next address basing on this data.
Here is why this unit appeared so necessary. During the program execution there comes a moment when you have to address the memory and request some data from there. It could be OK, if it didn’t take the memory so long to deliver the requested data: hundreds of CPU cycles. And during this time the CPU has nothing to do. That is why it would make sense to request the data in advance, so that we could save some time and thus increase the CPU efficiency. The question here is where shall we go for this data and what data will wee need next? In fact, we will return to some details of the Data Prefetch mechanism later in this article, and in the meanwhile this is exactly what you need to know about it. The Data Prefetch mechanism works side by side with the Branch Prediction Unit and with the third block called Memory Subsystem.