Of course, both these tasks can be fulfilled one after another, just the way everything was done in the first example. In this case we will need quite a lot of time. But if we had one more free execution unit, then these tasks could be completed twice as fast. This is the case when the second execution unit could make our performance twice as high.
This is how we arrived at the idea of a superscalar processor – a processor capable of performing more than one operation per clock cycle. In fact, the ability to perform more than one operation per clock cycle is the essence of the superscalar architecture. For example, Pentium 4 is a super-scalar CPU. Nevertheless, the second task could require increasing the contents of the “B” register by 1. In this case we would have to wait for the first task to be completed, even though we had a second execution unit at our disposal. So, the managing logics should also decide if the tasks are interdependent or if they could be performed in parallel. This is another responsibility of the Back End block. In particular, Pentium 4 core can perform up to three elementary operations per clock cycle that is why its managing logics should very quickly figure out the interdependencies between these instructions.
As we have already found out it makes a lot of sense to perform many operations in parallel in order to increase the overall performance. However, it is not always possible. But maybe we could start not only the next closest micro-operation but also a more remote one, which was scheduled to be run later? Maybe we could temporarily leave out a few micro-operations, which operands haven’t been calculated yet? Of course, only if the change of instructions order does not affect the result, and this condition needs to be very accurately controlled too.
Well, we have just arrived at the idea of Out-of-Order execution algorithm. Imagine that we have a number of tasks. Note that they have to be fulfilled in a certain order, because some of them depend on the results of the previous operations. And some of them don’t. And the third part is waiting for some data to be sent from the memory and hence will not start for a while.
There are two ways of solving this problem. We can either wait for the tasks to be completed in their respective order according to the initial program code, or we could try to perform all the operations which do not require any additional data. Of course, we will not be able to perform all the tasks “out of order”, but even if we could do it for a few of them, this could save us some time.
Let’s think what we actually need in this case. We need a certain buffer, where we could store the tasks. In P6 architecture this buffer is called reservation station. A certain unit (let me tell you now that it is called “planner”, although we are going to discuss its details in the ongoing chapters) will be picking out the tasks that have all the necessary operands and can be completed at this given moment. The same unit will be responsible for deciding which tasks can be completed out of order, and which cannot. Since the intermediate data obtained as a result of successfully completed operations need to be stored somewhere, we may require free registers. But there are only eight general purpose registers. While there can be more than eight tasks in the buffer. Not to mention that we cannot actually use freely the registers already occupied by the previous micro-operations.