The addition of two numbers is performed as follows. At the first tick (half a clock) fast ALU sums up 16 low order positions of the two added 32bit numbers. The whole algorithm is very similar to the way you would perform the addition of two numbers on paper when you write one above the other. Two short numbers can be summed up mentally. Two long numbers have to be written down first so that you could add the numbers one by one. ALU can add two short numbers just the same way. But when it comes to two long numbers, it will take much longer. That is why 32bit numbers are added one by one in 16-bit halves, so that the addition could be speeded up (and so could the numbers processing). Very often the addition result is a number of the next order. When we do mental calculations we simply memorize this number. In the CPU this number is called “carry bit”. By the end of the first tick the carry bit for the low order positions of the two numbers.
At the second tick the second 16-bit sub-unit of fast ALU, which is shifted half a clock ahead relative to the first one, tackles the addition of 16 high order positions of the first numbers pair also involving the corresponding carry bit.
At the third tick we will already know the service flags that accompany all numeric operations (has there been an overflow, zero result, negative result, etc.). These flags can be used for further operations, such as conditional branching, for instance.
Note that the second sub-unit of the fast ALU processing 16 low order positions has already finished working at the second tick. Of course, it would make perfect sense to keep it busy anyway. This is exactly what is going to happen: at the second tick the first sub-unit of the fast ALU starts processing the next pair of numbers. As a result we start the addition of the second pair of numbers with only half a clock gap from the first one!
What is the result then? First of all, we managed to speed up the beginning of new addition operations significantly compared with the traditional ALU structure. Really, instead of one pair of numbers per clock, we now can start the addition every half a clock. Which results into complete processing of two pairs of numbers per clock. For the sake of this peak performance increase we made all these changes to the ALU unit.
At first glance we have to pay for higher processing speed with higher latencies: one and a half clock instead of one clock in the traditional ALU. Although you shouldn’t be too upset about it, as we’ve got one beautiful opportunity. Let’s see what different parts of the end result will be ready:
- 16 low order positions of the number will be ready in one tick (half a clock);
- If we need the high order positions of the number, they will be ready in two ticks (one clock cycle);
- If we need the flags, they will be obtained in three ticks (one and a half clock).