Part 6: Execution Units
But let’s continue our exploration of the internal processor structure. We have approached the execution units. It’s not as simple here as it seems at first glance. PowerPC970 contains two integer units (IU), each of which is paired with a Load/Store unit. I can’t say we have anything exceptional, as far as the number of units goes: Pentium 4 features two fast ALUs that work at a double frequency (and one slow ALU for certain types of operations). G4+ also has two of them: one integer unit for simple operations (addition) and one for complex operations (integer division). Athlon 64 boasts even three ALUs. But IBM has thrown in some technological nuances into the architecture. These two IUs we have in PowerPC970 are not exactly the same. Both units perform simple operations like addition and subtraction in perfectly the same way. When it comes to more complex operations, there is certain specialization: for example, it is the IU2 that performs all divisions in PowerPC970. IBM refers to these two units as “slightly different”, but that makes the situation even more confusing. Unfortunately, I couldn’t find any information about what unit sorts the commands according to their specifics and sends them to the appropriate IU. It’s not also quite clear what happens if a group of IOPs contains several division commands. Will they all be waiting for the IU2? If this is the case, we can pinpoint a definite bottleneck of this architecture. The decoder unit may try to keep the groups as balanced as possible: this variant seems more probable, but there is a question of how good the decoder is at changing the IOPs positions.
To our great disappointment, IBM also doesn’t disclose any information on the latencies and the instruction throughput for the PowerPC970. We only know that some simple instructions take one clock cycle (excluding the decoder, of course), while others take at least several clocks. Independent IOPs can start each clock cycle, while IOPs that are dependent on each other can start no faster than each second cycle. We could try evaluating a few things according to the given pipeline stages scheme: for instance, the most probable latency for accessing data stored in L1 cache will be equal to 4 clocks.
The Load/Store units come next. These units show some deviation from the ideology we see in x86 processors (Pentium 4, Athlon 64). The x86 processor includes two units for performing integer and floating-point loads/stores. For example, the Pentium 4 has the following specialization, according to the data from arstechnica.com:
- All LOAD: LSU (Load/Store Unit);
- Integer STORE: LSU;
- Floating-point STORE: FSTORE;
- Vector STORE: FSTORE.
The author is doubtful about this, as he seems to have some concerns about the correctness of such a “division of labor”, which is actually quite hard to confirm or deny. And in PowerPC970 these units are identical, that is there are two special units responsible for all types of loads/stores. There remains some uncertainty about vector operations, though: the predecessor of the family, Power 4 processor, didn’t have a vector unit. That’s why it is not quite clear how many vector Load/Store operations and of what kinds the appropriate unit of the PowerPC970 processor can perform. There is probably a specialization, too, so that one unit is responsible for Vector LOAD and the other - for Vector STORE. But that is only a supposition. Another variant is also possible: the units only perform Vector LOAD, while Vector STORE is combined with the execution units of the pipeline. The worst variant is when only one unit is responsible for all operations with vectors.