Now, let’s dwell on the FPUs. PowerPC970 has two identical FPUs; each of the two can execute any operation with floating-point operands. The fastest operation may be performed in 6 cycles, and the slowest – in 25 cycles. The two units are fully pipelined, that is, another instruction can be sent for execution each clock following the previous one (if they do not depend on one another). You should also remember that PowerPC970 has 72 physical FPU registers: 32 architectural registers and 40 “rename registers”. Moreover, there are a few pleasant peculiarities. In particular, PowerPC970 supports one very useful combined instruction: multiplication + addition “all-in-one”. Since it can be performed in each of the two FPUs each clock cycle, it means that there will be 4 operations performed in a single clock. This can turn out essential when we have to multiply matrices or solve many other tasks from linear algebra.
Besides that, PowerPC970 is very likely to allow simultaneous processing of two addition operations (or two multiplication operations), which cannot be done on Athlon XP, for instance, since the FPU units of the latter are asymmetric (one for addition and one for multiplication). The same is true for Pentium 4, where the situation looks even worse, since in x87 mode the ports throughput will become a bottleneck limiting the performance (they allow only 1 operation per clock).
Of course, the higher computational power of the processor called for more bandwidth from the memory and the system bus, which it did receive. We will talk about it shortly.
Now let’s review one more unit of the PowerPC970 processor, the AltiVec unit. First, take a look at the following pictuer, which is a flowchart of this unit in G4+ processor:
The picture is taken from arstechnica.com
As we see, the unit includes:
- Vector Permute Unit;
- Vector Simple Integer Unit;
- Vector Complex Integer Unit;
- Vector Floating-point Unit.
Besides that, the unit uses 32 registers 128bit long. 16 “rename” registers make the unit complete, facilitating out-of-order instruction execution. The performance of this unit looks as follows: G4+ processor can execute two vector IOPs per clock cycle in any three units of the four.
The same unit in PowerPC970 processor is somewhat different. Here is the flowchart:
The picture is taken from arstechnica.com
As you see, the structure of the AltiVec unit is a little bit different (by the way, IBM has another word for it, but I use this term on purpose to avoid confusion). There are two different units: Vector Permute Unit and Vector Logic Unit. The latter consists of:
- Vector Simple Integer Unit;
- Vector Complex Integer Unit;
- Vector Floating-point Unit.
This structure of the AltiVec unit leads to certain limitations, which G4+ processor never had. PowerPC970 processor can execute two vector IOPs per clock cycle, but only if one of the IOPs is referred to the Vector Permute Unit. The second IOP should be for any of the three pipelines of the Vector Arithmetic Logic Unit. Of course, the additional limitation doesn’t make the unit perform any faster.
Besides the architectural 32 registers, PowerPC970 processor has a bunch of “rename registers”. The total number of physical registers in the AltiVec unit is estimated at 72 or 80.
It seems IBM had to redesign this unit to reach higher processor frequencies. This is just a supposition, so I will try to prove it. The following table lists the pipeline depth (in stages) for several types of instructions:
| G4/G4+ | PowerPC 970 |
Vector Simple Integer | 1 | 1 |
Vector Complex Integer | 4 | 4 |
Vector Floating Point Unit | 4 | 7 |
Vector Permute Unit | 2 | 1 |
So, we can see that this unit has undergone certain revision that, without doubt, reduced the overall performance per clock cycle, but allowed reaching considerably higher operational frequencies.