You feel embarrassed applying a name like “microprocessor” to the IBM Power4. The die is monstrous – an assemblage of four processors with a L3 cache is a square of 115x115mm! That’s the size – 13225 square millimeters! The “micro” has nothing to do with this microprocessor.
Well, if someone makes processors of that size, someone certainly needs them. Let’s see what it has inside. First of all, the Power4 contains two processor cores. You can see them in the following figure:
You see that the internal structure of the processor is nontrivial. Two processor cores are linked with a special high-speed switch. In fact, we have an SMP system within one CPU – the cores are joined with a bus that works at 500MHz!
Other subsystems are impressive, too: the L2 cache uses three independent cache controllers, three banks (you see them in the figure) of a total capacity of 1536KB and has a bandwidth of over 100GB/s working at 1.7GHz (the frequency of the flagship Power4+ model).
The processor core is curious in itself. First of all, the IBM Power4 decodes the external instruction set into internal microinstructions like x86 CPUs do. The reasons for this solution are obvious: there’s too much software written for the previous CPU generation, which costs more than hardware. They just couldn’t abandon that software baggage. Thus, the same problems met the same solutions.
The micro-architecture is designed to perform up to eight instructions per cycle – that’s an impressive degree of parallel execution.
Let’s now see what a single core looks like. The decoder translates external instructions into a set of elementary operations (ops) that are then packed into groups. One command is usually unfolded into two or three ops. A group contains five commands – the first four cells are distributed freely, while the fifth cell always contains a branch prediction instruction. Commands go for execution in such groups, moving along the pipeline.
Each core has two ALUs, two FPUs (with slightly different functions; for example, division is only performed by the FPU2), two load/store units, two branch prediction units. Overall, we have eight functional blocks. Out-of-order execution is supported – the Group Completion Table (an analog of the Reorder Buffer in Xeon processors) can contain up to 20 groups of elementary operations (i.e. about 100 ops), sending them to the execution units as they are ready. Overall, the processor can have as many as 215 instructions at various execution stages in a given moment.
Besides that, the core can launch “addition plus shift” operation each cycle on each FPU. This operation often occurs in various programs. Thus, we have four FPU operations per cycle, which is an absolute record among all processors (well, nearly each characteristic of the IBM Power4 CPU aspires to be record-breaking). It’s also possible to launch two floating-point addition or multiplication operations at a time, which none other micro-architecture allows.
The cache subsystem tries to match this record-setting trend. Each core has 32KB of dual-channel data cache (with an access latency of only 1 cycle!) and 64KB of dual-channel instruction cache. Each cache consists of 128-byte lines; the data cache is organized as four 32-byte sectors, which can be read independently (it’s possible to write into one sector and read from two others, without jams). The instruction cache can write or read 32 bytes each cycle. The L2 cache is eight-channel, partially-associative, 128 bytes per line, 1536KB size. Each processor also contains an L3 cache controller. The amount of L3 cache memory can be up to 32MB per processor (per two cores). The processor also has a memory controller with a bandwidth of 11GB/s per processor. The maximum amount of memory supported by each processor is 16GB.