The pipeline is 17 stages long, which is very long for a RISC processor. For this long pipeline not to be idle, the Power4 uses an advanced branch-prediction system. It bases on three tables, each of which contains up to 16K of branching history. The first table uses a traditional branch history buffer with information about whether the branch prediction was successful. The second table (16K too) uses a global, rather than local, branch table. Each entry in this table is associated with an 11-bit counter that remembers which branch was chosen in the last eleven times when instructions were taken from the L1 cache (the load unit loads eight instructions from the L1-I) and whether the prediction was correct. The results of processing this information become the foundation for the results of the next branch prediction.
Let me stress the difference between the two methods: in the first variant we follow each branch instruction without its connection to the others; in the second variant we do directly otherwise, dealing with a sequence of results, without following any particular instruction. That’s why we have two names for the tables: local and global. Now, there’s also a third table that notices which method has been most efficient (caused less errors)! As a result, the Power4 can change the branch prediction method in a few hundreds of CPU cycles.
This Leviathan processor has several variegated system buses: a 32-bit I/O bus (working at one third of the CPU clock rate) and three 128-bit bidirectional buses (working at half of the CPU clock rate) for linking to other processors in the “assemblage”. A 64-bit bus for linking different assemblages crowns the structure. This abundance of buses and the advanced cache hierarchy serve one purpose – making the processors always busy with work. Thanks to the appropriate protocol, the processors can get to each other’s cache (L2 and L3).
Let me explain what I mean by the word “assemblage”. IBM surprised us in the manufacturing aspect too as they managed to produce four processors in one die, with all their buses and 128MB of L3 cache. That’s a manufacturing achievement – the area of the assemblage is 13,225 sq. mm! By the way, each processor (of the four) links to other processors through a “point-to-point” bus.
This technological miracle is of course priced accordingly – about $10,000. However, this is not a high price for a processor of this class.
The topology of systems that use this curious processor is also out of the beaten track. IBM calls it Distributed Switch. Such a topology has no clear center. In fact, the links of the processor assemblages are closed, forming two parallel circles. Thus, it’s possible to get to each processor in several ways, which eliminates jams in the bus. The maximum number of assemblages is 4, or 32 processors in a system. The highest efficiency of this organization allows such a system to perform as fast as 64-128-way systems from other manufacturers.
So we now only have to see what the Power4 shows in SPEC CPU 2000. Note, though, that this test is for a single processor, never focusing on the nuances of the CPU organization. In other words, the total performance in real applications will be higher due to the remarkable system organization.
So, this processor scores 1077 points in SPEC_int base 2000. This is an average result. In any case, that’s more than any other RISC processor scored (save for the Itanium, which is not a pure RISC).
The result is better in SPEC_fp base 2000 – 1598 points. In fact, only the Itanium managed to outperform it. The Power4 is good in this kind of tests.
Once again, this test cannot capitalize on the main advantages of Power4-based systems. In real applications (and in real systems), the Power4 is the world’s fastest CPU in the number of transactions per processor.