by Victor Kartunov
01/27/2004 | 11:36 AM
While the proponents of processors from Intel and AMD meet in a head-on crash and peruse the lines of the specifications for fresh arguments, other manufacturers manage to sell computing machines of quite another kind and quite different technical characteristics. So there is a natural question – what kind of machines they are? Are they better or worse than those we usually use? Do they have any unique features? Why people buy Apple Macintosh computers at all (hereafter referred to as Macs)?
So, this is one of the first times in our history that we, at X-Bit Labs, put aside the glorious PC to take a look at the Mac, or, to be more precise, at the microprocessor used in the Mac. I think this article should be strongly recommended to any person who heard to the prophets of the Mac faith (yeah, they are not far from religious zealots sometimes). Some of them say that Macs are just absolutely different, and others claim the Mac to have an unbelievably fast architecture – the competitors would take a decade just to catch up with it.
However, I haven’t seen yet a Mac in the list of fastest processors, according to SPEC CPU 2000. It means the microprocessors used in Macs don’t run that fast as far as raw performance is concerned. So, where do these rumors about a “fundamentally different architecture” come from? Why is there an opinion that Macs suit best for industries like the printing industry? Do Apple’s advertising claims about the leadership in graphics processing have any ground at all?
There was also one more reason for my decision to write this article. We don’t often stop and take a look around; we don’t readily see what other things, besides the x86 architecture in all its variations, the computer industry has developed for us. I think it is exciting to take a closer look at microprocessors that are now a “parallel world” rather than competitors to the x86 models. The Mac is another world, not a direct competitor in any way.
Over a year ago, the new processor for the Mac was announced, the PowerPC970. After initial agitation subsided, we now have an opportunity of telling whether the release was successful. Was it a breakthrough Apple is known to make with every next release of the Mac, or maybe the new Mac just keeps on the tradition of its predecessors?
As a starting point, let us recall what a typical computer is made of, since the functional units of the PC (x86) and of the Mac are basically the same. Whatever their names, we always have the following stuff inside the system case:
So, let’s discuss each of the items one by one. Item 1 is undoubtedly the business card of Apple. Impeccable design of the solutions has always been a distinctive feature of the Mac. Remember the iMac with its transparent case where you could see the insides? The industry immediately grasped at the new solution, although it seems rather trivial and boring now. Back then, it was a small revolution. The machines from Apple have always been elegant, coming in beautiful cases. I would say that it is the Apple design concepts that helped a lot to improve the exterior of the PC: you’ve got more options now besides the standard putty-color sarcophagus of a system case.
Skipping over Items 2, 3 and 4 (we will discuss them at length shortly, since they are going to become the main topic of this article), let’s run through the rest of them.
Item 5 is graphics cards. In the past, graphics cards on the Mac were quite different from what the PC had: the two architectures had nothing in common with regard to graphics. As time was passing by, the PC got much more skillful at graphics, while the Mac found itself lagging behind. Of course, the drive for more graphics power on the PC was mostly caused by the ever-growing popularity of 3D games that wouldn’t run smoothly on a slow graphics card. This “drive” stimulated demand for faster graphics cards, which in its turn made the manufacturers develop even faster models. When the PCI bus and, later, the AGP came to the Mac, it practically caught up with the PC – fastest graphics cards models are available for the Mac. At least, the gap is negligibly small nowadays. Summing it up, I would say that the Mac is slightly slower than the PC in the graphics hardware field.
Hard disk drives… There was a time when Macs were based around the SCSI interface with all its later revisions. SCSI drives used to be much faster than their humble IDE counterparts and this fact also contributed a lot to the myth about the sky-high performance of the Mac. In the 80-s, that was not a myth, though, but reality (back then, the Mac was overall the PC’s superior in hardware, that’s true). The market lives according to its own rules, though. Open standards and wild competition pushed the price of the PC down, and the price of the Mac, which at first was comparable to that of the PC, started feeling rather steep. The price gap was only increasing over time. Demand for the Mac degenerated and there were fewer machines sold, and their manufacturing cost grew accordingly.
The process threatened to become irrevocable, and Apple was considered dead or nearly dead by many analysts and users, but the corporation managed to stop right on the brink of an abyss. They changed the Mac, abandoning many of its exclusive or distinctive features like SCSI devices (including hard disk drives), proprietary AppleTalk standards, and exclusive buses. Instead, they brought in such industry standards as EIDE, PCI, AGP. The company went for this being in dire financial straits, and this worked.
In other words, half of the peripherals are basically the same in both Mac and PC. By doing this, Apple ameliorated the most important consumer factor – the price of the Mac. As for the image of an “exclusive, different solution”, it is possible to shape it with other means, as Apple has proven later.
Another key feature of the Mac is its operation system. I am inclined to think it was one of the factors that drew the crowd of fans and consumers under the banner of the Mac. Well, Apple was the first company to use an OS with a graphical interface (the Xerox research center was the inventor of the interface and influenced the following attempts, but the results of its work never went into the masses). And Apple has a long record of such key innovations. Of course, it is hard to evaluate an operation system, so I skip over any efficiency/performance/user-friendliness marks and other characteristics that are hard to estimate – they highly depend on the user’s particular needs and habits. Let’s just state that there is an operation system and it is one of the main distinctions between the Mac and the PC.
So, we are left with our favorite courses: microprocessor, system bus, memory. The stuff comes under the capacious name “Platform”. We will talk about platforms today.
First of all, let me remind you that the new microprocessor for the Mac was touted by Apple as something revolutionary: 64 bits for the desktop computing! Apple was a little sly here (although meant no harm). It was the first only because the Athlon 64 from AMD had been postponed. Or, to be more exact, AMD released it first into the server market. That’s how Apple took the title of the pioneer of “64 bits for the desktop”. Of course, the way of a pioneer has never been easy. As it turned out, glitches and compatibility problems were not the prerogative of Microsoft alone. Moreover, the famed “bug- and problem-free” platform from Apple was more like a myth originating from the fact that Apple had profited from a much narrower and better-ordered hardware platform. Of course, it is easier to develop and debug an operation system for a given small set of equipment rather than an OS that must run on any set of hardware parts. While the Mac platform was more or less uniform, it had no evident problems, but as soon as there appeared a platform quite different from the previous versions of the Mac (due to the same 64 bit, for example), then typical transition problems arose. There is a minute, but characteristic fact to confirm the point: the Mac OS X release that was going to become a triumph, quickly and smoothly transformed into OS 10.1 and 10.2. The current edition is version 10.3. There is no triumph at all: many popular programs have not yet been ported for the new OS. Old applications running in the appropriate compatibility mode don’t show much stability, too. The performance of the platform is not as high as Apple had promised. In other words, they couldn’t make the transition process as smooth and painless as they had hoped. Of course, this is just a temporary situation, as the applications will be ported and the OS will be debugged and the errors corrected.
Without getting into the ideological argument about the need for the 64-bit desktop computing, I just list a few useful things this technology can provide: more virtual memory, larger addressing space for physical memory, benefits to specific applications. For example, using 64-bit operands in cryptography often leads to fewer operations (sometimes, the number of operations is reduced by several factors), i.e. the calculation time.
We should also keep it in mind that the amount of system memory in modern computer systems is slowly approaching the addressing space limitation of 32-bit processors (4GB with flat memory addressing). In other words, it is better to make sure that there is sufficient bit depth in advance rather than make up hasty and frantic decisions later. My own opinion on the 64 bit issue is simple: why not? It cannot do you any harm, but can sometimes help you a lot. So, the users don’t actually have any complaints about the 64 bit, but it transpires that someone else has as I will tell you shortly.
Thus, most of the today’s advanced processors (by brands, not sales volumes or quantities) are actually 64-bit ones. Nearly all RISC (MIPS, Alpha, HP PA_RISC), post-RISC (Itanium and, with certain reservations, processors from Transmeta) and some x86 processors (Opteron and Athlon 64) fit into this category. In fact, it is only the Pentium 4 that remains the stronghold of the 32-bitness in the modern processor world (the Xeon can be considered a 36-bit processor). With this background, it is the more interesting to watch the situation around the SPEC CPU 2004 tests. Some informed sources say that the SPEC committee has practically agreed upon the set of algorithms to include into this test. There are over 40 subtests that require about 2GB (!) of system memory and about a day to perform one pass. The testing procedure will become more intensive, as you see. At the same time, there will be no algorithms in SPEC CPU 2004 that ensure a significant performance boost from the 64 bit! That’s a real surprise. They say such algorithms are of less relevance, although right now we see 64-bit processors emerging for the desktop computer. Right now, every processor manufacturer offers 64-bit models. Right now, only one processor maker does not offer 64-bit processors in its solutions for the mass market. I guess you understand who I am talking about. Yes, it is sad, but Intel took an understandable, but unpleasant position: “We don’t like this game”. Under the company’s pressure (and because of its veto right), the SPEC committee has approved of a set of algorithms that don’t provoke anything, save for mild amusement.
Of course, Intel can stand for its own interests being one of the financial sources – perhaps the biggest one – for the committee, but I had hoped that they wouldn’t have gone beyond a certain limit. The hopes were misplaced. The situation around the new version of SPEC CPU 2004 resembles much (I would even say “too much”) the story with the “optimization” of SYSMark 2002: after that optimization the benchmark fell in love with one platform and became too… controversial to the others. I do fear the SPEC tests may repeat the same story. Until now, they have been highly respectable cross-platform performance tests specialists could rely upon. If this situation changes, there will be practically no benchmark to replace SPEC CPU – no other benchmark can boast the same unanimous recognition and such a big results database as this one.
I do hope there will be nothing like that, and the SPEC committee never publishes the proposed set of tests. Otherwise, the SPEC tests would become a marketing tool, rather than an adequate reflection of algorithms efficiency used in modern programming. I don’t think we need such marketing methods – they are too dirty. Moreover, good products wouldn’t really need such “support”.
Let’s be back to the platform. Next goes the microprocessor, the brain of each computer. So, what innovations did the current IBM PowerPC970 processor bring to the user compared to its chronological predecessor, the Motorola G4+? By the way, note that the main supplier of processors for Apple has changed – this is a rare thing in the business world. How good is this chip? How does it rank among the newest representatives of the x86 architecture?
The basic characteristics like die size, frequency and heat dissipation come first. The following table compares the PowerPC970 to other modern processors:
PowerPC 970 1.8GHz
Pentium 4 2.8GHz
Opteron 144 1.8GHz
up to 89 Watts*
* - This is the maximum heat dissipation for the entire platform for today. I couldn’t find the heat dissipation specs for the processor alone. Some sources say the Opteron 144 1.8GHz has a maximum heat dissipation of 70W.
It is clear the PowerPC970 is much closer to the modern x86 processors, than to its predecessor, and even surpasses them in some aspects like heat dissipation. The excellent value of this parameter is partially due to the lower operational voltage (IBM owns one of the best technological bases in the industry overall).
The following table shows you the number and organization of caches in the processors:
PowerPC 970 1.8GHz
64KB, direct mapped (!)
32KB, 2-way assoc.
512KB, 8-way assoc.
Pentium 4 2.8GHz
8KB, 4-way assoc.
512KB, 8-way assoc.
Opteron 144 1.8GHz
64KB, 2-way assoc.
64KB, 2-way assoc.
1024KB, 16-way assoc.
32KB, 8-way assoc.
32KB, 8-way assoc.
256KB, 8-way assoc.
* - As you know, Intel doesn’t publish the size of the Trace cache in kilobytes. Making assumptions about the size of a micro-op (micro-architecture op-code) helps to estimate roughly the size of this cache at 80KB-120KB! On the other hand, fewer instructions can be stored in a cache of this size, since the Trace cache contains them in the decoded form in which instructions tend to “swell up”. I guess this is the reason for Intel to forget that a cache can be measured in kilobytes rather than in micro-ops. Or maybe they just try to be correct and not to confuse apples with oranges. Whatever the case, we can very roughly tell the capacity of this cache basing on the following fact: the average size of an x86 instruction is 3-4 bytes. Let’s take it to be 4 bytes. Most of x86 instructions are decoded into two micro-ops. That is, 12,000 micro-ops correspond to about 6,000 x86 instructions. That’s how many of them (in the standard form) fit into the 24KB data cache. Once again, this estimate is very rough. Anyway, I think this is the upper estimate for the size of the Trace cache – it can’t possibly be any bigger.
Curiously, the instruction cache in this architecture is bigger than the data cache, but this is often the case with RISC systems (for example, with the NexGen and the K5 processor that was internally RISC). The simplicity of the instructions (the engineers highly approve of) means that the size of the code would be much higher than a comparable in functionality chunk of x86 code.
It is also a curious fact that the instruction cache uses the simplest structure – direct-mapped – when each line of the cache is 128 bytes long and consists of 4 sectors, 32 bytes each. One sector (32 bytes) can be either written into or read from the cache in a clock cycle. It’s more interesting with the L1 Data cache: two segments of 8 bytes can be read through two ports and 8 bytes can be written through the third. All of this takes one clock cycle, without any jams.
The L2 cache also consists of 128-byte lines, and is updated using the industry-standard method – Pseudo LRU, 7 bit.
As for the direct mapping of the L1 Instruction Cache, I can’t find the reason for the engineers to choose it. The engineering team from IBM may have thought this structure would provide an acceptable precision coefficient of cache hits. Or they may have been just simplifying the circuitry of the die. The forebear die was not small by itself (it was gigantic, to tell you the truth): it is no secret that a processor of an absolutely different price range, Power 4, was the base for developing the PowerPC970. Yes, it was that core that after some redesign transformed into a mainstream processor. The very idea was sensible, though. The Power4 was one of the performance leaders when it came out, and it remains quite competitive today. So it was much easier to use the available intellectual property, adapting it for the requirements of mass production (the die size of the Power4 is 417sq.mm; the high manufacturing cost of the monster would be unacceptable in mass production). They managed to reduce the cost of the die considerably: the processor manufacture of the Power4 (it also includes the L3 cache, though) costs about $10,000, while the PowerPC970 costs a few hundred dollars (I couldn’t find the exact number, apart from the cost of the whole platform). The following illustration is taken from a presentation made by IBM; it proves that these two processors are truly close relatives:
So, they sacrificed the second core for the sake of clocking the first one at a higher frequency (the finer production technology and the longer pipeline also contributed to reaching higher frequencies). They added SIMD instructions to the remaining core, too. We will have a chance to discuss this set of features; there is really a lot for discussion.
Right now, let’s get back to the microprocessor. What does it have inside?
First, let’s discuss the unit every modern processor features, irrespective of its architecture. It is the Branch Prediction Unit (BPU). This unit was necessary, because every long program contains conditional branches of various kinds. There is another fact about modern processors: all of them use pipelines as a means of increasing the operational frequency and also of increasing the percent of transistors in the processor that work simultaneously. In other words, each program instruction moves along a pipeline, on the way acquiring data read from the memory, results of other instructions, various additional properties and pointers. As a result, in each given moment of time the entire pipeline is usually busy processing different instructions on its various stages. This would be a perfect setup, if it were not for those misfortunate branches. If there is a jump in the flow of the program to another stretch of code (or there is a jump expected), the pipeline interrupts its smooth operation, and the resulting performance of the microprocessor gets greatly reduced. The engineers came up with a solution: they needed to create a unit, which main purpose would be to “guess” the most probable direction of the jump.
If the “guess” was right, we would get our reward: the continuous operation of the pipeline with a maximum possible workload. If the “guess” was wrong, we would get our punishment: the pipeline stalls, the buffers are all cleared up, and the right program branch is loaded. Of course, the penalty at a wrong guess (in processor clock cycles) may be greater than the economy if the guess could be correct. What saves the performance is that the number of correct guesses is overwhelming. As a rule, developers do their best for their processor to be as close to 100% correct predictions as possible in a majority of real-world algorithms. Most modern processors have a prediction precision of about 90% and higher! So overall, this method is profitable, although sometimes it is still necessary to clear the pipeline and fill it up again. On the other hand, by improving the predictions precision, you can increase the length of the pipeline, which favorably affects the CPU frequency (all other factors being equal).
Modern compilers use the branch prediction feature of the processor as a reserve for improving the performance of the program. They shape the code in such a way as to increase the branching prediction precision (the compiler developers know well how to use this feature of the processors they write the complier for). By the way, note that the PowerPC970 has a longer pipeline compared to its progenitor Power4 (16 stages against 12 stages; SIMD/FPU instructions may take as many as 25 stages!) As I have mentioned above, this was another trick to increase the frequency of the PowerPC970 (and close the gap in this parameter as Mac processors used to be lagging behind modern x86 processors in frequency).
But let’s return to the branch prediction unit. Let’s first see what flagship models from AMD and Intel have in this area. The Pentium 4 processor uses a branch history table (BHT) unit with a size of 4096 (4K) entries; its algorithm is based on the history of branching. In other words, a certain branching statistics is accumulated and the most probable (according to the accumulated statistical data) branch direction will have an advantage.
The latest microprocessor from AMD, Opteron (and Athlon 64), uses a buffer for 16K entries (!). That is, the branch history table is four times bigger than that of Pentium 4 (and the same four times bigger than that of AMD’s previous processor, Athlon XP). This helped to increase the predictions precision considerably (as far as I know, it is over 95% now). The improvement from 90% to 95% doesn’t seem to be a significant one, but look at the situation from the opposite point of view. 90% correct predictions means 10% wrong predictions, and 95% correct predictions means 5% errors. A double reduction of errors is a significant thing, don’t you think so?
How good are the current and previous Mac microprocessors at this kind of fortune-telling? Let’s start with the older one, the G4+. It was humble enough: a 2KB branching history buffer, and a BTB for 128 entries. On the other hand, the G4+ has a pipeline of 7 stages only, so it doesn’t actually need a powerful branch prediction unit as desperately as modern processors do – it suffers a much milder penalty per an incorrect branch guess. PowerPC970 has it another, much more interesting way.
First of all, let’s recall that it is a relative to the high-end Power4 processor. PowerPC970 has a sophisticated branch prediction mechanism. It is the most efficient, too, I only regret IBM never revealed their estimates of the prediction precision. At least, I couldn’t find them anywhere. First of all, the PowerPC970 is interesting for its system of two branch prediction units that work constantly. Moreover, they work simultaneously.
The first unit uses a traditional branch history buffer for 16KB of branch entries. The entries in the table denote the absence/presence of a branch and the correctness of the prediction. The next prediction is made by analyzing this information.
Let me explain it in more detail to you, as we will need this information later. That’s how the first table works. Ideally, a 1-bit element in the table should correspond to each branch instruction. The number of the element can be inferred from the instruction address, but since the size of the table is limited, the element is accessed by means of some bits of the 64-bit (32-bit) address, rather than by the entire address. That is, 14 bits should be extracted for a 2^14 table. Of course, we may run into a situation when one and the same element corresponds to several branch instructions, but that’s not important: these instructions are more likely to be performed in different periods of time (which will be pretty far from one another), so overall this “reassignment” will be invisible. The element 1 bit long can only store information about whether or not the jump in the program flow happened last time. This kind of a counter would be wrong only twice – for example, when the jump happens only once in 10 passes. If the jump should take place every second pass, this counter will always be wrong.
The second working scheme uses a table of the same size, 16KB. However, the second branch table is global, while the first one is local. Besides that, each entry in the second table is associated with an 11-bit counter (there is only one such counter for the whole processor). This counter marks the branch direction chosen in the previous 11 times when the instruction group was selected from the L1 cache (the load unit loads 8 instructions at once from the L1 instruction cache) and also remembers if the prediction was correct. This information helps to predict the outcome of the next branching.
Let me now explain the operation of the second table. Let’s see how a random code sequence is executed. Each time we meet a branch instruction, we just mark what happened: whether the jump was really needed or not. We don’t need the instruction address for now, only the result. As an outcome, we get a sequence of “yes” and “no” answers. Now we can try guessing the next branch. Let’s take 8 last results. There are 2^8=256 combinations possible, so we need a special bit array for 256 elements. When we receive a new result, we write it into the array element, which corresponds to our history. For example, if we have a “history” like “yes-no-yes-no-no-no-no-no” and we find out during the code execution that there is no need in the next jump, we put 0 (“no”) into the array element number 5 (5 = 00000101 in the binary system). Next time we meet the same historical combination, we use the record. It is easily seen that this table would adjust itself nicely to short, but exotic branch sequences.
So, we have seen the main difference of this prediction method from the first one. In the first variant, we follow each branch instruction, independent of its connection to the others. It is all done the opposite way in the second variant: we make predictions basing on a sequence of results, without tying it up to any definite instruction. That’s what the names of the tables – local and global – stand for.
We can introduce an improvement to the method: we can try to enhance the latter idea to avoid dealing the same way with all instructions. To do this, we need a bigger table first. Second, we shouldn’t use the global counter alone to access an element of the table, but rather the counter with, say, the address of the current instruction. Thus, the Athlon XP extracts 4 address bits, adds 8 bits of the counter (the data on the last 8 branches) and gets the element index in the global history bimodal counter table (GHBC). The PowerPC970 uses an 11-bit counter, combines it with the address bits (by a logical operation rather than simple addition) and gets a 14-bit address. By the way, there is one more important difference: the Athlon XP (Athlon 64) has a table with 2 bits per entry, not 1 bit as with PowerPC970. 2 bits give us more flexibility as the entry can be not only “yes” or “no”, but also “perhaps yes”, “perhaps no”. But the PowerPC970 has three tables!
The gist of the branch prediction unit of the PowerPC970 processor is that there is a third 16K buffer (!) that analyzes which of the two prediction systems has proven to be more efficient over a given period of time, that is, which of them has a lower percent of wrong predictions. As a result, the processor can adapt itself to the environment in a short time and switch to the prediction algorithm, which is more efficient under the current conditions! It’s really sad that this most exciting solution is rarely mentioned in press. The information above was collected from the datasheets on the Power4 – it seems like PowerPC970 inherited this system from its predecessor.
My resume: it is probable that PowerPC970 has the best branch prediction unit among all modern processors. And we are looking forward to the Power5 where this unit should be enhanced further!
Here is a picture that illustrates the internal structure of the processor:
There are a few things you can notice in the picture: the processor has two execution pipelines, two FPUs, and a unit for processing SIMD instructions called AltiVec. But there is one thing that catches the eye: the instruction processing stage called “Decode”. But why Decode? We have been taught that RISC means a Computer with a Reduced Instruction Set: simple instructions are ready to be fed directly to the processor. Why the processor transforms them here? Why does it need to make those wonderful RISC instructions into anything else? How does it all work?
It is all logical. PowerPC970 processor does transform RISC instructions into some internal set of instructions (this is quite a mysterious thing and I found no information about it, except the fact that it does exist). PowerPC970 uses an internal instruction set, which is completely different from the external set. This feature makes it look very similar to x86 processors that transform external irregular x86 instructions of variable length into internal commands (or command sequences) of constant length. We got used to seeing that in x86 processors, but it is quite strange to see this in a RISC one. A set of simple and short instructions have always been considered the key advantage of the RISC architecture. Thanks to that simplicity, the execution units were relatively simple, and many commands were executed directly, without any transformation or translation. But it is one thing when you have to execute directly a few dozen instructions, and it is quite another, when the number of supported instructions approaches two hundreds (this processor supports the AltiVec set that consists of as many as 162 instructions). So, I think this measure was adequate: it is easier to add a decoding unit into the processor rather than to try executing several hundred instructions directly.
The problem is even not about the AltiVec set, but rather about IBM (with its clients) having accumulated a lot of software, which is too expensive to be just thrown away. The older instruction set doesn’t support modern programming, and the standard methods for increasing the processor performance are practically exhausted. So, similar problems (the need to keep the money invested into older software together with the need for higher performance) met similar solutions for the Mac and for the x86: introduction of the decoding unit.
History once again shows its ironic nature. The instruction set for the x86 and the system of decoders (often referred to as “crutches for the x86”) have always been the laughing stock for the proponents of the RISC architecture. In fact, the translation of x86 instructions into RISC-like commands inside the processor was considered a victory of the RISC camp. They used to say that a real processor didn’t need to transform instructions into a “digestible” form, while “crippled” x86 processors had to translate their external instruction set into the “right” RISC-like internal set. I wonder what they are going to say now. :)
The translation itself is performed in a way similar to the algorithm of the decoder in Athlon processor. All instructions fall into two categories. The first group, called cracked instructions, includes instructions that can be translated into two simplest ones (IBM calls them IOPs). The second group, millicoded instructions, is decoded into more than two IOPs. Every clock cycle PowerPC970 microprocessor can send a group of five IOPs into its execution units. Most micro-operations occupy slots 0 to 3 in the group. Slot 4 is reserved for branch prediction operations. If there are no operations to occupy the preceding slots, or they are too few, the decoder inserts the so-called NOPs (No Operation). The NOP is an instruction to literally “do nothing”.
Besides that, there are certain limitations connected with the positions of micro-operations inside the group. After a cracked instruction is translated into two IOPs, these two IOPs must be both included into the same group. If this cannot be done, the decoder inserts a NOP and starts a new group. Millicoded instructions always start a new group. If some instruction calls a millicoded instruction, it also starts with a new group.
These nuances and limitations resemble much the decoder of the Athlon XP processor. Athlon 64 processor, as we know, has a significantly improved and enhanced decoder. So, this unit of PowerPC970 should be considered adequate, although it is not the best; at least, we know better implementations of the instruction decoder.
Besides this very strange (for a RISC processor) decoder unit, PowerPC970 boasts bigger buffers along the pipelines (by the way, at the decoding stages the processor not only translates cracked instructions, but also resolves dependencies and forms a group of micro-operations). Right now, IBM (and Apple) says they are proud of PowerPC970 processor being able to have as many as 215 instructions on different pipeline stages at the same time (“on the fly”). Apple says it is much more than Pentium 4 can handle, as it has a “window” of only 126 instructions. G4+ processor has only 16 “on the fly” instructions. I will explain to you later that PowerPC970 doesn’t have any big advantage; the total number of “on the fly” instructions is very similar to that of PowerPC970, Athlon 64 and Pentium 4.
About half of these instructions (IOPs) are stored in a buffer called Group Completion Table. This is a functional analog of the Reorder Buffer – it can store up to 20 formed groups of micro-instructions (that is, about 100 IOPs) that are waiting to be sent for execution. Note that this all happens in the order set by the program code. The micro-instructions are sent to execution units as soon as they are properly prepared, and without keeping their sequential order in the program. “Out-of-order” execution happens only here! As soon as the functional unit “confirms” that the operation is being performed successfully, the place in the queue gets free. Note that
When the entire group of micro-instructions is executed, and all preceding groups are also executed, the processor writes down the final results and the Group Completion Table gets cleared. Besides that, if the Group Completion Table buffer is full, the decoder will not decode instructions and form other groups until there is free space. Of course, we need to know two things to evaluate the capabilities of a processor: the size of the Group Completion Table and the length of its queue. Let me clarify it once again: the size of the Group Completion Table roughly indicates the maximum size of a continuous instruction block (as if cut out of the program) that can be processed by the processor at a given moment. To be exact, this is the maximum number of processed micro-instructions, which the instructions from the continuous part of the program are translated into. The queue depths are the maximum number of micro-instructions from which out-of-order instructions are selected. Of course, I suggest that the program instruction is the same as micro-operation for the sake of simplicity. We should also take into account other features of the micro-architecture. For example, the size of the Group Completion Table should be increased if the processor has a long pipeline, since each instruction takes more time to be processed.
Now follows a slight deviation from the main topic. The problem is that we deal with marketing tricks here, too. Those “126 on-the-fly instructions” Apple applies to Pentium 4 processor don’t refer to the entire pipeline, but to the Reorder Buffer. A quote: “The Allocator allocates a Reorder Buffer (ROB) entry, which tracks the completion status of one of the 126 uops that could be in flight simultaneously in the machine”. So, it would be more correct to compare the “width” of this window with the Group Competition Table of the PowerPC970 processor.
The same correction is true for Athlon 64 processor, and we should make comparisons with the reorder buffer, which is 72 macro-operations big (The reorder buffer allows the instruction control unit to track and monitor up to 72 in-flight macro-ops (whether integer or floating-point) for maximum instruction throughput). Note also that this number refers to macro-operations, so it ideally corresponds to 144 micro-operations, although closer to 72 in our real world. Compare this to about a hundred instructions in the Group Completion Table. It seems like PowerPC970 processor cannot boast any exceptional features as far as the number of simultaneously processed instructions is concerned. Quite opposite to what Apple says.A lot of space is necessary for the “register renaming” procedure. As you probably know, this procedure is necessary to perform several instructions that work with one and the same register. As a result, every instruction receives a register with the name it needs, and it is quite another matter what real physical register this one corresponds to. The procedure of renaming the registers is performed by nearly all modern processors. It should also be mentioned that the total number of internal “rename registers” must be proportional to the total number of processed instructions.
Let’s compare PowerPC970 processor with those we have mentioned earlier:
They all look humble compared to the Itanium, though, with its 320 architectural registers! :)
Let me veer again from the main topic of our discussion one more time… The separating line between “architectural” and “rename” registers is very thin. Of course, we deal with a sharply outlined set of physical registers, onto which architectural registers are dynamically reflected (assigned to them). It is also clear that the more physical registers – the better. However, the growth of the number of registers brings about certain problems. First of all, they are the warmest part of the processor (the register renaming unit always has something to do). Then, if we have many registers, it is harder to fetch data from them. The quantity of the architectural registers also affects the amount of fetched data. Thus, a big set of registers is good, but also leads to certain problems.
As you see, PowerPC970 can’t show anything extraordinary in this respect. It is not the winner, but it is also no loser – just a well-made processor. Moreover, the number of its physical rename registers has grown considerably since the previous Mac processor, G4+. It was necessary just to support the much bigger number of “on-the-fly” instructions: the more instructions are on the fly, the more physical registers are required. Note also that talking about the registers, I meant just general-purpose registers, excluding the rest of them. Otherwise, we would get quite different numbers. For example, here is a table for the PowerPC970 (original source: here):
8 (9) 4-bit fields
Thus, the total number of registers in the PowerPC970 is 244! Of course, other processors also have many registers, besides general-purpose ones. However, when it comes to estimating processor power, we usually count in only the general-purpose registers.
But let’s continue our exploration of the internal processor structure. We have approached the execution units. It’s not as simple here as it seems at first glance. PowerPC970 contains two integer units (IU), each of which is paired with a Load/Store unit. I can’t say we have anything exceptional, as far as the number of units goes: Pentium 4 features two fast ALUs that work at a double frequency (and one slow ALU for certain types of operations). G4+ also has two of them: one integer unit for simple operations (addition) and one for complex operations (integer division). Athlon 64 boasts even three ALUs. But IBM has thrown in some technological nuances into the architecture. These two IUs we have in PowerPC970 are not exactly the same. Both units perform simple operations like addition and subtraction in perfectly the same way. When it comes to more complex operations, there is certain specialization: for example, it is the IU2 that performs all divisions in PowerPC970. IBM refers to these two units as “slightly different”, but that makes the situation even more confusing. Unfortunately, I couldn’t find any information about what unit sorts the commands according to their specifics and sends them to the appropriate IU. It’s not also quite clear what happens if a group of IOPs contains several division commands. Will they all be waiting for the IU2? If this is the case, we can pinpoint a definite bottleneck of this architecture. The decoder unit may try to keep the groups as balanced as possible: this variant seems more probable, but there is a question of how good the decoder is at changing the IOPs positions.
To our great disappointment, IBM also doesn’t disclose any information on the latencies and the instruction throughput for the PowerPC970. We only know that some simple instructions take one clock cycle (excluding the decoder, of course), while others take at least several clocks. Independent IOPs can start each clock cycle, while IOPs that are dependent on each other can start no faster than each second cycle. We could try evaluating a few things according to the given pipeline stages scheme: for instance, the most probable latency for accessing data stored in L1 cache will be equal to 4 clocks.
The Load/Store units come next. These units show some deviation from the ideology we see in x86 processors (Pentium 4, Athlon 64). The x86 processor includes two units for performing integer and floating-point loads/stores. For example, the Pentium 4 has the following specialization, according to the data from arstechnica.com:
The author is doubtful about this, as he seems to have some concerns about the correctness of such a “division of labor”, which is actually quite hard to confirm or deny. And in PowerPC970 these units are identical, that is there are two special units responsible for all types of loads/stores. There remains some uncertainty about vector operations, though: the predecessor of the family, Power 4 processor, didn’t have a vector unit. That’s why it is not quite clear how many vector Load/Store operations and of what kinds the appropriate unit of the PowerPC970 processor can perform. There is probably a specialization, too, so that one unit is responsible for Vector LOAD and the other - for Vector STORE. But that is only a supposition. Another variant is also possible: the units only perform Vector LOAD, while Vector STORE is combined with the execution units of the pipeline. The worst variant is when only one unit is responsible for all operations with vectors.
Now, let’s dwell on the FPUs. PowerPC970 has two identical FPUs; each of the two can execute any operation with floating-point operands. The fastest operation may be performed in 6 cycles, and the slowest – in 25 cycles. The two units are fully pipelined, that is, another instruction can be sent for execution each clock following the previous one (if they do not depend on one another). You should also remember that PowerPC970 has 72 physical FPU registers: 32 architectural registers and 40 “rename registers”. Moreover, there are a few pleasant peculiarities. In particular, PowerPC970 supports one very useful combined instruction: multiplication + addition “all-in-one”. Since it can be performed in each of the two FPUs each clock cycle, it means that there will be 4 operations performed in a single clock. This can turn out essential when we have to multiply matrices or solve many other tasks from linear algebra.
Besides that, PowerPC970 is very likely to allow simultaneous processing of two addition operations (or two multiplication operations), which cannot be done on Athlon XP, for instance, since the FPU units of the latter are asymmetric (one for addition and one for multiplication). The same is true for Pentium 4, where the situation looks even worse, since in x87 mode the ports throughput will become a bottleneck limiting the performance (they allow only 1 operation per clock).
Of course, the higher computational power of the processor called for more bandwidth from the memory and the system bus, which it did receive. We will talk about it shortly.
Now let’s review one more unit of the PowerPC970 processor, the AltiVec unit. First, take a look at the following pictuer, which is a flowchart of this unit in G4+ processor:
The picture is taken from arstechnica.com
As we see, the unit includes:
Besides that, the unit uses 32 registers 128bit long. 16 “rename” registers make the unit complete, facilitating out-of-order instruction execution. The performance of this unit looks as follows: G4+ processor can execute two vector IOPs per clock cycle in any three units of the four.
The same unit in PowerPC970 processor is somewhat different. Here is the flowchart:
The picture is taken from arstechnica.com
As you see, the structure of the AltiVec unit is a little bit different (by the way, IBM has another word for it, but I use this term on purpose to avoid confusion). There are two different units: Vector Permute Unit and Vector Logic Unit. The latter consists of:
This structure of the AltiVec unit leads to certain limitations, which G4+ processor never had. PowerPC970 processor can execute two vector IOPs per clock cycle, but only if one of the IOPs is referred to the Vector Permute Unit. The second IOP should be for any of the three pipelines of the Vector Arithmetic Logic Unit. Of course, the additional limitation doesn’t make the unit perform any faster.
Besides the architectural 32 registers, PowerPC970 processor has a bunch of “rename registers”. The total number of physical registers in the AltiVec unit is estimated at 72 or 80.
It seems IBM had to redesign this unit to reach higher processor frequencies. This is just a supposition, so I will try to prove it. The following table lists the pipeline depth (in stages) for several types of instructions:
Vector Simple Integer
Vector Complex Integer
Vector Floating Point Unit
Vector Permute Unit
So, we can see that this unit has undergone certain revision that, without doubt, reduced the overall performance per clock cycle, but allowed reaching considerably higher operational frequencies.
Now, we have come to the processor bus and the memory subsystem of PowerPC970 platform. The situation is appealing here. The processor bus works at one fourth of the processor frequency (that is, it is 450MHz for the 1.8GHz processor). The bus uses DDR technology to produce an effective bandwidth of 900MHz. As the width of the bus is 64 bits (there are nuances here to be discussed below), we’ve got a theoretical peak bandwidth of 7.2GB/s. IBM says the weighted average bandwidth is 6.4GB/s. It is closer to reality, considering the effect of latencies, memory access, and chipset peculiarities. In other words, PowerPC970 platform features the today’s fastest bus among all modern processor architectures. In fact, however, the bus of PowerPC970 consists of two one-way buses, each 32bits wide. That is, you can send 3.6GB/s in each direction, but not 7.2GB/s in one direction (there is certain similarity with the HyperTransport “system bus” of the Athlon 64 processor). Another thing is that RAPID I/O bus (that’s how IBM calls the whole family of high-speed serial buses; although it is called Elastic I/O here) has another interesting property (at least, in theory): it can change its direction. It takes some time (a few hundred CPU clock cycles) for the bus controller in the chipset to switch the bus into a single-directional mode when the bus pumps data in one direction only. Unfortunately, I don’t know if this operational mode is implemented in the chipset of the new platform from Apple. It would be interesting to check out the performance of the concept as well as learn whether the data streams from and into the processor differ so much that it makes sense to use the bus in this way.
The memory subsystem changed, too. Instead of DDR333 SDRAM (PC2700), PowerPC9700 platform uses dual-channel DDR400 SDRAM. The difference is obvious: 6.4GB/s bandwidth instead of 2.7GB/s. Note that G4+ platform didn’t allow using the full bandwidth of DDR SDRAM as the processor bus could only transfer data at 1.3GB/s (that was similar to the Pentium 3 platform – DDR SDRAM provided no benefits). PowerPC970 platform actually owes much of its improved performance to the memory subsystem (and faster bus) – the previous platform, G4+, used a bus clocked at 166MHz (synchronously with the memory). That is, the memory bandwidth doubled and the bus bandwidth grew five times bigger.
Besides other things, PowerPC970 processor supports SMP mode (Symmetric Multiprocessor) so you can easily use these processors in pairs. By the way, the ability of PowerPC970 to work in multi-processor systems is actively employed by Apple in their marketing campaign. It is very interesting that in spite of the similarity between the Elastic I/O and HyperTransport buses, bus architecture is used for building a dual-processor system, similar to Intel systems. That’s curious, although quite reasonable: bus architecture is usually simpler to implement. Moreover, it is enough to build desktops or inexpensive workstations, while the NUMA version would require a serious redesign of the operation system. And this is a sore spot: as I have mentioned earlier, Apple suffers great pains moving to the new Max OS X.
Overall, PowerPC970 platform feels all right on this front.
If you have read the previous chapters carefully, you know what I am going to say now: PowerPC970 is a well-made successful processor. It is good in some areas, and the best in others. And there is no area where this processor would lose evidently to the competitor solutions. So, we can expect a high level of performance from PowerPC970, at least no worse than current chips from AMD and Intel do. We will see the performance results shortly.
Right now I’ve got a lyrical deviation for you: however excellent and potentially winning the processor architecture features are, you have to do one most important thing about it. You have to sell it to somebody. That’s the reason why the folks from the marketing department usually determine the launch schedule for a particular product, rather than the engineering team. If you think the engineering component is paramount, I can remind you of one striking example: DEC. This corporation was a well-known developer and inventor of many efficient technical solutions, but it slipped once. They didn’t pay attention to marketing technologies that would help to sell its solutions (best in the world at that time). Note, best from the technical point of view! The outcome was sad, although quite natural. The developer of one of the fastest microprocessors of that day (by the way, it was the legendary first 64-bit Alpha processor) went bankrupt and was bought by Compaq (that merged with Hewlett-Packard later). DEC was an example company in designing high-performance systems, but proved to be too weak for this harsh business world. They thought their excellent products would sell good enough without any marketing. Regrettably, this was not so. Most customers are not specialists in computer technologies: they are professionals in other areas. So, they need someone to tell them why this very product is exactly what they need. And it is the marketing department of any company that does the telling.
No doubt, Apple learned these simple truths long ago. I would even say that marketing has traditionally been Apple’s strong point: they used to come up with ever more ingenious ways to promote their products. The new microprocessor and the new platform immediately became the focus of the intensive marketing work. And that is when a small, but very unpleasant event happened.
One of the trickiest nuances in the marketing campaign for any product is its comparison to the potential competitors. Of course, you must tout your product as good as you can. It is also clear that you need to stay politically correct, at least seem to be like that. If you are trying to point at the deficiencies of a competing solution, you cannot lie. Otherwise, if you get caught on that, people would become very negative to your company and to your products. That’s psychology. So, they will believe you until you are exposed lying. Accordingly, the marketing department cannot lie openly, as it could ruin the company’s image. I think these are all well-known truths.
Therefore, I was shocked to see what awful disinformation was offered to the public in the official Apple document on the launch of their new systems. It is not about the numerous overstatements like “the fastest system”, “leaves far behind” and other magnificent phrases: that’s normal, that’s marketing. But running along the lines of the document I found the performance results of the cross-platform benchmark SPEC CPU 2000…
Let’s first see who’s in the lead of the performance parade today. The following table lists the performance results in this test (I took the Base result). Note that these data are taken directly from the official SPEC website. You will see shortly why this is so important. Here you are:
Pentium 4 XE
* - IBM’s preliminary estimate
The table needs some comments of ours, I suppose. First of all, the results of PowerPC970 are the lowest among the competitors (Power 4+ processor belongs to quite another market sector; I was just curious to know how far the offspring ran from the forefather). The results refer to the 1.8GHz processor of course, while we now have a 2.0GHz version, but this cannot conceal the fact that PowerPC970 is considerably slower than the others. By the way, the release of the “extreme versions” of Pentium 4 and Athlon 64 has been like a spurt of the x86 processors: they are now beyond reach for the poor PowerPC970.
It’s also clear that PowerPC970 loses to Power 4+ in floating-point calculations (its FPU was greatly cut down compared to the predecessor). What’s worse is its significant loss (from 20% to 33%) to the competitors from the x86 world. Accordingly, you cannot expect this processor to compete successfully with x86 processors in serious scientific calculations: it won’t fall too far back at best (if the software is properly optimized for each platform, of course).
On the other hand, this PowerPC970 is a good foundation for Apple which has been in constant trouble in terms of performance (I don’t have much faith in the results of Apple’s “tests” in Photoshop without any clearly defined methodology – just columns of numbers). So there is some hope that the new processors will help Mac platform to bridge the gap to the x86 platforms, which we see today.
Of course, I was looking for performance results of PowerPC970 from Apple. And when Apple did publish them, I couldn’t believe my eyes. Yes, you have the right to show your platform in a most favorable light, everyone does so. But when the company goes for a direct fraud during its marketing campaign and deliberately lowers the results of the competitors, it’s unacceptable! Take a look yourselves, as here are the results for Pentium 4 3.0GHz as benchmarked by Apple:
Results from SPEC.org
“Results” from Apple
I am quite curious how they were testing to get the performance reduced by more than twice! As you know, the main concept of the SPEC tests is to achieve the maximum performance for the platform considered. The compiler is chosen at your own wish. So, I can’t think of a reason for such a big difference between the results, save for a deliberate inaccuracy of Apple’s marketing folk. Even by choosing a bad compiler it’s hard to drop the performance twice in SPECfp_base 2000. Apple chooses not the fastest compiler for Pentium 4 and doesn’t use the SSE2 instruction set saying it couldn’t get a performance advantage from using this set (!). Well, the whole world sees a performance advantage from SSE2 optimization, and Apple – doesn’t. Should we buy them spectacles?
It’s even funny that this “cheap” way to success brings no victory to Apple. The performance results for PowerPC970 as benchmarked by Apple were 800 in SPECint_base and 840 for SPECfp_base. Considering that Apple tested a 2GHz PowerPC970, it’s strange they got smaller results than IBM for the 1.8GHz model (IBM quotes 937 and 1051, respectively). That’s a shameless fraud and I really doubt such benchmarking methods will add Apple any popularity. Is it so bad with the sales that this measure was necessary?
But maybe that’s because the trump of PowerPC970 processor, the AltiVec technology, is not used in any way? Maybe this technology can turn the tables?
Well…We see the same marketing miracles again here. Firstly, Apple juggles with some mythical 16 Gigaflops on the 2GHz PowerPC970. But they forget to stress a trifle: the numbers refer to 32-bit calculation precision. Let me remind you that the very notion of Gigaflop was defined specifically for 64-bit precision. Yes, I could think of some spheres where 32-bit representation of numbers is quite enough for calculations, but they are scarce. Scientific and technical calculations as a rule involve 64-bit precision, as this is the minimally sufficient precision for this type of tasks. It’s a lie that 32-bit precision is enough for everyone. Of course, this performance is compared to correct performance results for x86 processors that use 64-bit precision. All of this goes under a banner of “fair game”.
Sorry, dear Apple marketing guys, but I don’t feel like eating this. You either follow the standard testing methodology from the SPEC committee (or give your systems to someone else to benchmark), or don’t quote performance results in SPEC at all. Give me nothing, if you can’t give me the truth. User-friendliness, ease of use, impressive stability – these are things you can limit yourselves with. You can talk about your leadership in multimedia. Trying to say your processor is the best, you forget that marketing should show advantages, rather than confuse the potential customer. Your “marketing” is a blatant lie.
You didn’t make it, guys, and that’s sad. Your processor wasn’t that bad, really.
Let me end this article with this sad note. PowerPC970 is the example of an improper marketing campaign. As we have seen, PowerPC970 processor (and the whole platform) is not bad at all. But it is not the best, either. Yes, it is slower than x86 processors and I can’t see any reasons for this gap to become smaller. No clusters on PowerPC970 will change this situation (recently Apple triumphed over a cluster on two-way PowerPC970 platforms in one of US universities). By the way, Apple is again not quite honest saying the cluster cost $5 million. This number doesn’t include the cost of building the cluster (the students who built it were working for pizza!), the cost of the cooling system, and other expenses. Apple says the university got the equipment at a special price. I won’t even tell you about the size of the system – they used ordinary system cases rather than Rack Mounts.
I hope Apple won’t continue using such “marketing methods”. It is simply very sad to see things like that happening in a company that was a pioneer and initiator of many revolutions and breakthroughs in the personal computer industry.
I would like to sincerely thank Vadim Levchenko, Yury Malich, Sergey Romanov and Valeria Filimonova whose feedback was very helpful. My special thanks go to Jan Keruchenko who contributed a lot to the hardest passages.