<%BANNER[top_768x90]%>

<%BANNER[banner_468x60_h]%>

CPUs with 64bit Architecture: Evolution or Revolution?

We are facing the transition from 32bit to 64bit technology. But are we ready for that? In this article we are going to take a look at the processor market to see if there is real need in this transition and what the solutions will be the pioneers in the 64bit world.

by Andy Yaschenko
07/07/2003 | 10:47 AM

Not so long ago, some 10 years or so, a very important revolution happened: we moved from 16bit applications to 32bit ones. Little by little, overcoming some problems on the way, we carried out this transaction starting with Windows for Workgroups 3.11 and Windows 95, and finishing with the newer versions. Most of you do remember this process very well, I suppose, as well as the fact that this change didn’t bring any tangible performance improvements. So, when we move from 32bit to 64bit, the situation will be just the same.

<%BANNER[article]%>

Now that we have already made a pretty optimistic introduction, let’s try to figure out why everything goes this particular way. Moreover, we suggest checking if the situation is really like we have just described, or maybe we have an erroneous vision of the things. Especially, since the marketing people have created a great stir around 64bit processors, which grows bigger and bigger as we get closer to their launch. It is remarkable, that we have never seen anything like that before, although 64bit RISC chips arrived in the late 90s already. Well, maybe those first ones were strictly server processors, which have been already very well known to a limited number of specialists, that’s why the public response was not that tremendous. And today we have Intel investing many millions of dollars into the marketing of this particular field.

As a result, Intel’s aggressive marketing campaign sometimes creates weird myths and makes the users imagine very strange things about the reality, such as doubling of the systems performance once the transaction to 64bit is complete. Although there is nothing really bad about it: the 64bit processors are still used strictly in servers and powerful workstations. And the people who work with this hardware usually know very well what they need and what the real potential of the equipment is.

Nevertheless, the situation changes. When Athlon 64 arrives, 64bit processors will be considered as a PC solution. That is why, we think it could be a good idea to recall everything we know about these CPUs, their functioning and to separate some myths from true facts.

At first we would like to discuss the very basics of the CPU functioning, as the entire ongoing story will have to do with the basic notions, of course. CPUs have a pretty long history. It starts with the times of Intel 4004, which was a 4bit processor. It means that the general purpose processor registers could save only 4bit numbers, the ALU (arithmetic logic unit) working with the integers could work with them, and the CPU could use 4bit numbers when addressing the memory blocks. In other words, the instructions working with integers could use the values lying between -7 and 8. Well, this is too little for any arithmetic operation, don’t you think so?

However, very soon an 8bit 8008 processor came out (it could use the numbers in the interval from -127 to 128). After that they introduced 16bit 8086 processor (the interval included integers from -32767 to 32768). And in 1986 we all saw the 80386 processor, which was the first one to support 32bit work mode and hence to deal with integers exceeding 2 billion. By the way, do you have any idea what work we are talking about when we discuss these integers?


Each CPU has an ALU (the latest models have a few of them already), which receives all instructions and data for integer operations. This is usually one of the fastest processor units in terms of clock frequency. Roughly speaking, the working algorithm of this unit looks as follows: the data arrive (two integer values: 3 and 4), the instruction arrives (multiplication), the outgoing result equals 12. Addition, subtraction, division, etc., are all effected the same way.

Of course, it doesn’t matter if we use the numbers in 4bit or 64bit representation: the algorithm will remain the same. ALU receives the numbers, applies certain instructions to them and produces the result. Certainly, it is potentially possible to split this part of the algorithm into two parallel ones, by finding two pairs of numbers and a pair of operations to be performed, which wouldn’t affect one another. This is a pretty rare situation (that is actually why we use a set algorithm, i.e. a chain of sequential steps), but it is not completely impossible. Contemporary programs have enough procedures, which are not connected with one another in any way.

Then you can ask: what do the registers have to do with all that? As we have already said, the data should be loaded to the processor ALU from a certain source. The source can be of any type: starting with the system hard disk drive and finishing with the processor cache. From the processor cache the data get directly to the CPU, to the data storage unit, which is not very big, but quite fast. However, this unit cannot be placed close enough to the ALU for physical reasons, that is why there are special transitional blocks for temporary data storage working very fast and known as registers.

This is exactly where the data for the processor ALU come from. And for the ALU the “3x4”operation will most probably sound like: “register A multiplied by register B and save the result in one of the registers” (maybe even in one of the registers already involved into this operation). Later on the result can be taken from this register for further processing, or may be saved in the RAM, so that the sacred space could be free. When we are talking about the processor capacity, we first of all imply the registers capacity, i.e. if they can store 8, 16, 32 or 64bit numbers.

When it comes to broken numbers, namely to floating point numbers, the situation changes completely. Much bigger capacity is evidently required to store and process them properly. Of course, as you can definitely imagine how many digits should be usually stored for numbers like that. Especially, if you need to perform high precision calculations with a lot of digits after the dot. However, since they started using these numbers not yesterday, no one was going to wait until 64bit processors arrive.

There is a separate x87 instructions set working with the floating-point numbers. There are special computational units organized in a co-processor, which deal with these numbers. The numbers are stored and processed in a special internal format with 80bit representation. So, everything is alright here, and the floating-point numbers were not the ones to cause the changes. Especially, since this direction takes its own autonomous way: SSE, 3DNow!, etc. In other words, the transition from 32bit to 64bit has very little influence here.


Anyway, there is one more aspect to this question: the memory access. The thing is that in the basic “flat addressing” mode the same general purpose registers are used to store memory access addresses. 32 bits allow saving 4.3 billion of possible combinations, so that the 32bit processor can cope with only 4.3GB of memory. Its registers simply cannot store the addresses of those cells, which exceed the above mentioned limit.

Of course, the advantages of 64bit technology are evident here, because the size of the addressed space grows up to 18 million terabytes. Even though 4GB of memory can still be seen only in servers, we anticipate that very soon this memory capacity will become a common thing for PCs as well. As we have just pointed out, 64bit technology will ensure a theoretical capacity of 18 million terabytes. The practical value will definitely be lower than that, but it will still be enough to last a few tens of years.

So, since we came to speak about servers, it seems the right time to ask: what are the advantages of the 64bit technology? Who will benefit from that? Of course, the first thing that comes to our mind is the notorious database servers. Serious databases have already exceeded 4GB in size, and the opportunity to cache them completely in the RAM is too attractive to give it up. So, just like incase of coprocessor, the server solutions makers didn’t want to wait for someone to solve this problem. Thanks to different tricks they managed to make 32bit Xeon capable of addressing over 4GB of data (up to 64GB). Although you should keep in mind that these hacker-like solutions can hardly be called a serious platform for the future, and besides, they cause a too big performance drop when working with the memory: it can be measured in tens of percents. Anyway, the future from Intel is connected with the Itanium processor and not with Xeon.

Well, this is about it. It is pretty hard to think even of a workstation that might need more than 4GB of memory. We don’t doubt that the 2D and 3D applications as well as video processing tasks will be improved so greatly within the next few years that we will easily overcome this undeclared barrier. Nevertheless, today 64bit technology is hardly of any real practical value.

It is simply ridiculous to talk about Word, games and the like, although on the other hand, we can imagine that more memory could be needed for some games displaying highly realistic worlds with high level of detail. Just think of all servers supporting Ultima Online or Everquest being a prototype of a mainstream gaming PC of the future.

However, some of you may have already noticed that the whole discussion is based on the 64bit addressing, as if it were the only advantage of 64bit processors. Does it mean that nothing is going to change for integers when we start using 64bit representation? Of course, it is not quite true. Something will definitely change, but mostly for the already mentioned servers in the first place, which deal with such tasks as simulations of nuclear explosions and weather, cryptography, and the like, where 32bit integer range may be not enough. As for the common PCs, 64bit will find no real application. By the way, if you really need 64bit representation so badly, such high-end programming languages as C, for instance, allow using standard focus, when two 32bit registers are used for a 64bit number. Which will definitely affect the performance a little, because the number of available registers is strictly limited.


Of course, this will lead to a logical end: for the same type of architecture the performance of 64bit processors will be higher only for those tasks, which use 64bit calculations, because they will no longer need to resort to any tricks like the use of extra registers (see the previous passage). 32bit applications will be effected as fast as they used to be effected with 32bit processors, although they might be able to access 64bit address space after some additional individual changes. Certainly, if the time and effort spent on that is really worth it.

We mentioned not for nothing that our statement is valid only “for the same type of architecture”. Speaking about x86, it will work only for Athlon 64 and Opteron from AMD. All other players took a completely different direction developing totally new architectures for 64bit technology. Take Intel, for instance, which was the one to start the whole stir around 64bt processors, actually.

The company preferred to carry out the so-called “velvet” revolution: they kept the 32bit processor generation on the market and started introducing 64bit family little by little. The move from 16bit to 32bit was absolutely discreet: i386 completely replaced i286. Now the situation is different.

They developed the CPU from the very beginning in two versions at a time: by Intel engineers and by Hewlett-Packard engineers. In fact, both developed chips were based on the same ideas, because they were developed by both teams together and were intended to give birth to one and the same family of future processors. Of course, the general EPIC (Explicitly Parallel Instruction Computing) ideology, which came to replace CISC, and IA-64 architecture including instructions, registers description and so on, were a sort of uniting power in this case. However, architecture is usually subject to changes: remember the differences between 8086 and i486 CSC processors based on the same x86 architecture?

Just like Merced and McKinley, Itanium and Itanium 2 are both based on the same ideology, but on different types of architecture. Something like that has already happened once: we are talking about Pentium and Pentium Pro. However, they did have some common traits, and the today’s newcomers also have something in common, this is what EPIC is here for. First of all we are talking about fully-fledged large-scale super-scalarity, i.e. the ability to perform several instructions at a time. This is where the CPU needs execution modules: for integer operations, for floating-point operations, etc.

Unlike Pentium and its successors, working with the code themselves, EPIC-processors rely on the compiler, which should analyze the code and find the most optimal spots where its processing could be parallelized. This info is then submitted to the CPU. This is why these CPUs are called “explicitly parallel”. In fact, it is a very convenient thing: the CPU doesn’t have to decide anything, the compiler is the one to explain everything to the CPU. Besides, all execution units of the processor should be evenly loaded, that is why the new processors boast powerful branch prediction algorithms, preliminary code processing, data prefetch, etc.


They also tried to radically solve the problem with the lack of registers: their number grew several times bigger. Itanium has 128 general purpose registers, 128 floating-point registers, 8 branch registers and 64 registers responsible for prediction algorithms. Well, so many 64bit registers will definitely be enough for storing any amount of numbers for any reasonable amount of execution units. Itanium, the first representative of the new family, has 5 of them: 2 integer units and 2 units for work with the memory (it makes 4 ALU instructions per clock), and 4 for floating-point operations. The physical memory is addressed with 44bit numbers, which actually limits the capacity at “only” 17.6 terabytes. The floating-point units work with 82bit number representations.

Intel gave up the idea of designing hardware 32bit x86 core, because they considered it to be just a waste of the die surface. That is why to make Itanium capable of performing the good old x86 code, they had to develop a special translation system transforming the x86 code into IA-64 code on the fly. Of course, the performance of such solution will be lower than that of the pure x86 working at the same core clock frequency. But to tell the truth, no one expects Itanium to be really fast with x86 applications: the support of this architecture is actually considered just an expense of the transitional period. Nevertheless, the fact is undeniable: this family is not good at solving 32bit tasks. Although, I really doubt that anyone will ever buy Itanium for this purpose.

Besides, Itanium was mostly a pilot project, like Pentium Pro, that is why we would consider it mostly a demonstration of the architecture capabilities rather than a real commercial product. A typical indication of that is the fact that the chipset for Itanium processors, 460GX, supports only PC100 SDRAM, which gives you at least some idea of the overall processor performance. On the other hand, very large L3 cache makes up for not very fast interface between the CPU and the RAM: it can be 2MB or 4MB big and it work at the full processor frequency (733MHz or 800MHz) with the bandwidth up to 12.8GB/sec.

Another Itanium’s goal was to solve the matter with the compilers, especially, since EPIC processors depend a lot on them, as we have already said. Unlike x86 compilers, which hardly affected the CPU performance at all, these compilers are fully-fledged partners of the EPIC CPUs, because they supply the processor with the vitally important information. So, the processor performance directly depends on the quality of the information supplied by the compiler.


Itanium 2 is a much more exciting product from the commercial point of view. This CPU was developed by Hewlett-Packard engineering team, which is much more experienced in 64bit PA-RISC processors design that is why Itanium 2 is closer to perfection than the predecessor. With a little smaller L3 cache (1.5MB or 3MB) and a little higher working frequency (900MHz or 1GHz) it provides 1.5-2 times better performance in the same tasks than the first Itanium. In fact, it is the first commercial IA-64 product. Intel’s further plans have already been set very strictly, so there are no more revolutions to come within the next couple of years: all performance boosts will be achieved only due to production technology improvements and polishing.

Later on they are going to increase the level of parallelizing in the today’s most popular way: the processor will have to move to two physical dies, which will almost double the performance at a reasonable cost. The result will anyway be much cheaper than in case they tried to fit the same number of execution units and registers onto a single die.

Yamhill technology, Intel’s half-mythical response to AMD x86-64, i.e. 64bit tuning for 32bit processors, still remains something unclear despite all rumors circulating around. No doubt that Intel does have something done in this respect, but the Pentium 4, Xeon and Itanium families cover all possible applications so well, that it doesn’t make any sense to add anything new right now.

AMD has a completely opposite approach. The company believes that it is not the right time and place for revolutions, and the smooth evolutionary development of x86 architecture since the times of 8086 until Pentium 4 and Athlon XP shouldn’t be interrupted. This idea was the basis for the next generation AMD K8 processor. Even then they could already see the future need for 64bit technology; besides, Intel announced the development of IA064 architecture. That is why they simply didn’t bother with the eternal “to be or not to be” question: they just started working on that.

However, the resources, the financial capacity and the market share of AMD and Intel are incomparably different, that is why AMD couldn’t follow into Intel’s footsteps and introduce a brand new ideology in response to its competitor’s moves. So, there remained only one way for AMD, the way they have always taken: they decided to take the good old x86 and make it better than what Intel had. This is where x86-64 architecture comes from.

Of course, we are talking about the increase of 8 general purpose registers to 64bit. Although this is awfully little for the today’s processors, even if we take into account another 8 floating-point registers and 8 SIMD registers. Anyway, it is incomparably small number of registers against the background of the new generation Itanium processors, which ensures their new performance level. We are very much used to evaluating the processor performance by the cache size, but the number of registers is also very important because they serve as a sort of pre-cache.


Anyway, we are not trying to seem genius fortune-tellers here, as the processor developers have been aware of this fact for ages already. And they have also been aware of the possible solutions to this matter. Say, the same Pentium 4 really has eight 32bit general purpose registers. However, these are the registers seen by the execution code. And beyond that, there are 120 (!) other registers hidden inside the CPU, so that any of them can be represented as one of the basic eight any time. Hm, this is an interesting turn, but not an evident one, especially for the compiler. But as we remember, Pentium 4 doesn’t depend on the compiler that much.

AMD chose another solution, which is more evident for the compiler. They simply doubled the number of visible registers, so that in 64bit mode K8 features 16 64bit general purpose registers. The floating-point registers remained unchanged: there are 8 of them, because x86 is not the No.1 during floating-point calculations. AMD pins most of its hope on SIMD instructions in this case, that is why they also doubled the number of SIMD registers: there are 16 of them now.

Then everything depends only on the processor working mode. There is a reverse compatibility mode, when K8 works like a regular 32bit CPU and doesn’t use any of its new features besides those simply improving the CPU performance. There is the so-called “long mode”, which in its turn consists of two sub-modes: “64bit mode” and “compatibility mode”. Of course, 64bit OS will be needed, though it can perform both code types: the 64bit one as well as the regular x86. In the latter case the program will certainly have no access to the memory beyond 4GB, the complete registers set, additional registers, etc. All in all, Windows 95 used to do the same thing to 16bit applications.

However, the registers are all set to 32bit mode by default, because even a 64bit program doesn’t use only 64bit numbers. It is even more likely to use the good old 32bit. Therefore, assigned 64bit registers for 32bit numbers is a waste of resources, that is why the default setting is 32bit. If the program needs to work with 64bit registers, it has to use a special 1-Byte long prefix when addressing them. It is quite a lot, especially if there are many requests like that. Although AMD assumes that the code can grow maximum 10% bigger, when they shift from x86 to 64bit.


But on the other hand, 32bit programs will feel absolutely at home in all reverse compatibility modes: they will simply not realize that they are run on a 64bit processor when they take advantage of all the benefits of the new technology, such as on-die chipset North Bridge, and the like. And keeping in mind that during the next few years 32bit will be more than enough for most applications running on PCs, workstations and Low-End servers, Athlon 64 and Opteron will have a definite advantage over Intel Itanium: they are not offering any useless extras for extra money. Need 64bit - no problem. Don’t need 64bit – then let’s use our dear 32bit mode without performing slow emulations.

Actually, the major question is: do we really need such CPUs? I mean CPUs capable of running the standard 32bit code, but using the advantages of 64bit calculations and access to more than 4GB of memory if necessary. Some people think we do. Among them I could mention Tim Sweeny, the developer of Unreal game. He believes that a platform like that will allow building low-cost workstations with much higher price-to-performance ratio than the one of Itanium based systems. By the time Athlon 64 is out, Epic is going to release 64bit Unreal version. Actually, they are very optimistic about the prospects of this approach.

Moreover, the mere possibility to access more registers (those additional eight and eight registers) allows speeding up the rewritten code processing by tens of percents, according to the developers. Add here higher processor performance due to pure technical enhancements with almost zero price growth (unlike Itanium) and you will get an excellent product for its niche.

The first representative of the K8 family has already come out: it is Opteron, a server solution for one-, two- and four-way configurations, with a possibility to build eight-way systems in the future. Unfortunately, not everything is clear yet with its “lite” version, Athlon 64: the CPU launch has been postponed several times already, and now is expected to come some time closer to 2004. You wonder what cause these delays? Well, sometimes they faced some technological problems, or maybe didn’t see real 64bit PC applications (including Windows-64, because Unix is not a PC operation system), which we have already mentioned, and hence decided to wait a little longer. I tend to believe that now it is mostly for the second reason, and AMD is waiting for Microsoft to help it out. Hopefully, this will not happen too late.


However, there are 64bit processors, which do not have any problem with the OS and other software support. of course, we are talking about the server veterans: RISC solutions from IBM, Sun and Compaq.

IBM PowerPC processor family also known as Power4/Power4+ are actually the today’s biggest competitors to Intel Itanium 2. They used to be the first 0.13micron 64bit processor in the market, and since then IBM has been working on the improvement of their technical specs. So, today we have a relatively inexpensive processor with two physical dies, 1.4MB-1.5MB L2 cache and on-die North Bridge. The processor supports pretty exotic approach: it combines up to 5 instructions into a group and then processes it as a solid parameter. On the one hand, it makes the work much simpler, but on the other, you will have to return far back if something happens. Besides, this limitation tells in far not the best way on the possibility to parallelize operations, thus reducing the overall algorithm flexibility.

We can’t help mentioning one more PowerPC representative: 64bit PowerPC 970 targeted for a totally different market segment. This processor is designed for PowerMac computers and is none other but a cut-down Power4 version featuring Altivec SIMD instructions from Motorola. In fact, it is a nearly ideal replacement for PowerPC G4 and G4+. It is a very economical solution, which also boasts the core clock potential beyond 2GHz.


Sun has been suffering quite a bit of hits from the competitors lately, thus losing more of its market share in the server and server processors market. The company has been too slow developing its product families, so that its today’s UltraSPARC III looks very unattractive in terms of performance against the competitors’ background. Moreover, UltraSPARC IV is just a 0.13micron version of the predecessor. We can expect something really new only from UltraSPARC V, but his is still a while ahead, as we can hardly hope to see it before 2005.

Alpha? The situation here is mostly evident: the architecture has no future. HP made a long-term stake on Itanium, Samsung hasn’t been very active here, either. Nevertheless, the inertia of this architecture is big enough, so it will keep developing even without any support on the side. Small but powerful 1.25GHz EV68 appeared powerful enough to compete with a more serious-looking Power4, and its EV7 successor turns out the today’s most powerful solution for multi-processor complexes. Besides, HP is going to introduce a 0.13micron EV7 version aka EV79 this year, which will boast a much larger L2 cache and will support faster PC1066 RDRAM.

The major question, however, is not about the technical characteristics of the processors, but about the power behind them: are they ready to ensure strong support? From this point of view and also due to their technical potential, Itanium and Power4+ will definitely be among the leaders one day. The future of Athlon 64 and Opteron is not quite clear. Everything depends on their ability to use the 64bit potential to the full extent, which means that they will need 64bit operation systems and mass applications. And the situation here is far from good today. However, this is not at all surprising: it took Microsoft about 10 years to shift from 16bit to 32bit.

<%BANNER[banner_468x60_f]%>