<%BANNER[top_768x90]%>

<%BANNER[banner_468x60_h]%>

SSE Technology in New Intel Prescott Processors

In our detailed technology coverage we are talking about the SSE technology implemented in new Intel Prescott processors, its exciting history, its peculiarities and advantages offered to the software developers. Also we are going to compare the cons and pros of the new SSE instructions with AMD x86-64.

by Lev Dymchenko
03/25/2003 | 10:30 PM

At the last Intel Developer Forum the new PC processor from Intel was officially introduced to the public (see this news story for more information on the new solution). It is a next generation processor, manufactured with 90nm technology. This allows clocking it at up to 4-5GHz frequencies. The new manufacturing process must have made it economically justifiable to increase the L2 cache up to 1MB as well as the L1 cache: its size was doubled. The FSB frequency grew to 800MHz. Overall, nearly every unit of the CPU has been somehow improved. But what does this polished-off product bring to software developers? A larger cache is a good thing: you worry less about the speed of reading/writing into memory, which often becomes a limiting factor. But this doesn’t eliminate all problems; when there are a lot of data, even a double-sized cache won’t help much.

<%BANNER[article]%>

The higher speed of the front-side bus suggests that the new Intel’s processor will be rather well-balanced, free from evident bottlenecks, unlike some previous processor models, which didn’t give us performance growth proportional to their frequency growth.

But things like the new seven-layer CPU design are hardly of any interest to software developers. It is much more important for them to know what new processor instructions have now become available, what optimization techniques should be used in the program to reach maximum performance, or at least, to do not slower than by the previous processor models. The last processor from Intel, Pentium 4, required significant software optimization to achieve higher performance. Across a wide range of tasks, Pentium 4 would lose to Pentium III of the same, or even twice as low frequency. We will discuss this phenomenon in detail later in this article. So far we have to point out that the main reasons were connected with the need for radical redesign of the processor core, so that it could support higher frequencies.

There is nothing revolutionary about the new CPU core from Intel. Everything, Pentium 4 dislikes (especially branching), was handed over to the 90nm newcomer. It even grew worse! In order to increase the clock-rate, they increased the pipeline length for Prescott, so we may expect quite significant performance losses when an incorrect branch prediction attempt leads to pipeline clearing.

But there is also good news; the extension of the processor instruction set. The software developers were “very pleased” about the introduction of MMX, SSE and SSE2, as they had to do extra work to optimize their programs for these instructions. Otherwise, the programs would never run fast. But those 13 new instructions introduced in Prescott do mean a great ease of the developer’s lot.


MMX, 3DNow!, SSE, SSE2: Operation and Optimization Principles

Before discussing the new instructions, let’s have a brief overview of the previous SIMD extensions of the processor instructions set. First of all we will estimate the possible performance growth provided by this or that instructions set. If you are very well familiar with this topic of our discussion, please go over to the next chapter of this article straight away.

Sometimes in product reviews you can read that some program is well optimized for SSE, which allows the CPU tested to show high performance when running it. And what is SSE? Before going further, let’s recall the meaning of the abbreviations: SSE is Streaming SIMD Extension, SIMD is Single Instruction Multiple Data (several operands are processed by a single command).

How did older x86 processor models, like 486 or the first Pentium, work? It was very simple. They had a few registers to store binary numbers (one number per register). You could sum up numbers in two registers, compare the result with the number stored in a third register, if the result is bigger than the number stored in the third register, the transition to another stage could be performed in accordance with the instructions list. But there arose problems when engineers tried to find ways to increase the CPU performance. The thing is that the processor just can’t execute an instruction before the necessary operands have been calculated by the preceding instructions. We could make a CPU with a hundred ALUs, but it won’t work any faster, as only one ALU would be working, while the others would be waiting for the results of the calculation. That’s why people at Intel decided to introduce processor instructions that would process more than a pair of operands at a time. Here is the way it works in the first SIMD extension for x86 – MMX:

Operation

Part 3

Part 2

Part 1

Part 0

Register

 

70

50

30

10

1st register

+

 

 

 

 

 

 

80

60

40

20

2nd register

=

 

 

 

 

 

 

150

110

70

30

result

Not a single pair of numbers is summed up at a time, but four pairs. The same works for subtraction, multiplication and other operations dealing with several pairs of operands, each being in a separate register. The introduction of this technology allows speeding up the CPU easily by increasing the number of calculation units (the frequency remained unchanged). To be more exact, it is not the number of calculation units that increases, but the number of operand pairs they process at a time. However, this is not that important for software developing.

Of course, to achieve this performance boost, the program must actually use SIMD instructions. The CPU cannot place different data to be processed similarly in a single register. It means the programmer has to tell the CPU explicitly that the CPU must load these or those data into MMX registers and perform some SIMD operations over them. The data should be specifically prepared in the memory to “fit” into the registers. In some cases special SIMD-optimizing compilers may be of help; whenever it’s possible they use highly efficient SIMD instruction instead of several ordinary ones. But, as a rule, the program code should be written according to certain guidelines to be effectively compiled.

Not every algorithm can be optimized using SIMD. Consider the expression: (a + b * c) * d. You can’t calculate it in less than three commands. On the other hand, such a task as addition of four vectors (x, y, z, w) (that is, calculation of x1+x2+x3+x4, y1+y2+y3+y4, z1+z2+z3+z4, w1+w2+w3+w4) can be reduced to only three SIMD instructions as well. If the CPU performs a SIMD instruction as fast as an ordinary one, this would bring a great performance growth (by times!) for SIMD-optimized programs.

From the developer’s point of view, there are several types of SIMD optimization: direct manual code optimization, which is a very hard thing, although most effective; the use of a SIMD compiler, which may produce unexpected results; the last way is to use standard application libraries, optimized and distributed by CPU makers themselves. Those libraries contain standard functions to perform most popular tasks. This last variant seems most attractive.

Now, what is the difference between various SIMD extensions? Where does MMX differ from 3DNow! or SSE? First of all, they differ by the type of supported data, by the size and number of registers and the sets of available instructions. The data type is the most important difference as other parameters are more or less similar by various extensions.

The CPU works with data in a number of formats; the most important of them are floating point and integer formats. Various tasks require different formats of data representation. The second thing to be considered is the size in bytes. Floating-point data contain approximate real numbers: the more bytes are assigned to each number, the higher is their precision. The more bytes are allocated for an integer, the bigger is the interval it can belong to.


MMX

The MMX extension appeared quite a long time ago and is now considered a standard for PC. MMX stands for Multi Media Extensions. This extension was intended for processing multimedia data, image and sound.

Processors supporting MMX have eight MMX registers, each 64bit (or 8Byte) large. MMX works only with integer numbers; 1, 2, 4 or 8-Byte data are supported. That is, one MMX register can store 8, 4, 2 or 1 operand.

Byte 7

Byte 6

Byte 5

Byte 4

Byte 3

Byte 2

Byte 1

Byte 0

-128

127

100

70

60

-50

20

10

Word 3

Word 2

Word 1

Word 0

15000

-30000

20000

10000

And so on. The data stored in the MMX registers can be added, multiplied and subtracted componentwise. There are other instructions, which often occur in multimedia applications, like add without overflow, arithmetic mean calculation, and logical operations. Bit by bit and, or, xor operations. There is one restriction, however, there is no division operation yet. But still, a lot of operations can be performed much faster than before. On the other hand, MMX requires manual optimization; no compiler can help you much. For example, various audio codecs are often optimized for MMX. Their algorithms get along well with MMX. Usually, a small part of the program, performing the biggest part of the encoding work, is optimized. This simplifies the entire optimization procedure a lot.

SSE2 – Integer Instructions

We jump from the “oldie” MMX straight to the newcomer SSE2. It makes sense, as SSE2 consists of two quite different parts: SSE development and MMX development. The former deals with real numbers, the latter – with integers. Compared to MMX, SSE2 registers are twice as big, i.e. there are not 8 numbers stored there but 16. It means twice the application performance after SSE2 optimization, because the instructions processing speed remained unchanged. By the way, a program already optimized for MMX can easily be further optimized for SSE2 due to similarities in their instruction sets.

Athlon XP processors don’t support SSE2. And we could witness a curious picture when Pentium 4 was at first losing to Athlon XP in speed, but after the application was optimized for SSE2, it would run faster on Pentium 4.

We should acknowledge that the idea to develop MMX into SSE2 was most felicitous. There are few programs that are optimized for MMX, but those that are, are optimized very thoroughly. We also should mention that Intel offered the software developers a number of SSE2-optimized libraries with some typical encoding functions almost for free, which played a crucial part in “saving” the Pentium 4 performance.


SSE

Let’s now discuss the SSE instruction set. It was introduced in Pentium III processors, but grew to a bloom after the launch of Pentium 4, where the use of SSE provided a great performance boost.

SSE is intended for faster processing of real data. Such data are often used in geometrical calculations, that is, in 3D graphics applications, computer games, 3DStudioMax-like editors and a number of other tasks. After 3D accelerators started performing the texturing in Quake-like games, the need for integer calculations became less urgent. It was now more important to speed up the processing of floating-point calculations, like multiplication of a floating-point vector by a floating-point matrix. Let’s see what SSE can offer to the developer.

Due to the introduction of SSE, the processor acquired eight new 128-bit registers in addition to the standard x87 registers. Each register stores four 32-bit floating-point numbers.

Part 3

Part 2

Part 1

Part 0

 

 

 

 

 

Register 7

 

 

 

 

Register 6

 

 

 

 

Register 5

 

 

 

 

Register 4

 

 

 

 

Register 3

*

*

*

*

Register 2

2

55.9

-1.9e10

1.567e-6

Register 1

0.7

-100.0

11.2

0.5

Register 0

It’s possible to perform the following arithmetic operations over the fours of numbers stored in the registers componentwise: you can add two fours of numbers, subtract, multiply or divide them. You can also find four (inverse) square roots at a time, accurately or approximately. The register contents can be shuffled up, moved from one part of the register to another and so on. But the data is moved no faster than added, so SSE is most effective when performed on specially pre-packaged data.

It takes Pentium 4 to execute one SSE operation about the same as in case of an ordinary instruction. It means optimization may bring about four times performance growth. Or even higher, due to the new large registers. But not all calculations can be effectively optimized for SSE. An example of a “good” task is multiplication of a four-dimensional matrix by a four-dimensional vector. You get a fourfold acceleration without any problems.

First of all, the use of SSE allows modern processors to compete with up-to-date graphics accelerators when transforming the vertexes of triangles that make up a 3D scene. However, the CPU has a lot of other work to do and it’s better to unload it as much as possible, so that it would work in parallel with the 3D accelerator.

And what about Athlon XP? Actually, the main innovation in this CPU compared to ordinary Athlons is the implementation of SSE. We may expect it to speed up about two time compared to performance with the ordinary program code. We should note however that SSE is implemented slightly “worse” in Athlon XP, while ordinary code is executed most efficiently. Athlon XP also has certain advantage in execution of branch prediction operations. It’s really good at that.

When only SSE instructions are used, Athlon XP and Pentium 4 of the same core frequencies show similar performance. But Pentium 4 notches much higher working frequencies, which together with the opportunity to use SIMD instructions makes it much faster in a variety of tasks.


3DNow! and 3 Mistakes in AMD’s Strategy

Now we will dwell upon the instructions extension introduced by AMD. It can be viewed as a competitor to SSE, or, to be more exact, SSE is the competitor to 3DNow! as it appeared later. The 3DNow! set was introduced in AMD K6-2-3D processors, competing with Intel Pentium II. We view 3DNow! and SSE as rivals, because both extensions work with real numbers and are intended for geometry-related applications.

Well, 3DNow! is very similar to SSE, but there are some differences, too. There are the same eight new registers, but of 64bit, not 128bit, size. Thus, they store two numbers, not four. You can perform similar operations as those in case of SSE: sum up / multiply / divide two pairs of operands, derive a (inverse) square root accurately or approximately (the latter is performed faster).

Part 1

Part 0

 

 

 

Register 7

 

 

Register 6

 

 

Register 5

 

 

Register 4

*

*

Register 3

10000.1

6.7

Register 2

-0.5

1.5e7

Register 1

2.0

1.0

Register 0

As you may guess, a 3DNow!-optimized code would run twice faster due to simultaneous processing of two pairs of operands. Seems less promising than SSE, doesn’t it? Yeah, if you sit down to optimizing your program, you would seek for a maximum possible performance growth. This factor, combined with the traditionally dominant position of Intel processors in the market, played a crucial role and prevented 3DNow! from becoming widely-spread among software developers.

By the way, SSE used in Pentium III provided twice as low improvement as in Pentium 4, and almost equaled the effect of 3DNow! in AMD Athlon processors. So, it’s mostly the predominance of SSE-supporting CPUs that negatively affected the fate of 3DNow!.

Nevertheless, 3DNow! offered some appealing options. One of them was the possibility to add up numbers stored in one register. That is, you can perform “horizontal” operations as well as “vertical” ones. This flexibility may come in handy in a lot of popular tasks, for example when calculating a scalar product of two 3D vectors. Try performing that with SSE. The result will be disappointing. You will be unable to add up the elements of the long SSE register without involving extra registers. As a result it will not be any faster than without any SSE, and maybe even slower than that. And scalar product is a very popular thing especially to get a vector norm, a distance between the two points. In this case 3DNow! looks much more preferable, due to higher flexibility.

One more advantage of 3DNow! is the possibility of effective automatic optimization by means of the compiler. SSE is too bulky – it has large registers – for automatic data organization. Such a compiler would make a floating-point-heavy program run 1.5 times faster. But AMD didn’t bother about implementing it, while Intel was actively promoting its SSE-supporting and 3DNow!-oblivious compiler. Things took such a turn that AMD had to use Intel’s compiler to create its Spec benchmarking tests (www.spec.org). They just used the most effective compiler to reach the highest performance of the benchmarking application.

Software developers were not eager to optimize their program once more for Athlon CPUs: they had had enough trouble with Intel processors already. So, as a result a program either had no SIMD-optimization at all and Athlon did well compared to Pentium 4, or there was SSE(2) optimization and Athlon would lose.

Overall, AMD made a mistake with 3DNow!. It all ended at the advertisement level. Among popular 3DNow!-optimized applications we can only name OpenGL drivers, where we could see a significant performance boost. I believe that since AMD couldn’t push its instructions set forward, they should have carefully implemented all Intel’s innovations in their processors. If they had implemented SSE2 in Athlon XP, even with a lower efficiency, instead of 3DNow!, it would be a very strong product, almost free from weak spots.

Running a little ahead, we should say that AMD has to take care of a compiler for its processors. The new x86-64 architecture from AMD requires recompilation of existing software to be able to use the new capabilities. AMD Athlon 64 still features 3DNow! so we will have an opportunity to compare the efficiency of SSE- and 3DNow!-optimizations.

SSE2

SSE2 is the last SIMD extension for x86 processors in the today’s market. We have already discussed the integer component of this set above. But SSE2 got not only integer instructions. The same eight 128-bit registers can here be interpreted as storing not four 32-bit floating-point numbers, but two 64-bit floating-point numbers with extended precision. Higher precision numbers are used when calculations with the ordinary precision result into too big errors. The SSE operations are now applied to two pairs of operands, not four pairs. The approximate calculation of square root is not possible any more, of course.

Part 1

Part 0

 

 

 

Register 7

 

 

Register 6

 

 

Register 5

 

 

Register 4

 

 

Register 3

*

*

Register 2

-1.5e10

0.00001

Register 1

1e-25

5.5

Register 0

So, we have a kind of 3DNow! analogue, but without the flexible addition of the numbers stored in one register.

What about speed? If you need the high precision of SSE2, you will “only” get a double performance growth.

We can put together a kind of rule to estimate performance of calculations-heavy programs. If the program is SSE2-optimized, it will run faster on Pentium 4, than on Athlon XP. If there is no special optimization, Pentium 4 will lose to Athlon XP of the respective rating.

New Prescott Instructions

At last, we have come to the point of this article – the new instructions introduced by Intel in its Prescott CPU.

Many developers of 3D applications want to have a handy and fast class library that would represent geometrical objects, vectors and matrices.

class Vector
{
float x,y,z;

public:

inline friend Vector operator +(const Vector &a, const Vector &b); //addition

inline float norm() const; // vector length
inline float norm2() const; // vector length squared

inline friend float Dot(const Vector &a, const Vector &b); // scalar product
};

So that you could just write a = b + c, and not the clumsy a.x = b.x + c.x; a.y = b.y + c.y; a.z = b.z + c.z. And there have always been troubles with that. Either a not very efficient compiler processed the calls of the vectors addition function too badly, so that it appeared easier to write in the “clumsy”, but more effective way. Then it turned out that SSE helped summing up vectors, but didn’t help much in finding scalar or vector product.

And now we witness a kind of miracle: the new instructions should make the optimization of vector math1ematics (along with other things) for SSE and SSE2 much easier.

Let’s delve a little bit into details.

First of all, there appeared an option to add up the components of one SSE register.

 

w2

z2

y2

x2

HADDPS

 

 

 

 

 

w1

z1

y1

x1

 

 

 

 

 

=

 

 

 

 

 

z1+w1

x1+y1

z2+w2

x2+y2

Thus, it is possible to derive scalar product of two vectors or vector’s norm without using additional registers and in three instructions only. Of course, we hope the instruction of “horizontal” addition is executed fast enough. At least, we hope it doesn’t clear the CPU pipeline.

Similarly, we can subtract the components of one register.

It is also possible to add and subtract two numbers located in one SSE2 register. There is also an instruction that combines addition and subtraction of the two elements. 

 

y2

x2

ADDSUBPD

 

 

 

y1

x1

=

 

 

 

y1+y2

x1-x2

This instruction owes its appearance to the nature of complex numbers multiplication: (a+bi)*(c+d1)=a*c-b*d+(b*c+a*d)i. Now the SSE2 optimization of complex numbers calculations has become considerably simpler. In fact, SSE2 is much like 3DNow!, but works with higher precision numbers.

Here is the table listing all the new Prescott instructions.

Instruction

Description

Horizontal operations with registers

HADDSP

Horizontal addition of the SEE registers

HSUBPS

Horizontal subtraction of the SSE registers

HADDPD

Sum of two elements stored in the same SSE2 register

HSUBPD

Difference of two elements stored in the same SSE2 register

These are very useful commands from all viewpoints, which have been so long-awaited by the software developers.
They do ease the automatic and manual optimization a lot.

Data loading commands

MOVSHDUP

Loads the data into the receiver-register by copying only 2nd and 4th 32bit elements

MOVSLDUP

Loads the data into the receiver-register by copying only 1st and 3rd 32bit elements

MOVDDUP

Loads the data into the receiver-register by copying and doubling of the first half of the origin-register

Useful for automatic and manual optimization

Combined addition-subtraction

ADDSUBPS

(x1,y1,z1,w1) * (x2,y2,z2,w2) = (x1-x2,y1+y1,z1-z2,w1+w2)

ADDSUBPD

 (x1, y1) * (x2, y2) = (x1-x2, y1+y2)

Simplify operations with complex numbers a lot.
Help during automatic programs optimization

Special data loading

LDDQU

Optimized loading of the uneven data

For relatively rare fine manual and automatic optimization

Data transformation

FISTTP

The only new x87 instruction. Transforms the co-processor stack contents into integer type

Very helpful for compilation during automatic software optimization.
This instruction was missing in the x87 instructions set introduced long time ago.
Now they made up for the lack of this instruction.

Hyper-Threading support improvement

MONITOR/MWAIT

The processor tracks the writes into the selected part of memory and activates the “sleeping” data streams.

Simplifies programs and OS services optimization for Hyper-Threading support, and multi-threading in general


Intel’s Innovations Efficiency and Their Comparison with AMD x86-64

The additional Prescott instructions make the whole instruction set somewhat well-shaped and complete. We can live with it for a long time, until some radical changes happen. Of course, we might wish to have more registers, but this would be hard to do without losing in compatibility because of restrictions of the x86 instructions format. But there are a lot of little dainties: like the effective fisttp conversion commands. Automatic generation of SIMD-optimized code has also become much simpler.

The upcoming processor may have only one weak spot, besides (possibly) the price. It’s the long pipeline and, accordingly, a strong “dislike” to conditional branching. The longer pipeline seems to be an unavoidable evil on the way to higher frequencies.

Overall, I think Prescott is going to be the most perfect product Intel issued in the last few years. I am not talking now about its power supply and cooling requirements – we just hope there won’t be any problems here. As for the price, there is every reason to think that it won’t be too high. Look at the size of the L2 cache: it is 1MB. What would the next Celeron be like then? Is it going to have 512KB L2 cache? This isn’t our good old Celeron – it’s just a monster! Most applications don’t need more than 256KB of cache, so there is a big reserve for prime cost reduction. The transition to the new technological process does promise to be profitable.

There is one circumstance, however, which can boost the processor price: no competitors. We’d like to view the new 64-bit processor from AMD, Athlon 64, as a strong rival to Prescott. We guess you already know a lot about its technical peculiarities and strengths, otherwise refer to our AMD Athlon 64 article, which covers this topic in really great detail. And now we would like to dwell upon its attractiveness to software developers.

AMD has finally implemented SSE2 support. Now Intel processors won’t have an “a priori” advantage in SSE2-using applications. Athlon 64 features compatibility and 64-bit work modes, the latter brings all the advantages of the AMD x86-64 architecture. In the compatibility mode, the new processor looks to the software developer like a Pentium 4 with screwed up 3DNow!. Moreover, it is free from certain drawbacks like the terrible execution of non-SIMD-optimized code and the dislike of branching. We guess the new AMD processor is going to be a worthy rival to Pentium 4, of course, if it ever reaches the boxing ring.

We may venture a prediction that Pentium 4 will need a considerable frequency advantage to equal the new Athlon in integral benchmarks that comprise a wide range of applications. Moreover, there are a lot of benchmarking results of Athlon 64 samples on the Web that show its twice as high performance-to-frequency coefficient over Pentium 4.

But what about the exclusive 64-bit mode? Besides the recompilation of programs, there are three main things: twice more registers, 64-bit arithmetic and over 2GB virtual address space of an application. 64-bit arithmetic is a good thing by itself, but has a narrow usage scope in desktop systems; most applications are quite well with 32-bit one. AMD pointed out another actual field of application of 64-bit integer arithmetic. It is a much faster execution of some cryptography tasks.


The enlargement of the address space is going to be demanded in desktop applications in the near future, however the 2GB of RAM will be quite a rare thing to see in desktops in the next few years.

The extra registers in the CPU give new opportunities to the developer. They can store function parameters and allow placing independent commands in the code more efficiently without accessing the memory extra times. Moreover, the optimizing compiler will take care of all of this. Thanks to the increased number of registers, the CPU will have a higher workload, while the memory bus – lower. This seems to be one of the major advantages. The thing is that modern CPUs feature a lot of lazy write buffers that allow swapping the registers without loading the memory that heavily. If the sent data is immediately called for again, it’s taken from the buffer in no time and not from the memory. By the way, the number of such buffers has been increased in Prescott.

The new AMD’s processor is going to be a success in server systems. It doesn’t fear branching, has a large address space and many registers, can encode very fast. It all suits well for a database server. Acceptable pricing may make it an ideal solution in a certain niche.

As for its perspectives in the desktop market, especially in heavy graphics applications, like 3D games, it can do quite well there, too. What factors determine the performance in a computer game? First, it’s scene transformation, cutting down invisible elements with the portal technology, AI and physical model calculations. The second factor is fast data exchange with the graphics accelerator. The sky-high frequencies of Intel processors as well as high system bus frequency are going to help in “feeding” the video processing unit. As for the performance of the game engine itself, AMD looks stronger here. Its processor is more tolerant to branching, which often occurs in applications like that. 3DNow! also might be of some help as it offers a handy instruction set to process geometrical objects. Moreover, as AMD Athlon 64 requires recompilation of existing applications, the 3DNow! optimization will be now carried out automatically by the compiler, without any intervention on the developer’s part.

If Intel didn’t introduce new instructions in Prescott, we should certainly prefer Athlon 64, which seems to have no drastic weaknesses. And with the today’s state of things, we cannot be sure any more.

Intel offers optimized program libraries and the new instructions to software developers, which are certainly going to be used to the utmost effect. AMD also went this way and announced its libraries fully utilizing x86-64 capabilities.

There is one more circumstance. There have been rumors about Intel going to implement the AMD x86-64 architecture in its Prescott CPUs if this architecture becomes a success. Seems like a sci-fi thing: we haven’t seen anything like that! On the other hand, as AMD is already delaying the launch of Athlon 64, it might include into its new CPU the new Prescott instructions. Then all Prescott-optimized libraries would work well on AMD CPUs. Prescott will no longer have any advantages due to more convenient instructions. Especially, since there should be no big problems with it: AMD CPUs have been long executing flexible 3DNow! operations. In fact, I would expect AMD to introduce these “horizontal” operations with the data due to more suitable processor core architecture, lover frequencies, bigger number of execution units, etc.

We hope AMD will do it as fast as possible, because it will willy-nilly have to implement SSE developments into its processors, as it has always done.

<%BANNER[banner_468x60_f]%>