MMX, 3DNow!, SSE, SSE2: Operation and Optimization Principles
Before discussing the new instructions, let’s have a brief overview of the previous SIMD extensions of the processor instructions set. First of all we will estimate the possible performance growth provided by this or that instructions set. If you are very well familiar with this topic of our discussion, please go over to the next chapter of this article straight away.
Sometimes in product reviews you can read that some program is well optimized for SSE, which allows the CPU tested to show high performance when running it. And what is SSE? Before going further, let’s recall the meaning of the abbreviations: SSE is Streaming SIMD Extension, SIMD is Single Instruction Multiple Data (several operands are processed by a single command).
How did older x86 processor models, like 486 or the first Pentium, work? It was very simple. They had a few registers to store binary numbers (one number per register). You could sum up numbers in two registers, compare the result with the number stored in a third register, if the result is bigger than the number stored in the third register, the transition to another stage could be performed in accordance with the instructions list. But there arose problems when engineers tried to find ways to increase the CPU performance. The thing is that the processor just can’t execute an instruction before the necessary operands have been calculated by the preceding instructions. We could make a CPU with a hundred ALUs, but it won’t work any faster, as only one ALU would be working, while the others would be waiting for the results of the calculation. That’s why people at Intel decided to introduce processor instructions that would process more than a pair of operands at a time. Here is the way it works in the first SIMD extension for x86 – MMX:
Operation | Part 3 | Part 2 | Part 1 | Part 0 | Register |
| 70 | 50 | 30 | 10 | 1st register |
+ |
|
|
|
|
|
| 80 | 60 | 40 | 20 | 2nd register |
= |
|
|
|
|
|
| 150 | 110 | 70 | 30 | result |
Not a single pair of numbers is summed up at a time, but four pairs. The same works for subtraction, multiplication and other operations dealing with several pairs of operands, each being in a separate register. The introduction of this technology allows speeding up the CPU easily by increasing the number of calculation units (the frequency remained unchanged). To be more exact, it is not the number of calculation units that increases, but the number of operand pairs they process at a time. However, this is not that important for software developing.
Of course, to achieve this performance boost, the program must actually use SIMD instructions. The CPU cannot place different data to be processed similarly in a single register. It means the programmer has to tell the CPU explicitly that the CPU must load these or those data into MMX registers and perform some SIMD operations over them. The data should be specifically prepared in the memory to “fit” into the registers. In some cases special SIMD-optimizing compilers may be of help; whenever it’s possible they use highly efficient SIMD instruction instead of several ordinary ones. But, as a rule, the program code should be written according to certain guidelines to be effectively compiled.
Not every algorithm can be optimized using SIMD. Consider the expression: (a + b * c) * d. You can’t calculate it in less than three commands. On the other hand, such a task as addition of four vectors (x, y, z, w) (that is, calculation of x1+x2+x3+x4, y1+y2+y3+y4, z1+z2+z3+z4, w1+w2+w3+w4) can be reduced to only three SIMD instructions as well. If the CPU performs a SIMD instruction as fast as an ordinary one, this would bring a great performance growth (by times!) for SIMD-optimized programs.
From the developer’s point of view, there are several types of SIMD optimization: direct manual code optimization, which is a very hard thing, although most effective; the use of a SIMD compiler, which may produce unexpected results; the last way is to use standard application libraries, optimized and distributed by CPU makers themselves. Those libraries contain standard functions to perform most popular tasks. This last variant seems most attractive.
Now, what is the difference between various SIMD extensions? Where does MMX differ from 3DNow! or SSE? First of all, they differ by the type of supported data, by the size and number of registers and the sets of available instructions. The data type is the most important difference as other parameters are more or less similar by various extensions.
The CPU works with data in a number of formats; the most important of them are floating point and integer formats. Various tasks require different formats of data representation. The second thing to be considered is the size in bytes. Floating-point data contain approximate real numbers: the more bytes are assigned to each number, the higher is their precision. The more bytes are allocated for an integer, the bigger is the interval it can belong to.





