SSE2
SSE2 is the last SIMD extension for x86 processors in the today’s market. We have already discussed the integer component of this set above. But SSE2 got not only integer instructions. The same eight 128-bit registers can here be interpreted as storing not four 32-bit floating-point numbers, but two 64-bit floating-point numbers with extended precision. Higher precision numbers are used when calculations with the ordinary precision result into too big errors. The SSE operations are now applied to two pairs of operands, not four pairs. The approximate calculation of square root is not possible any more, of course.
Part 1 | Part 0 |
|
|
| Register 7 |
|
| Register 6 |
|
| Register 5 |
|
| Register 4 |
|
| Register 3 |
* | * | Register 2 |
-1.5e10 | 0.00001 | Register 1 |
1e-25 | 5.5 | Register 0 |
So, we have a kind of 3DNow! analogue, but without the flexible addition of the numbers stored in one register.
What about speed? If you need the high precision of SSE2, you will “only” get a double performance growth.
We can put together a kind of rule to estimate performance of calculations-heavy programs. If the program is SSE2-optimized, it will run faster on Pentium 4, than on Athlon XP. If there is no special optimization, Pentium 4 will lose to Athlon XP of the respective rating.
New Prescott Instructions
At last, we have come to the point of this article – the new instructions introduced by Intel in its Prescott CPU.
Many developers of 3D applications want to have a handy and fast class library that would represent geometrical objects, vectors and matrices.
class Vector
{
float x,y,z;
public:
inline friend Vector operator +(const Vector &a, const Vector &b); //addition
inline float norm() const; // vector length
inline float norm2() const; // vector length squared
inline friend float Dot(const Vector &a, const Vector &b); // scalar product
};
So that you could just write a = b + c, and not the clumsy a.x = b.x + c.x; a.y = b.y + c.y; a.z = b.z + c.z. And there have always been troubles with that. Either a not very efficient compiler processed the calls of the vectors addition function too badly, so that it appeared easier to write in the “clumsy”, but more effective way. Then it turned out that SSE helped summing up vectors, but didn’t help much in finding scalar or vector product.
And now we witness a kind of miracle: the new instructions should make the optimization of vector math1ematics (along with other things) for SSE and SSE2 much easier.
Let’s delve a little bit into details.
First of all, there appeared an option to add up the components of one SSE register.
| w2 | z2 | y2 | x2 |
HADDPS |
|
|
|
|
| w1 | z1 | y1 | x1 |
|
|
|
|
|
= |
|
|
|
|
| z1+w1 | x1+y1 | z2+w2 | x2+y2 |
Thus, it is possible to derive scalar product of two vectors or vector’s norm without using additional registers and in three instructions only. Of course, we hope the instruction of “horizontal” addition is executed fast enough. At least, we hope it doesn’t clear the CPU pipeline.
Similarly, we can subtract the components of one register.
It is also possible to add and subtract two numbers located in one SSE2 register. There is also an instruction that combines addition and subtraction of the two elements.
| y2 | x2 |
ADDSUBPD |
|
|
| y1 | x1 |
= |
|
|
| y1+y2 | x1-x2 |
This instruction owes its appearance to the nature of complex numbers multiplication: (a+bi)*(c+d1)=a*c-b*d+(b*c+a*d)i. Now the SSE2 optimization of complex numbers calculations has become considerably simpler. In fact, SSE2 is much like 3DNow!, but works with higher precision numbers.
Here is the table listing all the new Prescott instructions.
Instruction | Description |
Horizontal operations with registers | |
HADDSP | Horizontal addition of the SEE registers |
HSUBPS | Horizontal subtraction of the SSE registers |
HADDPD | Sum of two elements stored in the same SSE2 register |
HSUBPD | Difference of two elements stored in the same SSE2 register |
These are very useful commands from all viewpoints, which have been so long-awaited by the software developers. | |
Data loading commands | |
MOVSHDUP | Loads the data into the receiver-register by copying only 2nd and 4th 32bit elements |
MOVSLDUP | Loads the data into the receiver-register by copying only 1st and 3rd 32bit elements |
MOVDDUP | Loads the data into the receiver-register by copying and doubling of the first half of the origin-register |
Useful for automatic and manual optimization | |
Combined addition-subtraction | |
ADDSUBPS | (x1,y1,z1,w1) * (x2,y2,z2,w2) = (x1-x2,y1+y1,z1-z2,w1+w2) |
ADDSUBPD | (x1, y1) * (x2, y2) = (x1-x2, y1+y2) |
Simplify operations with complex numbers a lot. | |
Special data loading | |
LDDQU | Optimized loading of the uneven data |
For relatively rare fine manual and automatic optimization | |
Data transformation | |
FISTTP | The only new x87 instruction. Transforms the co-processor stack contents into integer type |
Very helpful for compilation during automatic software optimization. | |
Hyper-Threading support improvement | |
MONITOR/MWAIT | The processor tracks the writes into the selected part of memory and activates the “sleeping” data streams. |
Simplifies programs and OS services optimization for Hyper-Threading support, and multi-threading in general | |



