SSE
Let’s now discuss the SSE instruction set. It was introduced in Pentium III processors, but grew to a bloom after the launch of Pentium 4, where the use of SSE provided a great performance boost.
SSE is intended for faster processing of real data. Such data are often used in geometrical calculations, that is, in 3D graphics applications, computer games, 3DStudioMax-like editors and a number of other tasks. After 3D accelerators started performing the texturing in Quake-like games, the need for integer calculations became less urgent. It was now more important to speed up the processing of floating-point calculations, like multiplication of a floating-point vector by a floating-point matrix. Let’s see what SSE can offer to the developer.
Due to the introduction of SSE, the processor acquired eight new 128-bit registers in addition to the standard x87 registers. Each register stores four 32-bit floating-point numbers.
Part 3 | Part 2 | Part 1 | Part 0 |
|
|
|
|
| Register 7 |
|
|
|
| Register 6 |
|
|
|
| Register 5 |
|
|
|
| Register 4 |
|
|
|
| Register 3 |
* | * | * | * | Register 2 |
2 | 55.9 | -1.9e10 | 1.567e-6 | Register 1 |
0.7 | -100.0 | 11.2 | 0.5 | Register 0 |
It’s possible to perform the following arithmetic operations over the fours of numbers stored in the registers componentwise: you can add two fours of numbers, subtract, multiply or divide them. You can also find four (inverse) square roots at a time, accurately or approximately. The register contents can be shuffled up, moved from one part of the register to another and so on. But the data is moved no faster than added, so SSE is most effective when performed on specially pre-packaged data.
It takes Pentium 4 to execute one SSE operation about the same as in case of an ordinary instruction. It means optimization may bring about four times performance growth. Or even higher, due to the new large registers. But not all calculations can be effectively optimized for SSE. An example of a “good” task is multiplication of a four-dimensional matrix by a four-dimensional vector. You get a fourfold acceleration without any problems.
First of all, the use of SSE allows modern processors to compete with up-to-date graphics accelerators when transforming the vertexes of triangles that make up a 3D scene. However, the CPU has a lot of other work to do and it’s better to unload it as much as possible, so that it would work in parallel with the 3D accelerator.
And what about Athlon XP? Actually, the main innovation in this CPU compared to ordinary Athlons is the implementation of SSE. We may expect it to speed up about two time compared to performance with the ordinary program code. We should note however that SSE is implemented slightly “worse” in Athlon XP, while ordinary code is executed most efficiently. Athlon XP also has certain advantage in execution of branch prediction operations. It’s really good at that.
When only SSE instructions are used, Athlon XP and Pentium 4 of the same core frequencies show similar performance. But Pentium 4 notches much higher working frequencies, which together with the opportunity to use SIMD instructions makes it much faster in a variety of tasks.





