Advanced Processor Core
Although Intel introduces Nehalem processors as based on new microarchitecture, their most important part, the computational core, has barely changed.
As we have already said the major improvements have been made in the infrastructure. However, you shouldn’t feel deceived by the manufacturer. Intel simply focused on eliminating the bottlenecks of the previous microarchitecture, and the core hardly had any. I doubt anyone will argue that Core 2 processors are excellent solutions with great performance.
However, they did improve a few things inside the processor core. By implementing these improvements, the engineers didn’t just want to increase the CPU performance at any rate, but tried to make Nehalem more efficient and capable of utilizing the resources in a more optimal way. Just like with Atom processors, all the changes were made taking into account the heat dissipation data. That is why the new generation processors should have very attractive performance-per-watt ratio.
According to this philosophy, the modifications dealt with decoders in the first place. We would like to remind you that processors with Core microarchitecture had four decoders at their disposal: three for simple instructions and one for complex ones. These processors could decode maximum 5 instructions per clock cycle thanks to Macrofusion technology. It allowed Core 2 processors to process certain pairs of instructions as a single command - for example, comparison followed by conditional branching.
Nehalem has the same number of the same decoders. However, Macrofusion technology did change significantly. First of all, there are more pairs of x86 instructions decoded “at one fling” within this technology. Secondly, Macrofusion technology in Nehalem processors works in 64-bit mode, while in Core 2 processors it could only be activated when the CPU worked with 32-bit code. So, CPUs with new microarchitecture will be able to decode five instructions per clock instead of four in a larger number of cases than their predecessors.
The next improvement deals with increasing productivity of the execution pipeline and occurred in Loop Stream Detector block. This block first appeared in CPUs with Core microarchitecture and was designed to speed up loops processing. Loop Stream Detector detected small loops in the program code and saved them in a special buffer. As a result, the CPU didn’t have to fetch them from the cache over and over again and predict branching within these loops. Nehalem processors have an even more efficient Loop Stream Detector block, which has been moved past the instructions decoding stage. In other words, Loop Stream Detector now saves decoded loops, which makes it a little similar to Trace Cache of Pentium 4 processors. However, Loop Stream Detector of Nehalem CPUs is a specific cache. First, it is very small, only 28 micro-ops. And second, it saves only loops.
When Intel engineers advanced Core microarchitecture, they found a way of improving one of the industry’s best branch prediction algorithms. However, there is nothing tricky about it: they simply added one more second-level predictor to the already existing branch prediction unit. It is slower than the first one, but features a larger buffer for storing the branching statistics and hence boasts more analysis depth. I have to say that this improvement will hardly boost the performance in typical desktop applications dramatically. However, dual-level branch prediction unit may become extremely efficient in servers. This proves once again that Nehalem microarchitecture is universal: it features engineering solutions targeted for different user needs.
They also improved the efficiency of the branch prediction unit by changing Return Stack Buffer unit. I would like to remind you that this unit is responsible for correct prediction of functions return addresses. However, previous generation processors could predict function return addresses incorrectly, for example when recursive algorithms were working and the corresponding buffer got overfilled. The new Return Stack Buffer implemented in Nehalem processors didn’t have this problem any more.
Although Intel engineers have introduced a lot of changes to preliminary stages of Nehalm’s pipeline, they left the execution units of the new processor almost intact.
Like Core 2, CPUs on Nehalem microarchitecture can send up to 6 micro-operations at a time for processing. However, the developers have increased the size of the buffers on commands execution stage. As a result, Nehalem processors can hold up to 128 micro-ops waiting to be executed in the Reorder Buffer, which is 33% more than Core 2 can. As a result, Reservation Station sending micro-operations directly to execution units has been increased from 32 to 36 instructions. They have also made the data buffers larger.