First Look at Nehalem Microarchitecture

In the next few days Intel is going to make another revolution in the processor market – launch Core i7, new processors on Nehalem microarchitecture. This microarchitecture should become the next significant step after extremely successful Core microarchitecture. In our today’s article we are going to talk about the details behind Nehalem microarchitecture that will help us better understand what we could expect from Intel.

by Ilya Gavrichenkov
11/02/2008 | 09:00 PM

We have known for a long time that in 2008 we will see Intel processors on new microarchitecture. We first learned about it two years ago, when Intel introduced their “Tick-Tock” concept, according to which we get new processor manufacturing process or new microarchitecture every year. Last year, Intel introduced Penryn processor family that was a refresh of Core microarchitecture: these CPUs were made with 45nm process using hafnium-based high-k dielectrics. Therefore, this year we should see new Nehalem microarchitecture represented by desktop Bloomfield processors.

 

Alternating new manufacturing process with the development of new microarchitectures allowed Intel to avoid delays with new processor launches.  That is why a year from the announcement of Penryn processors we are ready to talk about processor innovations again: the launch of CPUs with Nehalem microarchitecture is getting closer day by day.

Taking into account how important this event actually is, we decided to split our review of this long-anticipated solution into several parts. Today we are going to talk about the technological and architectural peculiarities of Nehalem microarchitecture and in a few days we will offer you analysis of Nehalem performance in real applications and discussion of its other practical features.

General Principles of Nehalem Microarchitecture

Before we get acquainted with a promising Nehalem microarchitecture, we would like to say a few words about the reasons for its arrival. Although they have been working on it for a long time, Intel hardly intended to announce CPUs based on it only for the sake of sticking to their own “Tick-Tock” schedule. It seems that even though Core microarchitecture has been extremely successful, there is something about it that doesn’t quite satisfy the microprocessor giant any more. And these reasons are not superficial. Core processors have a lot of advantages, sell very well and are way ahead of the competitor’s solutions.

It turns out that a serious drawback of Core microarchitecture that makes Intel very unhappy is their non-modular design. Being a continuation of the mobile Pentium M CPUs, Core 2 microprocessors were initially designed as dual-core semiconductor dies. When they started making multi-core Core 2 and Xeon solutions later on, they discovered several drawbacks of this approach. Quad-core and recently released 6-core processors on Core microarchitecture were simply composed or several dual-core dies that had difficulties communicating with one another. Separate cores exchanged data through system memory, which resulted in serious delays caused by limited processor bus bandwidth.

Another bottleneck surfaced in multi-processor systems. Although Intel had already solved the problem of sharing the system bus between processors by launching new chipsets providing an individual bus to each processor, the performance of these systems would often be limited by insufficient memory bus bandwidth. We could see this problem even in Skulltrail platform targeted for computer enthusiasts, not to mention high-performance workstations and servers.

In other words, increasing the systems performance by adding more processor cores to CPUs and more processors to the systems would have sooner or later brought Intel to a dead end, despite the fact that contemporary Core microarchitecture seemed very successful overall. That is why Intel is working real hard to switch to new Nehalem microarchitecture that solved the above described structural problems in the first place. Nehalem’s key peculiarities that immediately catch your eye are integrated memory controller and new bus with point-to-point topology called Quick Path Interconnect (QPI) that not only connects the processor with the chipset but also connects directly several CPUs with one another.

All these innovations remind us of the AMD processors’ structure: a few years ago AMD discovered the advantages of integrating the memory controller into the CPU and connecting the CPUs with one another in multi-processor systems. However, even though Intel is currently the one catching up, Intel CPU cores have been offering higher performance since the launch of Core microarchitecture.

Nehalem’s second important innovation is the modular CPU design. In fact, the actual microarchitecture consists only of a few building blocks that will be used to form a processor at the final design and production stage. This set of building blocks includes a processor core with an L2 cache, L3 cache, QPI bus controller, memory controller, graphics core, etc.

The appropriate blocks will be put together within a single semiconductor die and presented as a solution for this or that market segment. For example, the Bloomfield CPU we are going to discuss fairly soon consists of four cores, an L3 cache, a memory controller and one QPI bus controller.

Server processors with the same microarchitecture that should be announced in early 2009 will have up to eight cores, up to four QPI bus controllers for multi-processor systems, L3 cache and a memory controller. The upcoming budget Nehalem processors scheduled to come out in H2 2009 will have two cores, a memory controller, a graphics core and DMI bus controller connecting the processor directly to the South Bridge. These are far not the only possible modifications. We referred to them to illustrate how flexible Nehalem microarchitecture is.

New principles in platform and processor design are certainly a significant but far not the only innovation arriving with the new Intel microarchitecture. A lot of changes have been made to the main part of a CPU: its computational core. Although Nehalem processor cores may be regarded only as enhanced cores with Core microarchitecture, they still support a lot of new technologies and boast numerous improvements. They provide Nehalem processors with higher “pure” performance. Among important innovations we should mention SMT (Simultaneous Multi-Threading), which is similar to Hyper-Threading technology that allows processing two computational threads simultaneously in a single core. We should also point out support of new SSE4.2 instructions, more efficient branch prediction algorithms, larger internal buffers, more efficient and faster cache-memory.

Summing up everything we have just said, let’s once again list the major distinguishing features of the new CPUs from Nehalem family:

Now that we have briefly discussed the general concept of the new microarchitecture, let’s take a closer look at the individual parts of the CPU based on it.

Advanced Processor Core

Although Intel introduces Nehalem processors as based on new microarchitecture, their most important part, the computational core, has barely changed.

As we have already said the major improvements have been made in the infrastructure. However, you shouldn’t feel deceived by the manufacturer. Intel simply focused on eliminating the bottlenecks of the previous microarchitecture, and the core hardly had any. I doubt anyone will argue that Core 2 processors are excellent solutions with great performance.

However, they did improve a few things inside the processor core. By implementing these improvements, the engineers didn’t just want to increase the CPU performance at any rate, but tried to make Nehalem more efficient and capable of utilizing the resources in a more optimal way. Just like with Atom processors, all the changes were made taking into account the heat dissipation data. That is why the new generation processors should have very attractive performance-per-watt ratio.

According to this philosophy, the modifications dealt with decoders in the first place. We would like to remind you that processors with Core microarchitecture had four decoders at their disposal: three for simple instructions and one for complex ones. These processors could decode maximum 5 instructions per clock cycle thanks to Macrofusion technology. It allowed Core 2 processors to process certain pairs of instructions as a single command - for example, comparison followed by conditional branching.

Nehalem has the same number of the same decoders. However, Macrofusion technology did change significantly. First of all, there are more pairs of x86 instructions decoded “at one fling” within this technology. Secondly, Macrofusion technology in Nehalem processors works in 64-bit mode, while in Core 2 processors it could only be activated when the CPU worked with 32-bit code. So, CPUs with new microarchitecture will be able to decode five instructions per clock instead of four in a larger number of cases than their predecessors.

The next improvement deals with increasing productivity of the execution pipeline and occurred in Loop Stream Detector block. This block first appeared in CPUs with Core microarchitecture and was designed to speed up loops processing. Loop Stream Detector detected small loops in the program code and saved them in a special buffer. As a result, the CPU didn’t have to fetch them from the cache over and over again and predict branching within these loops. Nehalem processors have an even more efficient Loop Stream Detector block, which has been moved past the instructions decoding stage. In other words, Loop Stream Detector now saves decoded loops, which makes it a little similar to Trace Cache of Pentium 4 processors. However, Loop Stream Detector of Nehalem CPUs is a specific cache. First, it is very small, only 28 micro-ops. And second, it saves only loops.

When Intel engineers advanced Core microarchitecture, they found a way of improving one of the industry’s best branch prediction algorithms. However, there is nothing tricky about it: they simply added one more second-level predictor to the already existing branch prediction unit. It is slower than the first one, but features a larger buffer for storing the branching statistics and hence boasts more analysis depth. I have to say that this improvement will hardly boost the performance in typical desktop applications dramatically. However, dual-level branch prediction unit may become extremely efficient in servers. This proves once again that Nehalem microarchitecture is universal: it features engineering solutions targeted for different user needs.

They also improved the efficiency of the branch prediction unit by changing Return Stack Buffer unit. I would like to remind you that this unit is responsible for correct prediction of functions return addresses. However, previous generation processors could predict function return addresses incorrectly, for example when recursive algorithms were working and the corresponding buffer got overfilled. The new Return Stack Buffer implemented in Nehalem processors didn’t have this problem any more.

Although Intel engineers have introduced a lot of changes to preliminary stages of Nehalm’s pipeline, they left the execution units of the new processor almost intact.

Like Core 2, CPUs on Nehalem microarchitecture can send up to 6 micro-operations at a time for processing. However, the developers have increased the size of the buffers on commands execution stage. As a result, Nehalem processors can hold up to 128 micro-ops waiting to be executed in the Reorder Buffer, which is 33% more than Core 2 can. As a result, Reservation Station sending micro-operations directly to execution units has been increased from 32 to 36 instructions. They have also made the data buffers larger.

All these seemingly insignificant changes were called for by the fact that new processor cores support SMT technology and can simultaneously process up to two computational threads that require resource sharing. As a result, Intel used a few simple microarchitectural solutions to increase the efficiency of the CPU execution units, i.e. they increased the CPU performance without any serious power consumption changes.

I have to say that the return of SMT technology to Nehalem processors is one of the most significant innovations that can have the biggest positive effect on the CPU performance. Pentium 4 processors where the exact same technology was presented as Hyper-Threading, received up to 10% average performance boost from enabling it. New processors with Nehalem microarchitecture should benefit even more from SMT. Firstly, they have memory subsystem with much higher bandwidth that can much better supply two computational processes with data. Secondly, Nehalem boasts “wider” microarchitecture that allows processing more instructions simultaneously.

Here I have to say that Intel engineers didn’t have to increase the complexity of their processors significantly in order to implement SMP in Nehalem, as well as in Pentium 4 back in the days. In fact, they only duplicated the processor registers and Return Stack Buffer in the core. When SMT is enabled, all other resources are shared dynamically between processor threads (for example, Reservation Station or cache-memory) or shared 50-50 (for example, Reorder Buffer).

By the way, like in Pentium 4 processors, enabling SMT in Nehalem makes the operating system see each physical processor core as a pair of logical cores. For example, software will see a quad-core Nehalem processor as an 8-core CPU.

However, remembering that SMT activation may sometimes lower the performance, Intel engineers made sure that the physical and logical cores can be easily distinguished and are not possessing full rights. This way, software developers can decide themselves how the resources should be distributed between them more efficiently.

To illustrate how the above described practical changes affect the system performance, we decided to compare Nehalem against Penryn in a few simple benchmarks from SiSoftware Sandra 2009 suite. These results are especially valuable because they are not critical to the memory subsystem parameters and hence allow us to draw conclusions about the efficiency of processor microarchitectures discussed:

True, you can see the advantages of the new microarchitecture with enabled SMT. Sandra 2009 tests are optimized for multi-threading so no wonder that enabling SMT improves Nehalem results by 15-60%. However, is we compare the results of Nehalem and Penryn processors without SMT, then the new processor will not always be better than its predecessor. Everything depends on the type of workload, which indicates that there have been no revolutionary or universal changes made to the new core.

TLB and Cache-Memory

In the beginning of this article we mentioned that most features distinguishing the new Nehalem processors from their predecessors are not in their cores, but in interfaces and general CPU structure. The modifications introduced in cache-memory and TLB prove this statement fully. They look much more significant than the small modifications in the internal CPU units we have just discussed.

First of all, Intel engineers have significantly increased the size of the TLB (Translation-Lookaside Buffer). As you know, TLB is a high-speed buffer used to map over the physical and virtual page addresses. By making the TLB bigger they increase the number of memory pages that can be used without additional costly modifications employing address translation tables stored in regular memory.

Moreover, TLB of Nehalem processors became dual-level. In fact, Intel simply added another L2 buffer to the TLB inherited from Core 2 processors. The new L2 TLB is not only large and can save up to 512 entries, but also boasts relatively low latency. Also, the new L2 TLB is unified and can translate page addresses of any size.

It is evident that TLB modifications were intended primarily for server applications that require a lot of memory. However, the increased number of TLB entries may also have a positive effect on the memory subsystem performance in desktop tasks, too. Especially since both TLB levels are dynamically shared between the virtual cores when SMT technology is enabled, so the opportunity to save additional entries in this buffer will not go to waste.

Another innovation that should increase the memory subsystem performance in CPUs on Nehalem microarchitecture is significant acceleration of instructions dealing with the data that haven’t been aligned along cache-memory lines. They have made first shy attempts to implement it back in Penryn processors, but only in Nehalem CPUs they managed to succeed. Now SSE instructions using 16-byte data successions as operands work equally fast independent of the instruction type: for aligned or unaligned data. Since most compilers translate the code with unaligned instructions, this innovation should definitely improve the performance of applications working with media-content.

However, faster processing of unaligned data and adding L2 TLB are trifles compared with the dramatic modification of the cache-memory subsystem in the new Nehalem processors. From the old dual-level cache-memory structure with a shared L2 cache for each pair of cores they only borrowed a 64KB L1 cache split in two equal parts for storing data and instructions. And although L1 cache in Nehalem processors remained the same, its latency got 1 clock cycle higher than that of the L1 cache in Core 2. It resulted from more aggressive power-saving modes introduced in the new processors that according to Intel have little effect on the overall performance.

Although shared L2 cache proved to be highly efficient in CPUs on Core microarchitecture, it appeared pretty difficult to implement in processors with more cores. Therefore, Nehalem microarchitecture allowing processors with up to 8 cores, doesn’t have a shared L2 cache any more. Each core gets its own L2 cache of relatively small size: 256KB. However, due to its limited size, the cache boasts lower latency than L2 cache of Core 2 processors. It partially makes up for the higher latency of L1 cache in Nehalem.

Nehalem also acquired L3 cache, which connects all cores and is shared. As a result, L2 cache turns into a buffer when processor cores send their requests to pretty big shared cache-memory. For example, quad-core desktop processors with new microarchitecture will have an 8MB L3 cache.

Three-level cache-memory reminds us of AMD processors on K10 microarchitecture, however, Nehalem’s cache-memory is organized in a completely different way. First, L3 cache of the upcoming Intel processors works at higher frequency that will be set at 2.66GHz for the first representatives of this family and may increase later on. Second, the cache-memory remained inclusive, i.e. the data stored in L1 and L2 caches is duplicated in L3 cache. And there is a very good reason for that.  Inclusive shared cache speeds up the memory subsystem in multi-core processors due to excessive duplication of L1 and L2 caches of all their cores. Namely, if the data requested by one of the cores is not there, it doesn’t make sense to look for them in the individual caches of other cores. And since each line in L3 cache has additional flags indicating where this data comes from, the reverse modification of the cache line is also performed fairly simply. If a core modifies the data in L3 cache and these data initially belong to different core/cores, the L1/L2 caches of these cores get updated. This allows eliminating excessive inter-core traffic ensuring coherency of inclusive cache-memory.

The results of Nehalem cache-memory latency tests show that this solution is extremely efficient:

L2 cache of Nehalem processor does in fact have extremely low latency. L3 cache also shows very good access time despite its relatively large size. By the way, four times smaller exclusive L3 cache of AMD Phenom X4 processors shows pretty much the same latency of 54 cycles in Sandra 2009. However, the access time of L3 cache in Phenom CPUs is significantly higher than that of Nehalem, because of the lower clock speeds of AMD processors.

Despite a dramatic modification of the caching system, Intel engineers didn’t change the prefetch algorithms: Nehalem has borrowed them as is from Core 2. It means that prefetched data and instructions get delivered only into L1 and L2 cache. Nevertheless, even with old algorithms prefetch units started working faster. Each core in Nehalem processors has an individual L2 cache, and it is much easier to track memory request patterns with cache-memory organized like that. Moreover, operation of this prefetch unit barely affects the memory bus bandwidth thanks to L3 cache. Therefore, the prefetch units will no longer be disabled in server Nehalem processors, like they used to do with Xeon CPUs based on Core microarchitecture.

New SSE4.2 Instructions

Intel continued increasing the number of supported SIMD instructions in their new Nehalem microarchitecture. They added a set of seven new instructions called SSE4.2 (they used SSE4.1 name for SIMD instructions of Penryn CPUs). Intel specifically stressed that the new SSE4.2 instructions are designed not that much for the processing of streaming media content, but for slightly different things. Therefore, new Nehalem instructions are also called ATA (Applications Targeted Accelerators).

ATA concept implies that contemporary technological processes allow employing some processor transistors not only in universal functional units but also in some specific tasks increasing the performance in particular applications.

According to this concept, SSE4.2 has five instructions accelerating XML-files parsing. These instructions also speed up lines and texts processing. Another two new instructions from SSE4.2 set are intended for completely different applications. The first one, CRC32 accumulates checksum, and the second one, POPCNT, counts the number of set bits in the source. These instructions can also be used for a wide range of crunching and networking tasks.

Integrated Memory Controller

Nehalem is Intel’s first microarchitecture that has an integrated memory controller inside the CPU. It may see that Intel engineers borrowed this idea from their AMD colleagues, who have been integrating memory controllers inside their processors from 2003. However, it is not quite correct, because the first processors with the integrated memory controller should have been Intel Timna that never saw the light of day but were developed back in 1999. Moreover, no accusations of plagiarism should actually take place, because the memory controller Intel developed for their Nehalem CPUs is very different from the one used in existing AMD processors. Intel’s approach turned out much more massive.

The main feature of Nehalem processors memory controller is its flexibility. Keeping in mind the modular design of the entire upcoming processor family, that may include solutions differing dramatically in features and market positioning, Intel foresaw the opportunity not just to enable or disable buffered modules, but also to vary the memory speed and the number of channels.

The first processors with Nehalem microarchitecture will be quad-core models and they will have a triple-channel memory controller supporting DDR3 SDRAM. This way, desktop systems built on new CPUs will boast unprecedented memory subsystem bandwidth. With three DDR3-1067 SDRAM modules it will reach 25.6GB/s.

However, the main advantage of transferring the DRAM controller into the CPU is not the bandwidth increase, but lowering of the memory subsystem latency. Although Intel designed their new processors to work with DDR3 SDRAM that has relatively high latency, it will still be lower that the latency of Core 2 based systems equipped with DDR2 SDRAM.

To prove that this statement is correct we would like to offer you the results of our memory subsystem tests performed on a Nehalem based platform in Everest 4.60:

Even in single-channel mode, Nehalem memory controller performs better than chipset memory controller in LGA775 platforms. It is an absolutely logical result, because there are no intermediate devices between the CPU and the memory in the new generation processors. However, before that the chipset North Bridge was responsible for work with the memory subsystem and since it had to synchronize the memory bus and the FSB, it did affect the memory subsystem latency.

Another indirect advantage of the built-in memory controller is its complete independence of the chipset and the mainboard. As a result, Nehalem will work with the memory subsystem equally fast in platforms from different developers.

QPI Bus

You may believe that by moving the memory controller into the CPU, they should have taken a lot of load off the processor bus that is this case doesn’t have to transfer data between the CPU and the memory any more. It is partially true, but only for single-processor systems. Nehalem microarchitecture is universal, it should be used for desktop and mobile as well as server solutions. That is why Intel designed a new processor bus that could suit for multi-processor systems and provide sufficient bandwidth and scalability. Intel engineers didn’t have a choice anyway, because the traditional FSB bus cannot be used in this case. Multi-processor systems on processors with integrated memory controllers should use NUMA memory model (Non-Uniform Memory Access) and hence require direct high-speed connection between the CPUs.

To accomplish this task they built special serial interface called CSI (Common System Interface) with point-to-point topology that was later renamed to QPI (QuickPath Interconnect). On the technical side, QPI consists of two 20-bit links transferring data forward and back. 16 bit are assigned for data and the remaining 4 bits serve some auxiliary purpose: they are used by the protocol and error correction. This bus performs maximum 6.4 mln transfers per second (GT/s) and has 12.8GB/s bandwidth in each direction, or 25.6GB/s total bandwidth.

The current bandwidth of the new QPI bus allows us to call it the fastest processor bus out there. The old Quad Pumped Bus can only reach 12.8GB/s total bandwidth at 1600MHz frequency. HyperTransport 3.0 bus similar to QPI and used in contemporary AMD processors can boast only 24GB/s peak bandwidth.

Depending on their market positioning, processors on Nehalem microarchitecture may come equipped with one or multiple QPI interfaces. As a result, each CPU in the multi-processor system may be directly connected to all other processors to reduce the latency when working with the memory connected to another controller. CPUs for single-processor desktop systems will have one QPI connecting it to the chipset.

Power Management and Turbo-Mode

A lot of things Intel engineers introduced in their Nehalem processors are inspired by the optimization of this microarchitecture for native multi-core design. Therefore, it was necessary to also revise the processor power management system. Multi-core processors on Core microarchitecture are very power-inefficient in the sense that there is a single algorithm for their power management needs that doesn’t take into account the individual cores. Therefore, it is a pretty frequent situation when one of the cores in contemporary quad-core CPUs that is loaded heavily prevents other cores from going into power-saving mode even though they are hardly involved.

That is why Nehalem microarchitecture has one more important processor unit called PCU (Power Control Unit). It is actually just another programmable micro-controller built into the CPU that should manage power consumption intelligently. No wonder that PCU is of pretty complex design: it consists of about 1 million transistors.

PCU’s main task is to adjust the frequency and voltage of individual cores and it has everything it takes for that. It receives the sensor readings of temperatures, voltage and current for all cores. PCU analyzes these data and switches qualifying cores to power-saving mode by adjusting their frequency and voltage. Namely, PCU may disable inactive cores and put them in deep sleep state where their power consumption will be close to 0.

To make it all happen, Intel engineers and technologists created special semiconductor material that allowed disconnecting the cores from the power bus independently. The main advantage of this technology is that power management of individual cores is performed inside the CPU and doesn’t require enhancing the processor voltage regulator circuitry on mainboards in any way.

As for the processor units identical for all cores, such as memory controllers and QPI interface, they go into power-saving mode when all processor cores sleep.

An intelligent controller that can manage processor cores independently allowed Intel to implement one more interesting technology called Turbo Boost Technology. It introduces a Turbo-mode, when individual cores can work at frequencies exceeding the nominal, i.e. be overclocked. According to Turbo Boost Technology main principle, the overall processor power consumption and heat dissipation lowers when some cores go into power-saving mode, which allows increasing the frequencies of other cores without risking to get past the TDP limits.

In fact, they have already introduced something similar to this technology in mobile dual-core Penryn processors, however, they developed it much more in Nehalem. If there is no risk of exceeding typical power consumption and heat dissipation, PCU may increase the clock frequency of certain cores one step over the nominal (133MHz). It may occur when the workload is not paralleled and some cores are idling.

Moreover, if all the above described conditions are met, frequency of one of the cores may be increased two steps above the nominal (266MHz).

I have to stress that Turbo-mode doesn’t necessarily get enabled when one or more cores go into power-saving mode. It is simply one of the possible scenarios. Since the PCU can get all the information on the current processor cores status, Turbo-mode can also be enabled when all cores are active but the workload is relatively small.

Turbo Boost Technology is absolutely transparent for the operating system and it is its great advantage. It is implemented only in hardware and doesn’t require any software applications or utilities to be running.

To see what it actually means in practical tests we observed the status of our quad-core Nehalem processor with 3.2GHz nominal frequency during work with 1-8 computational threads created by Prime95 utility.


Click to enlarge

As you can see in the animated illustration above, Intel Enhanced SpeedStep technology kicks is when there is no workload: the processor frequency drops to 1.6GHz. Launching one thread activated one core, so the CPU can increase its multiplier from 24x to 26x overclocking itself to 3.46GHz. Two threads increase the processor load so much, that PCU only dares raise the clock speed to 3.33GHz. The frequency remains at this point until we have 5 simultaneous threads working. And only sixth thread increases the CPU utilization to 75% lowering its frequency back to the nominal 3.2GHz. In other words, Turbo Boost Technology is not an ephemeral concept: its effect is real.

Bloomfield is the First Nehalem

First mass production processors based on new Nehalem microarchitecture will be desktop processors codenamed Bloomfield. They will have four cores. Besides these four cores the Bloomfield processor die will also contain 8MB L3 cache, triple-channel memory controller supporting DDR3 SDRAM and one QPI interface. The CPU like that will consist of 731 mln transistors and will be manufactured with 45nm process using high-k dielectric metal gates. This die will be 263sq.mm big.

Bloomfield processors will be marketed as Core i7. The first models due to come out in mid November will work at 3.2, 2.93 and 2.66GHz frequencies and the typical heat dissipation of all three will be set at 130W.

Keeping in mind that processors on Nehalem microarchitecture have a built-in memory controller and high-speed QPI interface, they will use different packaging. Core i7 will be manufactured in LGA1366 form-factor of much bigger size than LGA775: 42.5 x 45mm.

So, Core i7 will require new mainboards that can currently be based only on one single chipset supporting QPI interface: Intel X58 Express.

More details on Core i7 processors are summed up in the table below:

The Extreme Edition model will not only boast higher clock frequency, but will also be overclocking friendly. It will have an unlocked frequency multiplier and memory frequency multiplier.

It is evident that Core i7 processors will take over the high-end desktop segment. They will replace the top Core 2 Quad and Core 2 Extreme processors that will soon stop shipping. Note that the Core i7 lineup will include only three models for a relatively long period of time. According to Intel roadmap, only in the middle of 2009 things will change and there will be more Nehalem choices not only in high-end but also in the mainstream segment at more affordable prices.

New processors should arrive into the server and workstation segment early next year, and into mobile platforms – in H2 2009.

Conclusion

Of course, the launch of Nehalem processors is a big event for the computer market. Intel has been skillfully warming up the public for the new microarchitecture launch and they did very well: all IT publications are full of rumors and opinions about these processors.

But we are going to hold back our optimistic conclusions and would like to warn our readers that despite all advantages of the new microarchitecture, you shouldn’t expect too much of the new Core i7 processors. First, most of the innovations and improvements are targeted primarily for server applications and desktop users will hardly have a chance to really feel them. Second, when Intel engineers worked on the new Nehalem they didn’t try to increase the “pure” performance but mostly to eliminate the platform bottlenecks that were especially critical in server applications.

The main performance gain from the Core i7 processors that desktop users can hope for will be determined by three factors. First, it is Turbo Boost technology that increases the CPU frequency above the nominal pretty often. Second, it is the integrated memory controller providing remarkable bandwidth and impressive latency. And third, it is SMT technology that will have its positive effect during multi-threaded load in the first place.

Although these factors can significantly increase the new processors performance on the one hand, on the other, their effect depends a lot on the type of the processor workload. Therefore, Core i7 processors may not seem so innovative and revolutionary any more, especially since their computational core has been changed very slightly compared with the “classical” representatives of Core microarchitecture. And it means that Core i7 will not delight us as much as Core 2 did when they came to replace Pentium 4.

The real revolution will happen when Nehalem processors come to servers. And although inertness of this market will hardly let these CPUs to quickly take over the segment despite all their advantages, servers with processors based on new Intel microarchitecture will definitely make the solutions much more attractive.

As a result, from the positions of the desktop CPU market, we could consider Nehalem just another refresh of Core microarchitecture along the evolutionary path starting with Conroe, then Penryn and now Nehalem. However, even in this case Intel will continue to remain an indisputable leader in the microprocessor market. Especially, since “Tick-Tock” concept is still there, which means that in a year from now Nehalem will move to finer 32nm process and in two years we will see another new microarchitecture.