Memory Subsystem Performance
First we decided to focus on evident advantages of the new Core i7 processors. One of their major trumps is new three-level cache memory with shared L3 cache and memory controller built into the CPU. I would like to remind you that despite the similarities between CPUs on Nehalem and Core microarchitectures, only L1 cache of Core i7 is similar to the L1 cache in Core 2 Quad. L2 cache of the new processor is organized differently: it has become much smaller, but instead each core has its own individual L2 cache.

Core 2 Extreme QX9770

Сore i7-965 Extreme Edition
Note that 6MB L2 cache of the Core 2 processor family has 24-way set associativity. It means that to accelerate the search this cache is split into 256KB areas. Core i7 processor has an entire L2 cache of 256KB, however it has 8-way set associativity. It means that processor on Nehalem microarchitecture should spend considerably less time on L2 cache data search.
To estimate the performance of the entire subsystem including the cache and the memory, we resorted to a synthetic bandwidth and latency test built into Everest 4.60 suite.

Core 2 Extreme QX9770

Сore i7-965 Extreme Edition
First of all look at the difference in L1 cache latency. Although Core i7 processors inherited L1 cache from their predecessors, Intel gave it a little higher latency for the sake of power-saving modes support. This is what you can see from our obtained practical results.
However, L2 cache memory of the new processors does work much faster. Its practical latency equals half the latency of L2 cache in CPUs on Core microarchitecture. L2 cache of Core i7 also has higher bandwidth during reading, writing and copying. It is L3 cache of Core i7 processors that works as fast as L2 cache in Core 2 CPUs.
In other words, triple-level cache-memory of the new CPUs should be at least as efficient as that of the predecessors. Its only bottleneck is higher L1 cache latency. However, faster L2 cache should make up for it, as it actually serves as an intermediate buffer between L1 and L3 caches, which work at similar speeds as L1 and L2 caches of the Core 2 Quad processors.
As for the memory performance, Nehalem processors are simply beyond all competition here. The bandwidth of triple-channel DDR3-1067 SDRAM is 45% higher than the memory bandwidth in an LGA775 system working with dual-channel DDR3-1600 SDRAM. And the latency of the memory subsystem in Core i7 platform is about 30% lower.
Core i7 platform remain an indisputable leader even when we switch the memory controller into dual-channel mode. Although our LGA775 system uses faster memory modules, it still loses in access time and bandwidth tests.

Dual-channel mode of the Сore i7-965 Extreme Edition memory controller
By the way, as you can see, when we switched from triple-channel to dual-channel mode in a Core i7 platform, the memory subsystem performance didn’t drop too significantly. And the latency not only didn’t increase, but got even lower. It means that there is nothing wrong with using dual-channel memory in LGA1366 platforms. The processor can employ two memory channels efficiently, too. In some cases you can even expect triple-channel memory to turn out not as fast as dual-channel memory because of higher latency that will not be compensated by insignificant advantage in bandwidth.
In conclusion to our short test session of the Core i7 memory subsystem I have to mention one more parameter that may speed it up. I am talking about the frequency of L3 cache and memory controller that may be adjusted in the mainboard BIOS Setup, as we have already said above. The results we have just discussed were obtained with the processor interface blocks working at twice the frequency of the memory, namely at 2133MHz. If we use a higher un-core multiplier for processor interface blocks, for example 20x, L3 cache and memory controller frequencies will increase to 2667MHz and the benchmark results will be higher as well.
Here are the numbers we got in this case in triple-channel memory mode:

Interface blocks work at 2.66GHz
L3 cache and memory controller frequencies increased by 25%. As a result, we can see about 24% improvement of the memory subsystem bandwidth during writes and a little less significant improvement of only 10% during copying. The latency of L3 cache and the memory also dropped 8-9%. But unfortunately, this highly efficient way of boosting performance has very limited application. The thing is that the increase of the processor interface blocks frequency may often affect system stability. In our case, for example, further increase of this multiplier made the system less reliable.
Therefore, all further tests were performed with the L3 cache and memory controller working at 2667MHz.




