TLB and Cache-Memory
In the beginning of this article we mentioned that most features distinguishing the new Nehalem processors from their predecessors are not in their cores, but in interfaces and general CPU structure. The modifications introduced in cache-memory and TLB prove this statement fully. They look much more significant than the small modifications in the internal CPU units we have just discussed.
First of all, Intel engineers have significantly increased the size of the TLB (Translation-Lookaside Buffer). As you know, TLB is a high-speed buffer used to map over the physical and virtual page addresses. By making the TLB bigger they increase the number of memory pages that can be used without additional costly modifications employing address translation tables stored in regular memory.
Moreover, TLB of Nehalem processors became dual-level. In fact, Intel simply added another L2 buffer to the TLB inherited from Core 2 processors. The new L2 TLB is not only large and can save up to 512 entries, but also boasts relatively low latency. Also, the new L2 TLB is unified and can translate page addresses of any size.
It is evident that TLB modifications were intended primarily for server applications that require a lot of memory. However, the increased number of TLB entries may also have a positive effect on the memory subsystem performance in desktop tasks, too. Especially since both TLB levels are dynamically shared between the virtual cores when SMT technology is enabled, so the opportunity to save additional entries in this buffer will not go to waste.
Another innovation that should increase the memory subsystem performance in CPUs on Nehalem microarchitecture is significant acceleration of instructions dealing with the data that haven’t been aligned along cache-memory lines. They have made first shy attempts to implement it back in Penryn processors, but only in Nehalem CPUs they managed to succeed. Now SSE instructions using 16-byte data successions as operands work equally fast independent of the instruction type: for aligned or unaligned data. Since most compilers translate the code with unaligned instructions, this innovation should definitely improve the performance of applications working with media-content.
However, faster processing of unaligned data and adding L2 TLB are trifles compared with the dramatic modification of the cache-memory subsystem in the new Nehalem processors. From the old dual-level cache-memory structure with a shared L2 cache for each pair of cores they only borrowed a 64KB L1 cache split in two equal parts for storing data and instructions. And although L1 cache in Nehalem processors remained the same, its latency got 1 clock cycle higher than that of the L1 cache in Core 2. It resulted from more aggressive power-saving modes introduced in the new processors that according to Intel have little effect on the overall performance.
Although shared L2 cache proved to be highly efficient in CPUs on Core microarchitecture, it appeared pretty difficult to implement in processors with more cores. Therefore, Nehalem microarchitecture allowing processors with up to 8 cores, doesn’t have a shared L2 cache any more. Each core gets its own L2 cache of relatively small size: 256KB. However, due to its limited size, the cache boasts lower latency than L2 cache of Core 2 processors. It partially makes up for the higher latency of L1 cache in Nehalem.
Nehalem also acquired L3 cache, which connects all cores and is shared. As a result, L2 cache turns into a buffer when processor cores send their requests to pretty big shared cache-memory. For example, quad-core desktop processors with new microarchitecture will have an 8MB L3 cache.
Three-level cache-memory reminds us of AMD processors on K10 microarchitecture, however, Nehalem’s cache-memory is organized in a completely different way. First, L3 cache of the upcoming Intel processors works at higher frequency that will be set at 2.66GHz for the first representatives of this family and may increase later on. Second, the cache-memory remained inclusive, i.e. the data stored in L1 and L2 caches is duplicated in L3 cache. And there is a very good reason for that. Inclusive shared cache speeds up the memory subsystem in multi-core processors due to excessive duplication of L1 and L2 caches of all their cores. Namely, if the data requested by one of the cores is not there, it doesn’t make sense to look for them in the individual caches of other cores. And since each line in L3 cache has additional flags indicating where this data comes from, the reverse modification of the cache line is also performed fairly simply. If a core modifies the data in L3 cache and these data initially belong to different core/cores, the L1/L2 caches of these cores get updated. This allows eliminating excessive inter-core traffic ensuring coherency of inclusive cache-memory.
The results of Nehalem cache-memory latency tests show that this solution is extremely efficient:
L2 cache of Nehalem processor does in fact have extremely low latency. L3 cache also shows very good access time despite its relatively large size. By the way, four times smaller exclusive L3 cache of AMD Phenom X4 processors shows pretty much the same latency of 54 cycles in Sandra 2009. However, the access time of L3 cache in Phenom CPUs is significantly higher than that of Nehalem, because of the lower clock speeds of AMD processors.
Despite a dramatic modification of the caching system, Intel engineers didn’t change the prefetch algorithms: Nehalem has borrowed them as is from Core 2. It means that prefetched data and instructions get delivered only into L1 and L2 cache. Nevertheless, even with old algorithms prefetch units started working faster. Each core in Nehalem processors has an individual L2 cache, and it is much easier to track memory request patterns with cache-memory organized like that. Moreover, operation of this prefetch unit barely affects the memory bus bandwidth thanks to L3 cache. Therefore, the prefetch units will no longer be disabled in server Nehalem processors, like they used to do with Xeon CPUs based on Core microarchitecture.