Besides cache-memory for instructions and data, processors have one more type of cache-memory: translation-lookaside buffers (TLB). These buffers are used to store the connection between virtual and physical page addresses obtained from the page
tables. The number of TLB entries determines how many memory pages can be involved without additional costly page table walks. This is especially critical for applications that process memory data randomly, when they constantly request the data on different pages. K10 processor has much more translation buffers. For your convenience they are all given in the table below:
Table 1: TLB capacity of K8 and K10 processors.
As you see from the table, there are much more buffers for translation of 2MB pages. There also appeared support of large 1GB pages that may be very useful for servers processing large volumes of data. With appropriate support from OS, applications using large 2MB and 1GB pages should run considerably faster.
When the requested data isn’t found in any of the caches, the request is issued to the memory controller integrated onto the processor die. On-die location of the memory controller reduces the latencies during work with the memory, but at the same time it ties up the processor to a specific memory type, increases the die size and complicates the die selection process thus affecting the production yields. The on-die memory controller was one of the advantages of the K8 processors, however, sometimes it wasn’t efficient enough. The memory controller of K10 processors will be improved significantly.
Firstly, it now can transfer data not only along one 128-bit channel, but also along two independent 64-bit channels. As a result, two or more processor cores can work more efficiently with the memory at the same time.
Secondly, the scheduling and reordering algorithms in the memory controller have been optimized. The memory controller groups reads and writes so that the memory bus could be utilized with maximum efficiency. Read operations have an advantage over writes. The data to be written is stored in the buffer of still unknown size (it is assumed to accommodate between 16 and 30 64-byte lines). By handling requested lines in groups we can avoid switching the memory bus from reading to writing and back all the time and thus save resources. It is allows to significantly improve performance during alternating read and write requests.
Thirdly, the memory controller can analyze requests sequences and perform prefetch.
Prefetch is a definite advantage of K8 processors. Integrated memory controller with low latency has let AMD processors to demonstrate excellent performance with the memory subsystem for a long time. However, K8 processors failed to prove as efficient with new DDR2 memory, unlike Core 2 with powerful prefetch mechanism. K8 processors have two prefetch units: one for the code and another one for the data. The data prefetch unit fetches data into the L2 cache basing on simplified successions.
K10 has improved prefetch mechanism.
First of all, k10 processors prefetch data directly into the L1 cache, which allows hiding the L2 cache latency when requesting data. Although it increases the probability of L1 cache pollution with unnecessary data, especially taking into account low cache associativity, AMD claims that it is a justified measure that pays off well.
Secondly, they implemented adaptive prefetch mechanism that changes the prefetch distance dynamically, so that the data could arrive in time and so that the cache wouldn’t get loaded with data that is not needed yet. Prefetch unit became more flexible: now it can trains on memory requests at any addresses, and not only the addresses that fall into adjacent lines. Moreover, prefetch unit now trains on software prefetch requests.
Thirdly, a separate prefetch unit was added directly into the memory controller. The memory controller analyzes request successions from cores and loads the data into the write buffer utilizing the memory bus in the most optimal way. Saving prefetch lines in the write buffer helps keep cache-memory clean and reduce the data access latency significantly.
As a result, we see that the memory subsystem of K10 processors has undergone some positive improvements. But we still have to say that it still potentially yields to the memory subsystem in Intel processors in some characteristics. Among these features are: the absence of speculative loading at unknown address past the write operations, lower L1D cache associativity, narrower bus between L1 and L2 caches (in terms of data transfer rate), smaller L2 cache and simpler prefetch. Despite all the improvements, Core 2 prefetch is potentially more powerful than K10 prefetch. For example, K10 has no prefetch at instruction addresses so that we could keeps track of individual instructions, as well as no prefetch from L2 to L1 that could hide L2 latency efficiently enough. These factors can have different effects on various applications, but in most cases they will determine higher performance of Intel processors.
Let’s take a quick look at other innovations introduced in K10 micro-architecture.