When the memory request addresses have been calculated in the AGU of K8 processor, all load and store operations are sent to LSU (Load/Store Unit). LSU contains two queues: LS1 and LS2. At first, load and store operations get into LS1 queue 12 elements deep. At two operations per clock speed, LS1 queue issues requests to L1 cache memory in order determined by the program code. In case of a cache-miss, operations are placed into the LS2 queue 32 elements deep. This is where the requests to L2 cache memory and RAM come from.
The LSU of the K10 processor has been modified. Now LS1 queue receives only load operations, while store operations are sent to LS2 queue. Load operations from LS1 can be executed in an out-of-order manner taking into account addresses of store operations in LS2. As we have already mentioned above, K10 processes 128-bit store operations as two 64-bit ones that is why they take two positions each in the LS2 queue.
L1 cache in K8 and K10 processors is separated: 64KB for instructions (L1I) and data (L1D). Each cache is 2-way set associative; the line length is 64 bytes. This low associativity may result into frequent conflicts between the lines aiming at the same sets, which in its turn may increase the number of cache-misses and negatively affect the performance. Low associativity is often compensated by the rather large size of L1 cache. A significant advantage of L1D is the two ports: it can process two read and/or write instructions per clock in any combination.
Unfortunately, L1 cache of K10 processor still has the same size and associativity. The only noticeable improvement is the read bus width increase. As we have said in the previous chapter, now the CPU can perform two 128-bit reads every clock cycle, which makes it much more efficient during SSE-data processing in local memory.
Each core of the dual-core and quad-core K8 and K10 processors has its own individual L2 cache. The L2 cache in K10 remained the same: 512KB per core with associativity of 16. Exclusive L2 caches have their pros and cons compared with the shared L2 cache in Core 2 CPUs. Among the advantages, certainly are the absence of conflicts and competition for the cache when several processor cores are heavily loaded at the same time. As for the drawbacks, there is less cache available for each core when there is only one applications running full throttle.
L2 cache is exclusive: the data stored in L1 and L2 caches do not duplicate. L1 and L2 caches exchange data along two unidirectional buses: one serves to receive data and another one – to send data. In K8 processor each bus is 64bit (8 bytes) wide (Pic.5a). This organization provides the data delivery rate to L2 cache at the modest 8 bytes/clock speed. In other words, it will take 8 clock cycles to transfer a 64-bit line, so the data delivery to the core will be delayed noticeably, especially if two or more lines of the L2 cache are addressed at the same time.
Although it hasn’t been confirmed yet, the send and receive buses in K10 processor will become twice as wide, i.e. 128bit each (Pic.5b). It should reduce the cache access latency significantly when two or more lines are requested at the same time.
To make up for the relatively small L2 cache, K10 acquired a shared between all cores 2MB L3 cache with associativity of 32. L3 cache is adaptive and exclusive: it stores all data evicted from L2 caches of all cores as well as the data shared by several cores. When the core issues a line read request, a special check is performed. If the line is only used by one core, it is removed from L3 freeing room for the line that is evicted from L2 cache of the requesting core. If the requested line is also used by another core, it remains in the cache. However, in order to accommodate the line evicted from L2 cache, another – older – line will be removed in this case.
L3 cache should help speed up the data transfer rate between the cores. As we have already found out, contemporary Athlon 64 processors exchange data between the cores via the memory bus. As a result, access to shared modified data occurs much slower. According to AMD’s materials, quad-core K10 processors may exchange data via L3 cache. Once the request from one of the cores arrives, the core that has the modified data copies them to L3 cache, where the requesting core can read them from. The access time to modified data in the other core’s cache should become much shorter. When we get a chance, we will certainly check it out.
Pic.6: Data transfer between the cores in K10 processors.
L3 cache latency will evidently be higher than L2 cache latency. However, AMD materials suggest that the latency will vary adaptively depending on the workload. If the workload isn’t too heavy, latency will improve, and under heavy workload the bandwidth will rise. We still have to check what really stands behind this.