Chapter V: Cache Hierarchy
Of course, you may have noticed that we jumped right to execution units having omitted quite a number of other CPU architecture components. This is definitely not because there is nothing we can say about any of the previous stages. In fact, right on the contrary: we found some really interesting details there. The thing is that we need to revise some things we already know before we pass over to the detailed discussion of the planning units and micro-instruction queues. And among these things are cache access latencies, for instance.
The access into caches of different hierarchy is characterized with the time it takes to access the cache and the width of the bus between them. The table below sums up all these data for different caches of the CPU:
Bus bandwidth at 3GHz frequency
L1 D cache* - registers
L2 cache – L1 D cache
L3 cache** - L2 cache
* - the numbers are given only for the data cache, because there is no defined bus bandwidth in the Trace cache. We only know the maximum transfer rate of 6 uop-s per two clock cycles, but since there is no fixed size for a single uop, it is hard to estimate the actual bandwidth.
** - Only Gallatin core has an L3 cache right now. This is a modification of 130nm Northwood core with the integrated on-die L3 cache (with ECC support). This core is used for the following CPUs: Pentium 4 Extreme Edition, Xeon MP, Xeon DP with L3 cache.
*** - in fact more exact bus bandwidth rates are 44.8GB/s, 89.5GB/s, 22.4GB/s.
You can easily notice that different cache levels have different peculiarities. In particular, the read speed form l1 cache into registers is much lower than the data transfer rate along the bus between L1 and L2 caches. You wonder how this thing happened? Why do we need the L1 cache at all then? Does it make any sense this way?
It is actually because the task of the L1 cache is different from what the L2 cache is supposed to do: L1 cache should find and present the needed data fast and with minimal latency. Note that the read speed from this cache doesn’t exceed 16KB/clock (48GB/s). Moreover, one of the main reasons for the L1 cache to be there is the inability of the CPU to access the L2 cache directly. In order to use the data, they should create a request, find the data, transfer it to the L1 data cache, etc.
But if the requested data is not in the L1 cache, then it needs to be transferred there as soon as possible.
The following example will illustrate why this is necessary. Imagine that the program needs some data, which is arranged not in an ordered chain (with sequential addresses). All pieces of data are more than the length of a string apart from one another. Or they are even scattered all over the place within quite a big area of the memory, which is even worse. In this case we will have to transfer the entire 64 Byte string in order to read only one byte of information(!). In other words, the amount of information transferred from the L2 cache is 64 times bigger than what we actually need. Of course, the higher is the data transfer rate from L2 to L1 data cache, the better. Moreover, the decoder also grabs the data from the L2 cache.