That is why the bus between the caches features simply fantastic bandwidth of 96GB/s (for 3GHz clock frequency). All this should reduce the time for data transfer and the delays. Roughly speaking, L1 data cache should feature provide latency, and L2 cache – maximum data transfer rate.
In Prescott core, the caches are not so strictly specified. The L1 cache latency is more nanoseconds than that of the L1 cache in Northwood core. And the data transfer rate between the L1 and L2 cache in Prescott is very often lower than that in Northwood. You can see it clearly if you try addressing the beginning of every string, which is absent in L1 but present in L2 cache.
Northwood core demonstrates stable data transfer rate of 32 bytes per clock, while Prescott core shows about 1.5-2 times lower result.
I have to point out here that the time it takes to receive the data from L1 cache is greatly dependent on where this data comes from. If it comes directly from L1 data cache, then its maximum read speed will be 48GB/s. If the data is transferred from L2 cache, then it arrives into L1 at 96GB/s.
This way you see that the physical meaning of these two bandwidth rates is completely different: since the data cannot get “lost”, then we have to admit that L1 data cache works at different speed in different situations. That is why it doesn’t make much sense to provide only one number for the data transfer rate value, we should always specify what case we are talking about. Moreover, the data transfer rate also depends on the type of the requested data. For example, if we need to transfer data from the L1 data cache into the SSE2 registers, the maximum speed would be 16 bytes per clock. For the transfer from L1 data cache into MMX registers the maximum speed would be 8 bytes per clock, and for the transfer into integer registers – 4 bytes per clock. Not that simple, eh?
Note that the data can be read and stored at the same time. But in this case the reading can not go at the maximum speed of 16 bytes per clock that is why our estimated maximum data transfer rate of 48GB/s will be quite correct in the long run.
Northwood Cache latency, clocks
Prescott cache latency, clocks
2 (ALU)/9 (FPU)
4 (ALU)/12 (FPU)
* - if the open page is accessed. TLB is relatively small that is why L3 cache access always occurs to the closed page. In this case the latency is about 50 clocks.
The string of L1 data cache is 64 bytes long. Half the cache string (32 bytes of data) can be loaded from the L2 cache every clock cycle.
L2 cache is arranged with 128-byte strings, which are split in two sectors 64 bytes each. Each of the sectors can be read independently. You will understand what’s behind this organization if you recall that the data is transferred from the memory in 128-byte blocks, while into the memory it goes in 64-byte blocks.
L3 cache is not of that much interest for us here, because it performs an auxiliary function in NetBurst micro-architecture we are discussing today. It can be up to 4MB big, is located on the processor die and is connected to the core with the 64bit bus. The L3 cache access latency is certainly higher than by L2 cache, but it is still much smaller than the memory access latency. Besides, it is bigger than the L2 cache which significantly improves the probability that the requested data is available there.
L2 and L3 caches do not get blocked and can process up to 8 requests at a time. Northwood features 4-channel partially associative L1 cache, and Prescott – 8-channel partially associative L1 cache.