There is one important thing to be mentioned: the cache access latencies given in the whitepaper documentation differ from those we get during real testing, because they do not take into account the cache functioning strategy. The claimed values are none other but the L2 cache latencies in case the cache is completely isolated from all other sub-systems.
But any actual request to the L2 cache of the Northwood processor will show 9+ clocks. When the new data is requested, the L1 cache is always checked first. If the requested data is not there, the search will continue in the caches of other hierarchies. In this case it will mean the following: the total latency will equal 2 clock cycles in L1 cache and 7 clock cycles in L2 cache.
So, the actual access latency in Prescott core will equal 22 clocks or even more. Note that it won’t be easy to get these numbers: the requests need to be composed in a specific way, so that there would be no influence of the data prefetch on the latency.
By the way, since we came to speak about prefetch, let me remind you that this mechanism was developed even more in Prescott (as you remember its major task is to “deliver” the data to the processor core in time). To be more exact the limitations, which used to be the case in Northwood core, have finally been eliminated. In particular, the prefetch of Northwood processor couldn’t cross the 4KB border of the virtual memory page, which reduced its efficiency tremendously. The problem was that the Northwood prefetch mechanism could grab the data from the page only when it had already been written into TLB. If they requested the data of the “following” page, and it hadn’t yet been written into the TLB, then the Northwood core would not initiate the request to page tables for further translation and would not add any record into the TLB. That is why this prefetch mechanism didn’t go beyond the 4KB limit. In fact, it can go beyond 4KB, but only if the data is already in the TLB, and it cannot add any record about the page into the TLB, if the page is not there yet.
In Prescott everything is absolutely different. Now the prefetch mechanism is more aggressive, and doesn’t have any limitations such as 4KB border of the virtual memory page. Moreover, it handles the loop exits much better.
As a result of all these improvements (as well as improved bus speed), work with the memory subsystem turned into Prescott’s major advantage. And the work with the caches got somewhat less efficient, and the maximum bandwidth got lower, especially during writing operations. In fact, we have already mentioned it above.
I have to point out that once the size of the L1 and L2 caches changed, the cache conflicts processing speed also changed. Aliasing conflicts are the most important ones among them. In this case this term implies that there is a situation when two different addresses in the memory have identical low order 16 bits, for instance. In this case if we try to locate both memory strings in the cache and address them in turns, they will be ousting one another from the cache every time, which will reduce the caching efficiency a lot.
Since we haven’t yet dwelled on the aliasing in our article, let’s do it now.
In order to understand better what the aliasing conflict is about, we will have to recall the structure of the n-channel partially-associative cache.