Well, this is the last functional block we have to discuss now. And the task of the memory subsystem is pretty simple I should say. It has to deliver the requested data as soon as possible and save the obtained result. And since the working frequencies of contemporary memory cannot even be compared with the working frequencies of contemporary CPUs, there are caches of different levels that get involved at different stages. These are small amount of high-speed memory used to store the most frequently requested data. Pentium 4 processor has two cache levels. Some other processors may have three cache levels.
L2 cache plays the major role in the Pentium 4 processor. In fact, it is exactly the L2 cache that turns into major data storage for the Pentium 4 micro-architecture. Why so, you will find later on.
The most demanded data is still located in the L1 cache, as it used to be. But unlike the previous processor generation, such as Pentium III, the size of L1 data cache is relatively small: only 8KB (by Willamette and Northwood cores) and 16KB (by Prescott core). However, it is very fast: the cache access latency is only 2 clock cycles compared with the 3 clock cycles by Pentium III. The difference doesn’t strike as immense, but is we take into account the working frequencies (with the same production technology the working frequency of the Pentium III CPU was 1GHz, while for Willamette – 2GHz), it will turn out that the L1 data cache now requires 1ns instead of 3ns. This is a much more impressive difference, isn’t it?
Since the L1 data cache is not that big, it is quite possible that the requested data may not be there. Because 8KB (16KB) is even less than the “locality space” of most programs (except specifically written ones). If the necessary data is absent the request to the L2 cache is initiated. From the predecessor Pentium 4 inherited an L2 cache protected against blocking, which can process up to 8 requests at a time. If the data we are looking for is there, it will be copied along the 256bit bus into the L1 data cache. Moreover, Pentium 4 processor can copy 256bit of data every clock cycle! This technology is called Advanced Transfer Cache and is a continuation of the corresponding technology implemented in the Pentium III processor, which could copy 256bit of information every second clock cycle. This bus bandwidth is really demanded because if the CPU is addressing different strings of the cache, it cannot start working with the second string of the cache without finishing the transfer of the first string. That is why even though the data transfer rate like that might seem pretty high and even excessive from the theoretical point of view, it appears needed in real life.
If the requested data is not available in the L2 cache, the Bus Unit involves RAM by sending a corresponding request there. Of course, it takes RAM much longer to respond than it takes L2 cache (the Northwood core features 7 clock cycles cache latency, while with the Prescott core the situation is completely different and it features 18 clock cycles cache latency). Of course, a request to RAM is an outstanding emergency situation for the CPU, because the response from the memory will take hundreds of processor clock cycles. This is why we need the Data Prefetch mechanism so badly: it was supposed to predict what data we might need by now so that the data could be delivered in time.
But let’s get back to the cache. L2 cache is 256KB for the Willamette core, 512KB for Northwood core and 1MB for Prescott core. Moreover, Pentium 4 XE has a 2MB L3 cache (by Xeon processors the size of L3 cache may reach up to 4MB). This cache is connected to the core with a 64bit bus, features higher latency than L2 cache, and also is inclusive, i.e. it duplicates the contents of L2 cache in itself. Nevertheless, you can see the effect from L3 cache in many programs: even though it is considerably slower than any cache from the previous hierarchy, it is still much faster than RAM.
Well, these were the major blocks of processor architecture split into functional units. Of course, this is not the full description, this is just an illustration of how the CPU works. We are going to take a closer look at the functioning of selected processor systems in the next chapters. So keep on reading! :)