The easiest would be to imagine that the cache is a 3D cube on the picture below. The number of layers (lines) in each set of the cache equals to the number of banks or ways (on the picture). The vertical section of this cube will be a set of cache lines with the same indexes.
Let’s consider some real cache as an example. Take, for instance, L1 data cache of the Northwood CPU with the total size of 8KB. We know that L1 cache of this processor core is a four-channel partially-associative cache with 64 bytes of data in each line.
It will be a set of four ways, each having 32 sets (set1, set2, set3,…). On the picture above each cache channel or way is marked as Line 1, Line 2, Line 3 and Line 4. Each line is 64 bytes long. So, the total cache size will be calculated as follows: 4x32x64=8,192byte or 8KB.
When we need to find a certain address, the following operation takes place. The cache controller logics processes low order address bits in each set and finds a set with the bits we need in positions from 6 to 10. Then with the help of bits from 11 to 15 we identify and select the way inside our set, which tag corresponds to our initial request. As a result we get the desired element.
This is exactly where the first obstacle appears. Sine we cannot process all bits of the address fast enough, we process only a part of them. But two different addresses with the identical low order 16 bits will be identical from the cache controller’s point of view. As a result, these two different addresses with the identical 16 bits will be ousting one another from the cache, when we are requesting them.
This is what they call aliasing problem.
In Prescott core they managed to solve this problem in a very smart way. The set is still identified according to the bits from 6 to 10, but the way is identified differently. Now they identify the way according to bits from 11 to 21. In other words, there are 11 bits involved, instead of 5. So, now we only face aliasing problem for the addresses that are 4MB apart, and not only 64KB apart, as in Northwood.
Moreover, the associativity level of the L1 cache has also been slightly changed. Now it is an 8-channel partially-associative L1 cache. There are still 32 sets, each line contains 64 bytes. And the total size is 8x32864=16,384 bytes or 16KB.
As a result, the cache access speed got somewhat lower (cache latency is 4 clocks instead of 2 now), however, the aliasing problem has been almost completely eliminated. As a result, the performance fines caused by aliasing problems got really rare. To be more exact, these fines grew bigger but started to occur much more rarely.
It would make sense to assume that longer tags are one of the major reasons for the higher cache access latencies.
Also I need to mention that 64KB aliasing is not the only type of aliasing in the Northwood core. It also has 1MB aliasing. Its nature is certainly not so evident, and the best supposition of ours is that the processing of bits 21-30 is postponed until the data arrives from the cache. In most cases it allows to speed up the overall processing, but sometimes it leads to collisions because of this delayed processing of the remaining address bits.
In fact, aliasing problem shouldn’t be a must for the L1 cache. In particular, the K7/K8 CPUs do not have this problem, because they always check all bits of the requested address.
It’s time we took a closer look at the Pentium 4 processor pipeline and summed up everything we know about it.