TLB-Bug and Its Fix
The description of the notorious TLB-bug is available in AMD’s technical documentation, where it is referred to as ERRATA 298.
The problem implies that under certain tragic circumstances the entries of page translation table located in L2 cache that are used by the system OS to transform virtual address space into physical addresses may get duplicated in L3 cache with wrong flag settings. It not just contradicts exclusive cache-memory architecture, but may also result in data corruption, if the wrong entry from the shared L3 cache will be used by another processor core. According to the official documents, this duplication can occur only in one very rare case: while the processor changes bit flags in L2 for a given entry from the page address translation table, another process evicts the entry into L3 cache.
The patch they developed immediately following the bug discovery that can be activated in the mainboard BIOS Setup, solves this problems very radically: it simply prohibits caching the page address tables at all. As a result, every time the entry cannot be located in the TLB (Translation Lookaside Buffer) featuring some data on direct mappings from virtual memory to physical memory, the processor has to go to main system memory and uncached page table. This certainly increases the memory subsystem latency that is why giving up page table caching may not be considered a good solution.
Even the simplest synthetic benchmarks measuring the memory subsystem performance reveal dramatic performance drops when this TLB-patch is activated. For example, the charts below show the memory subsystem performance measured in a system with AMD Phenom X4 9600 processor using B2 stepping. You can see the results with the patch and without it:
As you can see from the screenshots, enabling this patch results in about 50% latency increase. The practical bandwidth also worsens. As we have already shown in our article called AMD Fan Kit: Phenom 9600 Black Edition CPU + DFI LANParty UT 790FX-M2R Mainboard, it also affects the performance in real applications causing about 10% average drop and up to 30% slowing in some individual cases.
Although there are not too many examples when TLB-bug has some serious effect on reliability and only extremely unlucky desktop users working with some specific applications have a chance to ever really face it, hardware fix for ERRATA 298 turned into one of the most acute tasks for AMD.
New B3 processor stepping does solve the problem on the hardware level without losing any of the performance and sacrificing page tables caching. According to AMD representatives, the performance of new processors should be the same as that of CPUs using B2 processor stepping but working with disabled TLB patch. The same can be proven by synthetic benchmark results: Phenom X4 9850 working at lower 2.3GHz frequency and integrated North Bridge running at 1.8GHz speed demonstrates practically the same results as Phenom 9600 with disabled patch.
Nevertheless, the results are still a little different. The new processor stepping provides slightly worse latency when working with the memory subsystem. This is probably connected with the new algorithms for work with page address tables in the cache memory that now contain no potential hazards for the data. However, when we compared the performance of processors on B2 and B3 stepping in real applications, this difference was hardly noticeable.
Unfortunately, AMD engineers didn’t really explain to us what was done specifically to fix the TLB bug in the new B3 processor stepping. However, some indirect data we have at our disposal gives us reason to believe that now, after the processor core changes the bit flags for page table entries stored in L2 cache, they are all evicted into L3 cache. This may be the reason fore the latency to get a little bit higher.