Recording Data into Write Buffer
The story here is fairly funny: it all started with our discovery of absolutely new replay loops, which appeared out of nowhere. Of course, we had to find out where they actually come from and why. However, at first the whole investigation went on really slowly, as the loops emerges often and then disappeared for some unknown reasons. Our detailed investigation revealed the connection with the data recording into Write Buffer. Let’s discuss a few details about it.
The L1 cache of Pentium 4 processor supports Write Through data update policy, which implies that the contents of both: L1 and L2 cache should be updated simultaneously. Therefore, the record is actually made into L2 cache, while the results of multiple operations are accumulated in WB first, so that the transfer to L2 cache could be performed within the minimal number of transactions. When the line absent in WB has been recorded, it gets blocked for reading for a while.
For the data reading from Write Buffer to be blocked, a few conditions should be fulfilled:
- The line containing the address of the record should exist in L1 cache;
- The record for this line is not yet singled out in Write Buffer;
- Some time should pass between the line recording into the Write Buffer and the reading request for this line to be executed.
Note that the addresses for read and write operations may be different: they should only belong to the same line. According to the test results we obtained, if we read the first half of the line (0.31 bytes), the Load command will go to replay within 21-33 clock cycles after Store. And if we read the data from the second half of the line (32-63 bytes), Load command will go to replay within 21-34 clock cycles after Store. Only one clock cycle difference for the lower and higher 256 bits of the line indicates that the modified data is combined with the non-modified data from the cache within these periods of time.
Pentium 4 Northwood processor features Write Buffer with 6 lines 64 bytes each. It is not too much that is why if the read and write operations go close to one another, we will experience not only recording delays. It will be impossible to read saved data from Write Buffer immediately, which will also cause delays and send read commands through the replay system.