The industry chose the second way a long time ago. This unit is already available in the CPU and it is called the decoder. It translates the x86 commands into instructions of a very simple format, and these instructions are now known as micro-operations (uop-s).
The ideology of micro-operations is not that new, we have already talked about it in the times of Pentium Pro, Pentium II and Pentium III. The functional units of the Pentium 4 processor work with the same micro-operations instead of x86 commands. These uop-s are generated by the decoder unit from the x86 instructions.
Let’s veer away for a moment: I believe you are quite used to the so-called Harvard cache working scheme, when we have separate cache for data and separate cache for instructions. This is exactly the way the cache of Pentium III processor was organized. But unlike Pentium III with the almost traditional commands decoding scheme, Pentium 4 can boast some significant modifications.
Instead of the traditional instructions cache where the x86 code was stored, Intel introduced a modified instructions cache called Trace cache. It is located after the decoder but before the other processor units. It no longer stores x86 instructions, but the result of their decoding, i.e. uop-s. Here the decoder works independently of all other processor units, filling the Trace cache with uop-s as fast as one x86 instruction per clock cycle at maximum. If the Trace cache contains not instructions for the code requested by the CPU, they will be relatively slowly loaded from the L2 cache being decoded on the fly. Of course, the decoding and data transfer on this stage is strictly determined by the executed program code. The micro-operations are taken from the decoder when ready and the total performance at this point will never exceed one x86 instruction per clock cycle.
When the Trace cache is addressed the second time, the corresponding data will be found. The decoded instructions will be transferred at up to 6 uop-s per two clock cycles (Trace cache works at half the frequency). Moreover, if the same part of the code is addressed the second time, the CPU doesn’t have to perform the decoding again.
Actually, this allows saving quite a significant amount of resources: despite Intel’s extreme secrecy regarding the clock cycles involved into decoding operations, there are some indirect data proving that the “length” of the decoder equals minimum 10-15 and maximum 30 clock cycles. In other words, it is comparable with the rest of the Pentium 4 pipeline. The probability that the requested part of the code is available in the Trace cache is about 75-95% for an average program.
In most cases each x86 instruction is turned into 1-4 micro-operations. This is the way simple instructions are decoded. As a rule each micro-operation occupies one cell in the Trace cache, however, there are some special cases when there will be two uop-s in a single Trace cache cell. Complex x86 instructions are transformed into special uop-s (MROM-vectors), or into a combination of MROM-vectors and regular micro-operations. These MROM-vectors are a kind of a tag for a certain chain of uop-s, and moreover, they occupy minimum space in the trace cache. When the corresponding part of the code needs to be executed, MROM-vector will be sent to Microcode ROM, which will respond with a chain of normal micro-operations marked with this vector. This storing algorithm allows us to save a lot of space in the trace cache.