Intel Wide Dynamic Execution
The first mention of the “Dynamic Execution” term goes back to the times of Pentium Pro, Pentium II and Pentium III. Speaking of dynamic commands execution in these CPUs, Intel implied principally new superscalar P6 microarchitecture that could analyze the data stream and allowed speculative (predicative) commands execution and out-of-order commands execution. When the CPUs got transferred to NetBurst microarchitecture, Intel started talking about enhanced dynamic execution that could perform more in-depth data stream analysis and featured improved branch prediction algorithms.
The new Core Microarchitecture implies “wide” dynamic execution. It became wide because the future Intel processors will be able to process more commands per clock cycle than their predecessors. By adding an additional decoder and execution units into each core Intel enabled each of the cores to pick and process up to 4 x86 instructions simultaneously, while other Intel processors (desktop and mobile) and AMD competitors can only handle three instructions per clock. Core Microarchitecture offers 6 dispatch ports (one Load, two Store and three universal ports) for four decoders (one for complex instructions and three for simple instructions). Moreover, Core microarchitecture acquired more advanced branch prediction unit and larger command buffers that get involved at different stages of data analysis to optimize execution.
I would like to remind you that the predecessors of new Core Microarchitecture, Pentium M processors, boasted extremely interesting micro-ops fusion technology that allowed reducing the “expenses” during certain x86 commands execution. The idea behind micro-ops fusion technology is very simple. If the x86 command splits into independent microinstructions, the decoder connects them to one another. The micro-ops fusion technology ties these microinstruction successions together to ensure that the CPU will execute them in certain order. The CPU sees them as a single command all the way until the actual execution stage. This allows to avoid CPU stalling if the connected microinstructions get split apart because of out-of-order execution algorithms.
In addition to the extremely successful micro-ops fusion technology, Core Microarchitecture has also acquired what they call macrofusion . This technology allows increasing the number of commands processed per clock cycle. A set of successive x86 instruction pairs, such as comparison followed by conditional branching is also represented for the CPU as a single microinstruction. The scheduler treats this microinstruction and then executes it as a single command. This way they can execute the code faster and even save some power.