Getting Ready to Meet Intel Core 2 Duo: Core Microarchitecture Unleashed

Internet is full of Conroe processors previews, and today we decided to contribute to the overall excitement. Our article contains all the info on Core microarchitecture that will give you to understand why AMD is going to have really hard times.

by Ilya Gavrichenkov
06/29/2006 | 05:13 PM

Conroe processor that is due to come out very soon seems to be the most impatiently awaited newcomer of this year. Conroe will be the first desktop processor with new Core microarchitecture developed by Intel Israeli engineering team (the same team that designed the successful Pentium M processor). And at this time it seems to be a panacea against all issues that have emerged throughout the last 6 years. Now we can openly state that NetBurst microarchitecture that was first launched in the end of 2000 didn’t prove up to the expectations. The use of NetBurst architecture for desktop Intel processors resulted into decreasing popularity of the solutions, especially in the retail market. The processor users couldn’t put up with the fact that the growing performance of the Pentium 4 and Pentium D processors went side by side with rapidly increasing heat dissipation and power consumption.

 

But Intel learned from their mistakes. In the beginning of this year they announced the official radical change of the consumer features optimization plan for their CPUs. Intel shifted from hunting down maximum performance numbers to finding the best ratio between performance and power consumption. This basic principle guided Intel engineers when they were working on the new Core Microarchitecture (the initial version of the name for it was Next Generation Microarchitecture) that became the basis for Conroe processors.

I would like to point out that Intel has been severely criticized for setting the initial task in a slightly unusual manner. From the performance-to-power consumption prospective, even the extremely economical CPUs may be considered efficient, although their performance level will hardly be acceptable for the desktop space. However, things shouldn’t be taken too literally. By switching to new priorities, Intel engineers first of all emphasized the fact that from now on they will pay special attention not only to the performance increase of their processors, but also to their power consumption rate, which is another very important characteristic of the end product.

Therefore, you shouldn’t take low heat dissipation and power consumption of the new Core Microarchitecture based processors as their primary advantage. In reality, Intel engineers did their best to also improve their performance tremendously. According to the preliminary test results, Conroe has every chance to become one of the fastest desktop processors, which stimulates growing public interest in it. The arrival of the Conroe processor family into the CPU market has every chance to oust (even temporarily) the Athlon 64 family from the higher league to the budget and mainstream market.

All this explains why we should pay special attention to Conroe processors and Core Microarchitecture in general. Before we get a chance to really dig into the actual performance analysis, it would be very useful to spend some time discussing the material part of it. In our today’s article we will try to reveal the major strengths of the Core Microarchitecture that will ensure high performance and low heat dissipation of the upcoming Conroe processors. Besides, we will also discuss some preliminary data on Conroe performance and sum up everything we know about the model range within the upcoming CPU family.

Intel Core Microarchitecture: the Basics

According to a well-known formula popularized by AMD back in the days when they introduced their processor rating system, the performance is processor clock frequency multiplied by the number of instructions the CPU can perform in a single clock cycle. This way there are two major ways of increasing the processor performance: by raising the clock speeds and by increasing the number of instructions processed per clock cycle. The first parameter is pretty clear, while the second parameter is determined by the internal CPU structure and depends on the number of functional units such as instructions decoders and execution blocks.

Moreover, there is one more way to speed up your CPU: reduce the number of operations needed to process the same amount of data. A great illustration of the progress made in this direction is the SSE, SSE2 and SSE3 SIMD instructions that allow completing vector operations in no time.

As for the power consumption, it is processor clock frequency multiplied by processor Vcore2 and by constant dynamic capacity that is determined by the CPU microarchitecture and depends on the number of transistors and their activity during CPU operation.

As a result we can conclude that the developers need to focus on establishing a balance between the number of instructions the CPU can process per clock cycle and dynamic capacity in order to optimize the microarchitecture in terms of the best ratio between the performance and power consumption. Processor Vcore also has serious effect on the performance-to-power consumption ratio, however, it hardly depends on the microarchitecture and is determined mostly by the manufacturing process. The clock frequency doesn’t affect this ratio at all. So, these ideas were all taken into account when Intel developed Core Microarchitecture.

Basing on the requirements we have just listed above, Intel engineers decided to give up NetBurst (which is actually not surprising at all) in favor of mobile processors microarchitecture, because these processors, developed from Pentium Pro, Pentium II and Pentium III boast relatively high performance level and are very economical in terms of power. However, the new Core microarchitecture has been significantly improved and enhanced in order to deliver higher performance, wider range of features and lower power consumption. As a result, it would be absolutely incorrect to claim that the prospective processors will be none other but adapted (for the new applications field) Pentium M.

You can see that it is so by even listing the main formal specs of the Core microarchitecture. For example, Intel processors based on Core Microarchitecture can process up to four instructions per clock cycle, which is more than their predecessors could do, even those based on NetBurst microarchitecture. This way, the upcoming Intel processors should theoretically be faster than any other contemporary CPUs including the competitors from AMD working at the same clock speed. The execution pipeline of processors based on Core Microarchitecture is 14 stages long. It means that the frequencies of upcoming processors will definitely be lower than those of Pentium 4 and Pentium D with a more than 30-stage pipeline. However, if we consider the “performance per watt”, then a shorter pipeline will be an indisputable advantage.

As for the specifics, the first CPUs with Intel Core Microarchitecture will have dual-core design (within a single die), 64KB L1 cache (32KB for data and 32KB for instructions) and shared L2 cache 2Mb or 4MB big. It is extremely important to point out that Core Microarchitecture based processors will support 64-bit Enhanced Memory 64 Technology extensions (EM64T). This is a significant distinguishing feature of the new microarchitecture versus the microarchitecture of Pentium M processors, which do not support 64bit work modes just like their successors, Core Duo CPUs.

The peculiarities of Core Microarchitecture allow designing CPUs with different features for various market segments. The developers claim that by dropping the clock frequency only 15% lower, they can artificially reduce the peak power consumption of the future processors by half. This feature gives green light to three parallel processor families for the mobile, server and desktop markets at the same time. The new notebook processors based on the new microarchitecture and known as Merom will be designed basing on the typical heat dissipation requirements that cannot exceed 35W. As a result, they will run about 20% faster than the mobile computers on Intel Core Duo CPUs, while the battery life will remain the same. The server processors known as Woodcrest will be 80% faster than the today’s dual-core Xeon CPUs, while their typical power consumption will get about 35% lower and will equal 80W. As for the desktop processors, they are known as Conroe. The Conroe performance is forecast to grow up by about 40% compared with the current performance of the top models from the Pentium D 9XX family. As for the typical power consumption, it will drop down by about the same 40%. As a result, the power consumption of the upcoming desktop processors (except the models targeted for computer enthusiasts) will lie within 65W range.

The performance and power consumption numbers we have just discussed look very impressive. However, it is really hard to believe that the CPUs based on Pentium M microarchitecture can really do it. Therefore, it is high time we discussed the innovations introduced in the new Intel Core Microarchitecture in order to make all your doubts and concerns vanish.

Major Innovations

Intel Wide Dynamic Execution

The first mention of the “Dynamic Execution” term goes back to the times of Pentium Pro, Pentium II and Pentium III. Speaking of dynamic commands execution in these CPUs, Intel implied principally new superscalar P6 microarchitecture that could analyze the data stream and allowed speculative (predicative) commands execution and out-of-order commands execution. When the CPUs got transferred to NetBurst microarchitecture, Intel started talking about enhanced dynamic execution that could perform more in-depth data stream analysis and featured improved branch prediction algorithms.

The new Core Microarchitecture implies “wide” dynamic execution. It became wide because the future Intel processors will be able to process more commands per clock cycle than their predecessors. By adding an additional decoder and execution units into each core Intel enabled each of the cores to pick and process up to 4 x86 instructions simultaneously, while other Intel processors (desktop and mobile) and AMD competitors can only handle three instructions per clock. Core Microarchitecture offers 6 dispatch ports (one Load, two Store and three universal ports) for four decoders (one for complex instructions and three for simple instructions). Moreover, Core microarchitecture acquired more advanced branch prediction unit and larger command buffers that get involved at different stages of data analysis to optimize execution.

I would like to remind you that the predecessors of new Core Microarchitecture, Pentium M processors, boasted extremely interesting micro-ops fusion technology that allowed reducing the “expenses” during certain x86 commands execution. The idea behind micro-ops fusion technology is very simple. If the x86 command splits into independent microinstructions, the decoder connects them to one another. The micro-ops fusion technology ties these microinstruction successions together to ensure that the CPU will execute them in certain order. The CPU sees them as a single command all the way until the actual execution stage. This allows to avoid CPU stalling if the connected microinstructions get split apart because of out-of-order execution algorithms.

In addition to the extremely successful micro-ops fusion technology, Core Microarchitecture has also acquired what they call macrofusion . This technology allows increasing the number of commands processed per clock cycle. A set of successive x86 instruction pairs, such as comparison followed by conditional branching is also represented for the CPU as a single microinstruction. The scheduler treats this microinstruction and then executes it as a single command. This way they can execute the code faster and even save some power.

Intel Advanced Digital Media Boost

A separate approach to Core Microarchitecture improvement implied the modification of the SIMD instruction units (SSE, SSE2, SSE3). Contemporary software, for such application fields as image, video and sound editing, data encryption, scientific and financial tasks, uses a lot of SSE instructions that support all sorts of 128-bit operands (vectors and integers as well as high-precision real values).

This mere fact pushed Intel engineers to think about ways to speed up processor SSE units, especially since today’s Intel processors can only process one SSE instruction working with 128-bit operands within two clock cycles. They use the first clock cycle to process the first 64bits, and the second clock cycle to process the second 64bits. The new Core microarchitecture will make SSE instructions processing twice as fast as it used to be. Future CPUs will feature 128bit SSE units, so that the amount of data the CPU can handle per clock cycle will increase. Especially in that tasks that use a lot of SIMD instructions, such as various multimedia applications, for instance.

Besides speeding up the SIMD instruction execution units, Intel has once again revised the SSE command system. As a result, the SSE3 instructions set acquired 8 new commands. In fact, this SSE3 instructions extension has been planned since the times of Tejas processors. However, since they were cancelled, this modification found its way into new Core Microarchitecture.

Intel Advanced Smart Cache

Since Core Microarchitecture is designed for dual-core right from the start, the developers could optimize some functional units according of the upcoming processors accordingly. Unlike all other desktop processors available these days, CPUs with Core Microarchitecture will share their L2 cache between the cores. This cache memory works similarly to the mechanisms that you can find in today’s dual-core Intel Core Duo mobile processors.

There are a few evident advantages of this approach to cache-memory implementation. Firstly, the CPU and flexibly adjust the size of the cache parts used by each core. In other words, any of the two cores of the Core Microarchitecture based CPU can get the entire L2 cache at its disposal, especially when one of the cores is idle. If both cores work at the same time, the cache memory is split proportionally depending on the frequency of requests sent by each core to the memory. Moreover, if both cores work synchronously with the same data, this data will be stored only once in the shared L2 cache memory. In other words, the shared intellectual L2 cache of the Core Microarchitecture processors is much more efficient and even much more capacious than two separate caches assigned to each core.

Shared cache memory may be very useful for dual-core processors in some other cases also. Take, for instance the current discussion of Core Multiplexing Technology that indicates that Intel engineers are ready to offer their solution for dynamic disabling of the second processor core depending on the type of workload the CPU is experiencing. Of course, the single cache can help resolve a lot of technical issues with the implementation of this initiative.

The second significant advantage of shared L2 cache memory is that it reduces the workload on the system memory and the processor bus tremendously. In this case the system doesn’t have to control and ensure coherency of the cache memory of different cores. If the system features a dual-core CPU with different caches for each of the cores and both cores work with the same data at a certain time, then this data will be duplicated in both caches. This way it is important to make sure that both caches have the latest data. Before the data is extracted from L2 cache for further processing, each processor core should make sure that the data hasn’t been modified by the second core. And if the data has been modified, then the cache memory needs to be updated immediately. In NetBurst based systems this update is performed via the system bus and system memory. By having a shred cache for both cores you can forget about this inconvenient algorithm once and for all.

Moreover, the CPUs with Core microarchitecture will have special controlling core logic that will allow exchanging data between the L1 caches of each processor core through the shared L2 cache. As a result, the cores will work more efficiently together on the same task.

Intel Smart Memory Access

The technologies combined under this general name are developed to eliminate or reduce the delays when the processor tries to access processed data. Of course, data prefetch from the memory into L1 and L2 processors caches with lower latencies is a great way out here. I have to say that data prefetch algorithms have been used in Intel CPUs for a long time now. However, this functional unit will become much more enhanced in the new CPUs with Core Microarchitecture.

Core Microarchitecture allows implementing six independent data prefetch units. Two units have to prefetch data from the memory into the shared L2 cache, other two units work with the L1 caches of each of the CPU cores. Each of these units independently tracks down data access patterns (streaming data or data taken with a certain increment within the array) of the execution units. According to the accumulated stats, the data prefetch units try to load the data into the processor cache even before the corresponding request is made.

Also, the L1 cache of each processor core of the Intel Core Microarchitecture based CPUs features the instruction prefetch unit that works similarly.

Besides the improved data prefetch, Intel Smart Access implies one more interesting technology called memory disambiguation . This technology is intended to improve the efficiency of the out-or-order algorithms reading and writing the data into memory. The thing is that contemporary processors supporting out-of-order execution do not allow to commence reading, until the data saving has been completed. It is explained by the fact that the scheduler doesn’t know about the dependence of the loaded and saved data.

However, very often the successive saving and loading instructions are not connected with one another in any way. That is why the lack of ability to change their execution order may sometimes lower the load on the execution units thus reducing the overall CPU efficiency. Memory disambiguation technology is intended to resolve this issue. It supports special algorithms that detect the connection between the successive saving and loading commands with very high probability and thus allows applying out-of-order execution to these commands also.

This way, if the memory disambiguation algorithm works correctly, the CPU can utilize its own execution units in a more efficient way. If the dependency between data loading and saving instructions has been determined incorrectly (which happened very rarely, according to the developers), memory disambiguation technology should detect the conflict, reload the correct data and initiate re-execution of the code.

The use of data prefetch algorithms together with memory disambiguation technology increases the efficiency of processor work with the memory. It not only reduces the possible delays and idling of the processor execution units, but also lowers the latency during memory access and uses the bus bandwidth more efficiently.

Intel Intelligent Power Capability

When working on new Core Microarchitecture, Intel engineers tried to maximally optimize the “performance per watt” parameter. Besides, this microarchitecture is also designed for notebook processors, so the developers paid special attention to the technologies that would allow reducing the heat dissipation and power consumption of the upcoming CPUs. Of course, the new processors will have at their disposal such technologies as Demand Based Switching (primarily, Enhanced Intel SpeedStep and Enhanced Halt State). However, we are not going to talk about them here.

The CPUs based on Core microarchitecture will be able to interactively disable their subsystems that are not being used at the given moment. And we are not talking about the whole core here. The processor is decomposed into much lower-level units. Each of the processor cores is split into a lot of units and internal busses, which are powered separately with the help of additional logic circuits. The main peculiarity of the circuits within Intel Intelligent Power Capability is the fact that they do not cause any increase in the CPU response time to external influences when its needs to reactivate previously disabled units.

Note that the possibility to deactivate different CPU units on the fly pushed the developers to revise the way the processor temperature was measured. CPUs based on Core Microarchitecture will be equipped with a few thermal diodes on the core close to those spots that tend to heat most. To process all this thermal data the CPU will be equipped with a special circuitry that will determine the highest temperature. The CPU will report this particular temperature value to the user and hardware monitoring systems.

Microarchitecture Comparison: Intel Core vs. AMD K8

Of course, Intel’s upcoming processors based on Core Microarchitecture will compete primarily with AMD K8 CPUs. These processors are today’s most advanced solutions. Let’s take a close theoretical look at Intel’s new Core Microarchitecture against the background of the good old AMD K8:

 

Intel Core

AMD K8

L1 data cache

32 KB

64 KB

L1 instructions cache

32 KB

64 KB

L1 latency

3 clock cycles

3 clock cycles

L1 associativity

8-way

2-way

L1 TLB size

Instructions: 128 entries
Data: 256 entries

Instructions: 32 entries
Data: 32 entries

Max. L2 cache

4 MB for two cores

1 MB for each core

L2 latency

14 clock cycles

12 clock cycles

L2 associativity

16-way

16-way

L2 cache bus width

256 bit

128 bit

L2 TLB size

?

512 entries

Pipeline

14 stages

12 stages

x86 decoders

1 complex and 3 simple

3 complex

Integer execution units

3 ALU + 2 AGU

3 ALU + 3AGU

Load/Store units

2 (1 Load + 1 Store)

1

FP execution units

FADD + FMUL + FLOAD + FSTORE

FADD + FMUL + FSTORE

SSE execution units

3 (128-bit)

2 (64-bit)

This table explains a lot of things right away. And the most important thing is that the processors with Core microarchitecture have “wider” architecture that allows processing more instructions per clock cycle than CPUs with K8 microarchitecture. Although the execution units of both competing processor architectures can process up to three x86 and x87 instructions per clock cycle, Core Microarchitecture should prove more efficient with SSE instructions. While K8 processors can perform only one 128bit command per clock, Core can process up to three commands like that.

Moreover, Core Microarchitecture boasts another great advantage: more advanced decoding system. Together with the four decoders, macrofusion technology allows decoding up to five instructions per clock (in an ideal case). The competitor processors can only decode three instructions simultaneously. All this indicates that the decoders of Core Microarchitecture based CPUs will be able to better load the processor execution units by performing up to four instructions per clock in the most optimal conditions. In this case the overall commands execution will go 33% faster than by K8 AMD processors.

Here I would like to also mention more efficient data processing algorithms of the CPUs on Core Microarchitecture. The advantages of this microarchitecture show themselves best in the data caching system. Although, the L1 cache of the Core based processors is smaller, it is more associative. And as for L2 cache, it is not only bigger but also has higher bandwidth. Moreover, the shared structure of the L2 cache memory is beneficial for multi-threaded workload.

An important addition to the data prefetch algorithms of the new Core based processors is the unique memory disambiguation technology that has no analogues in the competitor solutions. It makes the upcoming Intel processor more out-of-order (from the code prospective).

In fact, the only indisputable advantage of the AMD K8 microarchitecture that will survive the arrival of Core will remain the integrated memory controller that can definitely ensure lower latency during data processing. However, it is a very tough question if integrated memory controller will be enough for AMD to worthily oppose the Conroe processors, and we still have to answer it later. However, AMD engineers are not keeping their hands in pockets. The future Athlon 64 cores scheduled to come out in early 2008 will be free from some architectural bottlenecks. But, it is a different story and a different article.

Core Microarchitecture for Desktops: Core 2 Duo CPUs

Now that we have discussed all the major peculiarities of the new Core Microarchitecture from the theoretical prospective, let’s try to find out what we are going to achieve if we have this microarchitecture in actual desktop platforms.

Conroe processors that represent the desktop implementation of Core Microarchitecture are expected to come out in the end of July. The official name of the Conroe processors is Core 2 Duo. Of course, this name points out very clearly that these CPUs belong to the new progressive microarchitecture.

I have to stress that Intel is going to be very aggressive about getting good sales of the new processors in order to avoid being called “paper launch” right before the very active “back-to-school” sales season. On the launch date not only Intel’s leading partners announce the availability of their solutions based on the new microarchitecture, but even the end-users will be able to buy a long-awaited CPU in stores. I don’t think we should doubt Intel’s ability to meet this schedule: the company already has quite a few samples available, which indicates that there are hardly any architectural or production issues that could slow down Conroe’s coming to the market. Especially, since Conroe processors will be manufactured with well-debugged P1264 65nm technology. In other words, they will continue using the same technology.

The first Core 2 Duo processors that we will see in the market, will features 2MB or 4MB of L2 cache memory shared between the two cores. At first their frequencies will start with 1.86GHz and reach 2.93GHz for the top models. Later on, as they conquer more market share, the clock speed range of the product family will be expanded both ways.

CPUs on Core Microarchitecture will use Quad Pumped Bus that has already proven very efficient for all market segments. Core 2 Duo processors will have this bus working at 1067MHz at least at first. Of course, Intel couldn’t experiment with new processor packaging because of the old bus they used. Therefore, Conroe will be manufactured in the same LGA775 package, just like today’s Pentium 4 and Pentium D CPUs.

However, the use of the same packaging doesn’t automatically imply that the CPUs will be compatible with the older mainboards. The mainboard will have to allow clocking front side bus at 1067MHz to support the new Core 2 Duo processors. But this will not be all. Besides that, the mainboard will feature a new voltage regulator unit (VRM 11). That is why the manufacturers will have to make new mainboard modifications based on Intel 975X Express, Intel P965 Express, Nvidia nForce 5XX Intel Edition chipsets or ATI Xpress 3200 Intel Edition.

Core 2 Duo processor rating will be formed the same way as the rating of mobile Core Duo CPUs. It will look like EXXXX with the letter “E” indicates that the product belongs to the desktop family and the next 4-digit number reflects the performance level and technical advancement of the solution.

Note that Core 2 Duo family will also have an “Extreme Edition” model. This CPU will be called Core 2 Extreme and its rating will look like XXXXX. The main difference between the Core 2 Extreme and Core 2 Duo (besides the extremely high price) will be its higher clock speed.

By the launch date the Conroe processor family will look as follows:

CPU

Clock
frequency, GHz

L2 cache,
MB

Bus frequency,
MHz

Typical heat
dissipation, W

Price,
$

Core 2 Extreme X6800

2.93

4

1066

75

999

Core 2 Duo E6700

2.67

4

1066

65

530

Core 2 Duo E6600

2.4

4

1066

65

316

Core 2 Duo E6400

2.13

2

1066

65

224

Core 2 Duo E6300

1.86

2

1066

65

183

Performance Preview

No doubt that we can expect a lot from the new processors on Core Microarchitecture. However, in order to get at least some idea of what the performance level of the new Core 2 Duo processors could be, we need some practical results. Unfortunately, we cannot share with you any results from our own lab, because we are working on the review under an NDA with Intel. However, it is no obstacle for the curious readers who are willing to dig out the first data about the actual performance of the Core 2 Duo processors.

A lot of online sites have already published the first performance tests of the Core 2 Duo engineering samples coming from unofficial sources. Even Intel themselves have many times given the media the opportunity to test the new promising processor in their own camp.

For example, we would highly recommend to check out the results of a test session like that from the Computex 2006 show in Taipei that has been conducted by a respectful Anandtech site. In this article you can see a side-by-side comparison of the top-of-the-line desktop processors from AMD and Intel: AMD Athlon 64 FX-62 (2.8GHz0 and Intel Core 2 Extreme X6800 (2.93GHz).

The results obtained during this test session indicate that Core 2 Extreme X6800 processor is on average 22% faster than the competitor in office applications, 17% faster in digital content creation and processing applications, and 21% faster in games.

Frankly speaking, no comment is necessary here. Practical performance results prove all our theoretical conclusions made above. Core 2 Duo processors definitely question the competitiveness of the CPUs with K8 architecture. Of course, there are a lot of aspects that make this or that CPU attractive for the user. Performance is certainly not everything today. We still need to find out more about the overclocking potential of the Core 2 Duo processor, their practical heat dissipation, etc. However, the preliminary performance results we have just seen are not very optimistic for AMD. Until AMD engineers introduce a fresh update to their K8 architecture, Athlon 64 processors will not be able to settle in high-performance platforms.

Conclusion

In fact, it is too early for any definite conclusions. All the statements we make in this article are based only on theoretical data and on preliminary test results of early engineering samples that were posted on the web. However, it looks like the days when Intel processors were falling behind their rivals from AMD are over. The CPUs with Core Microarchitecture will certainly change the situation in the processor market, and it is most likely to be not in AMD’s favor.

So far, I would like to wind up our discussion, because we will be able to go into more detail about this promising Intel solution only on the official launch day. So, stay tuned!