Desktop Kabini Series: Architecture Details
The Kabini announcement changes the entry-level market segment. Until now, processors of this kind, including Intel’s Atom or AMD’s Zacate, were soldered to the mainboard whereas the Kabini is installed into a socket. AMD thinks that the opportunity to upgrade the processor may be an important advantage. It may attract users who used to prefer inexpensive tablets, netbooks, nettops, Chromebooks and other surrogates of full-featured PCs.
Four processor models are offered at the launch of the Socket AM1 platform:
Each model is based on a 28nm semiconductor die and incorporates four or two computing cores with Jaguar architecture and a GCN graphics core with 128 shader processors. Thus, the Socket AM1 compatible Kabini products are very similar in their specs to the mobile processors which have been available for near a year. The Athlon 5350 is similar to the A6-5200 and the Athlon 5150, to the A4-5100. The Sempron 3850 and 2650 models are closely related to the E2-3800 and E1-2500. There is a small difference in terms of the graphics core frequency and TDP but overall the new desktop Kabini series is very much alike to the old mobile one. And that’s not good news. AMD just has been unable to improve the frequency potential of its junior processor series over the last year.
People who hope to use the Socket AM1 platform to build something like a latest gaming console from Sony or Microsoft will be disappointed, too. The console processors have 8 Jaguar cores clocked at about 2 GHz and their GCN graphics core has 768 or more shader processors. In other words, the new desktop Kabini models are inferior to the console APUs.
AMD obviously targets the entry-level market segment, promoting the Socket AM1 platform as a further development of the Brazos 2.0. Compared to the Zacate APUs, the Kabini series are more advanced just because they have twice the number of computing cores.
The Jaguar architecture itself brings about certain improvements over its predecessor Bobcat. They are not critical, though, just like in the Bulldozer branch of AMD’s processor designs. Developed with energy efficiency in mind, the Jaguar can execute two instructions per clock cycles, equaling Intel’s Silvermont architecture implemented in Bay Trail series processors. The Jaguar uses out-of-order execution, just like its predecessor. The key changes in this microarchitecture are only meant to increase efficiency of the resources available since the Bobcat, so they can mostly be seen in the input part of the execution pipeline.
First, the L1 instruction cache is complemented with an additional 128-byte loop buffer. It helps avoid repeatedly fetching the same instructions from the L1 cache while executing a loop. It doesn’t improve performance as its latency isn’t any better. It only helps reduce power consumption. Second, the Jaguar features an improved instruction prefetch algorithm. Third, the new microarchitecture has a larger buffer between the L1 cache and the instruction decode unit. And fourth, the execution pipeline is longer by one decode stage to improve frequency potential because the clock rate of the Bobcat design was limited by the decode unit.
There are some changes in terms of instruction execution. The Jaguar features an updated instruction set including SSE4.1/4.2, AES, CLMUL, MOVBE, AVX, F16C and BMI1. The floating-point unit had to be redesigned for that. The Bobcat had a 64-bit FPU whereas the Jaguar has a 128-bit one. As a result, 256-bit AVX instructions are executed in two steps while 128-bit instructions don't have to be split up anymore. The floating-point pipeline is longer by 1 stage in the Jaguar, yet the new microarchitecture claims to be much faster than its predecessor when it comes to vector operations.
There are changes about integer instructions, too. Although the Bobcat was quite fast with ordinary code, the Jaguar introduces a new integer division unit borrowed from the K10.5 microarchitecture. The division speed is doubled as the result.
AMD has also made the scheduler's buffers larger to improve the efficiency of out-of-order instruction execution.
The load-store unit works in the energy-efficient Bobcat and Jaguar designs in the same way as in the regular processors. We mean that it can prefetch and reorder data requests. The prefetch algorithm was improved in AMD’s latest Piledriver and Steamroller designs and now it is implemented in the Jaguar. The result is 15-percent acceleration in terms of data loading and storing.
All of the improvements in the microarchitecture make one Jaguar core about 17% more efficient in comparison with a Bobcat core. Coupled with the potentially higher clock rates and the increased number of cores, AMD claims the Kabini series are going to be 2 or 4 times as fast as their Zacate predecessors.
The new processors’ speed at multithreaded load is affected by the processing module structure. Each core used to have a dedicated L2 cache (clocked at half the processor’s clock rate) and different cores were connected via an external bus. The Jaguar uses a shared L2 cache instead. One quad-core Kabini module includes a shared 16-way associative L2 cache with a capacity up to 2 megabytes. For the first time in AMD processors, the cache is inclusive, meaning that it duplicates data stored in the L1 cache. This requires more cache memory but has a positive effect on multithreaded performance.
Thanks to the more advanced 28nm tech process and certain manufacturing techniques from the GPU field, one Jaguar core fits into an area of 3.1 sq. mm whereas a 40nm Bobcat core used to take up 4.9 sq. mm. So the addition of a large L2 cache doesn’t make the die larger and costlier to make.
Like AMD’s more advanced APUs, the Kabini features an integrated graphics core with the latest GCN architecture, the same as is used in AMD’s flagship graphics cards. That’s why the Kabini supports all modern APIs: DirectX 11.1, OpenGL 4.3 and OpenCL 1.2. The Kabini is inferior in terms of its GPU resources, though. It has only two execution clusters for a total of 128 shader processors, which is fewer than in the junior graphics cards of the Radeon R5 series. The Kabini's graphics core is referred to as Radeon R3. Besides the 128 shader processors, the integrated core contains eight texture-mapping units and four raster operators. It also includes a command processor and four independent asynchronous compute engines for heterogeneous computing. The Kabini doesn't support HSA technologies, though.
Despite its inferior specs, the Kabini’s integrated graphics core incorporates full-featured VCE and UVD engines, which means hardware acceleration for decoding video in H.264, VC-1, MPEG-2, MVC, DivX and WMV formats and for encoding H.264 video at Full-HD resolution. The latter function isn’t used in popular transcoding tools for some reason, though.
Unfortunately, for all the improvements in the architecture of its x86 and graphics cores, the Kabini still has a single-channel memory controller. It supports up to DDR3-1600 SDRAM, so the Socket AM1 platform may be limited by memory bandwidth in many applications. The graphics core is going to be impeded by this factor, too.
The good news is that the new desktop Kabini, just like its mobile counterpart, is a system-on-chip. So besides the x86 and graphics cores, memory controller and North Bridge, it incorporates a South Bridge with SATA 6 Gbit/s, USB 3.0 and PCI Express 2.0 controllers, allowing to connect peripheral devices.
Launching its Kabini processors, AMD dusts off its Athlon and Sempron trademarks, which may be a cause for some confusion because AMD also ships the Richland-based Athlon X4 for Socket FM2 and the Sempron 145 for Socket AM3.
The new Athlon and Sempron processors for low-end desktop PCs are affordable indeed. The senior Kabini costs a mere $55 and, as mentioned above, comes with a full selection of integrated peripheral controllers. It means that mainboards for such processors don’t have to carry those controllers and may be as cheap as $35. In this case, the cheapest Socket AM1 configuration is going to cost you a mere $65-70. You only have to add something like system memory, a storage device and a computer case.
There’s nothing extraordinary about the pricing. Incorporating 914 million transistors, the Kabini semiconductor die is extremely small. Its size is only 105 sq. mm.
AMD offers the following illustration: four Jaguar cores take up about the same area as one dual-core Steamroller module.
Indeed, the latest Kaveri’s die is 245 sq. mm large. We can also put forth another example: the dual-core Haswell with GT1 graphics, manufactured on the more advanced 22nm tech process, is 107 sq. mm large.