News
 

Bookmark and Share

(32) 

As Advanced Micro Devices is preparing to launch its next-generation microprocessors with Steamroller high-performance x86 cores, enthusiasts are revealing secrets about its fourth-generation Bulldozer core code-named Excavator. As it appears, that processing engine will support 256-bit AVX2 floating point instructions, which may mean that it will feature rather revolutionary changes from existing Bulldozer cores.

AMD recently released a patch to the GCC community that enables support for its future high-performance micro-architecture code-named Excavator, which the chip developer calls "bdver4" internally. The initial patch is designed to bring very general support of Excavator to Linux operating system, but even that general support may reveal some of the secrets the Excavator may have. Based on the information released by AMD, the Excavator will support all the instructions found in the modern Intel code-named Haswell microprocessors, including SSE4.1, SSE4.2, AES, PCLMUL, AVX, BMI, F16C, MOVBE, AVX2, BMI2, RDRND and so on.

The most important disclosure is support for AVX2 instructions introduced by Intel Haswell earlier this year. While such instructions are barely used today and therefore may not be considered important nowadays, they require major hardware changes from previous generations, something that is clearly important for the future.

The original AVX brought 256-bit floating-point SIMD instructions, the AVX2 allows to operate with the AVX 256-bit wide YMM register for integer data types. The problem with current AMD hardware is that the Bulldozer FPU only supports 128-bit integer operations used in the XOP instruction set, reports HardwareLuxx web-site.

To support AVX2 instructions, AMD will need to either considerably upgrade its FPU [floating point unit], which is shared between two ALUs in a Bulldozer module, or even develop a new one from scratch. The new one will expectedly feature dramatic performance improvements, but even a redesigned one should be noticeably faster than existing one in numerous demanding applications that process loads of data.

Considering the timing about Excavator – 2015 or even 2016 – redesign of the FPU is completely logical and necessary. Keeping in mind that by the time AMD’s Excavator begins to roll commercially, Intel will release its brand-new Skylake high-performance micro-architecture that will support 512-bit AVX instructions known as AVX 3.2, upgrading FPU is simply a must.

AMD did not comment on the news-story.

Tags: AMD, Bulldozer, Excavator, Steamroller, Piledriver, Haswell, Broadwell

Discussion

Comments currently: 32
Discussion started: 10/19/13 07:54:39 AM
Latest comment: 11/18/13 09:51:26 PM
Expand all threads | Collapse all threads

[1-10]

1. 
and I imagine with such big changes will require a new socket (AM4) which AMD really needs since the current AM3+ socket dates all the way back to the original AM3 socket that came out back in 2009.
4 1 [Posted by: SteelCity1981  | Date: 10/19/13 07:54:39 AM]
Reply
- collapse thread

 
I don't expect to see any new AMx socket. I think we will have only FMx sockets in the future.
2 2 [Posted by: john_gre  | Date: 10/19/13 11:55:30 AM]
Reply
 
AMX and FMX sockets will both be useless upgrade paths once DDR4 ships. The AMx socket replacement will likely be a single socket version of whatever replaces the C32 socket. FMx will obviously be replaced by a new socket also but i doubt it will be FM3 or FM4 they will likely use a new name for it and have a full break because of DDR4 topology changes.
2 0 [Posted by: sollord  | Date: 10/19/13 07:29:07 PM]
Reply
 
Kaveri will support DDR4 and therefore so will the FM2+ platform. There are rumors of it even supporting GDDR5, and that used to be the case with early Kaveri prototypes, not sure if they still support it.
4 1 [Posted by: Hakob Panosyan  | Date: 10/20/13 05:38:05 AM]
Reply
 
show the post
1 4 [Posted by: siuol11  | Date: 10/19/13 10:00:55 PM]
Reply
 
5 years isn't often lol.
5 0 [Posted by: SteelCity1981  | Date: 10/20/13 09:52:10 AM]
Reply
 
if you design it right you only need a socket update every time a new generations of ram comes out or something else changes dramatically (like on-die-pci-e controllers)
2 0 [Posted by: Countess  | Date: 10/20/13 11:58:29 PM]
Reply
 
Remember to tell Intel about this too!
0 0 [Posted by: MHudon  | Date: 10/21/13 11:10:44 PM]
Reply

2. 
The basic premise of this article seems to be that 256-bit AVX2 support implies a 256 bit wide FPU. That is incorrect.

The decoders in most modern x86 cores translate from the x86 ISA to an internal instruction set consisting of uops (Intel terminology) or MacroOps (AMD terminology). It is entirely possible to implement a "wide" ISA like AVX2 on a "narrow" FPU by decoding each 256-bit source instruction to 2 or more 128-bit uops/MacroOps. That can be done either via a complex decoder that emits multiple uops/MacroOps per input instruction (AMD calls this "VectorPath" if memory serves) or via microcode.

Some previous examples:

1. Intel's Pentium III (and its derivatives in the Pentium-M series) supported the 128-bit SSE ISA, but the vector units were only 64 bits wide.

2. AMD's Bobcat supported the 128-bit SSE ISA, but also only had 64-bit vector units.

One catch is that the AVX2 gather instruction is somewhat problematic to split. That's reportedly why Intel didn't implement AVX/AVX2 in Silvermont (Silvermont also directly executes most non-microcoded x86 instructions, so that would also make splitting more complicated).


4 1 [Posted by: patrickjchase  | Date: 10/19/13 09:32:12 AM]
Reply
- collapse thread

 
show the post
4 8 [Posted by: linuxlowdown  | Date: 10/19/13 10:19:51 AM]
Reply
 
True, but you do get a substantial performance drop compared to native level. Considering AMD is already having poor performance due to using a single FPU for two cores, they will probably be looking to either double the FPU count or double the width. If they do the latter only, perhaps they can try something fancy to let both cores use the FPU at the same time for smaller commands (assuming the commands are similar enough between cores)
5 0 [Posted by: basroil  | Date: 10/21/13 11:02:56 AM]
Reply

3. 
Some people seem very confused. The AM3+ socket doesn't bottleneck AMD performance in any way. Excavator like Steamroller is a processor "core" and can be used in any number of socket configurations.

HSA will make huge changes in both compute and GPU performance.
6 5 [Posted by: beenthere  | Date: 10/19/13 12:11:22 PM]
Reply
- collapse thread

 
show the post
0 4 [Posted by: trumpet-205  | Date: 10/19/13 07:09:26 PM]
Reply
 
A 32 pair HT3.1 link is way faster then 32x PCIe 3.0 it's even fast enough to handle 48x PCIe 3.0... Not that it matters since PCIe 3.0 16x and HT3.1 16 pair at 2.0Ghz as used in the AM3+ are about the same so it's a wash. AMD could release a chipset and CPU that fully uses the 16 pair HT3.1 link at 3.2GHz and pull around 25.6GB/s which should be enough for dual crossfire on PCIe 3.0 and no bottle neck. Hypertransport isn't the culprit here it's AMD saving money by reusing there base chipsets with minor tweaks till they need to do a full redesign which won't happen till they roll out PCIe3.0, DDR4, and SATAExpress all on one new platform.
3 2 [Posted by: sollord  | Date: 10/19/13 07:54:56 PM]
Reply
 
When you add SATA Express, PCIe 3.0, AND other southbridge communication, it is pushing HyperTransport. AMD could just cram as many HT link to compensate that but I wonder about latency.

In any case, it is in AMD best interest to integrate northbridge into the CPU, which means ditching AM3+.
1 1 [Posted by: trumpet-205  | Date: 10/19/13 10:59:19 PM]
Reply

4. 
There is no AM3+ HSA CPU. HSA is on socket FM2+. AMD "cores" i.e. Steamroller, Excavator, etc. can be used with whatever socket(s) and CPU/APU AMD desires.
7 4 [Posted by: beenthere  | Date: 10/19/13 07:43:06 PM]
Reply
- collapse thread

 
Keep it up, they'll get it sooner... or later!
0 1 [Posted by: MHudon  | Date: 10/21/13 11:14:53 PM]
Reply

5. 
where was everyone when amd publicly closed that it was abandoning the r
fx desktop chip series and am sockets by extension, the only cpus going forward are opterons and apus. the persons will be phased out for power pc chips and apus. I only wished amd shipped 6 core apus with crystal welltech. because amd will be pressed severely against the soon to be released broadwell especially the crystal well series for i5s
2 1 [Posted by: ericore  | Date: 10/20/13 06:30:30 AM]
Reply
- collapse thread

 
Assuming that rumor is even true (AMD, as far as I recall, has not publicly made the abandonment statement), you don't need the FX line to continue on with excavator. They will simply call the CPU core of the next APU line 'excavator'.
2 0 [Posted by: jumpingjack  | Date: 10/21/13 05:09:32 PM]
Reply

6. 
Meh. I'm a long-time AMD advocate, but I'm getting a little tired of bait and wait tactics.
3 0 [Posted by: Aquineas  | Date: 10/21/13 06:26:43 PM]
Reply
- collapse thread

 
I agree, this practice is tiresome. In the end tough, they always did deliver affordable and reliable products.
0 2 [Posted by: MHudon  | Date: 10/21/13 11:19:32 PM]
Reply

7. 
Please explain "dramatic" performance increase!! Will be finally able to compete with an i5 in games and other apps??
1 1 [Posted by: TAViX  | Date: 10/22/13 02:25:02 PM]
Reply

8. 
I think AMD should focus on unifying CPU and GPU. then transfer all multimedia extensions as well as FPU to GPU. This way they will reduce die size and power and get better performance.
0 1 [Posted by: Akram Al  | Date: 10/22/13 08:25:26 PM]
Reply

9. 
I for one have been waiting for Quad Channel RAM support, as well as independent fetch/decode *per core*. (I know, dream on for that last one..) It may be that scaling back to only 6 cores and reducing, or even eliminating, L3 cache would free up enough silicon to allow for true parallelism. Could just pay off.
0 0 [Posted by: LifeInChrist  | Date: 10/23/13 07:14:37 PM]
Reply

10. 
Another idea might be to add one or two very small, low-performance cores for background tasks. (Call it 6.1 or 8.1 if you wish.) It/they could have a small amount of independent cache (say 512k of L2). Non-demanding programs could apply to run on these cores through the scheduler API. By freeing up the primary cores, critical data could persist for longer periods of time in the main cache, thus improving the hit ratio.
0 0 [Posted by: LifeInChrist  | Date: 10/23/13 07:30:06 PM]
Reply

[1-10]

Add your Comment




Related news

Latest News

Thursday, November 6, 2014

6:48 am | LG’s Unique Ultra-Wide Curved 34” Display Finally Hits the Market. LG 34UC97 Available in the U.S. and the U.K.

Wednesday, October 8, 2014

8:52 pm | Lisa Su Appointed as New CEO of Advanced Micro Devices. Rory Read Steps Down, Lisa Su Becomes New CEO of AMD

Thursday, August 28, 2014

12:22 pm | AMD Has No Plans to Reconsider Recommended Prices of Radeon R9 Graphics Cards. AMD Will Not Lower Recommended Prices of Radeon R9 Graphics Solutions

Wednesday, August 27, 2014

9:09 pm | Samsung Begins to Produce 2.13GHz 64GB DDR4 Memory Modules. Samsung Uses TSV DRAMs for 64GB DDR4 RDIMMs

Tuesday, August 26, 2014

6:41 pm | AMD Quietly Reveals Third Iteration of GCN Architecture with Tonga GPU. AMD Unleashes Radeon R9 285 Graphics Cards, Tonga GPU, GCN 1.2 Architecture