News
 

Bookmark and Share

(21) 

Advanced Micro Devices on Thursday quietly published the first patch work concerning support of its next-generation Steamroller micro-architecture in GNU compilers. The bdver3 GCC patch sheds some light on the peculiarities of AMD's future high-performance x86 core, but does not contain a lot of specifics.

"The attached patch (Patch.txt) enables the next version of AMD's Bulldozer core. A new file (bdver3.md) is also attached which describes the pipelines," Ganesh Gopalasubramanian, an AMD engineer, wrote at GCC project web-site.

According to Phoronix web-site, the Bulldozer version 3 (bdver3) GCC patch is presently in its very early form and generally copies most of the tuning work from bdver2 (Piledriver), except the fact that the pipelines have already been modeled in accordance with the new Steamroller core design. Considering the fact that AMD did not add support for any new instruction the next-gen x86 core supports, it is evident that the company is more concerned about ensuring that peculiarities of the Steamroller cores are taken into consideration by software designers on the first place.

AMD is reportedly trying to ensure that Steamroller micro-architecture is supported by the GNU compiler collection 4.8, which is due in the first half of 2013. Apparently, the company is very concerned about optimization of compilers for the new bdver3 pipelines, which were significantly redesigned in the third-gen compared to the original Bulldozer.

AMD pins a lot of hopes on Bulldozer micro-architecture and even disclosed many of its peculiarities back in August '12, well ahead of the roll-out of the first chips, which are projected to be due in late 2013.

Tags: AMD, Bulldozer, Piledriver, Steamroller, 28nm

Discussion

Comments currently: 21
Discussion started: 10/12/12 03:47:54 AM
Latest comment: 10/16/12 10:28:47 AM
Expand all threads | Collapse all threads

[1-3]

1. 
show the post
1 6 [Posted by: JBG  | Date: 10/12/12 03:47:54 AM]
Reply
- collapse thread

 
On a related note, Bulldozer architecture will get a boost in the up coming Windows 8 with a specially developed kernel driver taking full advantage of the Bulldozer CPU scheduling. In Windows 7 there was a patch rolled out that only partly took advantage of Bulldozer, improving performance by 3%. It will be higher in Windows 8.
5 1 [Posted by: linuxlowdown  | Date: 10/13/12 01:58:56 AM]
Reply

2. 
show the post
2 9 [Posted by: Tristan  | Date: 10/12/12 04:19:28 AM]
Reply
- collapse thread

 
Huh? Intel has processor instruction set optimisation also. Do you really know what you're talking about? You have lost credibility sir. Read the link for the full story from the horse's mouth.

http://www.phoronix.com/s...=news_item&px=MTIwNDY
8 3 [Posted by: linuxlowdown  | Date: 10/12/12 06:04:33 AM]
Reply
 
Ignore the clueless haters.

Every chance they get they try to bash AMD only to illustrate their technical ignorance.

As noted Intel uses processor instructions just like AMD. To get the best performance from a processor the software writers need to incorporate BOTH Intel and AMD instructions but many lazy coders or those bought by Intel $$$ fail to write good code that fully supports all the AMD instructions. That is why AMD is assisting the industry to better serve it's customers and use the true performance of AMD processors.

FYI - You'll find many benches do NOT use the proper AMD instructions but do use all of the Intel instructions - to boost Intel's processor performance and make AMD processors look inferior. This dirty little secret has been documented by the better PC hardware reviewers quite some time ago.
7 5 [Posted by: beenthere  | Date: 10/12/12 06:51:47 AM]
Reply
 
Hmm I wonder why people writing compilers would focus more on optimizing for the cpu's that make up 80% of the market vs the cpus that make up 20% of the market. Its almost as if they would like it to run programs well on the majority of possible computers. Sure they should optimize for both cpus and 100% of the market, but there are limits on time and money when writing software.

Also theres 2 parts of the cpu that the complier can optimize for: new instructions and architecture. Almost all compilers support both companies instructions, and generally both companies use the same instructions within 1 generation. Architecture didnt use to matter in compilers because there were only single core cpus. Most likely a lot of compilers have only just finished adapting to many core scenarios, and have very little interest in writing lots more code to adapt to bulldozer's shared FP unit, to only get ~1-3% more performance out of bulldozer, for a cpu that up until the end of this year will probably be less then 1% of all computers in use (since k10 apus were still vastly outselling bulldozer until now)
3 1 [Posted by: cashkennedy  | Date: 10/12/12 10:26:41 AM]
Reply
 
Compiler vendors get help from the CPU manufacturers to add CPU specific optimizations.
0 1 [Posted by: user99  | Date: 10/12/12 04:11:37 PM]
Reply
 
Even assuming the split is 80/20 (which it may not be), do you realize how many computers we're talking about? 20% of the 2012 x86 market is about ~500 million CPUs x .20 = ~100 million CPUs. So what you're saying is that, since there are 'only' 100 million AMD cpus being sold in 2012 alone (of course, there were a similar number sold in 2011, and more will be sold in 2013), it's not worthwhile. We're talking hundreds of millions of users.

This is a load of BS. It is perfectly possible to compile for both Intel and AMD architectures, and include both optimizations in an installation package which detects which CPU you are using and installs the appropriate code build for that architecture. It's utter nonsense that it can't be done, or that it would cost significantly more.
0 0 [Posted by: anubis44  | Date: 10/16/12 10:28:47 AM]
Reply
 
100% correct again @beenthere. But you know, I have trouble ignoring the clueless AMD haters ;-)
4 2 [Posted by: linuxlowdown  | Date: 10/13/12 01:47:14 AM]
Reply
 
Any C/C++ compiler worth its salt optimizes for the chip its running on.
4 1 [Posted by: KeyBoardG  | Date: 10/12/12 07:22:52 AM]
Reply
 
Thus they aren't worth their salt... or they are improperly used for only one type of processor. <LOL>
5 4 [Posted by: beenthere  | Date: 10/12/12 07:38:19 AM]
Reply
 
A compiler doesn't care what CPU it's running on. You can even compile ARM applications while running a x86 OS, it's called cross compiling.
1 2 [Posted by: user99  | Date: 10/12/12 04:13:19 PM]
Reply
 
Yes it does, there are compilers which are optimized for different CPUs because they are made by the vendor. Add to that if they are not open source they can do unoptomising on certain hardware. AMD should really add more support for GCC as much as it does for Open64, it would really be good.
3 1 [Posted by: the_file  | Date: 10/12/12 05:50:50 PM]
Reply

3. 
I'm amazed at the level of...um..."passion" in this thread, given that it's basically a non-event.

Every CPU vendor works with compiler vendors (gcc project, MS, PGI, etc) to ensure support for their forthcoming microarchitectures. The fact that AMD are doing so simply tells us that they're doing their job.

The REAL question to ask is: "How sensitive is the Steamroller microarchitecture to targeted optimizations?". Consider a pair of recent examples:

1. The Sandy Bridge / Ivy Bridge microarchitecture is notably insensitive to optimization. You can compile for any of a number of recent microarchitectures (including AMD's) and get basically identical performance across a range of problems. This happens because SB/IB have a very flexible out-of-order backend with a large scheduling window and significant memory [re-]ordering capabilities. If the compiler produces a less-than-optimal instruction schedule, then SB/IB can often fix that at runtime.

2. Atom is notoriously sensitive to optimization. There is often a significant performance difference between binaries optimized for Atom ones that are not. This happens because Atom is in-order, which means that it can only dual-issue if consecutive instructions are independent. Atom therefore depends on the compiler to order the instruction stream based on its issue constraints. The last Intel architecture that was similarly sensitive to the compiler was the original Pentium.

The real question is therefore: Is Steamroller more like SB/IB or Atom in terms of how it responds to compiler optimization? Steamroller has a fairly capable out-of-order engine (not as capable as SB/IB, but very good nonetheless), so I personally anticipate that it will be fairly insensitive to the compiler, like Bulldozer and Piledriver. If that turns out to be true then this whole discussion is much ado about nothing.

AMD's recent "clustered" microarchitectures with shared FP/AVX/cache are *very* sensitive to the operation of the OS thread scheduler (hence all of the discussion about Windows patches and Windows 8) but that's a completely separate issue and discussion. The compiler has very little to do with that (except in certain very narrow cases where it can choose between instructions that use shared vs unshared units)

6 1 [Posted by: patrickjchase  | Date: 10/13/12 11:14:48 AM]
Reply
- collapse thread

 
wow very good post, you should post on xbit more often
1 1 [Posted by: cashkennedy  | Date: 10/13/12 06:27:57 PM]
Reply
 
Pentium 4(willamette/northwood) was also very sensitive for compiler optimizations, just like Atom, and even more than original Pentium.

For example, on all other processors, shift was a fast operation, and multiply by 2 or 4 should be done with shift, so all compilers did it with shifts.

On Pentium 4, shift was slow and addition very fast, so fastest way to do multiply by 2 or 4 was addition.

And the FPU was very slow with old x87 FPU code, new SSE2 instructions HAD to be used to get good performance.

Also the very small L1D cache and longer cache lines data structures might need different placement/padding optimizations than other CPUs


Presscott was a bit better, some of the mentioned things do not apply on it.
2 1 [Posted by: hkultala  | Date: 10/13/12 11:20:50 PM]
Reply
 
I don't think P4 was as difficult to optimize for as either Pentium or Atom. It certainly had its share of quirks, but the fact that it was out-of-order with a large instruction window meant that it fundamentally didn't require the sort of hand-holding that the statically scheduled x86s do.

Addressing your specific points:

P4 did indeed have weak x87 FP performance, but so does Atom, and Pentium required FXCH instructions to be "paired" with FP instructions in a specific order to get good FP performance. All 3 therefore required recompilation of traditional x87 FP code to get high performance. P4's SSE performance was pretty easy to exploit (i.e. the compiler just had to use SSE - It didn't have to be all that fancy about how), whereas Atom requires dedicated hand-holding. Atom has high result->use latencies for most SSE ALU ops, and coupled with static scheduling this means that for Atom to attain good SSE throughput the compiler has to manually schedule those latencies (i.e. find independent instructions to "plug the holes" and avoid interlock stalls).

You're right that the lack of a barrel-shifter hurt P4, though the specific example you give (left-shift by an immediate to implement multiplication by a power of 2) is one where the early P4 actually did pretty well, at 1 shift/clock sustained throughput. The early P4 wasn't very good at non-immediate shifts and rotates though.

I personally liked the P4's L1 Dcache design, for one simple reason: The small size enabled lower load->use latencies than any other x86 CPU of its time in spite of P4's high clock speeds. It's interesting to note that in their recent CPUs (Bulldozer and beyond) AMD has also headed towards small caches, for exactly the same reason. The high bandwidth between L1 and L2 (what Intel called "advanced transfer cache" also offset the small size of L1 to a degree.

If I were going to pick on the P4 for something I'd probably go with the one thing that you didn't mention: It's "narrow" frontend (1 instruction/clock decode rate) and resulting dependency on the uop trace cache. If P4 missed its trace cache and had to start decoding anew, then its performance was absolutely awful. That in turn created some pressure on the compilers to do placement optimizations etc.
1 0 [Posted by: patrickjchase  | Date: 10/14/12 05:10:44 PM]
Reply
 
well there was a patched version of a program (some video converter or so) which was compiled using newer instructions (for AMD bulldozer) and showed note able improvements on SB and BD
0 0 [Posted by: madooo12  | Date: 10/15/12 02:23:55 PM]
Reply

[1-3]

Add your Comment




Related news

Latest News

Monday, July 28, 2014

6:02 pm | Microsoft’s Mobile Strategy Seem to Fail: Sales of Lumia and Surface Remain Low. Microsoft Still Cannot Make Windows a Popular Mobile Platform

12:11 pm | Intel Core i7-5960X “Haswell-E” De-Lidded: Twelve Cores and Alloy-Based Thermal Interface. Intel Core i7-5960X Uses “Haswell-EP” Die, Promises Good Overclocking Potential

Tuesday, July 22, 2014

10:40 pm | ARM Preps Second-Generation “Artemis” and “Maya” 64-Bit ARMv8-A Offerings. ARM Readies 64-Bit Cores for Non-Traditional Applications

7:38 pm | AMD Vows to Introduce 20nm Products Next Year. AMD’s 20nm APUs, GPUs and Embedded Chips to Arrive in 2015

4:08 am | Microsoft to Unify All Windows Operating Systems for Client PCs. One Windows OS will Power PCs, Tablets and Smartphones