Dear forum members,
We are delighted to inform you that our forums are back online. Every single topic and post are now on their places and everything works just as before, only better. Welcome back!


Discussion on Article:
AMD Secretly Rolls-Out "Steamroller" Support Patch for Compilers.

Started by: JBG | Date 10/12/12 03:47:54 AM
Comments: 21 | Last Comment:  10/16/12 10:28:47 AM

Expand all threads | Collapse all threads


show the post
1 6 [Posted by: JBG  | Date: 10/12/12 03:47:54 AM]
- collapse thread

On a related note, Bulldozer architecture will get a boost in the up coming Windows 8 with a specially developed kernel driver taking full advantage of the Bulldozer CPU scheduling. In Windows 7 there was a patch rolled out that only partly took advantage of Bulldozer, improving performance by 3%. It will be higher in Windows 8.
5 1 [Posted by: linuxlowdown  | Date: 10/13/12 01:58:56 AM]

show the post
2 9 [Posted by: Tristan  | Date: 10/12/12 04:19:28 AM]
- collapse thread

Huh? Intel has processor instruction set optimisation also. Do you really know what you're talking about? You have lost credibility sir. Read the link for the full story from the horse's mouth.
8 3 [Posted by: linuxlowdown  | Date: 10/12/12 06:04:33 AM]
Ignore the clueless haters.

Every chance they get they try to bash AMD only to illustrate their technical ignorance.

As noted Intel uses processor instructions just like AMD. To get the best performance from a processor the software writers need to incorporate BOTH Intel and AMD instructions but many lazy coders or those bought by Intel $$$ fail to write good code that fully supports all the AMD instructions. That is why AMD is assisting the industry to better serve it's customers and use the true performance of AMD processors.

FYI - You'll find many benches do NOT use the proper AMD instructions but do use all of the Intel instructions - to boost Intel's processor performance and make AMD processors look inferior. This dirty little secret has been documented by the better PC hardware reviewers quite some time ago.
7 5 [Posted by: beenthere  | Date: 10/12/12 06:51:47 AM]
Hmm I wonder why people writing compilers would focus more on optimizing for the cpu's that make up 80% of the market vs the cpus that make up 20% of the market. Its almost as if they would like it to run programs well on the majority of possible computers. Sure they should optimize for both cpus and 100% of the market, but there are limits on time and money when writing software.

Also theres 2 parts of the cpu that the complier can optimize for: new instructions and architecture. Almost all compilers support both companies instructions, and generally both companies use the same instructions within 1 generation. Architecture didnt use to matter in compilers because there were only single core cpus. Most likely a lot of compilers have only just finished adapting to many core scenarios, and have very little interest in writing lots more code to adapt to bulldozer's shared FP unit, to only get ~1-3% more performance out of bulldozer, for a cpu that up until the end of this year will probably be less then 1% of all computers in use (since k10 apus were still vastly outselling bulldozer until now)
3 1 [Posted by: cashkennedy  | Date: 10/12/12 10:26:41 AM]
Compiler vendors get help from the CPU manufacturers to add CPU specific optimizations.
0 1 [Posted by: user99  | Date: 10/12/12 04:11:37 PM]
Even assuming the split is 80/20 (which it may not be), do you realize how many computers we're talking about? 20% of the 2012 x86 market is about ~500 million CPUs x .20 = ~100 million CPUs. So what you're saying is that, since there are 'only' 100 million AMD cpus being sold in 2012 alone (of course, there were a similar number sold in 2011, and more will be sold in 2013), it's not worthwhile. We're talking hundreds of millions of users.

This is a load of BS. It is perfectly possible to compile for both Intel and AMD architectures, and include both optimizations in an installation package which detects which CPU you are using and installs the appropriate code build for that architecture. It's utter nonsense that it can't be done, or that it would cost significantly more.
0 0 [Posted by: anubis44  | Date: 10/16/12 10:28:47 AM]
100% correct again @beenthere. But you know, I have trouble ignoring the clueless AMD haters ;-)
4 2 [Posted by: linuxlowdown  | Date: 10/13/12 01:47:14 AM]
Any C/C++ compiler worth its salt optimizes for the chip its running on.
4 1 [Posted by: KeyBoardG  | Date: 10/12/12 07:22:52 AM]
Thus they aren't worth their salt... or they are improperly used for only one type of processor. <LOL>
5 4 [Posted by: beenthere  | Date: 10/12/12 07:38:19 AM]
A compiler doesn't care what CPU it's running on. You can even compile ARM applications while running a x86 OS, it's called cross compiling.
1 2 [Posted by: user99  | Date: 10/12/12 04:13:19 PM]
Yes it does, there are compilers which are optimized for different CPUs because they are made by the vendor. Add to that if they are not open source they can do unoptomising on certain hardware. AMD should really add more support for GCC as much as it does for Open64, it would really be good.
3 1 [Posted by: the_file  | Date: 10/12/12 05:50:50 PM]

I'm amazed at the level"passion" in this thread, given that it's basically a non-event.

Every CPU vendor works with compiler vendors (gcc project, MS, PGI, etc) to ensure support for their forthcoming microarchitectures. The fact that AMD are doing so simply tells us that they're doing their job.

The REAL question to ask is: "How sensitive is the Steamroller microarchitecture to targeted optimizations?". Consider a pair of recent examples:

1. The Sandy Bridge / Ivy Bridge microarchitecture is notably insensitive to optimization. You can compile for any of a number of recent microarchitectures (including AMD's) and get basically identical performance across a range of problems. This happens because SB/IB have a very flexible out-of-order backend with a large scheduling window and significant memory [re-]ordering capabilities. If the compiler produces a less-than-optimal instruction schedule, then SB/IB can often fix that at runtime.

2. Atom is notoriously sensitive to optimization. There is often a significant performance difference between binaries optimized for Atom ones that are not. This happens because Atom is in-order, which means that it can only dual-issue if consecutive instructions are independent. Atom therefore depends on the compiler to order the instruction stream based on its issue constraints. The last Intel architecture that was similarly sensitive to the compiler was the original Pentium.

The real question is therefore: Is Steamroller more like SB/IB or Atom in terms of how it responds to compiler optimization? Steamroller has a fairly capable out-of-order engine (not as capable as SB/IB, but very good nonetheless), so I personally anticipate that it will be fairly insensitive to the compiler, like Bulldozer and Piledriver. If that turns out to be true then this whole discussion is much ado about nothing.

AMD's recent "clustered" microarchitectures with shared FP/AVX/cache are *very* sensitive to the operation of the OS thread scheduler (hence all of the discussion about Windows patches and Windows 8) but that's a completely separate issue and discussion. The compiler has very little to do with that (except in certain very narrow cases where it can choose between instructions that use shared vs unshared units)

6 1 [Posted by: patrickjchase  | Date: 10/13/12 11:14:48 AM]
- collapse thread

wow very good post, you should post on xbit more often
1 1 [Posted by: cashkennedy  | Date: 10/13/12 06:27:57 PM]
Pentium 4(willamette/northwood) was also very sensitive for compiler optimizations, just like Atom, and even more than original Pentium.

For example, on all other processors, shift was a fast operation, and multiply by 2 or 4 should be done with shift, so all compilers did it with shifts.

On Pentium 4, shift was slow and addition very fast, so fastest way to do multiply by 2 or 4 was addition.

And the FPU was very slow with old x87 FPU code, new SSE2 instructions HAD to be used to get good performance.

Also the very small L1D cache and longer cache lines data structures might need different placement/padding optimizations than other CPUs

Presscott was a bit better, some of the mentioned things do not apply on it.
2 1 [Posted by: hkultala  | Date: 10/13/12 11:20:50 PM]
I don't think P4 was as difficult to optimize for as either Pentium or Atom. It certainly had its share of quirks, but the fact that it was out-of-order with a large instruction window meant that it fundamentally didn't require the sort of hand-holding that the statically scheduled x86s do.

Addressing your specific points:

P4 did indeed have weak x87 FP performance, but so does Atom, and Pentium required FXCH instructions to be "paired" with FP instructions in a specific order to get good FP performance. All 3 therefore required recompilation of traditional x87 FP code to get high performance. P4's SSE performance was pretty easy to exploit (i.e. the compiler just had to use SSE - It didn't have to be all that fancy about how), whereas Atom requires dedicated hand-holding. Atom has high result->use latencies for most SSE ALU ops, and coupled with static scheduling this means that for Atom to attain good SSE throughput the compiler has to manually schedule those latencies (i.e. find independent instructions to "plug the holes" and avoid interlock stalls).

You're right that the lack of a barrel-shifter hurt P4, though the specific example you give (left-shift by an immediate to implement multiplication by a power of 2) is one where the early P4 actually did pretty well, at 1 shift/clock sustained throughput. The early P4 wasn't very good at non-immediate shifts and rotates though.

I personally liked the P4's L1 Dcache design, for one simple reason: The small size enabled lower load->use latencies than any other x86 CPU of its time in spite of P4's high clock speeds. It's interesting to note that in their recent CPUs (Bulldozer and beyond) AMD has also headed towards small caches, for exactly the same reason. The high bandwidth between L1 and L2 (what Intel called "advanced transfer cache" also offset the small size of L1 to a degree.

If I were going to pick on the P4 for something I'd probably go with the one thing that you didn't mention: It's "narrow" frontend (1 instruction/clock decode rate) and resulting dependency on the uop trace cache. If P4 missed its trace cache and had to start decoding anew, then its performance was absolutely awful. That in turn created some pressure on the compilers to do placement optimizations etc.
1 0 [Posted by: patrickjchase  | Date: 10/14/12 05:10:44 PM]
well there was a patched version of a program (some video converter or so) which was compiled using newer instructions (for AMD bulldozer) and showed note able improvements on SB and BD
0 0 [Posted by: madooo12  | Date: 10/15/12 02:23:55 PM]


Back to the Article

Add your Comment