Dear forum members,
We are delighted to inform you that our forums are back online. Every single topic and post are now on their places and everything works just as before, only better. Welcome back!
Discussion on Article:
AMD Secretly Rolls-Out "Steamroller" Support Patch for Compilers.
Every chance they get they try to bash AMD only to illustrate their technical ignorance.
As noted Intel uses processor instructions just like AMD. To get the best performance from a processor the software writers need to incorporate BOTH Intel and AMD instructions but many lazy coders or those bought by Intel $$$ fail to write good code that fully supports all the AMD instructions. That is why AMD is assisting the industry to better serve it's customers and use the true performance of AMD processors.
FYI - You'll find many benches do NOT use the proper AMD instructions but do use all of the Intel instructions - to boost Intel's processor performance and make AMD processors look inferior. This dirty little secret has been documented by the better PC hardware reviewers quite some time ago.
Also theres 2 parts of the cpu that the complier can optimize for: new instructions and architecture. Almost all compilers support both companies instructions, and generally both companies use the same instructions within 1 generation. Architecture didnt use to matter in compilers because there were only single core cpus. Most likely a lot of compilers have only just finished adapting to many core scenarios, and have very little interest in writing lots more code to adapt to bulldozer's shared FP unit, to only get ~1-3% more performance out of bulldozer, for a cpu that up until the end of this year will probably be less then 1% of all computers in use (since k10 apus were still vastly outselling bulldozer until now)
This is a load of BS. It is perfectly possible to compile for both Intel and AMD architectures, and include both optimizations in an installation package which detects which CPU you are using and installs the appropriate code build for that architecture. It's utter nonsense that it can't be done, or that it would cost significantly more.
Every CPU vendor works with compiler vendors (gcc project, MS, PGI, etc) to ensure support for their forthcoming microarchitectures. The fact that AMD are doing so simply tells us that they're doing their job.
The REAL question to ask is: "How sensitive is the Steamroller microarchitecture to targeted optimizations?". Consider a pair of recent examples:
1. The Sandy Bridge / Ivy Bridge microarchitecture is notably insensitive to optimization. You can compile for any of a number of recent microarchitectures (including AMD's) and get basically identical performance across a range of problems. This happens because SB/IB have a very flexible out-of-order backend with a large scheduling window and significant memory [re-]ordering capabilities. If the compiler produces a less-than-optimal instruction schedule, then SB/IB can often fix that at runtime.
2. Atom is notoriously sensitive to optimization. There is often a significant performance difference between binaries optimized for Atom ones that are not. This happens because Atom is in-order, which means that it can only dual-issue if consecutive instructions are independent. Atom therefore depends on the compiler to order the instruction stream based on its issue constraints. The last Intel architecture that was similarly sensitive to the compiler was the original Pentium.
The real question is therefore: Is Steamroller more like SB/IB or Atom in terms of how it responds to compiler optimization? Steamroller has a fairly capable out-of-order engine (not as capable as SB/IB, but very good nonetheless), so I personally anticipate that it will be fairly insensitive to the compiler, like Bulldozer and Piledriver. If that turns out to be true then this whole discussion is much ado about nothing.
AMD's recent "clustered" microarchitectures with shared FP/AVX/cache are *very* sensitive to the operation of the OS thread scheduler (hence all of the discussion about Windows patches and Windows 8) but that's a completely separate issue and discussion. The compiler has very little to do with that (except in certain very narrow cases where it can choose between instructions that use shared vs unshared units)
For example, on all other processors, shift was a fast operation, and multiply by 2 or 4 should be done with shift, so all compilers did it with shifts.
On Pentium 4, shift was slow and addition very fast, so fastest way to do multiply by 2 or 4 was addition.
And the FPU was very slow with old x87 FPU code, new SSE2 instructions HAD to be used to get good performance.
Also the very small L1D cache and longer cache lines data structures might need different placement/padding optimizations than other CPUs
Presscott was a bit better, some of the mentioned things do not apply on it.
Addressing your specific points:
P4 did indeed have weak x87 FP performance, but so does Atom, and Pentium required FXCH instructions to be "paired" with FP instructions in a specific order to get good FP performance. All 3 therefore required recompilation of traditional x87 FP code to get high performance. P4's SSE performance was pretty easy to exploit (i.e. the compiler just had to use SSE - It didn't have to be all that fancy about how), whereas Atom requires dedicated hand-holding. Atom has high result->use latencies for most SSE ALU ops, and coupled with static scheduling this means that for Atom to attain good SSE throughput the compiler has to manually schedule those latencies (i.e. find independent instructions to "plug the holes" and avoid interlock stalls).
You're right that the lack of a barrel-shifter hurt P4, though the specific example you give (left-shift by an immediate to implement multiplication by a power of 2) is one where the early P4 actually did pretty well, at 1 shift/clock sustained throughput. The early P4 wasn't very good at non-immediate shifts and rotates though.
I personally liked the P4's L1 Dcache design, for one simple reason: The small size enabled lower load->use latencies than any other x86 CPU of its time in spite of P4's high clock speeds. It's interesting to note that in their recent CPUs (Bulldozer and beyond) AMD has also headed towards small caches, for exactly the same reason. The high bandwidth between L1 and L2 (what Intel called "advanced transfer cache" also offset the small size of L1 to a degree.
If I were going to pick on the P4 for something I'd probably go with the one thing that you didn't mention: It's "narrow" frontend (1 instruction/clock decode rate) and resulting dependency on the uop trace cache. If P4 missed its trace cache and had to start decoding anew, then its performance was absolutely awful. That in turn created some pressure on the compilers to do placement optimizations etc.
Add your Comment
Enter your username and e-mail address. Password will be sent to you.