Reason of Poor performance of Bulldozer in Windows 7

AcceleratorX

Youngling
As far as FMA is concerned, it's actually a legitimate issue....but...A processor has to show it's performance in real world applications as well and not expect the application to be optimized for it.

In most real-world testing, BD is similar to 2500K at best. For this reason, irrespective of compiler specifics, BD is a failure. NVIDIA, for example, faced similar issues with GeForce FX series. They worked around it by introducing a new philosophy: Optimize the hardware to the compiler and not the other way around. AMD needs to learn from this.

I have respect for AMD because it has put in significant efforts to drive competition. But at the same time, their graphics division is aggressively working in a very NVIDIA-like fashion (the old NVIDIA). AMD constantly disses NVIDIA instead of working around bugs (for example). When has NVIDIA done this? Even when the nForce/Intel fiasco broke out, NVIDIA weren't giving interview after interview about why the competitor sucks.

AMD needs to clean up its engineering and its general act. That being said, if they pull it off, they'll be very competitive, CPUs or otherwise.
 
OP
Cilus

Cilus

laborare est orare
I think you are getting the picture wrong. I agree with you that AMD should have optimized Bulldozer for Windows 7 and rather that creating hype like 1st 8 core in the world, they should have made it something else, the modules to be detected by the OS.
But what I'm trying to point out that some software won't run with their max performance in AMD or any other Non-Intel CPU because Intel deliberately stops their compilers to use optimized code path in any Non-Intel CPU or simply force drive the vendors to use only their optimization.

I guess you have missed out the VIA Nano issue. It used to be around 8% faster than the Intel Atom CPUs but it supposed to be more faster as per Via's claim. So when a mod has been used to change the CPUID of VIA nano to Intel original, the performance increase jumps to whopping 30% over the comparable Intel Atom processor.

See, even if the software is optimize to run at a specific hardware, Intel is forcefully preventing it by their money power and influence, it is not like that hardware is not already optimized for software.

Check the Linux benchmarks of Bulldozer where plenty of open source and unbiased Compilers are available, in more than 90% cases FX-8150 is ahead of 2500K.

And regarding software optimization thing, one has to take steps into the future, otherwise the innovation will stop. If that is the case then Dual Core processors shouldn't be released at all as Windows XP, the primary OS on that time was never optimized for Dual Core processors. It is after the release of Pentium D and Athlon X2, Microsoft started releasing patches for Multi-Core optimization for XP.
 

vickybat

I am the night...I am...
Vicky, FMA4 does have other benefits apart from being much flexible in programmers point of view. Consider the following Case:

const c = 10; //c is a constant assigned to value 10
int d[] = new int[10]; //d is an array with size 10 of integer data type
For (int a = 0, a<10, a++)
{
int b = a +1;
int d[a] = a*b+c
}

Now here if you're using FMA3, the implementation for each of the iteration will be like:
the value of a, b and c will be loaded to three CPU registers, say A, B and C. And the address of d[a] is stored in Program Counter Register or PC for the iteration under consideration.
C = A*B + C.
C->MEM[PC] --Store the value of C in the memory location present in PC
For next iteration
again Load A, B and C and continue the cycle. Here you have to reload C as its value has been overwritten, although it is a constant.
Now if you look at it, The value of c is needed to be read from the Memory or Ram 10 times. But think, c is just a constant and its value is actually same throughout the life cycle of the program. So you are basically wasting 9 Memory Read cycle, a very slow process, for Reading the same value.

Now consider FMA4,

Load the value of a, b and c in A, B and C respectively. D is another register
D = A*B + C
D-> MEM[PC]
Load the Value of a and b to A and B and continue the cycle.

So here you're saving 9 precious memory read times as C is a constant and is kept in the Register C throughout the life cycle without reloading. This is just an example and with the increase of the number of iterations , you will save more memory read times.

Basically FMA4 can actually have all the features of FMA3, it is only Intel who made it incompatible with FMA4 by using legal steps.

Yup buddy this case is fair enough. Since c is a constant, FMA3 faced a performance hit compared to FMA4 as it unnecessarily fetched the value of c from main memory nine times.

But consider a scenario, where c is a variable and its value changes every iteration or lets say its a random generated number. Then won't FMA3's performance equals to FMA4 as both have to fetch value of c equal number of times?

Won't such a scenario be available in a program logic? Just asking.
 

AcceleratorX

Youngling
The thing with dual core is, well, Intel released the first dual cores with similar frequencies as the top single cores, which meant performance was roughly comparable either way. For this reason dual cores caught on quickly since Microsoft released patches later that improved the performance.

Now that you mention the VIA Nano issue, I remember this, and this issue was not just with VIA's CPUs. Even AMD CPUs had this issue (AFAIK the Intel compiler limited the instruction sets when a non-Intel CPU was detected). However, the issue was resolved by simply switching the compiler to Microsoft (AFAIK the MS compiler does have FMA4 support now).

I didn't think it was a large enough issue since I assumed it was easier for developers to use Microsoft's compiler for Windows and GCC for Linux than resorting to Intel's stuff (The developers I know do this only. In fact, AVG, despite being partly owned by Intel, still uses Microsoft compiler). IMO this issue never gained enough traction because AMD's processors were still fast enough despite being crippled by the Intel compiler and VIA never had a significant market share.

I know FX is a good architecture limited by choice of compiler, but what I'm saying is that pointing fingers at the competitor will not help your bottom line. You gotta find another way, no matter what are the odds.
 
OP
Cilus

Cilus

laborare est orare
I know FX is a good architecture limited by choice of compiler, but what I'm saying is that pointing fingers at the competitor will not help your bottom line. You gotta find another way, no matter what are the odds.

Is that so simple when Intel is your competitor?
They have spent 50 million over the compiler business and let the developers use it for free. Also in terms of features and optimization, Intel compilers are far superior than their competitors, even better than Microsoft. You like it or not, most developers still rely on Intel compilers over Microsoft. In fact other CPU makers pay Intel for licensing of their compilers, including AMD.

I know GCC and Open64 are two open source compilers, having support for AVX and FMA for AMD processors and that's why in Linux benchmarks 8150 excels over i5 2500K. Check the review link given in the 1st page.

Most of the synthetic benchmark software like Syssoft Sandra, 3dMark 05/06 are based on Intel compilers and therefore favors Intel Processors.

The latest X264 codec, having developed with FMA4, AVX and XOP (rest of the SSE 5 instructions which are not a part of Intel's specification) support , performs better in 8150 than even the mighty 2600K, in Windows 7, despite of its poor thread scheduling. So isn't it clear if proper and unbiased benchmarks are used, FX series get a big performance boost.
And whether Via has market share or not, VIA nano Modding shows that the foul play by Intel.
 

AcceleratorX

Youngling
Hmm......that is interesting, and I am aware Intel is less than ethical with some of it's practices, but I still think AMD can only make the best of a bad situation and not try to change things about (because Intel will not allow it). That's why I'm saying they should try and optimize the architecture to the compiler the next time around.

As for now, with BD only awareness will change things and not being first to market has hurt AMD a lot. I'm hoping they make a better comeback with Piledriver.

There are many good things to say about AMD though - they are really pushing features in their chipsets, I mean you only need to look at the features from an AMD platform compared to Intel - Intel's been so slow in pushing SATA 6gbps and USB 3.0.
 
OP
Cilus

Cilus

laborare est orare
AMD is trying hard buddy. They have paid some unknown amount to Intel for licensing FMA3 for Piledriver to make sure Intel Compilers will support code path which is optimized for FMA3 for AMD processors too. Also they are going to use resonating Mesh Clock generator to run their processors at 4 GHz+ speed without increasing the power draw and to increase the IPC.
 

vickybat

I am the night...I am...
Here's a link that chronicles the "instruction set war" between amd and intel :

Source

I hope you'll like it guys. :)
 
OP
Cilus

Cilus

laborare est orare
Yup buddy this case is fair enough. Since c is a constant, FMA3 faced a performance hit compared to FMA4 as it unnecessarily fetched the value of c from main memory nine times.

But consider a scenario, where c is a variable and its value changes every iteration or lets say its a random generated number. Then won't FMA3's performance equals to FMA4 as both have to fetch value of c equal number of times?

Won't such a scenario be available in a program logic? Just asking.

That can be implemented in FMA4 also as you can use max of 4 registers, so you can use 3 also. But the thing is FMA3 is use self destructive regoster mapping where the value of one operand inside a register gets updated with the result. The example I've given is just one of the millions of possible scenarios. For example consider the following example:-

For (int i = 0, i<10, i++)
{
a = i*2; b= i*3, c = i+2;
int d = a*b+c;
int j = c*a;
printl (a,b,c,d,j);
}

Here a, b, c and d all are variables but the value of c is needed to calculate the value of both d and j in each iteration.
Now if you use FMA3: C=A*B+C, the value of the C register will be overridden and can't be use to calculate the value of j. So again you need to fetch it from Memory, resulting one extra memory read in each of the 10 iterations resulting 10 extra reads.
on the Oter hand, in FMA4, you can use the combination of 4 register and 3 Register use.
D = A*B+C; (FMA4 using 4 registers)
C = A*C (FMA3 implementation within FMA4)

MOVE C to MEM[j] //Save the value of D to the variable j in Memory;
MOVE D TO MEM[d] //Save D to the variable d in Ram or Memory;

Now look at the advantage.
1. We are reusing the value C and hence saving one memory read in each iteration
2. Moving C to the variable j in memory and Moving D to the variable d inmemory can be performed simultaneously as they are coming from two separate registers and updating two separate memory locations. So we are also saving one memory Write operations time per cycle.

These advantages actually provides very high degree of flexibility to the compiler as well as to the programmers.
 

vickybat

I am the night...I am...
That can be implemented in FMA4 also as you can use max of 4 registers, so you can use 3 also. But the thing is FMA3 is use self destructive regoster mapping where the value of one operand inside a register gets updated with the result. The example I've given is just one of the millions of possible scenarios. For example consider the following example:-

For (int i = 0, i<10, i++)
{
a = i*2; b= i*3, c = i+2;
int d = a*b+c;
int j = c*a;
printl (a,b,c,d,j);
}

Here a, b, c and d all are variables but the value of c is needed to calculate the value of both d and j in each iteration.
Now if you use FMA3: C=A*B+C, the value of the C register will be overridden and can't be use to calculate the value of j. So again you need to fetch it from Memory, resulting one extra memory read in each of the 10 iterations resulting 10 extra reads.
on the Oter hand, in FMA4, you can use the combination of 4 register and 3 Register use.
D = A*B+C; (FMA4 using 4 registers)
C = A*C (FMA3 implementation within FMA4)

MOVE C to MEM[j] //Save the value of D to the variable j in Memory;
MOVE D TO MEM[d] //Save D to the variable d in Ram or Memory;

Now look at the advantage.
1. We are reusing the value C and hence saving one memory read in each iteration
2. Moving C to the variable j in memory and Moving D to the variable d inmemory can be performed simultaneously as they are coming from two separate registers and updating two separate memory locations. So we are also saving one memory Write operations time per cycle.

These advantages actually provides very high degree of flexibility to the compiler as well as to the programmers.

Yes, here FMA4 is advantageous in calculating j = c*a per iteration because it requires the original value of c per iteration. In FMA3 , the use of C register to hold the value of a*b+c would have overwritten c's original value and not available for calculating j = c*a. But in FMA4, that D register doesn't let the compiler to again fetch the original value of C per iteration like in FMA3 as its already available in register C.

Definitely FMA4 has an advantage. Doubts cleared buddy. This was a good scenario for testing if c had a constant change in value. So we can conclude that in scenarios, where we have to reuse values stored in registers, FMA4 comes handy and saves excess memory read/write. :)
 

Tech_Wiz

Wise Old Owl
Its look like we may have to Let BD skip and hang on to Phenom II and jump on Piledriver directly just like we stick with XP and skipped over Vista to Jump on Win 7.
 

RiGOD

SoLa BeLLaToR...
^^I guess you're right there. If the Piledriver has better IPC performance, power management features and can run cooler then BD's, then it'll be wise to wait till Q3;)
 

Tech_Wiz

Wise Old Owl
Yeah but Phenom II & Thuban are still giving enough performance to wait till that time.

OCed Phenom/Thuban pretty much match/exceed First Gen i Core Proccys in most cases they are not bad at all to hang on to.
 
Top Bottom