Reason of Poor performance of Bulldozer in Windows 7

amjath

Human Spambot
Why doesn't AMD consider these things before releasing bulldozer.Hope Atleast now AMD engineers should coordinate with Microsoft to fix it on SP2.

@Cilus, So linux has any upper hand on performance compared with Win7 ?

[Phoronix] AMD FX-8150 Bulldozer On Ubuntu Linux Review
 

vickybat

I am the night...I am...
Read the article fully and it explains everything to the fullest. So the things to look out for are efficient use of turbo core and handling threads having dependent and independent instructions accordingly.

So it all boils down to one thing i.e the os should be aware of bulldozer's modular design in order to assign the dependent and independent threads in a proper manner. As explained above, dependent threads get a single module to share instructions in a similar data space whereas independent instructions should be assigned to different modules rather than different execution units so they get to utilize execution resources to the fullest.

Doing so will solve the turbo-core issues as well. Even though we don't have to look deep into how capable the bulldozer execution units ( integer and float units) are , proper view by the os will definitely help a lot in efficient resource utilization and thus increasing performance.

When i first read about bulldozers architecture in several articles, it really sounded impressive and a serious breakthrough from conventional designs. But it was plagued by current OS view. Lets hope win 8 finally realizes bulldozers modular design and puts it to proper use by allocating threads accordingly as explained in the above article by cilus.

Great article and a real simple read buddy. :)
 

RiGOD

SoLa BeLLaToR...
And don't forget that the review sites were making it even worse for the buyers by bashing the BD's by every means possible without even mentioning the impotence in the OS to untilise the advanced architecture to the fullest.

Even the most reputed technology discussion forums were flooded with Intel fanboys and AMD haters (even the inactive members came outta nowhere just to bash) passing noob comments about BD's. But check the user reviews of the FX-8120 at newegg man, it has got 5 eggs rating and almost every customer is satisfied with the performance even with the OS glitches.

Anyways I'll say one thing, if Windows 8 makes way for flawless functioning of the architecture and results in atleast 20% improved performace and reduced load power consumption, the price performance ratio will make it a killer VFM product and a worthy opponent to current hot seller.

BTW check this.
 
OP
Cilus

Cilus

laborare est orare
[Phoronix] AMD FX-8150 Bulldozer On Ubuntu Linux Review

Just finished the review, really a good one. It clearly proves that even without any optimization pack, BD FX 8150 is ahead of 2500K in all multi-threaded benchamrks. With the optimization packs I think Linux users are having a clear and big advantage for using Bulldozer modules.
 

gopi_vbboy

Cyborg Agent
Just finished the review, really a good one. It clearly proves that even without any optimization pack, BD FX 8150 is ahead of 2500K in all multi-threaded benchamrks. With the optimization packs I think Linux users are having a clear and big advantage for using Bulldozer modules.

Thanks for reading and letting us know that.:-D
 

amjath

Human Spambot
I recommended Bulldozer for my friend who is a architect, who uses multi threaded applications Revit Architecture, Photoshop and 3ds max all running @ same time :D
 
Last edited:

sukesh1090

Adam young
@cilus,
very god article bro.keep it up.rep added.
btw if i am right there is some mistakes from AMD also in this poor performance and high power consumption and the big problem is even PD or all successors of BD can't fix some of the problem.to fix it they have to change and redevelop whole architecture.which is not a choice considered by AMD right at this time. so with PD and others we can only see some of the problems getting fixed.
 
OP
Cilus

Cilus

laborare est orare
No man, for sorting out power consumption, a lot of techniques have been implemented in Piledriver, like Resonant Mesh Clock generator for running @ higher ferquency, you don't need the whole architecture to be changed.

Look at original Phenom and Phenom II architecture: Phenom was an absolute disaster for AMD but Phenom II, I think the best VFM processor til date.
 

RiGOD

SoLa BeLLaToR...
^^Seems like everytime AMD introduces a new processor it'll be fail and its revision will be awesome;) Like Phenom II after Phenom and PD (let's wait and see) after BD.
 
OP
Cilus

Cilus

laborare est orare
It is because although they have considerably lesser man power as well as money power when compared to Intel, they always tried to deliver something out of the box, which actually helps the CP industry to move forward. Sometimes they are instant hit (Like Athlon 64, the 1st X86-64 CPU), sometimes they are failure like the Phenom.

Find below the couple of innovations which are introduced by AMD and you'll get a clear picture that how important they were:-

  1. They are the 1st company to provide separate L1 Data Cahce and Instruction cache for reducing Cache Coherency problem
  2. They are the 1st company to produce On-Die Memory controller with Athlon 64. Intel got that techniuqe with their Nehalem processors, after more than 4 years.
  3. They are the one to introduce the 1st Dual (Athlon X2) and Quad core (Phenom) processors where all the cores are implemented inside a single die. Intel C2Q is just two Core2Duo processor packed inside a single package.
  4. They are the pioneer of Serial linking or Peer to Peer linking between CPU and the memory instead of using old FSB. A 3 GHz P4 used to run on a 533 MHz FSB..... Even with Athlon 64, using Hyper Transport 1.0, AMD was able to provide 6 times faster memory bandwidth. Intel only got that with their Nehalem processor. Intel's QPI or Quick Path Interface is just a modified version of Hyper Transport technology.
  5. Also don't forget the AMD lllano APU because they're the one where the whole southbridge, including PCI-E bus, USB 3.0 and 2.0, everything is fused inside the CPU die. Intel only has this capability with Ivybridge.
 

Omi

-=[-_-]=-
Many people just Label AMD products as cheap and backward, under performing etc etc. But We have to see the Importance of AMD in the market.
It is the Sole company that is giving competition to 2 of the top companies in their fields. It is the company giving us choice, keeping prices under control and giving speed to innovation. Competition is the key to innovation in the Corporate world.

Hats Off to AMD's effort, its very hard to do something new when you don't have billions lying around,and things are at stake. They have been doing it quite successfully.

@Cilus very nice article, so simple and lucid.

Windows 7 or the Intel dominated compilers are not only to be blamed here. They went a bit more ahead of the time, Single threaded performance is not something that can be ignored. And despite having 8 physical cores it is not that much ahead of i52500k even in heavy multitasking apps. Potential is there but they need to do a lot more.
 

sukesh1090

Adam young
No man, for sorting out power consumption, a lot of techniques have been implemented in Piledriver, like Resonant Mesh Clock generator for running @ higher ferquency, you don't need the whole architecture to be changed.

Look at original Phenom and Phenom II architecture: Phenom was an absolute disaster for AMD but Phenom II, I think the best VFM processor til date.

hope you are right.wish there is some reason why AMD postponed PD by 1 year.about that VFM yeah you are absolutely right thats the reason for which i bought 955.

It is because although they have considerably lesser man power as well as money power when compared to Intel, they always tried to deliver something out of the box, which actually helps the CP industry to move forward. Sometimes they are instant hit (Like Athlon 64, the 1st X86-64 CPU), sometimes they are failure like the Phenom.

Find below the couple of innovations which are introduced by AMD and you'll get a clear picture that how important they were:-

  1. They are the 1st company to provide separate L1 Data Cahce and Instruction cache for reducing Cache Coherency problem
  2. They are the 1st company to produce On-Die Memory controller with Athlon 64. Intel got that techniuqe with their Nehalem processors, after more than 4 years.
  3. They are the one to introduce the 1st Dual (Athlon X2) and Quad core (Phenom) processors where all the cores are implemented inside a single die. Intel C2Q is just two Core2Duo processor packed inside a single package.
  4. They are the pioneer of Serial linking or Peer to Peer linking between CPU and the memory instead of using old FSB. A 3 GHz P4 used to run on a 533 MHz FSB..... Even with Athlon 64, using Hyper Transport 1.0, AMD was able to provide 6 times faster memory bandwidth. Intel only got that with their Nehalem processor. Intel's QPI or Quick Path Interface is just a modified version of Hyper Transport technology.
  5. Also don't forget the AMD lllano APU because they're the one where the whole southbridge, including PCI-E bus, USB 3.0 and 2.0, everything is fused inside the CPU die. Intel only has this capability with Ivybridge.

hey bro you left BD from that list.the new module design for me looks like the future of the processors and look at the turbo core and its consistency.BD's turbocore speed is maintained without any drops but Intel turbo core it has more peaks and valleys than those in himalayas.intel turbocore speed never stays in its position for more than 2 min it always fluctuates.
btw aren't they the first company to introduce 64bit technology with athlon 64.some of the oses still detect 64bit as AMD 64 like mac,even some linux.

Many people just Label AMD products as cheap and backward, under performing etc etc. But We have to see the Importance of AMD in the market.
It is the Sole company that is giving competition to 2 of the top companies in their fields. It is the company giving us choice, keeping prices under control and giving speed to innovation. Competition is the key to innovation in the Corporate world.
AMD is more than that buddy.it has given some of the best processors and technologies without which we would have been in stone age of processor development.
Windows 7 or the Intel dominated compilers are not only to be blamed here. They went a bit more ahead of the time, Single threaded performance is not something that can be ignored. And despite having 8 physical cores it is not that much ahead of i52500k even in heavy multitasking apps. Potential is there but they need to do a lot more
you have to check that BD review in linux os from phoronix given in the previous post.you will see BD is racing ahead of 2500k in all the multi threaded benchmarks.
 

sukesh1090

Adam young
who me??but the problem is what should i do with that 8 core:wink: and even my mobo doesn't support 8150.it supports only 8120.:cry:
 
OP
Cilus

Cilus

laborare est orare
Sukesh, thanks for pointing out that. Actually I just shared couple of infos which are not known to all.

Lets have some info about FMA4 instruction, it stands for Fused Multiply and Add. For example a= b*c + d is an X86 instruction or Macro Ops. Now this instruction can't be calculated directly and needs to be devided into multiple small instructions which are called Micro Ops. So in a processor without FMA support,
1. The value of b*c will be calculated
2. Then it will be rounded off as per the required level of accuracy.
3. After that it will be added with d and again the value will be rounded off.
4. Now the total calculated value will be stored in a register and saved to the
memory location where a is situated. Rounding off can be again done as
per the data type of the operand d.
So here 4 steps are required and three levels of rounding off, resulting use of multiple CPU cycle and reduced level of precision.

If you have a FMA enabled porcessor then the FPU will execute the instruction a= b*c + d, instead of the integer unit. As FPU has support for FMA, the b*c + d will be calculated in a single clock cycle and then the result will be rounded off. So the advantages are:-
1. Execution speed is faster
2. High level of accuracy because of only one rounding logic
3. FPU can produce more accurate result than Integer unit.

Now AMD has implemented FMA4 in their BD architecture while intel planning to introduce FMA3 in Hashwell. None of the Sandybridge, Sandybridge-E or Ivybridge have support for FMA.
In FMA4, 4 operands can be placed inside the CPU register. So here the value of a, b, c and d can be fetched from memory to CPU register. The the whole operation can be performed inside the CPU withour any memory access, resulting far better speed up as you know memory access is a very slow process and CPU has to wait for several thousands of cycles for it.

On the other hand in FMA3, you can only place three operands inside CPU registers. So first b, c and d will be fetched adn the result of (b*c+d) will be stored in Temp register. After that CPU will fetch the value of a from Memory to a Register, saves the result in a and then again writes it back to memory.

So it is clear to all that FMA4 is faster than FMA3.

Now here comes Intel's foul play. In their compiler Instruction support Specification back in 2010, they have shown FMA4 as a supported Instruction set. But because of the hype of Bulldozer, just before it's release, Intel just dropped support for FMA4 and took FMA3 instead.

As a result none of the major software vendors haven't implemented FMA4 support. There are couple of open source programs like X264 encoder, have support for it and you just check the benchmarks...In AMD compiled X264 benchmark 2nd pass where originally the video gets converted, FX-8150 is ahead of 2600K even under Windows 7.

The good news is that AMD has signed a deal with Intel that from next iteration of their processor, starting from Piledriver, FMA3 will be used and Intel will support it in their compiler. So you can see a serious performance boost in Piledriver because of compiler support from all major vendors as Intel is gonna support.
 

amjath

Human Spambot
^ oh i see why 8150 outperform 2600k in video encoding benchmarks. Very well explained cilus bro.

I have a question. U said FMA4 is faster than FMA3, then why dont they use it instead signed to use FMA3??? Also is it difficult to make FMA4 support to implement in softwares.

I can see Intel wants to make money by making us idiotic sheeps just like Apple is doing
 
OP
Cilus

Cilus

laborare est orare
Intel procesors are better suited for FMA3 because of the number of registers present and their structures. Intel has 256 bit Floating point registers and can perform operations on a 256 bit data but has fewer number of registers whereas AMD has two 128 bit registers and need to break a 256 bit data to 2 128 bit data to be operated on. But the number of registers are relatively higher.
Now in multimedia needs like Video conversion or editing, Photo Editing etc, you hardly operate on 256 bit data and instead operate on large set of smaller data. So you can basically store more number of small sized operands inside the FPU of Bulldozer and that's why FMA4 has been implemented.

In case of Intel, although they can't store large number of smaller data inside FPU registers but they can perform a 256 bit operation on a single cycle, compared to two cycles of Bulldozer. That's why they probably stressed on the operation on big sized data instead of storing large number of small sized data inside FPU. So FMA3 is perfect for them.

Since AMD is going to support FMA3, we will see more number of optimized software as Intel based Compilers will have support for FMA3. There is a chance that AMD might retain the current FMA4 compatibility along with new FMA3 in their Piledriver architecture, making it more future proof and versatile solution.

Have a read here: Single Floating-Point Unit, AVX Performance, And L2 : AMD Bulldozer Review: FX-8150 Gets Tested
 

vickybat

I am the night...I am...
Maybe the architectural revision of piledriver that we constantly talk about will be one of these. They might follow intel's path of using fewer registers but working on a wider data bit. This way they can use FMA 3 more efficiently.

Logically , fma4 just uses an extra register and thus saves cpu cycles. Considering an FMA4 instruction : d = a*b +c , the same instruction will be implemented in FMA3 as c = a*b +c.

In the first case, after computation of a,b &c the result is saved in a different register i.e d. But in the 2nd case, after computation of a,b & c, the data is stored in either a,b or c after freeing the previous data.

There are a some losses in CPU cycles in fma3 but there are a few advantages in a 3 operand scheme. Fewer operands result in shorter code and thus simplifies the implementation. It has been seen the that losses are negligible in fma3 compared to fma4 and is a mere 1%. So considering simplicity, fma3 is the new standard.

But others say fma4 to be more flexible from a programmers perspective.
 
OP
Cilus

Cilus

laborare est orare
Vicky, FMA4 does have other benefits apart from being much flexible in programmers point of view. Consider the following Case:

const c = 10; //c is a constant assigned to value 10
int d[] = new int[10]; //d is an array with size 10 of integer data type
For (int a = 0, a<10, a++)
{
int b = a +1;
int d[a] = a*b+c
}

Now here if you're using FMA3, the implementation for each of the iteration will be like:
the value of a, b and c will be loaded to three CPU registers, say A, B and C. And the address of d[a] is stored in Program Counter Register or PC for the iteration under consideration.
C = A*B + C.
C->MEM[PC] --Store the value of C in the memory location present in PC
For next iteration
again Load A, B and C and continue the cycle. Here you have to reload C as its value has been overwritten, although it is a constant.
Now if you look at it, The value of c is needed to be read from the Memory or Ram 10 times. But think, c is just a constant and its value is actually same throughout the life cycle of the program. So you are basically wasting 9 Memory Read cycle, a very slow process, for Reading the same value.

Now consider FMA4,

Load the value of a, b and c in A, B and C respectively. D is another register
D = A*B + C
D-> MEM[PC]
Load the Value of a and b to A and B and continue the cycle.

So here you're saving 9 precious memory read times as C is a constant and is kept in the Register C throughout the life cycle without reloading. This is just an example and with the increase of the number of iterations , you will save more memory read times.

Basically FMA4 can actually have all the features of FMA3, it is only Intel who made it incompatible with FMA4 by using legal steps.
 
Top Bottom