Reason of Poor performance of Bulldozer in Windows 7

Cilus

laborare est orare
Well, after lots of recent interests shown here about the reasons of Bulldozer's poor performance, I thought it is the time for a detailed explanation. In this article I have only discussed the Windows 7 Scheduling problems with Bulldozer modules but there are other reasons too. One can be tracked as the foul play by Intel by dropping the support of advanced Instruction sets like FMA4 and XOP but those are kept for another article. Hope you'll like it.

Little Detail about the Module in Bulldozer:
As you guys know, Bulldozer architecture is actually a module based architecture where each of the modules are consisting of two discrete Integer cores, one double issue (two instructions can be issued in a single clock cycle) floating point Execution core, a Dual Issue Fetch/Decode logical unit and Shared L2 Cache. The L1 cache arrangement is unique here as the both the core shares the L1 instruction cache but the L1 data cache is separate for each of the Integer cores inside a Module.
This approach is used here to utilize the two cores of a module even in lightly threaded environment. As they share the L1 instruction cache, both of them can simultaneously fetch and executes instructions of a single thread currently present in the cache. It improves the resource utilization as you don’t need two threads to run at the two cores, two cores can work upon a single thread. Although one floating point unit might look not so impressive, the empirical study has shown that most of the instructions assigned to a CPU core at a give time slice are actually Integer instructions. So AMD tried to shrink to the Die size and power consumption by only duplicating the most used units rather than doubling everything.

Similarities and Dissimilarities with SMT or HT:
This module design has lots of similarities with the Intel Hyper-Threading design where each of the physical core is able to process instructions from two different threads simultaneously (not switching as explained in most reviews). But there are lots of dissimilarities too.

The basic concept of HT is that if a single core can process four instructions in a single clock cycle (using Pipeline) by overlapping their execution time and less than 4 instructions, say 2 instructions are available from one thread at clock cycle C1, it can take another two instructions from another thread, if available, to process at Clock cycle C1. As a result all the CPU resources are getting utilized at its fullest capabilities. But remember, HT enabled core doesn't have two separate execution unit, it just shares different portion of the same execution unit of a single core. For example when decode unit is decoding the 1st instruction, Fetch unit can start fetching the 2nd instruction from the instruction queue.

So Intel Hyper-Threading basically uses Instruction level parallelism (ILP) whereas Bulldozer module has two integer cores, each of the Integer core can execute one thread at a time, resulting two threads running simultaneously, hence TLP improvement. Also each of the integer core can use Pipeline to overlap the execution of multiple instructions of a single thread.
HT won’t work if you are having only one thread running at one physical core. On the other hand, Bulldozer module’s each core can work upon the instructions of a single thread simultaneously and execute multiple instructions at a time because of the presence of two physical integer cores inside a module. Here all the instructions present in the thread are accessible to the whole module.

Turbo Core 2.0:
It is like Intel Turbo Boost but with some more functionality inside. Unlike Intel where the dynamic Clock frequency change is based on per core, in Bulldozer it is per module. So the two cores of a module cannot run at different speeds but one module can run at a different speed than other modules present in the processor die. Bulldozer has the ability to cut the power completely from the unused modules, resulting better power management. Similarly here two physical integer cores of a module are running at higher speed, not a single physical integer core of an Intel Processor.

Windows 7 performance issues with Bulldozer architecture:
After reading all of the above details, it looks like Bulldozer design has solid performance delivery capability but it actually fails to do so in current Windows 7 Environment and plagued with all the problems which it actually promised to improve on…Low Instruction Level Parallelism performance, not so good multi-threaded performance, high power consumption etc. Let’s discuss the root causes of the issues.

Before starting the explanation lets be clear with our assumption:-
Let’s consider a Dual module based Bulldozer processor, say FX4100. We have two modules here: M1 and M2. M1 has two integer cores, say C1 and C2 and one Floating Point Core, say F1. Similarly consider M2 has two integer cores C3 and C4 and F2 as the Floating point unit.
In Windows 7 there is no existence of M1 and M2 and it sees four completely independent cores, C1, C2, C3 and C4. It is also not aware about the presence of only two floating point execution units instead of four. Here is the TABULAR representation:-

Cores | Int Core1 | Int Core 2 \| FPU |Ideal View
Module1 |C1|C2|F1| M1(C1, C2, F1)
Module2 |C3|C4|F2|M2 (C3, C4, F2)
Windows 7 View | C1, C2, C3, C4 as separate cores

  1. Multi-Threaded performance with two independent Threads:
    Suppose we have two independent threads T1 and T2, running concurrently at a point of time.

    Ideal Case: The ideal scheduling for this case will be T1 to be assigned to M1 and T2 to be on M2. So here C1 and C2, both can operates over the integer instructions present on T1 and F1 will handle the Floating point instructions of T1. Same is true for T2, having C3, C4 and F1 operating simultaneously over the integer and floating point instructions of T2. As a result the execution of T1 and T2 will be very fast as they are utilizing very large amount of CPU resource.

    Windows 7 Scheduling: As Win 7 has no idea about M1 and M2, it might assign T1 and T2 to C1 and C2 respectively, i.e. in the two cores of the same module M1. Now you see, T1 and T2 are actually fighting for CPU resources, C1 only works on T1, C2 on T2 and both T1 and T2 shares only one FPU F1. Also Module M2 is just sitting idle. So obviously it actually decelerates the execution speed and very poor CPU resource utilization.
  2. Multiple Dependent Threaded Performances:
    Consider T1 and T2 are two threads and they are highly related or tightly coupled. Execution of T2 needs 90% of the instructions of T1 to be completed and they also share the data space in which they are operating. For simplicity you can consider in T1 has instructions to calculate the area of different geometric figures and T2 has the instructions of calculating the price needed to cover the geometric figures with Iron sheet.

    Ideal Case: Now C1 and C2 of M1 actually shares data and instructions and they can communicate with each other very fast as they are situated inside a single module. So T1 and T2 should be assigned to C1 and C2 respectively. When C2 is executing whatever independent instructions are present in T2, C1 can finish execution of the instructions in T1 which are required by T2. As a result there will be no pause or stall in execution as when C2 will start the execution of the dependent instructions of T2 which are related to T1, T1 has been already completed by C1 and made available to C2.
    It also reduces the number of memory accesses as both T1 and T2 shares a common data space. Consider the example C= 2^5 + 10 and E= C+5 are two instructions where the former belongs to T1 and the later belongs to T2. Here after calculating the value of C, the value of E can directly be calculated inside the CPU as both the threads are present inside a single module. So you are basically saving one memory Write (Write C back to Ram) and one Read (Reading the value of C while calculating E).

    Windows 7 Scheduling: As windows 7 is not aware of M1 and M2, it might assign the interdependent threads T1 and T2 to cores of different modules. For example T1 is assigned to C1 (of M1 module) and C4 (of M2 module). As a result C4 needs to wait until C1 finishes execution of T1 and saves it to Main memory. So poor parallel processing and more number of memory accesses, resulting slower execution speed.
  3. Turbo Core Performance in interdependent Multi-threaded environment: Just check the scenario mentioned in above case, i.e. the case 2.

    Ideal Case: As the two interdependent threads T1 and T2 have been assigned to C1 and C2 cores of M1 module, the logic can completely cut the power from Module M2 and increase the clock frequency of M1 to increase the execution speed and reduce the power consumption.

    Windows 7 Scheduling: As discussed in case 2, here Windows 7 may assign the dependent threads to cores of different modules rather than assigning them the cores of a single module. As a result, at a particular time frame both M1 and M2 are running but none of them are getting utilized properly. So power cannot be cut from any of the modules, preventing Turbo Core to be activated
  4. Turbo Core Performance in Single-Threaded Environment:
    Suppose we are working on a single threaded environment where apart from the Windows processes, a heavy single thread is in priority which is operating on a large set of data. Now this kind of thread needs multiple iterations of CPU time slice to be executed completely. In each iteration (Iteration means each thread is allocated a dynamic time period to be executed in CPU. After that it has to release CPU for other tasks) some of the instructions get executed or partially executed and the values get saved in Main Memory. In the next iteration the thread can resume execution from the point where it was released from CPU in the previous iteration. In our case consider T1 is the large single thread, affecting large set of data.

    Ideal Case: Here T1 should be allocated to only one module, say M1, in each iteration so that all the other Modules , Say M2, can be turned off and M1 can hit the maximum Turbo Core Frequency level, hence speeding up the single thread execution significantly. So if T1 needs 10 iterations for completion then in all the 10 iterations it should be assigned to M1 module for Turbo Core to be activated for an effective time for producing a noticeable performance improvement.

    Windows 7 Scheduling: As Windows 7 is not aware of Module M1 and M2 and only sees C1, C2, C3 and C4 as four independent cores, each of the iterations of T1 are assigned to any of the four cores. So consider the case where T1 has been assigned to C1 (of M1 Module) in 1st iteration, in 2nd iteration at C3 (of M2 Module), in 3rd iteration at C2 (of M1 Module) and so on. As a result all the modules got busy within the execution time of Thread T1 without any performance improvement and stops Turbo Core to be activated. It also increases the power consumption indirectly as none of the Modules are turned off.

WHAT SHOULD BE DONE IN WINDOWS 8
Actually Bulldozer modules should be treated more like a SMT or HT enabled core with some extra tweaks. OS needs to understand that one Bulldozer module is not a complete Dual Core but it also needs to understand that it has more CPU resource available than a HT enabled Core.

Suppose we have 4 threads, T1, T2, T3 and T4, waiting in the thread queue to be picked up by the Processor. Among them T1, and T2 are loosely coupled or they have very less dependency among them. T3 and T4 are dependent threads. Let’s consider that T3 is dependent upon T1, and T4 is dependent upon T2
How the scheduler should work and assign the threads to a Dual Module (M1 and M2) Quad Core Processor (Core C1, C2, C3 and C4)?
1st it should check for the most independent threads from the thread queue and will fetch T1 and T2. Then OS should assign T1 to C1 of Module M1 and T2 to C3 of Module M2 so that each of the thread can get two integer cores and one FPU for being processed. Reason is explained in Point 1
Now it should fetch the next two threads T3 and T4 and analyze the level of dependency with other threads. As T3 is dependent upon T1, it should assign it to C2 of Module M1. Similarly T4 should be assigned to C4 of Module M2. Reasons are explained in Point 2.

After this scheduling, now M1 has T1 and T3, two dependent threads and M2 has T2 and T4, again two dependent threads. Now each of the modules has data which are independent of each other and inside a module we have two dependent threads which can share data and instructions, resulting faster execution.

Suppose Module M2 has finished the execution of T2 and T4 prior to the execution of T1 and T3, present in Module M1. As the two dependent threads T1 and T2 have been placed inside a single module M1, M1 does not need to wait for any other modules (here M2) to send data to it. So CPU Logic can completely cut power to M2 module and enable Turbo Core to increase speed of Module M1. As a result the execution speed of T1 and T2 will be far faster
 

RiGOD

SoLa BeLLaToR...
^^Oh man this is heartbreaking. Such a promising architecture and see how Windows 7 ruined the party. BTW read this stuff on a discussion forum

There is most definitely a Windows 7 AMD FX – software patch in the works. By most estimates the AMD Bulldozer FX is underperforming by 40-70% in most Windows 7 benchmarks. By forcing Windows 7 to recognize 8 cpu cores a huge performance hit has happened. The Bulldozer FX-8xxx design… really isn’t 8 cores, it’s a 4 core CPU with an extra integer pipeline on each core. If the FX-8xxx series scale according to the 4 and 6 core Bulldozer design than there is a serious bug in Windows 7 that is crippling the FX-8150 performance.

The one thing that is for-sure here is that every hardware review website rushed to be the first to publish an AMD FX-8150 review, they all used the same generic benchmarks and NONE did any real world computing. The game is fixed, the big-dog spreads around the most ad-dollars.

Any truth in the second para?
 

amjath

Human Spambot
so there is still hope for best performance cause of those patch.

BTW can we lower power consumption cause of these patches [noob question!!! :(]
 

RiGOD

SoLa BeLLaToR...
^^You missed the last para it seems.

Suppose Module M2 has finished the execution of T2 and T4 prior to the execution of T1 and T3, present in Module M1. As the two dependent threads T1 and T2 have been placed inside a single module M1, M1 does not need to wait for any other modules (here M2) to send data to it. So CPU Logic can completely cut power to M2 module and enable Turbo Core to increase speed of Module M1. As a result the execution speed of T1 and T2 will be far faster
 
OP
Cilus

Cilus

laborare est orare
nice info thx
where did you get that from ??

I am a computer engineer and have little understanding about how a CPU work in little depth. The content of the articles have been gathered from the Architecture detailed published in depth (not only the module core thing, but all the other details too), Windows 7 problems published in all the articles and my personal understanding.

amjath, the Hotfixes for Bulldozer in Windows 7 didn't work well, here we need a complete revisit and patches can't solve the problems. If Windows 8 will work properly with BD modules then ya, there are chances of power consumption reduction. But that's an educated guess based on the different analysis, I can't guaranty you that.

RiGOD, what you have posted is actually reflected in my article but I can't say Win 7 is holding back BD in how much amount. Could you provide me the link from where you've got that info? Let me have a look at it.
 

gopi_vbboy

Cyborg Agent
^^ No but i brought it for gaming 2 months back and planning to get a gpu to have decent rig.Now this analysis is a shock.
 
OP
Cilus

Cilus

laborare est orare
FX 6100 is pretty much okay for your need, just add a good GPU and there won't be any problem.
 

amjath

Human Spambot
my friend would be very happy seeing this thread cause he own a bulldozer. He was pretty sad on seeing a bad performance on Windows 8 CP.

So r u sure it ll be okay in Windows 8

*off topic*
BTW how is Sabertooth performing. He is looking for an upgrade.
 

RiGOD

SoLa BeLLaToR...
@Cilus : Here you go buddy

Source 1
Source 2
Source 3

BTW I've got a question. Windows 7 was released way back in 2009 and BD in 2011, I mean didn't AMD have enough time to realise that such a widely used OS would respond to their architecture in such a manner? Or did they see it coming and run into trouble even after knowing it? I read somewhere that the development of BD was started in 2007 (not sure how legit it is), is that the reason?

my friend would be very happy seeing this thread cause he own a bulldozer. He was pretty sad on seeing a bad performance on Windows 8 CP.

So r u sure it ll be okay in Windows 8

Windows is just a part of the story

Well, after lots of recent interests shown here about the reasons of Bulldozer's poor performance, I thought it is the time for a detailed explanation. In this article I have only discussed the Windows 7 Scheduling problems with Bulldozer modules but there are other reasons too. One can be tracked as the foul play by Intel by dropping the support of advanced Instruction sets like FMA4 and XOP but those are kept for another article. Hope you'll like it.
 
Last edited:
OP
Cilus

Cilus

laborare est orare
Actually performance of a CPU doesn't follow any pre-defined rules, instead it follows empirical patterns, i.e., patterns or trends deduced from the analysis of statistical data. Since BD is a completely new design, it has no statistical data available. There are certain issues from AMD side too. They relived too much on peer feedback who have tested BD and used machine to optimize the design rather than human based inputs.
That's the reason AMD has terminated a lot of peers after BD's poor performance because of the wrong info provided by them.
 

AcceleratorX

Youngling
AFAIK the Windows 7 "core parking" feature also plays a role in hindering Bulldozer's performance, effectively disabling some parts of the pipeline when it shouldn't.

This possibly means Bulldozer may perform better under Vista which doesn't have this feature (not sure about XP since that kernel wasn't optimized for multi-core in general). I'm not sure if anyone ran a test to see if this theory is true.
 

Tech_Wiz

Wise Old Owl
Windows people should release a SP2 with Bulldozer thing Fixed. i52500k at 12k and FX4100 @6k @@.
If that patch can improve performance by 30%+ then BD will be an instant hit.
 

gopi_vbboy

Cyborg Agent
Why doesn't AMD consider these things before releasing bulldozer.Hope Atleast now AMD engineers should coordinate with Microsoft to fix it on SP2.

@Cilus, So linux has any upper hand on performance compared with Win7 ?
 

RiGOD

SoLa BeLLaToR...
Actually performance of a CPU doesn't follow any pre-defined rules, instead it follows empirical patterns, i.e., patterns or trends deduced from the analysis of statistical data. Since BD is a completely new design, it has no statistical data available. There are certain issues from AMD side too. They relived too much on peer feedback who have tested BD and used machine to optimize the design rather than human based inputs.
That's the reason AMD has terminated a lot of peers after BD's poor performance because of the wrong info provided by them.

Suicidal I'd say. Baddest way ever of optimising the performance of a much awaited and higly hyped product.

@gopi_vbboy : Check this post.
 

tarey_g

Hanging, since 2004..
Bought 1090T for 8600/- for a friend last week and saved money. It was hard to find in Pune, everyone is stocking bulldozer these days.
Win for people who read when a new architecture is released.
 
Top Bottom