Re: Hardware price list/spec sheet
Max_Snyper, You're right. Even I was little carried away as you know CPU architecture is my field of interest. This is my last post regarding it. Tomorrow, I'll be moving all the related posts to Bulldozer Discussion thread. BTW, please have a look at my explanation and share your opinion if you wish.
Extreme Gamer; Yes, OS can see the CPU cache, but inside the Cahce how data is placed, in what sequence or if cache is full what line items are needed to be deleted and write back to memory is handled by CPU hardware logic, not by OS. Cache, present inside CPU die is entirely managed by CPU hardware logic which implements different hardware logic for efficiency like LRU algorithm for cache cleaning, Cache coherency problem resoving hardware etc. To OS, Cache operation is a Blackbox view.
I hope you know what blackbox view is, it means in a module you send the requested inputs and you will get the output without interfering or knowing the actual mechanism how the output is produced. In my previous post I've mentioned it, I never mention that OS is not aware of Cache.
And one thing...Thread assignment is never done on the base of Cache handling algorithm.
For example the L2 cache of a core in a HT enabled Nehalem CPU like i7 920/960, is although shared by two hyperthreaded logical cores, it is partitioned statically to keep any one thread from monopolizing all the resources. It is same like Bulldozer where the L3 cache is staically partitioned, 2 MB/per module.
But in Sandybridge, two logical cores of a physical core can access any position of the shared L2 cache as the division is dynamic, decided by the hardware Logic at run time. It is same as the 2 MB L2 cache per module in Bulldozer, shared dynamically by both the cores.
So you see, the shared cache mechanism is completely different in Nehemiah and Sandy-bridge HT processors, but does it poss any challenge for the OS to assign two tightly coupled threads into the two logical cores of a physical one? The answer is no. Because the entire thing is handled by the Cache management logic of the CPU.
Actually one Bulldozer module lies within a True Dual Core module and a SMT enabled mdoule. It does have some independent unit, mainly the Integer execution unit and some shared unit, like the Front-end (Fetch/Decode, OOO) and the FPU.
What you are saying, treating them as one module rather than two separate cores, will come with Windows 8.
Consider the example and please make sure you read it:-
Suppose we have two Thread T and T1 which are tightly coupled. So they share a lot of common resources and interdependent instructions. Now consider a Dual Module Bulldozer, having module M1 with C1 and C2, M2 with C3 and C4.
To windows 7, there is no existence of M1 and M2, it only sees C1, C2 C3 and C4 as 4 independent cores and assign T to C1 and T1 to C4.
Now whenever C1 faces an instruction from T which is dependent upon some instruction(s) of T2, it has to stall it, push it to the waiting queue C4 completes the instructions in T1 and write it back to L3 cache, if it is available. Then C1 has to read that value from L3 cache and process with the dependent instruction.
So it will waste lots of valuable CPU cycles because:
1. L3 cache is far slower than faster L2 cache which is nearer to CPU.
2. C4 does not have any knowledge that execution of the specific instruction from the instruction queue of T1 cache in the 1st place will fasten the execution of T thread in C1, even if it can do it.
3. C1 and C4, as they belongs to the different modules, have different Frontend (Fetch/Decode/Out of Order Execution). As they don't have any knowledge of the state of the other core, their out of Order logic cannot reschedule the instructions accordingly to fasten the execution of both the threads.
4. Although T and T1 are interrelated, they are placed in two different module s. But since they are interrelated, the single threaded performance of each of the module is poor due to loss of valuable CPU cycles. But the worst thing is:-As the threads are using all the CPU resources or both the module, Turbo boost will be disabled to kick up the single threaded performance.
Now Suppose Windows 8 can identify a Bulldozer Module and it treats C1 and C2 as part of M1 and C3 and C4 as part of M2. So C1 and C2 will be treated as not as independent cores, but something between a SMT core and Dual core.
Now OS has full access to each of the threads and by judging their nature, shared resources, operation memory locations or CONTEXT of each of threads, it can easily understand the interrelation between T and T1. So it will assign both of them to module M1 or to C1 and C2. Now look at the improvements:-
1. Common Fetch and Decode: Since both the Threads share resources , i.e. same memory locations, same operands, they need to be fetched only once. As you are minimizing slower memory access, an expensive process for CPU cycles, performance will be improved due to recovery of the wasted CPU cycles.
2. Out of Order Rescheduling of Both the Threads' instruction: As you know it reorders the instructions to increase the number of executions simultaneously. Now it has two thread full of instructions and a lot of them are dependent, out of order execution unit can reschedule them more effectively to increase the parallel execution. In HT enabled cores, OOO does the same thing but here as two discrete Integer execution units are available, the process will be far superior.
In Computer science it is observed that normally OOO logic can fetch 2 parallel instructions from a single Thread in one cycle and normally one core can run 4 instructions in parallel using Pipe-lining.
So here at least for integer, we have maximum of four instructions from T and T1 and two cores. So you can understand the speed of execution.
3. Data Sharing using L2 cache: As C1 and C2 shares dynamically partitioned same L2 cache, data sharing among them using it will be far faster.
4. Turbo Boost: As you can see, here module M2 is not in use, Turbo boost can make it sleep and charge up the Frequency of the M1 to increase processing speed.
Hope you understand now.