I'm currently doing the last minute touch to my article where I've explained everything about multi threading...from the single threaded single module to a Hyper-threaded module, about operating system with block diagrams and examples. But it is huge, you guys need to be little patient to read it.
But still some words about HT and Bulldozer cores:-
Actually 1 module of Bulldozer will beat one Hyperthreaded core of Intel i7 but if you consider 1 module as two cores then each of the core's performance is not as good as Intel Sandybridge hyper-threaded core.
There is a misconception that Hyper-threading is TLP, not ILP, which is actually wrong. Holding two threads inside CPU thread registers and switching to one when the other one is busy or waiting for resources...which is the basic concept of Hyperthreading...is actually a misconception. The above mentioned technique is called Superthreading, not Hyper-threading. Hyper-threading improves the Thread level Parallelism by improving ILP. In HT, if the CPU fronted (fetch, decode, Out of Order Logic) can issue 4 instructions and suppose the current thread can issue only two instructions, then a HT enabled processor can issue another two instructions from the 2nd thread. So total 4 instructions will be executed by CPU at any particular cycle where two instructions are coming from Thread 1 and another two from Thread 2. Here some of the instructions from both the threads are being executed simultaneously. In contrast, in Superthreading (which is wrongly thought as Hyper-threading by most of us) can only issue instructions from one of two threads in any particular cycle, not from both. But this is a best case scenario where both the threads are completely independent to each other. HT performance decreases if dependency is present among the threads since it has to wait for the completion of the independent thread to process the dependent one and both of their instructions can't be processed because CPU execution unit is one.
Consider the following example:
Thread 1 or T1: Instructions I1, I2, I3, I4 and I5 are present
Thread 2 or T2: Instructions I6, I7, I8, I9 and I10 are present.
Thread 3: It is in the thread queue waiting to be picked up after the completion of any of the threads T1 and T2. It has instructions I11, 12, I13, I14 and I15. All are independent.
Dependencies: I1 and I2 are independent, I4 is dependent upon I6, i5 is dependent upon both I4 and I10.
I6 is independent, i7 is dependent upon i2, I8 is dependent upon i5. I9 and I10 are independent
Case 1: one Hyper-threaded Core of i7 core with two logical units:
In a HT processor which can issue 4 instructions in a single cycle, I1, I2, I6 and I9 can be issued in the 1st Cycle as they are independent. Now in the 2nd Cycle only I4 (I6 is completed in 1st Cycle), I7(I2 is completed) and I10 can be executed as all the remaining are dependent. So one instruction issue logic is getting wasted.
Cycle 3, only instructions I5(I4 & I10 is done) can be issued as I8 requires I5 to be finished. So three instruction resource is getting wasted. Only in Cycle 4, I8 can be issued. To total 4 cycles are required. At the start of 5th cycle T3 will be fetched. For finishing T3, two more cycles will be required as it has five instructions. So total number of cycles to finish three threads is 7.
Case 2: One Bulldozer module with two cores and shared Frontend
Now Consider a Bulldozer module where two separate execution units are available and they are but the two threads inside a module are shared among all the execution units inside the module which is two here. Consider it has Core 1 and Core 2 which can share all the data, threads and instructions present inside the module.
So in the 1st Cycle Core1 will have I1, I2, I6 and I9 and Core2 will have I10.
In the 2nd Cycle, Core1 will have I4 and I5 and Core2 will have I7 and I8 loaded but waiting for resource.
Now lets devide 2nd Cycle into two different time frame. let's consider 2nd Cycle time span = t1 (timespan to execute 1st instruction) + t2 (timespan to execute 2nd instruction). So both t1 and t2 are less that Clock cycle 2's total time period.
Now at the end of t1, Core1 will finish I4 (I6 is done in Cycle 1) and Core2 will finish I7 (I2 is done is 1st Cycle). At the end of t2, I5 will be executed as I4 and I10 are done and I8 is still in Core2 execution unit
So Now Core 1 is completely free and Core2 has 1 instruction left. So at the beginning of Cycle 3, T3 is fetched and I11, 12, I13, I14 of it will be assigned to Core1 and I15 will be assigned to Core2 as it has still 3 empty slots.
Now at the end of 3rd clock cycle T1, T2 and T3, all are completed.
So advantage over a Ht core is 4 cycles.