Processor Page

Bulldozer929 · Jan 26, 2011

The cache controller is always observing the memory positions being loaded and loading data from several memory positions after the memory position that has just been read. To give you a real example, if the CPU loaded data stored in the address 1,000, the cache controller will load data from ”n“ addresses after the address 1,000. This number ”n“ is called page; if a given processor is working with 4 KB pages (which is a typical value), it will load data from 4,096 addresses below the current memory position being load (address 1,000 in our example). By the way, 1 KB equals to 1,024 bytes, that’s why 4 KB is 4,096 not 4,000. In Figure 7 we illustrate this example.

what are the values for current SNB processors? Are all previous Core 2 processors having 4KB pages? What about AMD?

also does hyperthreading mean two execution engines running in parallel?

bump!!

also someone please explain a pipeline in microprocessor. you can use technical terms. i am familier to most. from what i have gathered till now is that these are a series of instructions that always remain in cpu and allow the fetch unit and further, to perform tasks simultaneously instead of waiting for a task to finish..am i correct?

skeletor · Jan 27, 2011

Bulldozer929 said:
also does hyperthreading mean two execution engines running in parallel?

Intel's HyperThreading is running two threads simultaneously in a single core.

Bulldozer929 said:
what are the values for current SNB processors? Are all previous Core 2 processors having 4KB pages? What about AMD?

Whatever it may be, it doesn't make much of a difference as of today.

I think it is just like cluster sizes we have in file-systems. You can format your hard-disk partition with NTFS in 2 kB cluster size or 4 kB cluster size. When the OS reads a tiny file on the hard-disk, it actually reads the whole cluster.

Jaskanwar Singh · Jan 27, 2011

actally from what i am able to understand its hyperthreading puts the idle execution engines to work on other thread. maybe i am wrong. not sure.

Cilus · Jan 27, 2011

Hyperthrading does not mean two execution engines are running in parallel. Hyperthreading is the Intel nomenclature of SMT of simultaneous Multi Threadig.

Here inside the processor die, normally they have single series of execution units like single ALU, FPU and fetching unit. Now HyperThreaded processors do have an extra set of registers called Thread Register.
While fetching Data from slower memory location, the branch predictors fetch the current instruction and the possible next instructions to be executed estimated by different logic and stores it into the Thread Register.
Now when the current Instruction is waiting for resources, say another memory fetch or I/O operations like printing, CPU fetch the next thread of block of instructions from the Thread Register and starts to execute it.
Here CPU does not need an memory read as the instruction set is already in Thread register and accessing CPU register is million time faster than a slower memory read. As a result CPU gains some extra throughput(amount of work done at a certain time) at a given time span.

But not necessarily it will increase performance all the time. For that Programs need to be highly multithreaded and loosely coupled or independent. If the threads are tightly bound to each other for resources then hyperthreading can actually decrease performance.

Jaskanwar Singh · Jan 27, 2011

Cilus the waiting instruction(for some another memory read or I/O read), where does it remain? In the ALU or FPU(whichever necessary) or the L1 data cache? And so the the various ALU or FPU start executing the next thread atop the current one?

Actually i am still learning. So lot of confusions :-D

vickybat · Jan 27, 2011

Cilus is absolutely correct about smt. Its usually implemented in a superscalar architecture and provides multiple threads to be processed in a given pipeline stage at a time. So basically the thread register as pointed by cilus is a huge register that can house multiple instruction threads from separate tasks.

When an instruction is stalled, due to delayed memory i/o, those instructions are stored in one of cpu's register memory or getting fetched from main memory by memory controller ,until its fetched for cpu operations. If the instruction is integer based, then the alu computes the instruction but if its a floating point, then the fpu handles it. Alu or Fpu are execution units of a processor and do not store anything.

Cmt( chip level multithreading) is more based on ILP (instruction level parallelism) as found in all superscalar processors. Smt exploits TLP(Thread level parallelism).

However, in most current cases, SMT is about hiding memory latency, increasing efficiency, and increasing throughput(Due to accessing cpu register memory) of computations per amount of hardware used.

Cilus · Jan 28, 2011

Jas, in simple terms, every cpu has some registers to hold and process the instructions. your waiting instructions remain on these registers. In HT enabled processors, apart from these registers another set of registers are present to hold the extra thread. And ALU and FPUs are execution unit, they don't store anything, only perform operations over the store operands in the CPU registers and the result is also stored in CPU registers.

Jaskanwar Singh · Jan 28, 2011

Ok thanks cilus

Bulldozer929 · Jan 28, 2011

thanks for clarifications everyone

Cilus · Jan 29, 2011

Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously.
It may be software based where intelligent compiler or Scheduler can be used or hardware based where hardware logics like VLIW or Very long instruction word, Out of Order Execution(OOO) are used.

1. Software Based

Consider the following instructions:

1. x=y+z
2. c=a+b
3. p=m+n
4. m=x*c

Here Instructions 1 and 2 are independent to each other, 4 is dependent upon 1 and 2 and 3 is dependent upon 4.

Lets assume for executing each of them the time required is t. Now if CPU starts a sequential execution without the help of compiler, for executing 1 and 2 it will take 2t time. Now when 3 is fetched, it cannot be executed as value of m is not yet calculated. So waste of time t. Now instruction 4 is fetched and executed...time taken is t. Now execution of 3 is started and completed. So total time required is 5t.

Now lets reschedule the sequence of execution of the instructions:
1. x=y+z; c=a+b
2. m=x*c
3. p=m+n

As 1 and 2 were independent of each other, they can be executed simultaneously, so the time required to execute both of them in parallel is t.
For new 2 and 3, time taken is t+t=2t.
So here level of ILP is (5t/3t) = 5/3

Now for implanting Software level ILP, you need basic hardware capability too. Examples are VLIW, Pipelining, Superscalar execution. We are not going to discuss all these but going for two very basic hardware level ILP, Pipelining and VLIW

Pipelining: Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. This is solved without additional hardware but only by letting different parts of the hardware work for
different instructions at the same time.

Consider the Instruction Cycle, i.e. the steps require to execute an instruction. For this all the CPUs are having three main units:
a. Fetching Unit, b. Decoding Unit, c. Execution Unit

1. Fetch Instruction

2. Decode Instruction

3. Fetch Operand: Normally an instruction specifies the locations of the operand and the operation to be performed on them. So Fetch Operand is consisting of two sub process
i. Calculate the Address of the Operand (Memory or CPU
Register)
ii. Fetch the Operand

4. Execute Instruction
i. Execute Instruction (Perform the operation on operand)
ii. Write back the result.

Look at the chart

For Fetch Operation
Time Required: t1 ; Start Time: 0; End Time: t1

For Decode Operation
Time Required: t2 ; Start Time: 0+t1; End Time: 0+t1+t2

For Execution Operation
Time Required: t3 ; Start Time: 0+t1+t2 ; End Time: 0+t1+t2+t3

Now when Decode starts at 0+t1, the fetching unit is sitting idle. Similarly when the execution starts at 0+t1+t2. the Fetch and Decode unit are sitting idle.
Now rather than sitting idle, what can be done is at the time
0+t1, Fetch unit can start fetching the next instruction, at time 0+t1+t2, Decode unit can start decode the next fetched instruction and at the time 0+t1+t2+t3, Execution unit can start executing the next decoded instruction and Fetching unit can start fetching the next to next instruction. This process will go on and in every step you will get more number of concurrent operations.
Obviously this is the simplest example where we are assuming all the instructions are independent of each other. But if instruction comes whic are depended upon the previous instruction and the data is not available yet, then the situation is call Pipeline Hazard. However this can be addressed by different ways like halting the execution of the dependent instruction untill the data required for it is available in the previous cycle.

THE BLOCK DIAGRAM
[IMG]*i52.tinypic.com/fmknl0.jpg[/IMG]

VLIW OR VERY LONG INSTRUCTION WORD:

As the name suggests, in VLIW architecture CPU actually fetch several instructions as a single one. There is lot of similarities among VLIW, OOO or Out of Order Execution and Superscaler execution. In the last two, inside a single CPU,multiple execution units are placed....and the
ability to execute multiple operations simultaneously. The techniques used to achieve high performance, however, are very different because the parallelism is explicit in VLIW instructions but must be discovered by
hardware at run time by super-scalar processors.

VLIW architecture heavily relies upon a good Compiler. Because the instructions which are going to be fetched need to be as much independent to each other as possible. Oherwise there will be run-time penalty resulting multiple memory access.
In VLIW. Since determining the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler, the processor does not need the scheduling hardware, hence reducing the price.
All the current generation processors.

Initially it was implemented in the Embedded CPU market where vendors are adding an software compiler (firmware) to their embedded systems. GPU market, mainly AMD is also one of the example VLIW architecture.

It was implemented first in desktop processors when Intel IA 64 Itanium processors implemented it.

One piece of Info Guys. You know why AMD 64 bit was a huge success? Because that time Intel did not have their native X86-X64 bit support. They in fact used a modified VLIW architecture where two 32 bit instructions were fetched as a single 64 bit instruction and inside the CPu there were hardware logic implemented to convert it to X86 instructions. This was a time taking process and the prediction was logic was not up to the mark..So Intel P4 processors lost to their AMD counter parts.

Thread-level parallelism (TLP)
its pretty easy compared to ILP. I think you guys know it already...running separate threads to different processors (multi-processor systems) or multiple computers or our Hyper threading or Multi Core processors running multiple threads simultaneously.

Jaskanwar Singh · Jan 29, 2011

Great one, thanks cilus. :-D

Processor Page

Bulldozer929

Broken In

skeletor

Chosen of the Omnissiah

Jaskanwar Singh

Aspiring Novelist

Cilus

laborare est orare

Jaskanwar Singh

Aspiring Novelist

vickybat

I am the night...I am...

Cilus

laborare est orare

Jaskanwar Singh

Aspiring Novelist

Bulldozer929

Broken In

Cilus

laborare est orare

Jaskanwar Singh

Aspiring Novelist