Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously.
It may be software based where intelligent compiler or Scheduler can be used or hardware based where hardware logics like VLIW or Very long instruction word, Out of Order Execution(OOO) are used.
1. Software Based
Consider the following instructions:
1. x=y+z
2. c=a+b
3. p=m+n
4. m=x*c
Here Instructions 1 and 2 are independent to each other, 4 is dependent upon 1 and 2 and 3 is dependent upon 4.
Lets assume for executing each of them the time required is t. Now if CPU starts a sequential execution without the help of compiler, for executing 1 and 2 it will take 2t time. Now when 3 is fetched, it cannot be executed as value of m is not yet calculated. So waste of time t. Now instruction 4 is fetched and executed...time taken is t. Now execution of 3 is started and completed. So total time required is 5t.
Now lets reschedule the sequence of execution of the instructions:
1. x=y+z; c=a+b
2. m=x*c
3. p=m+n
As 1 and 2 were independent of each other, they can be executed simultaneously, so the time required to execute both of them in parallel is t.
For new 2 and 3, time taken is t+t=2t.
So here level of ILP is (5t/3t) = 5/3
Now for implanting Software level ILP, you need basic hardware capability too. Examples are VLIW, Pipelining, Superscalar execution. We are not going to discuss all these but going for two very basic hardware level ILP, Pipelining and VLIW
Pipelining: Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. This is solved without additional hardware but only by letting different parts of the hardware work for
different instructions at the same time.
Consider the Instruction Cycle, i.e. the steps require to execute an instruction. For this all the CPUs are having three main units:
a. Fetching Unit, b. Decoding Unit, c. Execution Unit
1. Fetch Instruction
2. Decode Instruction
3. Fetch Operand: Normally an instruction specifies the locations of the operand and the operation to be performed on them. So Fetch Operand is consisting of two sub process
i. Calculate the Address of the Operand (Memory or CPU
Register)
ii. Fetch the Operand
4. Execute Instruction
i. Execute Instruction (Perform the operation on operand)
ii. Write back the result.
Look at the chart
For Fetch Operation
Time Required: t1 ; Start Time: 0; End Time: t1
For Decode Operation
Time Required: t2 ; Start Time: 0+t1; End Time: 0+t1+t2
For Execution Operation
Time Required: t3 ; Start Time: 0+t1+t2 ; End Time: 0+t1+t2+t3
Now when Decode starts at 0+t1, the fetching unit is sitting idle. Similarly when the execution starts at 0+t1+t2. the Fetch and Decode unit are sitting idle.
Now rather than sitting idle, what can be done is at the time
0+t1, Fetch unit can start fetching the next instruction, at time 0+t1+t2, Decode unit can start decode the next fetched instruction and at the time 0+t1+t2+t3, Execution unit can start executing the next decoded instruction and Fetching unit can start fetching the next to next instruction. This process will go on and in every step you will get more number of concurrent operations.
Obviously this is the simplest example where we are assuming all the instructions are independent of each other. But if instruction comes whic are depended upon the previous instruction and the data is not available yet, then the situation is call Pipeline Hazard. However this can be addressed by different ways like halting the execution of the dependent instruction untill the data required for it is available in the previous cycle.
THE BLOCK DIAGRAM
[IMG]*i52.tinypic.com/fmknl0.jpg[/IMG]
VLIW OR VERY LONG INSTRUCTION WORD:
As the name suggests, in VLIW architecture CPU actually fetch several instructions as a single one. There is lot of similarities among VLIW, OOO or Out of Order Execution and Superscaler execution. In the last two, inside a single CPU,multiple execution units are placed....and the
ability to execute multiple operations simultaneously. The techniques used to achieve high performance, however, are very different because the parallelism is explicit in VLIW instructions but must be discovered by
hardware at run time by super-scalar processors.
VLIW architecture heavily relies upon a good Compiler. Because the instructions which are going to be fetched need to be as much independent to each other as possible. Oherwise there will be run-time penalty resulting multiple memory access.
In VLIW. Since determining the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler, the processor does not need the scheduling hardware, hence reducing the price.
All the current generation processors.
Initially it was implemented in the Embedded CPU market where vendors are adding an software compiler (firmware) to their embedded systems. GPU market, mainly AMD is also one of the example VLIW architecture.
It was implemented first in desktop processors when Intel IA 64 Itanium processors implemented it.
One piece of Info Guys. You know why AMD 64 bit was a huge success? Because that time Intel did not have their native X86-X64 bit support. They in fact used a modified VLIW architecture where two 32 bit instructions were fetched as a single 64 bit instruction and inside the CPu there were hardware logic implemented to convert it to X86 instructions. This was a time taking process and the prediction was logic was not up to the mark..So Intel P4 processors lost to their AMD counter parts.
Thread-level parallelism (TLP)
its pretty easy compared to ILP. I think you guys know it already...running separate threads to different processors (multi-processor systems) or multiple computers or our Hyper threading or Multi Core processors running multiple threads simultaneously.