Sukesh, thanks for pointing out that. Actually I just shared couple of infos which are not known to all.
Lets have some info about FMA4 instruction, it stands for Fused Multiply and Add. For example a= b*c + d is an X86 instruction or Macro Ops. Now this instruction can't be calculated directly and needs to be devided into multiple small instructions which are called Micro Ops. So in a processor without FMA support,
1. The value of b*c will be calculated
2. Then it will be rounded off as per the required level of accuracy.
3. After that it will be added with d and again the value will be rounded off.
4. Now the total calculated value will be stored in a register and saved to the
memory location where a is situated. Rounding off can be again done as
per the data type of the operand d.
So here 4 steps are required and three levels of rounding off, resulting use of multiple CPU cycle and reduced level of precision.
If you have a FMA enabled porcessor then the FPU will execute the instruction a= b*c + d, instead of the integer unit. As FPU has support for FMA, the b*c + d will be calculated in a single clock cycle and then the result will be rounded off. So the advantages are:-
1. Execution speed is faster
2. High level of accuracy because of only one rounding logic
3. FPU can produce more accurate result than Integer unit.
Now AMD has implemented FMA4 in their BD architecture while intel planning to introduce FMA3 in Hashwell. None of the Sandybridge, Sandybridge-E or Ivybridge have support for FMA.
In FMA4, 4 operands can be placed inside the CPU register. So here the value of a, b, c and d can be fetched from memory to CPU register. The the whole operation can be performed inside the CPU withour any memory access, resulting far better speed up as you know memory access is a very slow process and CPU has to wait for several thousands of cycles for it.
On the other hand in FMA3, you can only place three operands inside CPU registers. So first b, c and d will be fetched adn the result of (b*c+d) will be stored in Temp register. After that CPU will fetch the value of a from Memory to a Register, saves the result in a and then again writes it back to memory.
So it is clear to all that FMA4 is faster than FMA3.
Now here comes Intel's foul play. In their compiler Instruction support Specification back in 2010, they have shown FMA4 as a supported Instruction set. But because of the hype of Bulldozer, just before it's release, Intel just dropped support for FMA4 and took FMA3 instead.
As a result none of the major software vendors haven't implemented FMA4 support. There are couple of open source programs like X264 encoder, have support for it and you just check the benchmarks...In AMD compiled X264 benchmark 2nd pass where originally the video gets converted, FX-8150 is ahead of 2600K even under Windows 7.
The good news is that AMD has signed a deal with Intel that from next iteration of their processor, starting from Piledriver, FMA3 will be used and Intel will support it in their compiler. So you can see a serious performance boost in Piledriver because of compiler support from all major vendors as Intel is gonna support.