Guys, here is my analysis over the new AMD GPU architecture. Took me more than 3 hours. Sorry it is little bit lengthy. Please have a little patience and finish the reading. God, I had to read my old Computer architecture books and the architecture of Cray super computer for it. Please share your valuable feed backs.
_____________________________________________________________
Well, as per my understanding the architecture of the new AMD GPU is actually uses the same concept of the famous Cell Processors currently getting used in PS3. Here we have a PPE or Power PC Processing Element, a simple in order X64 based microprocessor based on PowerPC architecture which is responsible for basic scheduling, Branch prediction, Allocation of works to the lower units and series of SPE or Synergistic Processing Element which are simple vector processors with their own cache but lacking extra hardware for Branch prediction and memory read write. It is the duty of the PPE to schedule the task among the SPEs and Read/Write from Main memory. It uses standard memory management techniques like Paging, Indexing, Virtual memory etc.
SPEs are very much capable of very fast execution of instructions stored in their own cache memory (they have 256 KB of cache memory for each) and they are also interconnected to each other by a 256 bit bus and can access cache or instructions of another SPE if required. These SPEs are basically Vector units and a presence of series of SPE, Cell processor can perform vector operations in a set of instructions in a single cycle by parallel computing, compared to the one by one execution of a scalar unit or Instruction level parallelism of current Superscalar, Out of Order Execution and heavily pipelined micro architecture where all the hardware logic are present in a single die. This approach actually reduces the complexity of each of the element present in Cell (as now whole prediction and scheduling is done by PPE and the raw parallel processing power is present in the SPEs), resulting less Transistor count and enables it run in very high speed.
If you compared it to the coming generation of AMD architecture, you will find a lot of similarity. The Instruction Fetch Logic, Control and decoding logic, Branch prediction logic etc are handled by the Scalar unit as these operations don't involve any Vector operations, they just use some algorithm to detect the next instruction to be fetched or Loop to which branch. One striking point is that the GPU memory controller uses the X86-64 memory management which is a revolution as till date no dedicated GPU use X86-X64 bit memory management. Instead they use Vector Memory Management where for performing parallel operation on a set of data, they fetch a set of memory locations which contains the set of data to their register, devides it into blocks of instructions which can be called a Thread. So a thread is basically group of related instructions requires to perform a task.
Now comes the Vector unit, consisting of 4 modules. These parts are similar to SPEs of Cell processor with one difference, the whole block shares a single Instruction and data cache. Again each of the 4 vector unit is divided into Geometry Processing (Vertex Shader and Pixel Shader) and Tessellation unit.
The whole thing is considered as Compute Engine and there are several of Compute engines available where each of them are connected by some interconnect bus. Compute engines are not the same VLIW and far more independent as each of unit is now having their own Fetch, Decode, Branch logic which are handled by the SCALAR unit and Vector processing units for parallel processing and Cache for storing the Instructions and data.
For programming GPU, nVidia has released their Library CUDA for quite some times now. The nVidia architecture heavily relies upon TLP or thread level parallelism. Now for Graphics processing GPU works more like a SIMD model where a single task say transforming a set of pixels to their complement, GPU can perform this operation in a single cycle by performing it over all the pixels simultaneously. For implementing TLP it heavily relies upon its hardware logic and hardware branch prediction technique which increases the hardware cost. Other disadvantage is here programmers don't have access or knowledge about the internal principles of the GPU which actually restricts to maximize GPU optimizations as through programming logic whatever you have implemented, Multi Threading, high degree of parallelism, rescheduling the execution order of the instructions, you can't be sure that everything is going to run inside GPU on the same way.
Till date AMD is using VLIW architecture. VLIW architecture on the other hand, heavily relies upon the Software compiler to reschedule, branching etc. The advantage is removes the necessity of costly hardware logic to do the above mention tasks, resulting reduced cost, small die size and power consumption.
The main problem with that architecture is for maximizing the utilization of all the 5/4 units, the software compiler needs to be highly optimize to rearrange the code structure so that it can creates blocks of Instructions (thread) which are very loosely coupled or Independent which is not an easy task. Other thing is the missing Hardware logic inside the shader cores slows down the processing capabilities, mainly executing GP-GPU codes which may contain complex instructions that needs to be broken down to multiple simple instructions to be executed in parallel. It is very hard in VLIW to schedule an Instruction ahead in execution time as there is no run time scheduling logic present in the hardware level.
The main problem with all the current GPU is memory management. All of you aware of that even the highest GPUs hardly use all the memory allocated to them. The reason is GPU memory management is completely different from CPU memory management. The main logic in here is to group the set of data in which same operations need to be performed in contiguous memory locations and fetch the whole block at a time to perform the operation over the whole set. For very complex and wide data this approach is not efficient .
The New AMD Architecture Advantage: The Compute Engine , as described by AMD is almost a independent processor, having own Fetch, decode, Scalar and Branch units for fetch , decode, schedule and 4 module vector units for raw processing power by parallel execution.
The 1st Advantage is moving from software based Compiler static scheduling to Hardware based execution time scheduling. In Compiler scheduling if it is found that some rescheduling may avoid cycle penalty or more parallelism, nothing can be done as there is no run time hardware scheduler and whole execution is going through a predefined path.
The 2nd Thing is executing multiple groups, each of them contains a set of instructions. In ideal scenario all of them will be independent and can run in a single Compute Unit. But if there are dependencies among them, VLIW has to wait un till the dependent block is completed and result is available and then it can start execution the next block. But in here just like Intel's HT, if say block A is waiting for resource in Compute Unit C1, C1 can fetch another block of instructions or thread assigned to it and start processing it. It is called Simultaneous Multi Threading or SMT. One thing, the instructions present inside a particular block cannot be executed in that above fashion, they need to be executed in Order.
3rd thing is more efficient Multi Threading performance. Now each of the CU can process one thread more efficiently due to the presence of hardware units. VLIW can also perform multi threading, but not that efficiently due to the lack of hardware unit and compiler dependencies.
4th Thing is writing code. As now Scheduling task is not required in Compiler, programmer can write much dynamic codes and specify the execution pattern, more optimization to run the code in parallel more easily as he is sure that in execution time hardware logic will take care of those things.
5th and revolutionary thing is using X86-64 memory management. This is a very well proven and highly efficient memory management and programmers can directly write highly multi threaded and optimized code while developing any applications as most of the compiler knows how to take advantage of X86 memory management. If you are designing a highly parallel code and multiple threaded environment for the applications, X86 memory management can take care of it easily and will make sure that the execution path should be what you have intended, not something rescheduled by the compiler. It also increases memory usage efficiency more than 100% due to the use of standard Virtual Memory concept.
6th is Scalar unit: The scalar unit is responsible for looping, branching, prediction etc. But it can also execute completely independent single instructions so that those instructions don't need to sent in the Vector unit and save valuable SIMD unit time.
Another thing is communication between multiple CUs when they are execution different threads. If CU1, executing Thread T1 requires some input from Thread T2, getting executed in CU2, CU2 can share that data through the shared L1 cache among all the CUs. This is somewhere similar to Superscalar architecture where each of the CUs are basically considered as Execution units and a superscalar has multiple execution unit under it. AMD's ACE or Asymmetric Compute Engine can actually enables these features and gives the GPU Out of Order execution capability. IF it is found that execution in a different manner than the defined one can improve performance, ACE can prioritize the tasks and allowing them to be executed in a different order than the way they have been received.