AMD Graphics Core Next: The architecture of future AMD GPUs!

vickybat

I am the night...I am...
This will clear things more .

This is the translated part.

Nice findings JAS. Well done :thumbs:.
 
Last edited:

Skud

Super Moderator
Staff member
Nice find. And thanks Vicky for the Guru3D link. Already posted some links here:-

*www.thinkdigit.com/forum/graphic-c...ds-southern-islands-track-release-2011-a.html

And some update from Anandtech:-

AnandTech - AMD's Graphics Core Next Preview: AMD's New GPU, Architected For Compute
 

Cilus

laborare est orare
Guys, here is my analysis over the new AMD GPU architecture. Took me more than 3 hours. Sorry it is little bit lengthy. Please have a little patience and finish the reading. God, I had to read my old Computer architecture books and the architecture of Cray super computer for it. Please share your valuable feed backs.

_____________________________________________________________

Well, as per my understanding the architecture of the new AMD GPU is actually uses the same concept of the famous Cell Processors currently getting used in PS3. Here we have a PPE or Power PC Processing Element, a simple in order X64 based microprocessor based on PowerPC architecture which is responsible for basic scheduling, Branch prediction, Allocation of works to the lower units and series of SPE or Synergistic Processing Element which are simple vector processors with their own cache but lacking extra hardware for Branch prediction and memory read write. It is the duty of the PPE to schedule the task among the SPEs and Read/Write from Main memory. It uses standard memory management techniques like Paging, Indexing, Virtual memory etc.
SPEs are very much capable of very fast execution of instructions stored in their own cache memory (they have 256 KB of cache memory for each) and they are also interconnected to each other by a 256 bit bus and can access cache or instructions of another SPE if required. These SPEs are basically Vector units and a presence of series of SPE, Cell processor can perform vector operations in a set of instructions in a single cycle by parallel computing, compared to the one by one execution of a scalar unit or Instruction level parallelism of current Superscalar, Out of Order Execution and heavily pipelined micro architecture where all the hardware logic are present in a single die. This approach actually reduces the complexity of each of the element present in Cell (as now whole prediction and scheduling is done by PPE and the raw parallel processing power is present in the SPEs), resulting less Transistor count and enables it run in very high speed.

If you compared it to the coming generation of AMD architecture, you will find a lot of similarity. The Instruction Fetch Logic, Control and decoding logic, Branch prediction logic etc are handled by the Scalar unit as these operations don't involve any Vector operations, they just use some algorithm to detect the next instruction to be fetched or Loop to which branch. One striking point is that the GPU memory controller uses the X86-64 memory management which is a revolution as till date no dedicated GPU use X86-X64 bit memory management. Instead they use Vector Memory Management where for performing parallel operation on a set of data, they fetch a set of memory locations which contains the set of data to their register, devides it into blocks of instructions which can be called a Thread. So a thread is basically group of related instructions requires to perform a task.
Now comes the Vector unit, consisting of 4 modules. These parts are similar to SPEs of Cell processor with one difference, the whole block shares a single Instruction and data cache. Again each of the 4 vector unit is divided into Geometry Processing (Vertex Shader and Pixel Shader) and Tessellation unit.
The whole thing is considered as Compute Engine and there are several of Compute engines available where each of them are connected by some interconnect bus. Compute engines are not the same VLIW and far more independent as each of unit is now having their own Fetch, Decode, Branch logic which are handled by the SCALAR unit and Vector processing units for parallel processing and Cache for storing the Instructions and data.

For programming GPU, nVidia has released their Library CUDA for quite some times now. The nVidia architecture heavily relies upon TLP or thread level parallelism. Now for Graphics processing GPU works more like a SIMD model where a single task say transforming a set of pixels to their complement, GPU can perform this operation in a single cycle by performing it over all the pixels simultaneously. For implementing TLP it heavily relies upon its hardware logic and hardware branch prediction technique which increases the hardware cost. Other disadvantage is here programmers don't have access or knowledge about the internal principles of the GPU which actually restricts to maximize GPU optimizations as through programming logic whatever you have implemented, Multi Threading, high degree of parallelism, rescheduling the execution order of the instructions, you can't be sure that everything is going to run inside GPU on the same way.

Till date AMD is using VLIW architecture. VLIW architecture on the other hand, heavily relies upon the Software compiler to reschedule, branching etc. The advantage is removes the necessity of costly hardware logic to do the above mention tasks, resulting reduced cost, small die size and power consumption.
The main problem with that architecture is for maximizing the utilization of all the 5/4 units, the software compiler needs to be highly optimize to rearrange the code structure so that it can creates blocks of Instructions (thread) which are very loosely coupled or Independent which is not an easy task. Other thing is the missing Hardware logic inside the shader cores slows down the processing capabilities, mainly executing GP-GPU codes which may contain complex instructions that needs to be broken down to multiple simple instructions to be executed in parallel. It is very hard in VLIW to schedule an Instruction ahead in execution time as there is no run time scheduling logic present in the hardware level.

The main problem with all the current GPU is memory management. All of you aware of that even the highest GPUs hardly use all the memory allocated to them. The reason is GPU memory management is completely different from CPU memory management. The main logic in here is to group the set of data in which same operations need to be performed in contiguous memory locations and fetch the whole block at a time to perform the operation over the whole set. For very complex and wide data this approach is not efficient .

The New AMD Architecture Advantage: The Compute Engine , as described by AMD is almost a independent processor, having own Fetch, decode, Scalar and Branch units for fetch , decode, schedule and 4 module vector units for raw processing power by parallel execution.
The 1st Advantage is moving from software based Compiler static scheduling to Hardware based execution time scheduling. In Compiler scheduling if it is found that some rescheduling may avoid cycle penalty or more parallelism, nothing can be done as there is no run time hardware scheduler and whole execution is going through a predefined path.

The 2nd Thing is executing multiple groups, each of them contains a set of instructions. In ideal scenario all of them will be independent and can run in a single Compute Unit. But if there are dependencies among them, VLIW has to wait un till the dependent block is completed and result is available and then it can start execution the next block. But in here just like Intel's HT, if say block A is waiting for resource in Compute Unit C1, C1 can fetch another block of instructions or thread assigned to it and start processing it. It is called Simultaneous Multi Threading or SMT. One thing, the instructions present inside a particular block cannot be executed in that above fashion, they need to be executed in Order.

3rd thing is more efficient Multi Threading performance. Now each of the CU can process one thread more efficiently due to the presence of hardware units. VLIW can also perform multi threading, but not that efficiently due to the lack of hardware unit and compiler dependencies.

4th Thing is writing code. As now Scheduling task is not required in Compiler, programmer can write much dynamic codes and specify the execution pattern, more optimization to run the code in parallel more easily as he is sure that in execution time hardware logic will take care of those things.

5th and revolutionary thing is using X86-64 memory management. This is a very well proven and highly efficient memory management and programmers can directly write highly multi threaded and optimized code while developing any applications as most of the compiler knows how to take advantage of X86 memory management. If you are designing a highly parallel code and multiple threaded environment for the applications, X86 memory management can take care of it easily and will make sure that the execution path should be what you have intended, not something rescheduled by the compiler. It also increases memory usage efficiency more than 100% due to the use of standard Virtual Memory concept.

6th is Scalar unit: The scalar unit is responsible for looping, branching, prediction etc. But it can also execute completely independent single instructions so that those instructions don't need to sent in the Vector unit and save valuable SIMD unit time.

Another thing is communication between multiple CUs when they are execution different threads. If CU1, executing Thread T1 requires some input from Thread T2, getting executed in CU2, CU2 can share that data through the shared L1 cache among all the CUs. This is somewhere similar to Superscalar architecture where each of the CUs are basically considered as Execution units and a superscalar has multiple execution unit under it. AMD's ACE or Asymmetric Compute Engine can actually enables these features and gives the GPU Out of Order execution capability. IF it is found that execution in a different manner than the defined one can improve performance, ACE can prioritize the tasks and allowing them to be executed in a different order than the way they have been received.
 
Last edited:

Skud

Super Moderator
Staff member
Still Hebrew for me, but at least your analysis is more clear to me than the online sites'.

Simply superb! :doublethumb:
 

ico

Super Moderator
Staff member
My analysis: This is Fermi done right and better.

Why? Read my second analysis.

Second analysis: This takes us much close to real 'fusion' ;)

Future beyond Trinity - Bulldozer + VLIW4 looks interesting. Has been an enormous year for AMD this.
 

Cilus

laborare est orare
Ico, it is not that easy to implement general purpose computing along side with gaming performance. The part AMD has shown is actually only the general processing capabilities of their upcoming Architecture, not how good it is gonna be in gaming. Example is Fermi which has better general processing capability than any AMD GPU but in gaming it simply can't beat AMD, even with their superior architecture. It is very early to predict anything.
 

comp@ddict

EXIT: DATA Junkyard
I saw a chart where even the high end Bulldozer Enhanced (next gen bulldozer) will have DX11 capable GPUs.

So AMD will use GPU cores to enhance CPU functions wherver the GPU is more efficient? Amazing plan I think, hats off to AMD and their Fusion concept.
 

Cilus

laborare est orare
^^Thanks buddy. Lets other members to have a look at it and share their opinion. BTW, I'm planning to write another on OOO or Out OF Order Execution.
 

tkin

Back to school!!
@Cilus, wow, nice work there dude, and I read it, ALL of it, very detailed analysis and lines up with pretty much what I learned in comp architecture and advanced OS this year, one thing for sure, they are planning to decrease dependency and hazards in the pipeline, more advanced branch prediction like cpus, and also multilevel shared cache.

Now while all of this looks good in theory but in reality the entire implementation will heavily depend on the application, just like in pcs the apps can control the threading performance, and apps have to explicitly written to use multi core features, now my question is since game are ported from consoles to pc, will any developer rewrite their game engine completely(this has to be done in the engine core) to use this feature? If amd does pull the entire thing in software I'll applaud but its almost impossible, take for example, can anyone run a 2 threaded application in windows to use 4 threads without rewriting the core? I guess they are trying to achive true fusion, general purpose core for both the cpu and the gpu together.

One thing for sure, this will require lot of research, even larabee could not be pull it off(and intel puts more money in research than amd makes), I'll say 2015 atleast?

But if done right, we can see real time raytracing finally in games(and one more step to achieving true photorealism, then photon mapping and we are set).
 

Cilus

laborare est orare
Tkin, actually AMD is moving out of the software dependencies to hardware branch prediction. VLIW architecture is all about software compiler scheduling and completely relies upon the application's code optimization and AMD driver optimization. Although they are good in gaming but in GPGPU processing VLIW is behind nVidia's TLP based architecture.
That's why they are changing the architecture to Scalar + Vector based unit with hardware prediction logic. As a result even if the code path is not optimized in software level, at runtime hardware can reschedule the execution path to yield better performance. If the code path is optimized, hardware prediction can still find out better execution path while executing the piece of code, enhancing the performance, a feature, completely missing in VLIW 5/4 design. But they need to take care of the fact while increasing general processing performance, gaming performance should not be hampered, if not increased.
 

tkin

Back to school!!
Tkin, actually AMD is moving out of the software dependencies to hardware branch prediction. VLIW architecture is all about software compiler scheduling and completely relies upon the application's code optimization and AMD driver optimization. Although they are good in gaming but in GPGPU processing VLIW is behind nVidia's TLP based architecture.
That's why they are changing the architecture to Scalar + Vector based unit with hardware prediction logic. As a result even if the code path is not optimized in software level, at runtime hardware can reschedule the execution path to yield better performance. If the code path is optimized, hardware prediction can still find out better execution path while executing the piece of code, enhancing the performance, a feature, completely missing in VLIW 5/4 design. But they need to take care of the fact while increasing general processing performance, gaming performance should not be hampered, if not increased.
This is slowly starting to become like fermi(better actually), if they can reschedule in hardware level that would be great(true cpu), but that would complicate the logic and increase cost, as well as power consumption, maybe 22nm.
 

rchi84

In the zone
Now, for the hard part. Can AMD convert it all into a decently performing package?

I remember when the details of the 29xx series came out. Sound like something from the future, with 512 bit memory, tessellator and the works. Didn't translate all that well into performance for two generations, until the 4xxx launch..
 
Top Bottom