AMD Unveils its Heterogeneous Uniform Memory Access (hUMA) Technology

Cilus · May 3, 2013

The success of any new hardware architecture, concepts depends upon both the hardware implementation and software support for it. If these two things aren't in sync, the chances of success is very limited inspite of the advancement of either of them.
Remember when 1st generation X86 based DUal Core Processores (Intel Pentium D and Athlon 64 X2) were released, Windows XP wasn't even ready understand the hardware design even though concept of multithreading is very old any OS.
Same problem is present with HSA approach. Even with current generation hardware like GPU and CPU using NUMA (Non Unified Memory Access), HSA can be implemented through some programming models. But the problem is that for that we don't have very widespread support of high level programming languages like C++, Python, which is the 1st choice of system programmers. There are few special purpose languages and add ons like DirectCompute, OPenCL and CUDA; but general purpose coding with them is a pretty tough job. That;s why we only see limited type of applications like Audio/Video Encoders and Converters, Encryption, Compression etc with HSA support.

With the HUMA ( Unified Memory Access for Heterogeneous System) implementation in hardware level, software can be very easily tweaked to accomodate it. Simple way to say this: As a programmers point of view, the syetm will be just like a Multi-Core Architecture where couple of cores are of same type (CPU Cores of a Multi-Core CPU) and couple of them are of different type (The GPU Execution units). You just need to provide support for identification of the instructions to be selected by the suitable cores (Either CPU or GPU or shared by both) and don't need to think how they will handle them or distribute them internally. In Hardware, the CPU and GPU can communicate to each other by using Interrupt Controls (Interrupt are used to satll an execution unit when other is working on the same data), token passing etc.

For example, Microsoft has already launched C++ Accelerated Massive Parallelism or C++ AMP to support HSA and can be used with all the AMD Fusion APU which is the closest part of HSA design til date. With this kind of widely accepted programming languages, developers can very easily leverage the parallel processing capabilities of GPU in a UMA design. So in future, expect to see even the day to day general apps like Microsoft Word or Google Chrome to take the advantage of HUMA.

vickybat · May 3, 2013

Yes, from the hardware level, things are progressing quite nicely. One thing i would like to point out from memory aspect is the overall bandwidth that will be available.
Now in console implementations, like PS4, there is a single GDDR5 memory pool that is capable of impressive bandwidths over 100 Gb/s (176Gb/s precisely).

This allows a uniform data feed for all resources ( cpu and gpu) attached to this memory type. But in a pc, these memory types will have an asynchronous bandwidth because they will be of different types. Main memory will be DDR3 ( as of now) and GDDR5 for VRAM. So, there might be a latency involved when a resource tries to access data from a slower memory type.

Lets forget general purpose apps and think about a game code. The AI, physics and other complex operations are usually handled by the cpu which do not involve vector data type. Things here don't have an array of data that can take advantage of gpu's parallelism. These data usually reside on the main memory rather than Vram. But currently, physics code is evolving to use gpu's parallel architecture and thus turning potentially more gpu based. So here Gpu accesses physics code from main memory directly albeit with a bandwidth constrain relative with GDDR5.

This is a limitation in pc architecture, not present in PS4. But there is a workaround by using a cache of faster memory type just like the next xbox is rumoured to be doing. The esram cache is coupled with the slower memory type ( in this case ddr3). The cache acts as a bridge between the slower memory type and execution resource increasing memory bandwidth significantly and almost match with the faster GDDR5 bandwidth. There should be a workaround like this in gpu's sporting unified memory access.

Lastly, i would like to add that the inclusion of python in general purpose compute api's is fantastic from a coding point of view as it makes thing a lot easier working on an SIMD model.
Since python uses "list" instead of conventional arrays, its a big relief to programmers as they don't have to worry about the problems involving complexities of arrays in low level languages like c++. Pushing and popping data out of an array often involves the stack or queue method that follows the logic of LIFO and FIFO and there are complexities involved working on a large number of data with multiple data types.

Lists in python solve these problems by a great extent as its irrespective of data types, more straightforward approach in list traversing ( for-in loop) and efficient methods like append, pop etc, which are much faster. Using REGEX, complex list operations on strings can be done extremely easily. Search operation on lists are also easily done involving ":" operator. So overall, array operations are much efficient and easy to implement in python and GPU's simd design benefits largely from it.

Java also has arraylist which closely matches python, but due to the presence of JVM, its not ideal for low level computation that gpgpu brings. HSA and GPGPU are the future models of computing.

AMD Unveils its Heterogeneous Uniform Memory Access (hUMA) Technology

Cilus

laborare est orare

vickybat

I am the night...I am...