Microprocessor cores have only been increasing over the years. It’s one of the key methods by which compute power of a processor is scaled. You can increase the core clock speed and/or you can increase the core count and get more processing power. However, these two methods are easier said than done as each method has its own set of implications which can be either negative or positive. It’s akin to receiving medication, you get the desired effect but there might be a few side-effects thrown in the mix as well.
When it comes to increasing clock speed, the obvious implication is the increased power consumption and the resultant extra heat. And when you think of increasing the number of cores, you need to wonder about optimising the time taken for one core to talk to another. Think of it like a room in which you let more people in, you can either have everyone talking at the exact same time resulting in cacophony or you can have a system of communication wherein a person raises their hand to be heard. Processors face a similar dilemma.
Ring Bus Architecture – The current solution
Back in the simple days, each core within the CPU had its own dedicated or private interconnect to the L3 cache or the Last Level Cache. Physically, this meant that there would be hundreds (if not thousands) of tiny wires between each core and the L3. As more cores were needed, this meant that either an additional set of thousands of wires needed to be laid down or a new method of communication had to be introduced. This is when the Ring Bus Architecture came into being.
Cores were placed adjacent to each other in uniform stacks and the interconnect would pass through one stack before hitting the L3 cache. While this did introduce a slight amount of latency between certain cores and the L3 cache, the net result was a much more optimised manner of interconnect. This method of arranging cores in a ring wasn’t new. The world of networking had been using ring networks for ages. However, each system comes with it’s own disadvantages.
Like the previous method wherein each core had its own interconnect path to the L3 cache had a limitation of the number of cores it could optimally handle. Even the Ring Bus architecture had a limit. The issue here was again, latency while communicating between cores and with Skylake-SP and Skylake-X processors, Intel is introducing a new architecture – Intel Mesh Architecture. Again, if we are to look back at networking, you’ll learn that mesh networks have been in existence for decades and it currently the most popular method of interconnecting for extremely large number of nodes. It was only natural for a similar system to be implemented for processors.
Introducing the Intel Mesh Architecture
Intel Mesh Architecture first made its foray into the limelight with Knights Landing or Xeon Phi as it’s better known. Knights Landing had 72 cores split into two segments of 36 cores each. All of these cores were arranged according to the new Mesh Architecture. And the upcoming Skylake-X and Skylake-SP Xeon processors will be utilising the Intel Mesh Architecture.
While this is just a sectional diagram of the entire processor, it is more than sufficient to get the hang of Mesh Architecture. Ensuring a fast interconnect between cores and reduced latency is what this architecture is all about. With the Ring Bus architecture, adding more cores would have required adding more rings but they’d then take up a certain footprint on the die which could otherwise have been allocated for the cores. And latency would surely have increased at moving data from one one core to another would have required even more clock cycles.
With the Mesh Architecture, you can scale in an every-increasing fashion. In this architecture, all cores are arranged in a grid like manner and each core is connected to its neighbouring cores through a mesh formed by rows and columns of interconnect paths. The number of interconnect paths will scale according to the number of cores present on the die.
Communicating along a column is quicker as compared to communicating along a row but not in all cases. If the core in the bottom-left corner were to communicate with the core right above it or right beside it, then it would take exactly one cycle. Let’s call this a unit distance. However, if it has to communicate with a core along the row that’s two units away then it will take 3 cycles. Three units away and it will take 3+1 = 4 cycles. And so on. Vertically, it will continue to consume just one cycle per core. Thus, communicating with cores that are farthest away requires a lot more cycles horizontally than vertically.
With Skylake-X and Skylake-SP, Intel’s QPI will make way for UPI or UltraPath Interconnect (also known as Keizer Technology Interconnect or KZI). Not much is known about this new interconnect aside from the fact that it is obviously faster than the previous QPI. And traffic will flow in both directions on this bus.
Another key advantage of the Mesh Architecture is the reduced power consumption. Since communication between cores need not necessarily take round trips about a ring bus, it ends up with a far higher amount of intrinsic bandwidth. The net result is lower power consumption for the communication which can then be offset onto the cores for compute purposes. Hence, you can get more performance with the same TDP.
Overall, the new Intel Mesh Architecture has been built from scratch with the intention of scalability for generations to come. With AMD entering the server market with its EPYC 7000 series processors, the race has begun anew.