Cilus
laborare est orare
Graphics card:
I guess choice of this component is actually the most tricky and we are having more discussions about it which is higher than any of the other components. Better FPS or smoothness, Compute Performance or raw gaming performance, single GPU or multi GPU setup, all the things have been discussed and in my article, I will try to relate all the things.
FPS or Frames Per Second: It is the most common technique of measuring performance and most review sites actually stress on this parameter only. Now better FPS is always better in most cases but now we know there are certain catches.
Gaming Latency is very important while playing games as it decides how responsive the GPU is between the input from the player and the effects to take place. Sometimes it is noticed that even full frame rate is maintained, the game lags if very high amount of post processing is being applied. It looks like Nvidia SLI setup does have an advantage here as per the HardOcp review of GT 660 Ti which Vicky has posted. However, in 1080P, this latency is almost negligible in current gen Graphics cards of same league.
Discarded Frames: Today's most of the Monitors, including Anand's BenQ G2420HD is capped at 60 Hz and it is impossible to see more than 60 FPS in real time. Then why we are opting for cards which can deliver over 100 FPS sometimes. There is a effect called Plummeting in Gaming, which reduces the smoothness, despite having 60FPS on average. Now here the extra FPS comes to rescue. By using the discarded Frames (Display can show only 60 frames in 1 sec whereas 90 Frames are available from the Gfx card) and a simple algorithm, GPU can enhance the output frame quality to provide a smooth game-play experience as well as very good Motion Blur effects. AMD has an upper edge here as it is observed in several reviews that Game Play experience become smoother with the increase of FPS in games. I'm sure they will add new algorithm in their upcoming drivers to take the advantage of it for reducing Gaming latency.
Here my votes goes to AMD for that reason because at same price point, AMD cards deliver little extra frames while gaming.
Compute Performance
The GCN architecture is far more capable than FERMI as well as KEPLER architecture when it comes to compute performance. The Microsoft based DirectCompute and open source OpenCL API based apps run far better in GCN design, thanks to its VECTOR PROCESSOR based design where each of the Compute Engine of the GPU can handle a thread independently. They have their local cache and read/write interface to enhance the performance. On the other hand each of the SP in Kepler is not as powerful as the FERMI architecture. Although it has more number of shader processor, a single shader can't handle a single thread alone...for that it needs a cluster of shaders.
Gaming Performance.
Gaming performance depends of many factors, memory bandwidth, number of shader processors, Number of ROPs and their performance, Texture handling capabilities. The conventional gaming approach basically stresses upon Vertex Shader and Pixel shader performance as GPU can perform a single operation over millions of pixels in parallel. in conventional way, all the gaming effects like Ambient Occlusion, Depth of Field, HDR Bloom and Motion Blur can be implemented using Pixel and Texture operations. But the processes are very heavy and older generation GPUs excels over those fields by using their raw parallel processing powers. But now as games are getting more and more demanding in GPU performance for rendering, if all those effects are turned on and executed in conventional methods, that might create a heavy toll over the GPU, resulting poor performance. So smart and efficient techniques are required to implement the advanced effects in a Game.
Are Gaming Performance and Compute Performance two different fields?
In most reviews, we see that Compute performance and Gaming performance are compared as two different faces of a Coin which lead to the believe that those parameters are different and a GPU can excel in one field while doing not so good in other field. But with the emergence of new API like DirectCompute, OpenCL, CUDA which can directly access GPU resources like the Stream Processor Clusters, Video Ram, shared memory like a conventional CPU, enables developers to create some advanced models like a Vector Processor type Processing. As a result we can have threads, data structures like Array, Structure and Classes, Object oriented models in GPU p[programming, just like what we do with our CPU wiyh one exception...GPU can process multiple threads at once by parallel processing unlike the sequential execution methodology of CPU.
I will discuss some of the advanced Graphics enhancement techniques , how they have been implemented in conventional way and how GP-GPU computing performance can improve the performance.
Ambient Occlusion or AO:
Definition: Ambient occlusion is a method to approximate how bright light should be shining on any specific part of a surface, based on the light and it's environment. This is used to add realism.
Here how a surface will be illuminated is not calculated based on a single point source of light, instead it is calculated by studying how the environment and surroundings of that surface interacts with light. So a place surrounding by other objects will be darker even though the light source is same for all of them.
This technique is a Global approach and needs to applied over the whole image rather than applying it on any specific objects.
Z-Buffer or Depth Buffer: In computer graphics, z-buffering, also known as depth buffering, is the management of image depth coordinates in three-dimensional (3-D) graphics, usually done in hardware, sometimes in software. It is one solution to the visibility problem, which is the problem of deciding which elements of a rendered scene are visible, and which are hidden.
Conventional Pixel Shader Approach:
The algorithm is implemented as a pixel shader, analyzing the scene depth buffer which is stored in a texture. For every pixel on the screen, the pixel shader samples the depth values around the current pixel and tries to compute the amount of occlusion from each of the sampled points. In its simplest implementation, the occlusion factor depends only on the depth difference between sampled point and current point.
*i.imgur.com/uFk9g.jpg?1
Given a point P on the surface with normal N, here roughly two-thirds of the hemisphere above P is occluded by other geometry in the scene, while one-third is unoccluded. The average direction of incoming light is denoted by B, and it is somewhat to the right of the normal direction N. Loosely speaking, the average color of incident light at P could be found by averaging the incident light from the cone of unoccluded directions around the B vector.
Now there is no smart method available to detect how many surrounding pixels are needed to be taken care of to provide a realistic illumination of point P. Otherwise using Brute force algorithm, the GPU needs to perform 200 texture reads per second which is not possible in real time using current generation hardware. So the following approximation is applied:
1. Some Random sample of pixels are taken from the surroundings of the point needed to be illuminated.
2. For each of the pixels present in the sample, their depth buffer is read from the texture Buffer of the Graphics card.
3. Now using the algorithm mentioned above, the GPU creates an approximate illumination level of the desired point.
Disadvantages:
1.Huge I/O cost. Here for each pixel, the GPU needs to access the Z-Buffer or Texture unit to get its Depth value. in case of 1920X1080 Resolution with 60 Fps, where we are taking 100 neighboring pixel
samples for each of the pixels, the total number of Z-Buffer read will be 1920X1080 (Total number of pixels in a single Frame) X 100 (Sample Size for each pixel) X 60 (Number of Frames per frame) =
12,441,6 X 10^5 which is huge for even the parallel processing power of GPU. Obviously that number is an rough estimation and can be minimized using different techniques, it gives us an idea how it can affect the game
play experience.
2. As the Sample of neighboring pixels are taken randomly, the output might not be that realistic compared to the compute power it needs and sometime creates an unnecessary shadow effect.
3. Here, for parallel processing, we need to rely upon the Texture buffer or Z-Buffer which can hold normally 12 texels per Sp cluster but not taking the advantage of the local registers and Cache Memory of each of the
Stream Processors.
4.Poor Resource sharing. I guess you guys have already understood that two neighboring pixel P1 and P2, situated very closely will have almost same surrounding pixels in common. So if we can keep the sample values in
GPU registers, taken for P1 and 1st check if they are also neighbor of P2 we can actually save the whole sampling thing required for P2. But unlike data in CPU, pixels cannot be directly kept in registers as no information about
them is present as the Sampling picks up random set of pixels.
GPU Computing based Approach:
Now 1st we will discuss another algorithm for AO, which uses Ray Tracing.
*i.imgur.com/Jytok.png
Local ambient occlusion in image-space:
(a) Rays emerging from a point Pmany of which are not occluded (marked red).
(b)Rays being constrained within a distance of rfar, with distant occluders being neglected (blue arrows).
(c) Pixels as viewed by a camera.Neighboring pixels are obtained (marked in green).
(d) We de-project these pixels back to world-space and approximate them using spherical occluders (green circles). These are the final approximated occluders around the point that are sought in image-space. Note that the spheres are pushed back a little along the opposite direction of the normal (−ˆ n) so as to prevent incorrect occlusion on flat surfaces.
*i.imgur.com/iEIzb.gif
In this example we have two sampling points A and B. At position A only a few rays hit the sphere therefore the influence of the sphere is small, at position B a lot of rays hit the sphere and the influence is big what results in a darker color here.
So the algorithm for Compute Shader will be something like that:
1. For every pixel in the image, perform Ray tracing and identify the neighboring pixel Samples required.
2. Check if the depth value of the selected pixels are already in GroupShared Registers.
3. If Yes, then pick the values from them .
4. If No, then read the Z-Buffer and place the fetched values in GroupShared Registers
Advantages:
• Using the groupsharedmemory avoids an incredible amount of over-sampling
• It can be filled using the Gather instruction, which further reduces the number of TEX operations
• Each of the SP of the GPU can perform operation for a different pixels in parallel and can have the data kept in Shared Memory or in Group Shared Registers to be accessed by other SP.
• Each of the SP does have their own Cache memory which can be used to keep frequently read data. It helps a single SP to minimize its I/O operations when it moves to the next pixel after finishing the AO calculation for the current one.
I hope this explains how GPU compute performance is actually beneficial for Gaming performance to implement Ambient Occlusion. Crysis is the 1st Game to implement AO but by means of Pixel Shaders and we all know how heavy the game was on hardware. On the other hand most of the latest games like Battleforge, Battlefield 3, Starcraft III etc use compute based AO logic and they run far smoother.
SO I guess you guys understood my point, a GPU with better compute power will definitely going to have an advantage in current and future games because of those factors discussed above.That's why AMD cards with better compute performance perform better in games like dirt 3, battleforge , civilization 5 etc.
In next iteration, I'll discuss about Depth of Field.
Sayonara till the DOF.....HAHAHAHAHA
I guess choice of this component is actually the most tricky and we are having more discussions about it which is higher than any of the other components. Better FPS or smoothness, Compute Performance or raw gaming performance, single GPU or multi GPU setup, all the things have been discussed and in my article, I will try to relate all the things.
FPS or Frames Per Second: It is the most common technique of measuring performance and most review sites actually stress on this parameter only. Now better FPS is always better in most cases but now we know there are certain catches.
Gaming Latency is very important while playing games as it decides how responsive the GPU is between the input from the player and the effects to take place. Sometimes it is noticed that even full frame rate is maintained, the game lags if very high amount of post processing is being applied. It looks like Nvidia SLI setup does have an advantage here as per the HardOcp review of GT 660 Ti which Vicky has posted. However, in 1080P, this latency is almost negligible in current gen Graphics cards of same league.
Discarded Frames: Today's most of the Monitors, including Anand's BenQ G2420HD is capped at 60 Hz and it is impossible to see more than 60 FPS in real time. Then why we are opting for cards which can deliver over 100 FPS sometimes. There is a effect called Plummeting in Gaming, which reduces the smoothness, despite having 60FPS on average. Now here the extra FPS comes to rescue. By using the discarded Frames (Display can show only 60 frames in 1 sec whereas 90 Frames are available from the Gfx card) and a simple algorithm, GPU can enhance the output frame quality to provide a smooth game-play experience as well as very good Motion Blur effects. AMD has an upper edge here as it is observed in several reviews that Game Play experience become smoother with the increase of FPS in games. I'm sure they will add new algorithm in their upcoming drivers to take the advantage of it for reducing Gaming latency.
Here my votes goes to AMD for that reason because at same price point, AMD cards deliver little extra frames while gaming.
Compute Performance
The GCN architecture is far more capable than FERMI as well as KEPLER architecture when it comes to compute performance. The Microsoft based DirectCompute and open source OpenCL API based apps run far better in GCN design, thanks to its VECTOR PROCESSOR based design where each of the Compute Engine of the GPU can handle a thread independently. They have their local cache and read/write interface to enhance the performance. On the other hand each of the SP in Kepler is not as powerful as the FERMI architecture. Although it has more number of shader processor, a single shader can't handle a single thread alone...for that it needs a cluster of shaders.
Gaming Performance.
Gaming performance depends of many factors, memory bandwidth, number of shader processors, Number of ROPs and their performance, Texture handling capabilities. The conventional gaming approach basically stresses upon Vertex Shader and Pixel shader performance as GPU can perform a single operation over millions of pixels in parallel. in conventional way, all the gaming effects like Ambient Occlusion, Depth of Field, HDR Bloom and Motion Blur can be implemented using Pixel and Texture operations. But the processes are very heavy and older generation GPUs excels over those fields by using their raw parallel processing powers. But now as games are getting more and more demanding in GPU performance for rendering, if all those effects are turned on and executed in conventional methods, that might create a heavy toll over the GPU, resulting poor performance. So smart and efficient techniques are required to implement the advanced effects in a Game.
Are Gaming Performance and Compute Performance two different fields?
In most reviews, we see that Compute performance and Gaming performance are compared as two different faces of a Coin which lead to the believe that those parameters are different and a GPU can excel in one field while doing not so good in other field. But with the emergence of new API like DirectCompute, OpenCL, CUDA which can directly access GPU resources like the Stream Processor Clusters, Video Ram, shared memory like a conventional CPU, enables developers to create some advanced models like a Vector Processor type Processing. As a result we can have threads, data structures like Array, Structure and Classes, Object oriented models in GPU p[programming, just like what we do with our CPU wiyh one exception...GPU can process multiple threads at once by parallel processing unlike the sequential execution methodology of CPU.
I will discuss some of the advanced Graphics enhancement techniques , how they have been implemented in conventional way and how GP-GPU computing performance can improve the performance.
Ambient Occlusion or AO:
Definition: Ambient occlusion is a method to approximate how bright light should be shining on any specific part of a surface, based on the light and it's environment. This is used to add realism.
Here how a surface will be illuminated is not calculated based on a single point source of light, instead it is calculated by studying how the environment and surroundings of that surface interacts with light. So a place surrounding by other objects will be darker even though the light source is same for all of them.
This technique is a Global approach and needs to applied over the whole image rather than applying it on any specific objects.
Z-Buffer or Depth Buffer: In computer graphics, z-buffering, also known as depth buffering, is the management of image depth coordinates in three-dimensional (3-D) graphics, usually done in hardware, sometimes in software. It is one solution to the visibility problem, which is the problem of deciding which elements of a rendered scene are visible, and which are hidden.
Conventional Pixel Shader Approach:
The algorithm is implemented as a pixel shader, analyzing the scene depth buffer which is stored in a texture. For every pixel on the screen, the pixel shader samples the depth values around the current pixel and tries to compute the amount of occlusion from each of the sampled points. In its simplest implementation, the occlusion factor depends only on the depth difference between sampled point and current point.
- For each Pixel present in the image
- Check surrounding neighborhood to see if they form a concave region–(Fit a cone, is it concave or convex). This step is need to check if surroundings of the pixels
- Improved results if normal of the point included in check.
*i.imgur.com/uFk9g.jpg?1
Given a point P on the surface with normal N, here roughly two-thirds of the hemisphere above P is occluded by other geometry in the scene, while one-third is unoccluded. The average direction of incoming light is denoted by B, and it is somewhat to the right of the normal direction N. Loosely speaking, the average color of incident light at P could be found by averaging the incident light from the cone of unoccluded directions around the B vector.
Now there is no smart method available to detect how many surrounding pixels are needed to be taken care of to provide a realistic illumination of point P. Otherwise using Brute force algorithm, the GPU needs to perform 200 texture reads per second which is not possible in real time using current generation hardware. So the following approximation is applied:
1. Some Random sample of pixels are taken from the surroundings of the point needed to be illuminated.
2. For each of the pixels present in the sample, their depth buffer is read from the texture Buffer of the Graphics card.
3. Now using the algorithm mentioned above, the GPU creates an approximate illumination level of the desired point.
Disadvantages:
1.Huge I/O cost. Here for each pixel, the GPU needs to access the Z-Buffer or Texture unit to get its Depth value. in case of 1920X1080 Resolution with 60 Fps, where we are taking 100 neighboring pixel
samples for each of the pixels, the total number of Z-Buffer read will be 1920X1080 (Total number of pixels in a single Frame) X 100 (Sample Size for each pixel) X 60 (Number of Frames per frame) =
12,441,6 X 10^5 which is huge for even the parallel processing power of GPU. Obviously that number is an rough estimation and can be minimized using different techniques, it gives us an idea how it can affect the game
play experience.
2. As the Sample of neighboring pixels are taken randomly, the output might not be that realistic compared to the compute power it needs and sometime creates an unnecessary shadow effect.
3. Here, for parallel processing, we need to rely upon the Texture buffer or Z-Buffer which can hold normally 12 texels per Sp cluster but not taking the advantage of the local registers and Cache Memory of each of the
Stream Processors.
4.Poor Resource sharing. I guess you guys have already understood that two neighboring pixel P1 and P2, situated very closely will have almost same surrounding pixels in common. So if we can keep the sample values in
GPU registers, taken for P1 and 1st check if they are also neighbor of P2 we can actually save the whole sampling thing required for P2. But unlike data in CPU, pixels cannot be directly kept in registers as no information about
them is present as the Sampling picks up random set of pixels.
GPU Computing based Approach:
Now 1st we will discuss another algorithm for AO, which uses Ray Tracing.
*i.imgur.com/Jytok.png
Local ambient occlusion in image-space:
(a) Rays emerging from a point Pmany of which are not occluded (marked red).
(b)Rays being constrained within a distance of rfar, with distant occluders being neglected (blue arrows).
(c) Pixels as viewed by a camera.Neighboring pixels are obtained (marked in green).
(d) We de-project these pixels back to world-space and approximate them using spherical occluders (green circles). These are the final approximated occluders around the point that are sought in image-space. Note that the spheres are pushed back a little along the opposite direction of the normal (−ˆ n) so as to prevent incorrect occlusion on flat surfaces.
*i.imgur.com/iEIzb.gif
In this example we have two sampling points A and B. At position A only a few rays hit the sphere therefore the influence of the sphere is small, at position B a lot of rays hit the sphere and the influence is big what results in a darker color here.
So the algorithm for Compute Shader will be something like that:
1. For every pixel in the image, perform Ray tracing and identify the neighboring pixel Samples required.
2. Check if the depth value of the selected pixels are already in GroupShared Registers.
3. If Yes, then pick the values from them .
4. If No, then read the Z-Buffer and place the fetched values in GroupShared Registers
Advantages:
• Using the groupsharedmemory avoids an incredible amount of over-sampling
• It can be filled using the Gather instruction, which further reduces the number of TEX operations
• Each of the SP of the GPU can perform operation for a different pixels in parallel and can have the data kept in Shared Memory or in Group Shared Registers to be accessed by other SP.
• Each of the SP does have their own Cache memory which can be used to keep frequently read data. It helps a single SP to minimize its I/O operations when it moves to the next pixel after finishing the AO calculation for the current one.
I hope this explains how GPU compute performance is actually beneficial for Gaming performance to implement Ambient Occlusion. Crysis is the 1st Game to implement AO but by means of Pixel Shaders and we all know how heavy the game was on hardware. On the other hand most of the latest games like Battleforge, Battlefield 3, Starcraft III etc use compute based AO logic and they run far smoother.
SO I guess you guys understood my point, a GPU with better compute power will definitely going to have an advantage in current and future games because of those factors discussed above.That's why AMD cards with better compute performance perform better in games like dirt 3, battleforge , civilization 5 etc.
In next iteration, I'll discuss about Depth of Field.
Sayonara till the DOF.....HAHAHAHAHA