Radeon GPU Profiler 1.5.1
Radeon GPU Profiler 1.5 We previewed the main RGP 1.5 features at GDC 2019 late last month, but didn’t set the release free because it …

Radeon GPU Profiler 1.5 We previewed the main RGP 1.5 features at GDC 2019 late last month, but didn’t set the release free because it …
Introduction This is part 2 of a series of posts on AMD FreeSync™ 2 HDR Technology (FreeSync 2 hereafter!). The first post covered color spaces …
If you weren’t able to attend GDC this year to catch the Advanced Graphics Techniques Tutorial Day and our Sponsored Sessions in person, or you …
Introduction This is going to be the first in a series of 4 blog posts covering different topics related to AMD FreeSync™ 2 HDR Technology …
OCAT is our open source capture and analytics tool, designed to help game developers and performance analysts dig into the details of how the GPU …
Radeon GPU Analyzer (RGA) is our offline compiler and integrated code analysis tool, supporting the high-level shading and kernel languages that are consumed by DirectX® …
San Francisco is the destination for the Game Developers Conference again in 2019, hosting our fine industry at the Moscone Center, March 19th to 23rd. …
Introduction Vulkan Memory Allocator (VMA) is our single-header STB-like library for easily and efficiently managing memory allocation for your Vulkan games and applications. The last …
Foreword This is a guest post from Sebastian Aaltonen, co-founder of Second Order and previously senior rendering lead at Ubisoft®. Second Order published their first …
OCAT is our open source capture and analytics tool, designed to help game developers and performance analysts dig into the details of how the GPU …
Radeon GPU Profiler 1.4 While the G in GPU stands for graphics, there are also popular SIMD programming models and associated APIs that map well …
The AMD GPU Services (AGS) library provides game and application developers with the ability to query information about installed AMD GPUs and their driver, in …
We are excited to announce the release of Compressonator v3.1! This version contains several new features and optimizations, including new installers for the SDK, CLI and …
Organised by the fine folks at Wargaming, the 4C conference was held in Prague over 2 days in early October this year, bringing attendees and …
Radeon GPU Profiler 1.3.1 RGP 1.3.1 is a hotfix release to keep compatibility with an upcoming Radeon Adrenalin Edition graphics driver. That driver descends from …
OCAT, our open source capture and analytics tool, has come a really long way since the 1.1 release around this time last year. The focus …
Introduction We released Vulkan Memory Allocator 1.0 (VMA) back in July last year, but we’ve been remiss in posting about the progress of the library …
Radeon GPU Profiler 1.3 First, happy birthday to RGP! We released 1.0 publicly almost exactly a year ago at the time of writing, something I’ve …
There are traditionally just two hard problems in computer science — naming things, cache invalidation, and off-by-1 errors — but I’ve long thought that there …
Adam Sawicki, a member of AMD RTG’s Game Engineering team, has spent the best part of a year assisting one of the world’s biggest game …
If you’ve ever heard the term “context roll” in the context of AMD GPUs — I’ll do that a lot in this post, sorry in …
Microsoft PIX is the premiere integrated performance tuning and debugging tool for Windows game developers using DirectX 12. PIX enables developers to debug and analyze …
With GDC 2018 done and dusted, we thought it’d be valuable to reemphasise that all of the presented content from the Advanced Graphics Techniques Tutorial …
The AMD GPU Services (AGS) library provides game and application developers with the ability to query information about installed AMD GPUs and their driver, in …
Radeon GPU Profiler 1.2 At GDC 2018 we talked about a new version of RGP that would interoperate with RenderDoc, allowing the two tools to …
Compressonator is a set of tools that allows artists and developers to work easily with compressed assets and easily visualize the quality impact of various …
We have posted the version 1.2 update to the TrueAudio Next open-source library to Github. It is available here. This update has a number of …
Vulkan™ is designed to have significantly smaller CPU overhead compared to other APIs like OpenGL®. This is achieved by various means – the API is …
Introduction Half-precision (FP16) computation is a performance-enhancing GPU technology long exploited in console and mobile devices not previously used or widely available in mainstream PC …
Real Time Ray Tracing was one of the hottest topics last week at GDC 2018. In this presentation, AMD Software Development Engineer and architect of Radeon …
The level of visual detail required of CAD models for the automotive industry or the most advanced film VFX requires a level of visual accuracy …
If you’re into the state of the art in games, especially real-time gaming graphics, your eyes will undoubtedly be on Moscone Center in San Francisco, …
The long wait is over. The GPU processing power of TrueAudio Next (TAN) has now been integrated into Steam Audio from Valve (Beta 13 release). …
Radeon GPU Profiler 1.1.1 With GDC 2018 getting ever closer, we wanted to get one last minor release of RGP out before things get hectic …
Radeon GPU Profiler 1.1.0 It feels like just last week that we released Radeon GPU Profiler (RGP) 1.0.3 but my calendar says almost 2 months …
Insights from Enscape as to how they designed a renderer that produces path traced real time global illumination and can also converge to offline rendered image quality
We are excited to announce the release of Compressonator V2.7! This version contains several new features and optimizations, including: Cross Platform Support Due to popular demand, …
Radeon GPU Profiler 1.0.3 A couple of months on from the release of 1.0.2, we’ve fully baked and sliced 1.0.3 for your low-level DX12- and …
The AMD GPU Services (AGS) library provides game and application developers with the ability to query information about installed AMD GPUs and their driver, in …
Due to architectural differences between Zen and our previous processor architecture, Bulldozer, developers need to take care when using the Windows® APIs for processor and core enumeration. …
The AMD GCN Vulkan extensions allow developers to get access to some additional functionalities offered by the GCN architecture which are not currently exposed in the Vulkan API. One of these is the ability to access the barycentric coordinates at the fragment-shader level.
Thanks (again!) Before we dive into a run over the release notes for the 1.0.2 release of Radeon GPU Profiler, we’d like to thank everyone …
Understanding the instruction-level capabilities of any processor is a worthwhile endeavour for any developer writing code for it, even if the instructions that get executed …
An important part of learning the Vulkan API – just like any other API – is to understand what types of objects are defined in it, what they represent and how they relate to each other. To help with this, we’ve created a diagram that shows all of the Vulkan objects and some of their relationships, especially the order in which you create one from another.
Summary In this blog post we are announcing the open-source availability of the Radeon™ ProRender renderer, an implementation of the Radeon ProRender API. We will give …
Introduction and thanks Effective GPU performance analysis is a more complex proposition for developers today than it ever has been, especially given developments in how …
TressFX 4 introduces a number of improvements. This blog post focuses on three of these, all of which are tied to simulation: Bone-based skinning Signed distance …
Full application control over GPU memory is one of the major differentiating features of the newer explicit graphics APIs such as Vulkan® and Direct3D® 12. …
We are excited to announce the release of Compressonator V2.6. This version contains several new features and optimizations, including: Adaptive Format Conversion for general transcoding operations …
When getting a new piece of hardware, the first step is to install the driver. You can see how to install them for the Radeon …
In this blog we will go through the installation process of the driver for your new Radeon Vega Frontier card. We will go through the …
When using a compute shader, it is important to consider the impact of thread group size on performance. Limited register space, memory latency and SIMD occupancy each affect shader performance in different ways. This article discusses potential performance issues, and techniques and optimizations that can dramatically increase performance if correctly applied.
The AMD Developer Tools team is thrilled to announce the availability of the AMD plugin for Microsoft’s PIX for Windows tool. PIX is a performance …
A new version of the CodeXL open-source developer tool is out! Here are the major new features in this release: CPU Profiling Support for AMD …
When it comes to multi-GPU (mGPU), most developers immediately think of complicated Crossfire setups with two or more GPUs and how to make their game …
Introduction Shortly after our Capsaicin and Cream event at GDC this year where we unveiled Radeon RX Vega, we hosted a developer-focused event designed to …
BC6 HDR Compression The BC6H codec has been improved and now offers better quality then previous releases, along with support for both 16 bit Half …
This article explains how to use Radeon GPU Analyzer (RGA) to produce a live VGPR analysis report for your shaders and kernels. Basic RGA usage …
I’m Mike Schmit, Director of Software Engineering with the Radeon Technologies Group at AMD. I’m leading the development of a new open-source 360-degree video-stitching framework …
AMD LiquidVR MultiView Rendering in Serious Sam VR with the GPU Services (AGS) Library AMD’s MultiView Rendering feature reduces the number of duplicated object draw …
In 2016, AMD brought TrueAudio Next to GameSoundCon. GameSoundCon was held Sept 27-28 at the Millennium Biltmore Hotel in Los Angeles. GameSoundCon caters to game …
Budgeting, measuring and debugging video memory usage is essential for the successful release of game titles on Windows. As a developer, this can be efficiently achieved with the …
Another year, another Game Developer Conference! GDC is held earlier this year (27 February – 3 March 2017) which is leaving even less time for …
With the launch of AGS 5.0 developers now have access to the shader compiler control API. Here’s a quick summary of the how and why…. Background …
There are many games out there taking place in vast environments. The basic building block of every environment is height-field based terrain – there’s no …
Understanding concurrency (and what breaks it) is extremely important when optimizing for modern GPUs. Modern APIs like DirectX® 12 or Vulkan™ provide the ability to …
Summary Many Gaming and workstation laptops are available with both (1) integrated power saving and (2) discrete high performance graphics devices. Unfortunately, 3D intensive application …
This post is taking a look at some of the interesting bits of helping id Software with their DOOM® Vulkan™ effort, from the perspective of …
This blog is guest authored by Croteam developer Karlo Jez and he will be giving us a detailed look at how Affinity Multi-GPU support was …
When opening a 64-bit crash dump you will find that you will not necessarily get a sensible call stack. This is because 64-bit crash dumps …
Vulkan™’s barrier system is unique as it not only requires you to provide what resources are transitioning, but also specify a source and destination pipeline …
This is the third post in the follow up series to my prior GDC talk on Variable Dynamic Range. Prior posts covered dithering, today’s topic …
Virtual desktop infrastructure systems and cloud gaming are increasingly gaining popularity thanks to an ever more improved internet infrastructure. This gives more flexibility to the …
As noted in my previous blog, new innovations in virtual reality have spearheaded a renewed interest in audio processing, and many new as well as …
This week marks the last in the series of our regular Warhammer Wednesday blog posts. We’d like to extent our thanks to Creative Assembly’s Lead …
Audio Must be Consistent With What You See Virtual reality demands a new way of thinking about audio processing. In the many years of history …
Happy Warhammer Wednesday! This week Creative Assembly’s Lead Graphics Programmer Tamas Rabel talks about how Total War: Warhammer utilized asynchronous compute to extract some extra …
It’s Wednesday, so we’re continuing with our series on Total War: Warhammer. Here’s Tamas Rabel again with some juicy details about how Creative Assembly brought …
A new release of the CodeXL open-source developer tool is out! Here’s the hot new stuff in this release: New platforms support Support Linux systems …
We’re back again on this fine Warhammer Wednesday with more from Tamas Rabel, Lead Graphics Programmer on the Total War series. In last week’s post …
For the next few weeks we’ll be having a regular feature on GPUOpen that we’ve affectionately dubbed “Warhammer Wednesdays”. We’re extremely lucky to have Tamas Rabel, …
Game engines do most of their shading work per-pixel or per-fragment. But there is another alternative that has been popular in film for decades: object …
EDIT: 2016/08/08 – Added section on Targeting Low-Memory GPUs This post serves as a guide on how to best use the various Memory Heaps and …
Before Direct3D® 12 and Vulkan™, resources were bound to shaders through a “slot” system. Some of you might remember when hardware did have only very …
Multi-GPU systems are much more common than you might think. Most of the time, when someone mentions mGPU, you think about high-end gaming machines with …
Compressonator is a set of tools to allow artists and developers to more easily create compressed texture image assets and easily visualize the quality impact …
Prior to explicit graphics APIs a lot of draw-time validation was performed to ensure that resources were synchronized and everything set up correctly. A side-effect of this robustness …
Direct3D® 12 and Vulkan™ significantly reduce CPU overhead and provide new tools to better use the GPU. For instance, one common use case for the …
As promised, we’re back and today I’m going to cover how to get resources to and from the GPU. In the last post, we learned …
A new CodeXL release is out! For the first time the AMD Developer Tools group worked on this release on the CodeXL GitHub public repository, …
Today, we are excited to announce that we are releasing an update for ShadowFX that adds support for DirectX® 12. Features Different shadowing modes Union of …
Achieving high performance from your Graphics or GPU Compute applications can sometimes be a difficult task. There are many things that a shader or kernel …
The GCN architecture contains a lot of functionality in the shader cores which is not currently exposed in current APIs like Vulkan™ or Direct3D® 12. One …
A Complete Tool to Transform Your Desktop Appearance After introducing our Display Output Post Processing (DOPP) technology, we are introducing a new tool to change …
Compaction is a basic building block of many algorithms – for instance, filtering out invisible triangles as seen in Optimizing the Graphics Pipeline with Compute. …
We are releasing TressFX 3.1. Our biggest update in this release is a new order-independent transparency (OIT) option we call “ShortCut”. We’ve also addressed some of …
Today’s update for GeometryFX introduces cluster culling. Previously, GeometryFX worked on a per-triangle level only. With cluster culling, GeometryFX is able to reject large chunks …
Full-speed, out-of-order rasterization If you’re familiar with graphics APIs, you’re certainly aware of the API ordering guarantees. At their core, these guarantees mean that if …
A New Milestone After the success of the first version, FireRays is moving to another major milestone. We are open sourcing the entire library which …
Last week, we organized a two hours-long talk at University of Lodz in Poland where we discussed the most common mistakes we come across in Vulkan applications. Dominik Witczak, …
We are very pleased to be announcing that AMD is open-sourcing one of our most popular tools and SDKs. Compressonator (previously released as AMD Compress …
Gaming at optimal performance and quality at high screen resolutions can sometimes be a demanding task for a single GPU. 4K monitors are becoming mainstream and gamers …
If you have supported Crossfire™ or Eyefinity™ in your previous titles, then you have probably already used our AMD GPU Services (AGS) library. A lot of …
Resource creation and management has changed dramatically in Direct3D® and Vulkan™ compared to previous APIs. In older APIs, memory is managed transparently by the driver. …
CodeXL major release 2.0 is out! It is chock-full of new features and a drastic change in the CodeXL development model: CodeXL is now open …
The prior post in this series established a base technique for adding grain, and now this post is going to look at very subtle changes to …
Welcome back to our performance & optimization series. Today, we’ll be looking more closely at shaders. On the surface, it may look as if they …
This is the first of a series of posts expanding on the ideas presented at GDC in the Advanced Techniques and Optimization of VDR Color …
The Game Developer Conference 2016 was an event of epic proportions. Presentations, tutorials, round-tables, and the show floor are only one part of the story …
This post describes how GCN hardware coalesces memory operations to minimize traffic throughout the memory hierarchy. The post uses the term “invocation” to describe one …
Bandwidth is always a scarce resource on a GPU. On one hand, hardware has made dramatic improvements with the introduction of ever faster memory standards …
Vulkan™ provides unprecedented control to developers over generating graphics and compute workloads for a wide range of hardware, from tiny embedded processors to high-end workstation GPUs with wildly different …
The Game Developer Conference 2016 (GDC16) is held March 14-18 in the Moscone Center in San Francisco. This is the most important event for game developers, …
Welcome back to our DX12 series! Let’s dive into one of the hottest topics right away: synchronization, that is, barriers and fences! Barriers A barrier is …
Vulkan™ is a high performance, low overhead graphics API designed to allow advanced applications to drive modern GPUs to their fullest capacity. Where traditional APIs …
Imagine that you were asked one day to design an API with bleeding-edge graphics hardware in mind. It would need to be as efficient as …
Hello and welcome to our series of blog posts covering performance advice for Direct3D® 12 & Vulkan™. You may have seen the #DX12PerfTweets on Twitter, and …
For GPU-side dynamically generated data structures which need 3D spherical mappings, two of the most useful mappings are cubemaps and octahedral maps. This post explores …
I have met enough game developers in my professional life to know that these guys are among the smartest people on the planet. Those particular individuals will go …
About CodeXL Analyzer CLI CodeXL Analyzer CLI is an offline compiler and performance analysis tool for OpenCL™ kernels, DirectX® shaders and OpenGL® shaders. Using CodeXL …
GPU PerfStudio supports DirectX® 12 on Windows® 10 PCs. The current tool set for DirectX 12 comprises of an API Trace, a new GPU Trace …
Today we’re going to take a look at how asynchronous compute can help you to get the maximum out of a GPU. I’ll be explaining …
What’s New With the recent adoption of new APIs such as DirectX® 12 and Vulkan™, we are seeing renewed interest in an older tool. AMD …
A typical problem with MSAA Resolve mixed with HDR is that a single sample with a large HDR value can over-power all other samples, resulting …
Half-precision (FP16) computation is a performance-enhancing GPU technology long exploited in console and mobile devices not previously used or widely available in mainstream PC development. With the advent of AMD’s Vega GPU architecture, this technology is now more easily accessible and available for boosting graphics performance in mainstream PC development.
The latest iteration of the GCN architecture allows you to pack 2x FP16 values into each 32-bit VGPR register. This enables you to:
There are also some minor risks:
It is tempting to assume that implementing FP16 is as simple as merely substituting the ‘half’ type for ‘float’. Alas not: this simply doesn’t work on PC. The DirectX® FXC compiler only offers half
for compatibility; it maps it onto float
. If you compare the bytecode generated, it is identical.
The correct types to use are the standard HLSL types prefixed with min16
: min16float
, min16int
, min16uint
. These can be used as a scalar or vector type in the usual fashion.
Your development environment will need specific software to successfully generate FP16 code. First of all, you need Windows 8 or later. Older versions of Windows will simply fail to create shaders if you use min16float
. Whilst there is a Platform Update available for Windows 7 which enables FP16 shaders to compile, it simply compiles the code as FP32. In practice you are emulating absent hardware and therefore the resulting code may not be as efficient. It may therefore be worthwhile providing an alternative code path or set of shaders for hardware and operating systems lacking FP16 support.
Secondly, you will need up-to-date versions of the FXC compiler and the driver compiler. The FXC compiler in the Windows 10 SDK will suffice, and Radeon Crimson driver version 17.9.1 or later is required.
Thirdly, it is worth clarifying that FP16 will work on DirectX 11.1 and Shader Model 5 code. DirectX 12 is not required. Simply add a run-time test to query for D3D11_FEATURE_SHADER_MIN_PRECISION_SUPPORT
from ID3D11Device::CheckFeatureSupport()
.
And most importantly, compatible hardware is required: an AMD RX Vega or Vega Frontier Edition GPU!
Whilst min16float
is perfectly legal HLSL syntax and therefore fine to use as-is, I would caution against using it directly. I find it better to implement pre-processor support to globally include or remove the use of min16float
and therefore FP16. There are two reasons for this:
With all these tools in place, what does compiled FP16 code look like? Let’s write a trivial test function:
cbuffer params
{
min16float4 colour;
};
Texture2D<min16float4> tex;
SamplerState samp;
min16float4 test( in min16float2 uv : TEXCOORD0 ) : SV_Target
{
return colour * tex.Sample( samp, uv );
}
The first step of verifying FP16 functionality is to look at the FXC .asm output. The driver cannot compile FP16 code unless it is given the correct bytecode from DirectX. Here we see the compiler has introduced a series of {min16f}
suffixes:
ps_5_0
dcl_globalFlags refactoringAllowed | enableMinimumPrecision
dcl_constantbuffer CB0[1], immediateIndexed
dcl_sampler s0, mode_default
dcl_resource_texture2d (float,float,float,float) t0
dcl_input_ps linear v0.xy {min16f}
dcl_output o0.xyzw {min16f}
dcl_temps 1
sample_indexable(texture2d)(float,float,float,float) r0.xyzw {min16f}, v0.xyxx {min16f}, t0.xyzw, s0
mul o0.xyzw {min16f}, r0.xyzw {min16f}, cb0[0].xyzw {min16f}
ret
Now we turn to the ISA output. There are typically two major classes of instruction to look for:
v_pk_add/mul/sub/mad_f16
v_mad_mix_f32/v_mad_mixlo_f16/v_mad_mixhi_f16
Instructions such as v_pk_add/mul/sub_f16
perform an ALU operation on two FP16 values at once, halving the instructions needed and your ALU time. This gives one of the primary performance advantages of FP16.
The mix
modifiers allow you to freely mix FP16 and FP32 operands in one VOP3
instruction, without requiring an additional conversion instruction. The cost of using these instructions is the lost opportunity to issue a packed instruction. It is therefore neither faster nor slower than the equivalent FP32 instruction you would have done.
Note that the specific form of the mix instruction is a multiply-add instruction. The compiler can use this to implement most arithmetic operations with creative use of 0, 1 or -1 constants. However, commonly encountered shader ALU operations, such as min()
or max()
, cannot be performed using a mix instruction.
Here is the GCN ISA output for the above shader:
shader main
asic(GFX9)
type(PS)
s_mov_b32 m0, s20
s_mov_b64 s[22:23], exec
s_wqm_b64 exec, exec
s_setreg_imm32_b32 hwreg(HW_REG_MODE, 0, 8), 0x000001cc
v_interp_p1ll_f16 v2, v0, attr0.x
v_interp_p1ll_f16 v0, v0, attr0.y
v_interp_p2_f16 v2, v1, attr0.x, v2
v_interp_p2_f16 v2, v1, attr0.y, v0 op_sel:[0,0,0,1]
image_sample v[0:3], v[2:4], s[4:11], s[12:15] dmask:0xf a16 d16
s_buffer_load_dwordx4 s[0:3], s[16:19], 0x00
s_waitcnt lgkmcnt(0)
v_mov_b32 v2, s1
v_cvt_pkrtz_f16_f32 v2, s0, v2
v_mov_b32 v3, s3
v_cvt_pkrtz_f16_f32 v3, s2, v3
s_setreg_imm32_b32 hwreg(HW_REG_MODE, 0, 8), 0x000001c0
s_waitcnt vmcnt(0)
v_pk_mul_f16 v0, v0, v2 op_sel_hi:[1,1]
v_pk_mul_f16 v1, v1, v3 op_sel_hi:[1,1]
v_mov_b32 v2, v0 src0_sel: WORD_0
v_mov_b32 v0, v0 src0_sel: WORD_1
v_mov_b32 v3, v1 src0_sel: WORD_0
v_mov_b32 v1, v1 src0_sel: WORD_1
s_mov_b64 exec, s[22:23]
v_lshl_or_b32 v0, v0, 16, v2
v_lshl_or_b32 v1, v1, 16, v3
exp mrt0, v0, v0, v1, v1 done compr vm
s_endpgm
end
This output illustrates a couple of interesting points. Firstly, the compiler has successfully introduced some v_pk_mul_f16 instructions. Instead of the usual four v_mul_f32
ops required to multiply a float4 by a a scalar, we’ve halved that to two v_mul_pk_f16
ops.
Secondly, consider the two v_cvt_pkrtz
instructions. These operations take 2 FP32 source values and packs them to 2 FP16 values in a single 32-bit destination register. It does this to form the min16float4
in the cbuffer. It is surprising that despite using the correct type, the compiler has not generated the simple load we may have expected. We will return to this issue later.
AMD offers an extremely powerful software tool known as Radeon GPU Analyzer (RGA). This tool is an interface to the driver compiler which allows you to directly see the resulting code. RGA accepts shader source or intermediates from all main graphics APIs. The user specifies which generation of GCN GPU to target, and the tool can output a number of analyses, including but not limited to ISA output and register usage analysis.
I consider RGA invaluable for FP16 work. We have integrated this tool into our tool chain so that we can obtain ISA output or register analysis immediately after compilation. I iterate on the ISA output until I have satisfactory code, and then test it for performance and correctness. Whilst some GPU capture tools now offer ISA disassembly, this is a far more productive method of working.
It is critical to choose your targets very carefully. Not all code is a suitable candidate for FP16 optimisation. The ideal target:
Data parallelism typically comes in two forms. Packed instructions can easily be used on code employing 2-, 3- or 4-component vectors. Alternatively, strictly scalar code can be made suitable for packed instructions by unrolling the loop manually and working on pairs of data.
A reliable target for FP16 optimisation is the blending of colour and normal maps. These operations are typically heavy on data-parallel ALU operations. What’s more, such data frequently originates from a low-precision texture and therefore fits comfortably within FP16’s limitations. A typical game frame has a plentiful supply of these operations in gbuffer export and post-process shaders, all ripe for optimisation.
BRDFs are an attractive but difficult candidate. The portion of a BRDF that computes specular response is typically very register- and ALU-intensive. This would seem a promising target. However, caution must be exercised. BRDFs typically contain exponent and division operations. There are currently no FP16 instructions for these operations. This means that at best there will be no parallelisation of those operations; at worst it will introduce conversion overhead between FP16 and FP32.
All is not lost. There is a suitable optimization candidate in the typical BRDF equation: the large number of vectors and dot products typically present. Whilst individual dot products are more a data reduction operation than a data parallel operation, many dot products can be performed in parallel using SIMD code. These dot products often feed back into FP32 BRDF code, so care must be taken not to introduce FP16 to FP32 conversion overhead that exceeds the gains made.
Finally, TAA or checker-boarding systems offer strong potential for optimisation alongside surprising risks. These systems perform a great deal of colour processing, and ALU can indeed be the primary bottleneck. UV calculations often consume much of this ALU work. It is tempting to assume these screen-space UVs are well within the limits of FP16. Surprisingly, the combination of small pixel velocities and high resolutions such as 4K can cause artefacts when using FP16. Exercise care when optimising similar code.
The most efficient way to write FP16 code is to supply it with FP16 constant data. Any use of FP32 constant data will invoke a conversion operation. Constant data typically occurs in two forms: cbuffer values and literals.
In an ideal world, there would be an FP16 version of every cbuffer value available for use. In practice, it is often possible to obtain a performance advantage just using FP32 cbuffer data. It depends on how frequently a constant is used. If a constant is used only once or twice it is no slower to simply use a mix instruction. If a constant is used more widely, or on vectors, it is usually more efficient to provide an FP16 cbuffer value. Clearly, larger types such as vectors or matrices should be supplied as native FP16 data as the conversion overhead would be prohibitive.
The second source of constant data is the use of literal values in the shader. It is tempting to assume that using the h suffix would be sufficient to introduce an FP16 constant. It isn’t. Again, the half
type is for backwards compatibility and FXC converts it to an FP32 literal. Using either the h
or f
suffix will result in a conversion. It is better to use the unadorned literal, such as 0.0, 1.5 and so on. Generally, the compiler is able to automatically encode that literal as FP32 or FP16 as appropriate according to context.
One exception is expanding literals for use in an operation with a vector. Sometimes the compiler is unable to expand the literal to a min16float3
automatically. In this case, you must either manually construct a min16float3
, or use syntax such as 1.5.xxx
.
Recall the earlier example code snippet. Whilst the compiler emitted the expected v_pk_mul_f16
operations, it didn’t emit the code sequence you might expect to load a min16float4
from memory. It loaded FP32 values and packed them down to an FP16 vector manually. If you were to access a larger type, such as a min16float4x4
matrix, the code sequence would be very sub-optimal. There is an easy solution. If we change the source code to:
cbuffer params
{
uint2 packedColour;
};
Texture2D<min16float4> tex;
SamplerState samp;
min16float2 UnpackFloat16( uint a )
{
float2 tmp = f16tof32( uint2( a & 0xFFFF, a >> 16 ) );
return min16float2( tmp );
}
min16float4 UnpackFloat16( uint2 v )
{
return min16float4( UnpackFloat16( v.x ), UnpackFloat16( v.y ) );
}
min16float4 test( in min16float2 uv : TEXCOORD0 ) : SV_Target
{
min16float4 colour = UnpackFloat16( packedColour );
return colour * tex.Sample( samp, uv );
}
The driver recognises this code sequence, and issues a much more optimal sequence of instructions:
shader main
asic(GFX9)
type(PS)
s_mov_b32 m0, s20
s_mov_b64 s[2:3], exec
s_wqm_b64 exec, exec
s_setreg_imm32_b32 hwreg(HW_REG_MODE, 0, 8), 0x000001cc
v_interp_p1ll_f16 v2, v0, attr0.x
v_interp_p1ll_f16 v0, v0, attr0.y
v_interp_p2_f16 v2, v1, attr0.x, v2
v_interp_p2_f16 v2, v1, attr0.y, v0 op_sel:[0,0,0,1]
image_sample v[0:3], v[2:4], s[4:11], s[12:15] dmask:0xf a16 d16
s_buffer_load_dwordx2 s[0:1], s[16:19], 0x00
s_setreg_imm32_b32 hwreg(HW_REG_MODE, 0, 8), 0x000001c0
s_waitcnt vmcnt(0) & lgkmcnt(0)
v_pk_mul_f16 v0, v0, s0 op_sel_hi:[1,1]
v_pk_mul_f16 v1, v1, s1 op_sel_hi:[1,1]
v_mov_b32 v2, v0 src0_sel: WORD_0
v_mov_b32 v0, v0 src0_sel: WORD_1
v_mov_b32 v3, v1 src0_sel: WORD_0
v_mov_b32 v1, v1 src0_sel: WORD_1
s_mov_b64 exec, s[2:3]
v_lshl_or_b32 v0, v0, 16, v2
v_lshl_or_b32 v1, v1, 16, v3
exp mrt0, v0, v0, v1, v1 done compr vm
s_endpgm
end
Finally, it is useful to embed FP16 constants at the end of the cbuffer rather than mix them alongside FP32 constants. This makes it much easier to strip away FP16 constants for the non-FP16 compatibility path, causing minimal effect on cbuffer size, layout and member alignment for both C++ and shader code.
It’s worth noting that Shader Model 6.2 supports 16-bit scalar types for all memory operations, meaning that the above issue will eventually go away in the future!
FP16 optimisation typically encounters two main problems:
At present, FP16 is typically introduced to a shader retrospectively to improve its performance. The new FP16 code requires conversion instructions to integrate and coexist with FP32 code. The programmer must take care to ensure these instruction do not equal or exceed the time saved. Is is important to keep large blocks of computation as purely FP16 or FP32 in order to limit this overhead. Indeed, shaders such as post-process or gbuffer exports as FP16 can run entirely in FP16 mode.
This leads us to the final point. FP16 code adds a little extra complexity to shader code. This article has outlined issues such as minimising conversion overhead, the special code to unpack FP16 data, and maintaining a non-FP16 code path. Whilst these issues are easily overcome, they may make the code take a little more effort to write and maintain. It is important to remember the reward is very worthwhile.
FP16 is a valuable additional tool in the programmer’s toolbox for obtaining peak shader performance. We have observed gains of around 10% on AMD RX Vega hardware. This is an attractive and lasting return for a moderate investment of engineering effort.
Radeon GPU Analyzer
AMD Radeon RX Vega Instruction Set