Vulkan barriers explained

Posted on October 18, 2016 by Matthaeus Chajdas

Vulkan™’s barrier system is unique as it not only requires you to provide what resources are transitioning, but also specify a source and destination pipeline stage. This allows for more fine-grained control of when a transition is executed. However, you can also leave quite some performance on the table if you just use the simple way, so today we’re going to look at vkCmdPipelineBarrier in detail.

Pipeline overview

It is common knowledge that the GPU is a highly pipelined device. Commands come in at the top, and then individual stages like vertex and fragment shading are executed in order. Finally, commands retire at the bottom of the pipeline when execution has finished.

This is exposed in Vulkan through the VK_PIPELINE_STAGE enumeration, which is defined as:

TOP_OF_PIPE_BIT
DRAW_INDIRECT_BIT
VERTEX_INPUT_BIT
VERTEX_SHADER_BIT
TESSELLATION_CONTROL_SHADER_BIT
TESSELLATION_EVALUATION_SHADER_BIT
GEOMETRY_SHADER_BIT
FRAGMENT_SHADER_BIT
EARLY_FRAGMENT_TESTS_BIT
LATE_FRAGMENT_TESTS_BIT
COLOR_ATTACHMENT_OUTPUT_BIT
TRANSFER_BIT
COMPUTE_SHADER_BIT
BOTTOM_OF_PIPE_BIT

Notice that this enumeration is not necessarily in the order a command is executed – some stages can be merged, some stages can be missing, but overall these are the pipeline stages a command will go through.

There are also three pseudo-stages which combine multiple stages or handle special access:

HOST_BIT
ALL_GRAPHICS_BIT
ALL_COMMANDS_BIT

For the sake of this article, the list between TOP_OF_PIPE_BIT and BOTTOM_OF_PIPE_BIT is what we’re going to discuss. So what does source and target mean in the context of a barrier? You can think of it as the “producer” and the “consumer” stage – the source being the producer, and the target stage being the consumer. By specifying the source and target stages, you tell the driver what operations need to finish before the transition can execute, and what must not have started yet.

A slow barrier, specifying the bottom of the pipe as the source stage and the top of pipe as the target stage. This will wait for everything to finish and block any work from starting. — Example 1: a slow barrier, specifying the bottom of the pipe as the source stage and the top of pipe as the target stage.

Let’s look first at the simplest case, which is a barrier which specifies BOTTOM_OF_PIPE_BIT as the source stage and TOP_OF_PIPE_BIT as the target stage (Example 1). The source code for this would be something like:


vkCmdPipelineBarrier(
    commandBuffer,
    VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, // source stage
    VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,    // destination stage
    /* remaining parameters omitted */);

This transition expresses that every command currently in flight on the GPU needs to finish, then the transition is executed, and no command may start before it finishes transitioning. This barrier will wait for everything to finish and block any work from starting. That’s generally not ideal because it introduces an unnecessary pipeline bubble.

Imagine you have a vertex shader that also stores data via an imageStore and a compute shader that wants to consume it. In this case you wouldn’t want to wait for a subsequent fragment shader to finish as this can take a long time to complete. You really want the compute shader to start as soon as the vertex shader is done. The way to express this is to set the source stage -the producer- to VERTEX_SHADER_BIT and the target stage -the consumer- to COMPUTE_SHADER_BIT (Example 2).


vkCmdPipelineBarrier(
    commandBuffer,
    VK_PIPELINE_VERTEX_SHADER_BIT,     // source stage
    VK_PIPELINE_COMPUTE_SHADER_BIT,    // destination stage
    /* remaining parameters omitted */);

If you write to a render target and read from it in a fragment shader, the stages would be VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT as the source and VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT as the destination – typical for G-Buffer rendering. For shadow maps, the source would be VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT. Another typical example is copying data – you produce the data through a copy, so the source stage would be set to VK_PIPELINE_STAGE_TRANSFER_BIT, and the destination to the stage where you need it. For vertex buffers, this would be for instance VK_PIPELINE_STAGE_VERTEX_INPUT_BIT.

Generally, you should try to maximize the number of “unblocked” stages, that is, produce data early and wait late for it. It’s always safe on the producer side to move towards the bottom of the pipe, as you’ll wait for more and more stages to finish, but it won’t improve performance. Similarly, if you want to be safe on the target side, you move upwards towards the top of pipe – but it prevents more stages from running, so that should be avoided as well.

One final remark: as mentioned previously, the hardware may not have all the stages internally, or may not be able to signal or wait at the specified stage. In those cases, the driver is free to move your source stage towards the bottom of the pipe and the target stage towards the top. This is implementation-specific though, and you should not have to worry about this – your goal should be to set the stages as “tight” as possible and minimize the number the blocked stages.

Matthäus Chajdas is a developer technology engineer at AMD. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

11 Comments

If you precise that you use early test, for a shadow map, you could use VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT right?

This would only work if you force fragment shaders on everything and early tests, but you should not use fragment shaders during shadow maps if don’t need to. Without fragment shaders, it’s undefined if it happens early or late. Also, there’s no real benefit from going from late to early, so I’d recommend you stick to late in all cases.

Okay thanks :).
Another question, what is the difference to use ALL_COMMANDS instead of BOTTOM_OF_PIPE in the srcStage?
Same question about TOP_OF_PIPE / ALL_COMMANDS in dstStage.
Thanks ! 🙂

You can use ALL_COMMANDS as well — it’s basically the same. Same goes for the target stage. I used the BOTTOM/TOP to make it clear this is at the end of the pipe/beginning of it, but you can use ALL_COMMANDS for both as well.

Hi!
1) Can VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT be used as Source Stage. If it is possible then what for VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT is needed? Does first or latter guarantee that hardware cache was flushed and image memory contains latest data?
2) Imagine the case:
Step A. Fragment shader writes to color attachment.
Step B. I dispatch compute shader that does something that is not related to color attachment.
Step C. I dispatch compute shader that reads color attachment.
Is this right: if I place barrier(VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, COMPUTE_SHADER_BIT) in the end of step A then my first dispatch will wait until writing to color attachment is finished? Is the only way to not block Step B is to place barrier in the beginning of the step C?
3) If I place for ex. barrier(VK_PIPELINE_COMPUTE_SHADER_BIT, VK_PIPELINE_VERTEX_SHADER_BIT) but never dispatch compute shaders (or if I place barrier before dispatched compute shaders) will this hang whole rendering pipeline?
Thanks!

FRAGMENT_SHADER is for side-effects of your fragment shader (like load-stores). The outputs written into render targets is only visible ones the COLOR_ATTACHMENT_OUTPUT stage is done.

In your case, you place the barrier before the dispatch that reads the color attachment. So you draw, dispatch the unrelated work, barrier, dispatch the work with consumes the draw.

3) It won’t hang, it will just logically wait for all stages up the compute shader stage to finish work (which can be none) and it will prevent anything including the vertex shader and stages after it from running before the other stages finished. This might very well be a no-op if there’s no work in flight, but it could be also a cache flush for no good reason, so you should avoid this kind of waits.

hi!
Say there are two buffers, A and B. A is host visible, B is device local. I think the following steps are needed:
1.transfer A from its ”initial state’ into PIPELINE_STAGE_HOST, ACCESS_HOST_WRITE using barrier
2.map(A), write data, unmap(A)
3.transfer A from PIPELINE_STAGE_HOST, ACCESS_HOST_WRITE into PIPELINE_STAGE_TRANSFER, ACCESS_TRANSFER_READ using barrier
4.transfer B from its “initial state” into PIPELINE_STAGE_TRANSFER, ACCESS_TRANSFER_WRITE using barrier
5. vkCopyBuffer
Is it correct? If so, what is the “initial state”, should I simply mark the srcStageMask and srcAccessMask as 0?
If not, what steps is needed?
Thanks!

Just take a look at how the HelloVulkan sample solves it: https://github.com/GPUOpen-LibrariesAndSDKs/HelloVulkan/blob/master/hellovulkan/src/VulkanTexturedQuad.cpp, specifically, CreateMeshBuffers.

It allocates the memory in the VK_BUFFER_USAGE_TRANSFER_DST_BIT state directly, and then performs a copy. It only barriers afterwards (after the copy is done) to get it into usable state.

I too have a few questions:

I have 4 buffers:
1 input staging buffer
1 input device local buffer
1 output device local buffer
1 output staging buffer

The idea is that I generate some data on the CPU into the input staging buffer.
I copy the staging buffer to the device local input buffer using a dedicated transfer queue.
I then launch a compute shader that fills the output device local buffer based on the input device local buffer.
Finally i copy the results in the output device local buffer to the output staging buffer, map it and use the data.

My question now is how to handle the dependencies between the queues. A commandbuffer needs to be submitted to a specific queue, so how do i go about handling ownerships?
Should I record 3 commandbuffers, 1 for copy of input, 1 for launching of compute and 1 for copy of output and end the first two with a pipeline barrier that transfers the buffers to the target queues?
Is it possible to create a single commandbuffer that has access to all the other queues and does all the ownership management itself?

Thanks in advance

You’ll need three separate command buffers, as you can’t submit one command buffer to multiple queues. The command buffers will all perform transitions, and you’ll synchronize them using semaphores (you’ll get one on submit, which you can use on the next submit to prevent the command buffer from starting.) Note that if your graphics queue is depending on the output of the compute queue, you might just do the compute on the graphics queue unless there’s other work you want to overlap with.

I don’t get why it is mentioned in Example 2 that the right pipeline is waiting on the finished vertex shader pipeline stage of the left one in order to start a compute shader after a subsequent fragment shader. My state of knowledge is that pipelines are either graphics or compute, so it is not possible to execute a fragment shader before a compute shader in the same pipeline, or am I wrong?

Dec	JAN	Feb
	27
2018	2019	2020

Vulkan Memory Allocator 2.2

Ryzen Threadripper for Game Development – optimising UE4 build times

OCAT 1.3

Radeon GPU Profiler 1.4

AMD GPU Services 5.3.0

New Compressonator 3.1 SDK for seamless integration into asset toolchains – and more!

Optimize your engine using compute @ 4C Prague 2018

Radeon GPU Profiler 1.3.1

OCAT 1.2

Vulkan Memory Allocator 2.1

Radeon GPU Profiler 1.3

Decoding Radeon Vulkan versions

Porting your engine to Vulkan or DX12

Understanding GPU context rolls

Microsoft PIX Introduces AMD-Integrated Plug-In with Occupancy Data Graph

GDC 2018 Presentation Links

AMD GPU Services 5.2.0

Radeon GPU Profiler 1.2

Compressonator V3.0 Release Brings Powerful New 3D Model Features

TrueAudio Next Version 1.2 Now Posted to Github

Reducing Vulkan API call overhead

First steps when implementing FP16

GDC 2018 Presentation: Real-Time Ray-Tracing Techniques for Integration into Existing Renderers

Real-Time Ray Tracing with Radeon ProRender

GDC 2018 Presentations

TrueAudio Next is Now Integrated into Steam Audio

Radeon GPU Profiler 1.1.1

Radeon GPU Profiler 1.1.0

Deferred Path Tracing By Enscape

Compressonator V2.7 Release adds cross platform support and 3D Model compression with glTF v2.0

Radeon GPU Profiler 1.0.3

AMD GPU Services 5.1.1

CPU core count detection on Windows

Stable barycentric coordinates

Radeon GPU Profiler 1.0.2

AMD Vega Instruction Set Architecture documentation

Understanding Vulkan objects

Open-source Radeon ProRender

Radeon GPU Profiler 1.0

TressFX 4 Simulation Changes

Vulkan Memory Allocator 1.0

Compressonator V2.6 Release Adds HDR Tonemapping Compression, New Image Analysis Features

Vega Frontier : How to for developers

Vega Frontier : How to install the driver

Optimizing GPU occupancy and resource usage with large thread groups

DirectX12 Hardware Counter Profiling with Microsoft PIX and the AMD Plugin

CodeXL 2.3 is released!

Content Creation Tools and Multi-GPU

Capsaicin and Cream developer talks at GDC 2017

Compressonator V2.5 Release Adds Enhanced HDR Support

Live VGPR Analysis with Radeon GPU Analyzer

The Radeon Loom Stitching Pipeline

AMD LiquidVR MultiView Rendering in Serious Sam VR

TrueAudio Next Demo and Paper at GameSoundCon

Profiling video memory with Windows Performance Analyzer

GDC 2017 Presentations

AGS 5.0 – Shader Compiler Controls

Optimizing Terrain Shadows

Leveraging asynchronous queues for concurrent execution

Selecting the Best Graphics Device to Run a 3D Intensive Application

Vulkan and DOOM

Implementing LiquidVR™ Affinity Multi-GPU support in Serious Sam VR

AMD Driver Symbol Server

Vulkan barriers explained

VDR Follow Up – Tonemapping for HDR Signals

Using RapidFire for Virtual Desktop and Cloud Gaming

AMD TrueAudio Next and CU Reservation – What is the Context?

Anatomy Of The Total War Engine: Part V

The Importance of Audio in VR

Anatomy Of The Total War Engine: Part IV

Anatomy Of The Total War Engine: Part III

Blazing CodeXL 2.2 is here!

Anatomy Of The Total War Engine: Part II

Anatomy Of The Total War Engine: Part I

Texel Shading

Vulkan Device Memory

Performance Tweets Series: Root signature & descriptor sets

Performance Tweets Series: Multi-GPU

Compressonator v2.3 Release Delivers ASTC, ETC2 Codec Support and GPU Rendered Image Views

Performance Tweets Series: Debugging & Robustness