Fast compaction with mbcnt

Posted on May 20, 2016October 17, 2016 by Matthaeus Chajdas

atomic, GCN, intrinsic, mbcnt, shader, Vulkan

Compaction is a basic building block of many algorithms – for instance, filtering out invisible triangles as seen in Optimizing the Graphics Pipeline with Compute. The basic way to implement a compaction on GPUs relies on atomics: every lane which has an element that must be kept increments an atomic counter and writes into that slot. While this works, it’s inefficient as it requires lots of atomic operations – one par active lane. The general pattern looks similar to the following pseudo-code:


if (threadId == 0) {
    sharedCounter = 0;
}

barrier ();

bool laneActive = predicate (item);

if (laneActive) {
    int outputSlot = atomicAdd (sharedCounter, 1);
    output [outputSlot] = item;
}

This computes an output slot for every item by incrementing a counter stored in Local Data Store (LDS) memory (called Thread Group Shared Memory in Direct3D® terminology). While this method works for any thread group size, it’s inefficient as we may end up with up to 64 atomic operations per wavefront.

With the newly released shader extensions, we’re providing access to a much better tool to get this job done: GCN provides a special op-code for compaction within a wavefront – mbcnt. mbcnt(v) computes the bit count of the provided value up to the current lane id. That is, for lane 0, it simply returns popcount (v & 0), for lane 1, it’s popcount (v & 1), for lane 2, popcount (v & 3), and so on. If you’re wondering what popcount does – it simply counts the number of set bits. The popcount result gives us the same result as the atomic operation above – a unique output slot within the wavefront we can write to. This leaves us with one more problem – we need to gather the laneActive variable which is per-thread across the whole wavefront.

We’ll need a short trip into the GCN execution model to understand how this works. In GCN, there’s a scalar unit and a vector unit. Any kind of comparison which is performed per-lane writes into a scalar register with one bit per lane. We can access this scalar register using ballot – in this case, we’re going to call mbcnt (ballot (laneActive)). The complete shader looks like this:


bool laneActive = predicate (item);
int outputSlot = mbcnt (ballot (laneActive));

if (laneActive) {
    output [outputSlot] = item;
}

Much shorter than before, and also quite a bit faster. Let’s take a closer look how this works. In the image below, we can see the computation for one thread. The input data is green where items should be kept. The ballot output returns a mask which is 1 if the thread passed in true and 0 otherwise. This ballot mask is then AND’ed together during the mbcnt with a mask that has 1 for all bits less than the current thread index, and 0 otherwise. After the AND, mbcnt computes a popcount which yields the correct output slot.

The input data is filtered using `mbcnt`. The currently active thread is highlighted. For this thread, `mbcnt` applied onto the output of `ballot` returns the output slot that should be used during compaction.

Notice that this piece of code will be only equivalent if your dispatch size is 64. For other sizes, you’ll need an atomic per wavefront – which cuts down the number of atomics by 64 compared to the naive approach. The pattern you’d use looks like this:


if (wavefrontThreadId == 0) {
    sharedCounter = 0;
}

barrier ();

bool laneActive = predicate (item);
int outputSlot = mbcnt (ballot (laneActive));

if (wavefrontThreadId == 0) {
    sharedWavefrontOutputSlot = atomicAdd (sharedCounter,
        popcount (ballot (laneActive)));
}
barrier ();

if (laneActive) {
    output [sharedWavefrontOutputSlot + outputSlot] = item;
}

And finally, if you are going over the data with multiple thread groups, you’ll need a global atomic as well. In this case, the pattern looks as follows (assuming globalCounter has been cleared to zero before the dispatch starts):


if (wavefrontThreadId == 0) {
    sharedCounter = 0;
}

barrier ();

bool laneActive = predicate (item);
int outputSlot = mbcnt (ballot (laneActive));

if (wavefrontThreadId == 0) {
    sharedWavefrontOutputSlot = atomicAdd (sharedCounter,
        popcount (ballot (laneActive)));
}
barrier ();

// This is a shared variable as well. readlane is not sufficient, as
// we need to communicate across all invocations
if (workgroupThreadId == 0) {
    sharedSlot = atomicAdd (globalCounter, sharedCounter);
}
barrier ();

if (laneActive) {
    output [sharedWavefrontOutputSlot + outputSlot + sharedSlot] = item;
}

The “big picture” can be seen below. Depending on the level you’re working on, you should use the right atomics. At the global level, where you need to synchronize between work groups, you have to use global memory atomics. At the work group level, where you need to synchronize between wavefronts, local atomics are sufficient. Finally, at the wavefront level, you should take advantage of the wavefront functions like mbcnt.

TestC — Each dispatch level requires a different function. For whole work groups within a dispatch, output memory would be reserved using a global memory atomic. Within a work group, a local memory atomic is sufficient, and within a wavefront, `mbcnt` is the faster function.

I you want to try this out right now – we’ve got you covered and have a sample prepared for you!

Matthäus Chajdas is a developer technology engineer at AMD. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

8 Comments

Nice. Is there any chance to expose this feature to OpenCL, as well?

How about this?

If the SPIR-V OpGroupIAdd opcode was relaxed to support differing return and argument types — specifically, an integer return type and boolean argument — then SPIR-V would be able to optionally *efficiently* express:

popcount( ballot() & lanes_less_than() )
popcount( ballot() & lanes_less_than_or_equal() )

This would then allow OpenCL to expose the following potentially optimal sub_group functions:

int sub_group_scan_inclusive_add(bool pred)
int sub_group_scan_exclusive_add (bool pred)

I added an RFE to SPIR-V here: https://github.com/KhronosGroup/SPIRV-Headers/issues/9

The primitives you need is ballot and get current lane ID, which would allow you to implement everything yourself. The lanes_less_than helps the compiler though in understanding that this can be folded into mbcnt. Thanks for the RFE!

I suggested a nicer alternative in the RFE that requires no change to SPIR-V.

You can safely emit popcnt(mbcnt() & ballot(pred)) whenever it’s determined that X in OpGroupIAdd.Subgroup.InclusiveScan(X) is guaranteed to be a 0 or 1.

ExclusiveScan and Reduce lane count implementations could also be implemented this way.

[Regarding the OpenCL question] Not right now and I don’t know what the plans are regarding OpenCL. It’s in SPIR-V and GLSL for now.

This line should it be:
output [sharedWavefrontOutputSlot + outputSlot + sharedSlot] = item;
instead of:
output [sharedWavefrontOutputSlot + outputSlot + globalCounter] = item;

Indeed – good catch. Fixing right now …

” The function ballotARB() returns a bitfield containing the result of evaluating the expression in all active invocations in the sub-group”

I don’t know what the AMD version of this extension says (can’t find neither documentation nor links anywhere) but this above from the ARB version should be a warning for anyone trying this in somewhat more portable code. The fix is simple: store the result of the ballot in a temporary before the branch. The keyword is ‘active’, I assume that refers to non-masked, and this is how it works on NVIDIA hw anyhow, a masked lane just returns zero.

Dec	JAN	Feb
	27
2018	2019	2020

Vulkan Memory Allocator 2.2

Ryzen Threadripper for Game Development – optimising UE4 build times

OCAT 1.3

Radeon GPU Profiler 1.4

AMD GPU Services 5.3.0

New Compressonator 3.1 SDK for seamless integration into asset toolchains – and more!

Optimize your engine using compute @ 4C Prague 2018

Radeon GPU Profiler 1.3.1

OCAT 1.2

Vulkan Memory Allocator 2.1

Radeon GPU Profiler 1.3

Decoding Radeon Vulkan versions

Porting your engine to Vulkan or DX12

Understanding GPU context rolls

Microsoft PIX Introduces AMD-Integrated Plug-In with Occupancy Data Graph

GDC 2018 Presentation Links

AMD GPU Services 5.2.0

Radeon GPU Profiler 1.2

Compressonator V3.0 Release Brings Powerful New 3D Model Features

TrueAudio Next Version 1.2 Now Posted to Github

Reducing Vulkan API call overhead

First steps when implementing FP16

GDC 2018 Presentation: Real-Time Ray-Tracing Techniques for Integration into Existing Renderers

Real-Time Ray Tracing with Radeon ProRender

GDC 2018 Presentations

TrueAudio Next is Now Integrated into Steam Audio

Radeon GPU Profiler 1.1.1

Radeon GPU Profiler 1.1.0

Deferred Path Tracing By Enscape

Compressonator V2.7 Release adds cross platform support and 3D Model compression with glTF v2.0

Radeon GPU Profiler 1.0.3

AMD GPU Services 5.1.1

CPU core count detection on Windows

Stable barycentric coordinates

Radeon GPU Profiler 1.0.2

AMD Vega Instruction Set Architecture documentation

Understanding Vulkan objects

Open-source Radeon ProRender

Radeon GPU Profiler 1.0

TressFX 4 Simulation Changes

Vulkan Memory Allocator 1.0

Compressonator V2.6 Release Adds HDR Tonemapping Compression, New Image Analysis Features

Vega Frontier : How to for developers

Vega Frontier : How to install the driver

Optimizing GPU occupancy and resource usage with large thread groups

DirectX12 Hardware Counter Profiling with Microsoft PIX and the AMD Plugin

CodeXL 2.3 is released!

Content Creation Tools and Multi-GPU

Capsaicin and Cream developer talks at GDC 2017

Compressonator V2.5 Release Adds Enhanced HDR Support

Live VGPR Analysis with Radeon GPU Analyzer

The Radeon Loom Stitching Pipeline

AMD LiquidVR MultiView Rendering in Serious Sam VR

TrueAudio Next Demo and Paper at GameSoundCon

Profiling video memory with Windows Performance Analyzer

GDC 2017 Presentations

AGS 5.0 – Shader Compiler Controls

Optimizing Terrain Shadows

Leveraging asynchronous queues for concurrent execution

Selecting the Best Graphics Device to Run a 3D Intensive Application

Vulkan and DOOM

Implementing LiquidVR™ Affinity Multi-GPU support in Serious Sam VR

AMD Driver Symbol Server

Vulkan barriers explained

VDR Follow Up – Tonemapping for HDR Signals

Using RapidFire for Virtual Desktop and Cloud Gaming

AMD TrueAudio Next and CU Reservation – What is the Context?

Anatomy Of The Total War Engine: Part V

The Importance of Audio in VR

Anatomy Of The Total War Engine: Part IV

Anatomy Of The Total War Engine: Part III

Blazing CodeXL 2.2 is here!

Anatomy Of The Total War Engine: Part II

Anatomy Of The Total War Engine: Part I

Texel Shading

Vulkan Device Memory

Performance Tweets Series: Root signature & descriptor sets

Performance Tweets Series: Multi-GPU

Compressonator v2.3 Release Delivers ASTC, ETC2 Codec Support and GPU Rendered Image Views

Performance Tweets Series: Debugging & Robustness