Stream Compaction is one of the most important applications of GPU prefix sums IMO.

Of course, prefer a library over writing your own. DirectX11 (and 12) has AppendStructuredBuffers (https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/sm5-object-appendstructuredbuffer) for instance that performs this automatically. AVX512 even has “compress” and “expand” instructions that effectively perform this stream-compaction task.

IMO, this is one of the “new era” methods of creating SIMD-datastructures. Its not often talked about, but its all over the place, the wizards / experts obviously know about this but somehow don’t know how to tell us normal people how it works.