The new warp alignment optimizer

As I mentioned in the previous post, I've been working on writing a better algorithm for optimizing the way that entities are aligned to GPU warps (or as Apple calls them, SIMD-groups). For the sake of conversation, let's assume that each GPU threadgroup is 1024 threads, and those are broken up into 32 warps, each of which has 32 threads. (These happen to be actual numbers from both my NVIDIA chip and my Apple M1.)

Each warp shares an instruction pointer. This is why Apple's name for them makes sense: each SIMD-group is kind of like a 32-data-wide SIMD unit. In practice these are some pretty sophisticated SIMD processors, because they can do instruction masking, allowing each thread to take different branches. But the way this works for something like "if (x) y else z" is that all the SIMD unit executes instructions for BOTH y and z on every thread, but the y instruction is masked out (has no effect) for threads that take the z branch, and z is masked out for threads that take the y branch. This is not a huge deal if y and z are simple computations, but each branch has dozens of instructions, you have to wait for each branch to be executed in serial, which is slow.

Note that this penalty is only paid if there are actually threads that take both branches. If all the threads take the same branch, no masking is needed and things are fast. This is the key thing: at runtime, putting computations with similar branch flows in the same warp is much faster than mixing computations with divergent branch flows.

For Anukari, the most obvious way to do this is to group entities by type. Put sensors in a warp with other sensors, put bodies in a group with other bodies, etc. In total, including sub-types, Anukari has 11 entity types. This means that for small instruments, we can easily sort each entity type into its own warp, and get huge speedups. This is really the main advantage of porting from OpenCL to CUDA: just like Apple, NVIDIA artificially limits OpenCL to just 8 32-thread warps (256 threads). If you want to use the full 32 32-thread warps (1024 threads), you have to port to the hardware's native language. Which, we really do, because with OpenCL's 8 warps, we're much more likely to have to double-up two entity types in one warp. Having 32 warps gives us a ton of flexibility.

The Algorithm

So we have 11 entity types going into 32 buckets. This is easy until we consider that an instrument may have hundreds of one entity type, and zero of another, or that instruments might have 33 of one entity type, which doesn't fit into a single bucket, etc. What we have here is an optimization problem. It is related to the bin-packing problem, but it's more complicated than the vanilla bin-packing problem because there are additional constraints, like the fact that we can break groups of entities into sub-groups if needed, and the fact that it needs to run REALLY fast because this happens on the audio thread whenever we need to write entities to buffers.

I'm extremely happy with the solution I ended up with. First, we simplify the problem:

Each entity type is grouped together into a contiguous unit, which might internally have padding but will never be separated by another entity type.
We do not consider reordering entity types: there is a fixed order that we put them in and that's it. This order is hand-chosen so that the most expensive entities are not adjacent to one another, and thus are unlikely to end up merged into the same warp.

A quick definition: a warp's "occupancy" will be the number of distinct types of entities that have been laid out within that warp. So if a warp contains some sensors, and some LFOs, its occupancy would be 2.

The algorithm then is as follows:

Pretend we have infinite warps, and generate a layout that would be optimal. Basically, assign each entity type enough warps such that the maximum occupancy is 1. (This means that some warps might be right-padded with no-op entities.)
If the current layout fits into the actual amount of warps the hardware has, we are done.
If not, look for any cases where dead space can be removed without increasing any warp's occupancy. (On the first iteration, there won't be any.)
Increment the maximum allowable occupancy by 1, and merge together any adjacent warps that, after being merged, will not exceed this occupancy level.
Go back to step 2.

That's it! This is a minimax optimizer: it tries to minimize the maximum occupancy of any warp. It does this via the maximum allowable occupancy watermark. It tries all possible merges that would stay within a given occupancy before trying the next higher occupancy.

There are a couple of tricks to make this efficient, but the main one is that at the start of the algorithm, we make a conservative guess as to what the maximum occupancy will have to be. This way, if the solution requires occupancy 11 (say there are 1000 bodies and 1 of each remaining entity type, so the last warp has to contain all 11 types), we don't have to waste time merging things for occupancy 2, 3, 4, ... 11. It turns out that it's quite easy to guess within 1-2 occupancy below the true occupancy most of the time. I wrote a fuzz test for the algorithm, and in 5,000 random entity distributions the worst case is 5 optimizer iterations, and that's rare. Anyway, it's plenty fast enough.

The solutions the optimizer produces are excellent. In cases where there's a perfect solution available, it always gets it, because that's what it tries first. And in typical cases where compromise is needed, it usually finds solutions that are as good as what I could come up with manually.

Results

That's all fine and good, but does it work? Yes. Sadly I don't have a graph to share, because this doesn't help with all my microbenchmarks. Those are all tiny instruments for which the old optimizer worked fine.

But for running huge complex instruments, it is an ENORMOUS speedup, often up to 2x faster with the new optimizer. For example, for the large instrument in this demo video, it previously was averaging about 90% of the latency budget, with very frequent buffer overruns (the red clip lines in the GPU meter). That instrument is now completely usable with no overruns at all, averaging maybe 40% of the latency budget. Other benchmark instruments show even better gains, with one that never went below 100% of the latency budget before now at about 40%.

This opens up a TON more possibilities in terms of complex instruments. I think at this point, at least on Windows with NVIDIA hardware, I am completely satisfied with the performance. Apple with Metal is almost there but still needs just a tiny bit more work for me to be satisfied.

devlog

Introducing the FX Object

Captain's Log: Stardate 79887.1

Evan Mezeske

Jun 2026

Devlog

Cranking out more UX improvements

Captain's Log: Stardate 79886.9

Evan Mezeske

Jun 2026

Devlog

Rewriting the 3D graphics engine (again)

Captain's Log: Stardate 79884.7

Evan Mezeske

Jun 2026

Devlog

VFX shaders and screen recording

Captain's Log: Stardate 79711.6

Evan Mezeske

Mar 2026

Devlog

RAM savings in the 0.9.26 release

Captain's Log: Stardate 79645.8

Evan Mezeske

Mar 2026

Devlog

Abandoning resend.com for email

Captain's Log: Stardate 79603

Evan Mezeske

Feb 2026

Devlog

A wild MTS-ESP support appears

Captain's Log: Stardate 79584.2

Evan Mezeske

Feb 2026

Devlog

NAMM stories: The Julius O. Smith III votive candle

Captain's Log: Stardate 79553.2

Evan Mezeske

Feb 2026

Devlog

Audio quality improvements

Captain's Log: Stardate 79463

Evan Mezeske

Dec 2025

Devlog

This website is now on Google Cloud

Captain's Log: Stardate 79462.5

Evan Mezeske

Dec 2025

Devlog

Railway.com knows better than you

Captain's Log: Stardate 79429.5

Evan Mezeske

Dec 2025

Devlog

Anukari on the CPU (part 3: in retrospect)

Captain's Log: Stardate 79350.3

Evan Mezeske

Nov 2025

Devlog

Anukari on the CPU (part 2: CPU optimization)

Captain's Log: Stardate 79317.7

Evan Mezeske

Nov 2025

Devlog

Anukari on the CPU (part 1: GPU issues)

Captain's Log: Stardate 79314.6

Evan Mezeske

Nov 2025

Devlog

Finally Anukari has macros, and a preset API

Captain's Log: Stardate 79052

Evan Mezeske

Aug 2025

Devlog

Working better on some Radeon chips

Captain's Log: Stardate 79013.9

Evan Mezeske

Jul 2025

Devlog

Multichannel, ASIO, Radeon, and randomization

Captain's Log: Stardate 79000.1

Evan Mezeske

Jul 2025

Devlog

Huge macOS performance improvements

Captain's Log: Stardate 78871.1

Evan Mezeske

May 2025

Devlog

Apple performance progress

Captain's Log: Stardate 78825.2

Evan Mezeske

May 2025

Devlog

Had a super productive conversation with an Apple Metal engineer

Captain's Log: Stardate 78811.5

Evan Mezeske

May 2025

Devlog

An Appeal to Apple from Anukari

Captain's Log: Stardate 78809.2

Evan Mezeske

May 2025

Devlog

Demo mode, first-launch flow, and more

Captain's Log: Stardate 78738

Evan Mezeske

Apr 2025

Devlog

Preparing for the open (paid) Beta

Captain's Log: Stardate 78674.7

Evan Mezeske

Mar 2025

Devlog

The AAX plugin is working

Captain's Log: Stardate 78612.4

Evan Mezeske

Feb 2025

Devlog

Getting more and more stable

Captain's Log: Stardate 78592.1

Evan Mezeske

Feb 2025

Devlog

More workarounds for Apple

Captain's Log: Stardate 78573.2

Evan Mezeske

Feb 2025

Devlog

CPack considered harmful

Captain's Log: Stardate 78553.4

Evan Mezeske

Feb 2025

Devlog

Getting into the usability weeds

Captain's Log: Stardate 78505.1

Evan Mezeske

Jan 2025

Devlog

Digging into usability

Captain's Log: Stardate 78422.5

Evan Mezeske

Dec 2024

Devlog

The chaos monkey lives

Captain's Log: Stardate 78396.0

Evan Mezeske

Dec 2024

Devlog

Complications from custom 3D models

Captain's Log: Stardate 78378.9

Evan Mezeske

Nov 2024

Devlog

The CUDA port, and more optimization

Captain's Log: Stardate 78341.3

Evan Mezeske

Nov 2024

Devlog

Waste makes haste...?

Captain's Log: Stardate 78324

Evan Mezeske

Nov 2024

Devlog

Starting to micro-optimize, and 3D artwork

Captain's Log: Stardate 78310.9

Evan Mezeske

Nov 2024

Devlog

Improving DAW compatibility

Captain's Log: Stardate 78294.1

Evan Mezeske

Oct 2024

Devlog

Weird Filament bug on Metal backend

Captain's Log: Stardate 78280.5

Evan Mezeske

Oct 2024

Devlog

Lions, tigers, and high-DPI, oh my

Captain's Log: Stardate 78275.5

Evan Mezeske

Oct 2024

Devlog

Reducing 3D renderer loading time

Captain's Log: Stardate 78261.3

Evan Mezeske

Oct 2024

Devlog

Configurable environment/skybox

Captain's Log: Stardate 78258.8

Evan Mezeske

Oct 2024

Devlog

3D renderer quality settings menu

Captain's Log: Stardate 78247.8

Evan Mezeske

Oct 2024

Devlog

3D renderer native window bugs

Captain's Log: Stardate 78245

Evan Mezeske

Oct 2024

Devlog

Bug in Google Filament's Vulkan code

Captain's Log: Stardate 78236.8

Evan Mezeske

Oct 2024

Devlog

First hacked-together build with new renderer

Captain's Log: Stardate 78226.3

Evan Mezeske

Oct 2024

Devlog

Ubershaders vs just-in-time compilation

Captain's Log: Stardate 78223.2

Evan Mezeske

Oct 2024

Devlog

New animations based on physics

Captain's Log: Stardate 78220.2

Evan Mezeske

Oct 2024

Devlog

New 3D renderer parity with old renderer

Captain's Log: Stardate 78217.3

Evan Mezeske

Oct 2024

Devlog

Fixing a crash in Ableton with WM_DESTROY

Captain's Log: Stardate 78214.6

Evan Mezeske

Sep 2024

Devlog

Skybox and irradiance map working

Captain's Log: Stardate 78209.3

Evan Mezeske

Sep 2024

Devlog

Instanced .glb assets in new renderer

Captain's Log: Stardate 78206.5

Evan Mezeske

Sep 2024

Devlog

Rendering on MacOS in Metal

Captain's Log: Stardate 78203.4

Evan Mezeske

Sep 2024

Devlog

3D rendering in a hovering native window

Captain's Log: Stardate 78201

Evan Mezeske

Sep 2024

Devlog

Exploring options for using Google Filament

Captain's Log: Stardate 78198.7

Evan Mezeske

Sep 2024

Devlog

3D graphics / GPU audio interference?

Captain's Log: Stardate 78195.4

Evan Mezeske

Sep 2024

Devlog

Anukari audio computed on GPU with Metal

Captain's Log: Stardate 78146.3

Evan Mezeske

Sep 2024

Devlog

Porting audio GPU code to Apple Metal

Captain's Log: Stardate 78143.7

Evan Mezeske

Sep 2024

Devlog

Performance state of the union (bad on MacOS)

Captain's Log: Stardate 78129.1

Evan Mezeske

Aug 2024

Devlog

New optimization: mixing mics on GPU

Captain's Log: Stardate 78115.4

Evan Mezeske

Aug 2024

Devlog

First pre-alpha installers out

Captain's Log: Stardate 78108.1

Evan Mezeske

Aug 2024

Devlog

Working through MacOS bugs for pre-alpha

Captain's Log: Stardate 78099.9

Evan Mezeske

Aug 2024

Devlog

Working windows installer

Captain's Log: Stardate 78093.8

Evan Mezeske

Aug 2024

Devlog

Writing bad crypto code

Captain's Log: Stardate 78089.3

Evan Mezeske

Aug 2024

Devlog

Ephemeral downloads and license keys

Captain's Log: Stardate 78086.6

Evan Mezeske

Aug 2024

Devlog

This devlog is on the website now

Captain's Log: Stardate 78080.5

Evan Mezeske

Aug 2024

Devlog

Building a blog engine (for some reason)

Captain's Log: Stardate 78069.7

Evan Mezeske

Aug 2024

Devlog

My TypeScript honeymoon period is over

Captain's Log: Stardate 78064.4

Evan Mezeske

Aug 2024

Devlog

Enjoying bugs in next.js

Captain's Log: Stardate 78061

Evan Mezeske

Aug 2024

Devlog

License validation apis

Captain's Log: Stardate 78053.5

Evan Mezeske

Aug 2024

Devlog

Prepping website for pre-alpha

Captain's Log: Stardate 78026.5

Evan Mezeske

Jul 2024

Devlog

It's 2024 and implementing web auth still sucks

Captain's Log: Stardate 78017.3

Evan Mezeske

Jul 2024

Devlog

Don't ever use boost::numeric::interval

Captain's Log: Stardate 78014.7

Evan Mezeske

Jul 2024

Devlog

Rabbit holes with GUI animations

Captain's Log: Stardate 78011.8

Evan Mezeske

Jul 2024

Devlog

MIDI note off modulation

Captain's Log: Stardate 78009.3

Evan Mezeske

Jul 2024

Devlog

Polishing up MPE/polyphony support

Captain's Log: Stardate 78006.4

Evan Mezeske

Jul 2024

Devlog

Giving up on Python for web

Captain's Log: Stardate 78001.1

Evan Mezeske

Jul 2024

Devlog

Giving up on Google Cloud

Captain's Log: Stardate 77998.1

Evan Mezeske

Jul 2024

Devlog

MPE works on my ROLI seaboard!

Captain's Log: Stardate 77993.6

Evan Mezeske

Jul 2024

Devlog

MPE support seemingly complete

Captain's Log: Stardate 77990.4

Evan Mezeske

Jul 2024

Devlog

Python, FastAPI, and NiceGUI for webapp

Captain's Log: Stardate 77988.1

Evan Mezeske

Jul 2024

Devlog

Second read-through of the MPE spec

Captain's Log: Stardate 77985

Evan Mezeske

Jul 2024

Devlog

Breaking ground on MPE support

Captain's Log: Stardate 77976.4

Evan Mezeske

Jul 2024

Devlog

Voice instancing is amazing

Captain's Log: Stardate 77973.4

Evan Mezeske

Jul 2024

Devlog

Storing per-instance data

Captain's Log: Stardate 77971.1

Evan Mezeske

Jul 2024

Devlog

Instanced instruments on the GPU

Captain's Log: Stardate 77965.4

Evan Mezeske

Jul 2024

Devlog

Digitally signing binaries

Captain's Log: Stardate 77948.7

Evan Mezeske

Jun 2024

Devlog

Full usability on MacOS

Captain's Log: Stardate 77924.1

Evan Mezeske

Jun 2024

Devlog

FFT-based audio clip comparison for golden tests

Captain's Log: Stardate 77921.4

Evan Mezeske

Jun 2024

Devlog

Cleaning up hacks from MacOS support

Captain's Log: Stardate 77918.8

Evan Mezeske

Jun 2024

Devlog

Restructuring GUI unit tests

Captain's Log: Stardate 77916.5

Evan Mezeske

Jun 2024

Devlog

Getting tests passing on MacOS

Captain's Log: Stardate 77913.4

Evan Mezeske

Jun 2024

Devlog

Anukari runs on MacOS now

Captain's Log: Stardate 77907.5

Evan Mezeske

Jun 2024

Devlog

Voice instancing: it's alive!

Captain's Log: Stardate 77875

Evan Mezeske

May 2024

Devlog

New memory layout for voice instancing

Captain's Log: Stardate 77872.5

Evan Mezeske

May 2024

Devlog

Starting work on instanced voice mode

Captain's Log: Stardate 77870.1

Evan Mezeske

May 2024

Devlog

Bug with links that link to other links

Captain's Log: Stardate 77861.2

Evan Mezeske

May 2024

Devlog

On-screen tuner

Captain's Log: Stardate 77850.6

Evan Mezeske

May 2024

Devlog

Fuzz tests for copying and pasting

Captain's Log: Stardate 77836.9

Evan Mezeske

May 2024

Devlog

Where to paste after copy?

Captain's Log: Stardate 77834.8

Evan Mezeske

May 2024

Devlog

Bad memory access on the GPU: pain

Captain's Log: Stardate 77823.5

Evan Mezeske

May 2024

Devlog

Envelope followers working

Captain's Log: Stardate 77820.5

Evan Mezeske

May 2024

Devlog

Golden tests for DAW automation

Captain's Log: Stardate 77814.9

Evan Mezeske

May 2024