Running iai-callgrind on Apple Silicon
My instruction-count benchmarks for Dynoxide, my embeddable DynamoDB engine in Rust, only ever ran in CI. The wall-clock benchmarks ran happily on my Mac, but the iai-callgrind track (benchmarks/benches/iai_core.rs) just sat there - because it needs Valgrind, and Valgrind doesn't run on Apple Silicon. Not "runs badly", not "needs a flag" - there's no aarch64 Darwin port at all. So on the machine where I write the code, I was guessing at performance until I pushed.
Which was a shame, because iai-callgrind is a lovely way to benchmark Rust code. Instead of wall-clock timing, which jitters with whatever else your machine happens to be doing, it counts CPU instructions under Valgrind's Callgrind. Same input, same count, every time. That makes it brilliant for catching the kind of regression that quietly adds 2% more work - exactly the sort of thing wall-clock noise buries. I wanted that locally, not just as a number CI reported back to me after the fact.
The fix is Docker. But there are a few catches that'll cost you an afternoon if nobody tells you about them first, so here they all are.
Run it in a native arm64 container
Valgrind on aarch64 Linux works fine. So the move is a native linux/arm64 container, not an emulated x86 one. Emulated x86 Valgrind is punishingly slow - you'd be running an instrumentation engine inside a CPU emulator. A native arm64 container runs at close to CI speed.
Start a long-lived container with the repo bind-mounted and an isolated target dir on a named volume. Match the rust: image tag to your project's toolchain.
REPO=/absolute/path/to/your/project
docker run -d --name dx-iai --platform linux/arm64 \
--security-opt seccomp=unconfined \
-v "$REPO":/repo \
-v dx-iai-target:/target \
-e CARGO_TARGET_DIR=/target -e CARGO_TERM_COLOR=never \
rust:1.95-bookworm sleep infinity
The isolated CARGO_TARGET_DIR on a named volume earns its place twice over: it keeps your host target/ clean (it's a different target triple anyway), and it persists iai-callgrind's stored results between runs, which is what makes the before/after comparison work later.
That --security-opt seccomp=unconfined flag, though, is the one that isn't optional. Here's why.
The seccomp / personality() trap
iai-callgrind runs your benchmark binary under setarch -R to switch off ASLR - it needs stable memory addresses to get deterministic counts. setarch does that through the personality() syscall, and Docker's default seccomp profile blocks personality(). Leave the flag off and the run dies with:
setarch: failed to set personality to aarch64: Operation not permitted
That's a maddening error to land on cold, because nothing about it points at Docker's security profile. --security-opt seccomp=unconfined lifts the block, and that single flag is all it needs - Callgrind runs the target on its own synthetic CPU, so there's no ptrace and no extra capability to grant.
Installing Valgrind and the runner (mind the PATH)
Install Valgrind and the iai-callgrind runner inside the container. The runner version has to match the iai-callgrind crate version pinned in your Cargo.toml.
docker exec dx-iai bash -c '
apt-get update && apt-get install -y valgrind &&
cargo install iai-callgrind-runner --version 0.14.2
'
Note the bash -c, not bash -lc. A login shell re-runs /etc/profile, which resets PATH and drops the rust image's cargo bin off the end of it. Use a login shell here and cargo comes back "command not found" - which is a genuinely baffling thing to debug when you can see cargo sitting right there in the image.
Running the benchmarks
docker exec dx-iai bash -c 'cd /repo/benchmarks && cargo bench --bench iai_core --features iai-callgrind'
Two more traps:
Don't pass --locked. Resolving the iai-callgrind feature wants to touch Cargo.lock, and --locked aborts the whole run before it builds a single thing. The run will modify Cargo.lock as a side effect; if you'd rather not carry that change, reset it afterwards with git checkout -- benchmarks/Cargo.lock.
Don't pipe the run through | tail if you care whether it passed. The pipe swallows cargo's exit code, so a failed run cheerfully reports success. Ask me how I know.
Before and after
iai-callgrind stores results per benchmark in the target dir and auto-compares each run against the last. So the workflow is simple: run on your baseline, change the code, run again. The second run prints the delta per metric - -1.54%, [-1.01560x] and so on. The only requirement is that both runs share the same target dir, which is exactly why the setup above parks it on a named volume.
To compare two branches or commits, check the ref out in the host repo between runs. The container sees the change through the bind mount, no restart needed.
git -C "$REPO" checkout <baseline-ref>
docker exec dx-iai bash -c 'cd /repo/benchmarks && cargo bench --bench iai_core --features iai-callgrind' # baseline
git -C "$REPO" checkout <changed-ref>
docker exec dx-iai bash -c 'cd /repo/benchmarks && cargo bench --bench iai_core --features iai-callgrind' # prints the delta
The build cache lives in the target volume, so only the crates you actually changed recompile between runs.
The caveat that actually matters
Here's the one to keep in your head, because it's the difference between useful numbers and a wild goose chase: arm64 Callgrind instruction counts are not the same as x86 counts. Different architecture, different instructions, different totals. The numbers from your container won't line up with the x86 figures your CI publishes, and they're not meant to.
That's fine, as long as you use them for what they're good at. iai-callgrind on your Mac is for relative comparison - before versus after, on the same machine and the same arch. It is not for checking against an absolute figure from a different architecture. Treat the container's counts as "did my change make this cheaper or dearer", never as "does this match the published number". Get that backwards and you'll spend time chasing a regression that only exists because you compared arm64 against x86.
Cleanup
docker rm -f dx-iai
docker volume rm dx-iai-target
That's it: a native arm64 container, seccomp unconfined so setarch can disable ASLR, the right shell so cargo stays on PATH, and a clear head about relative versus absolute counts. None of it is hard once you know the traps. It's finding the traps that costs you the afternoon.
If you want to see what's actually being measured, the iai-callgrind suite lives in the Dynoxide repo. And if DynamoDB tooling is your sort of thing, Dynoxide itself might be worth a look.