BuildKit Cache Disk Full During Container Builds
Large container pipelines can compile successfully and still fail in a later runtime stage because BuildKit keeps intermediate snapshots. Treat this as a stage-boundary policy problem: measure builder cache, images, containerd snapshots, and final artifacts before deciding whether prune should be automatic or opt-out.
Separate reusable cache from one-shot intermediate layers.
On disposable builders, prune is usually safe after a successful stage. On persistent builders, preserve cache only when it has a measured reuse benefit and will not block the next stage.
builder cache -> final image -> next stage headroom -> prune policy
Capture peak stage footprint before pruning.
This packet shows whether disk is held by BuildKit cache, containerd snapshots, Docker images, builder-stage artifacts, app/game build output, or the host filesystem.
df -h; docker system df -v; docker builder du; du /var/lib/docker/buildkit
Runbook: Define The Stage Boundary
- Measure disk before the engine/base image stage, after the final image is committed, and before the game/app stage starts.
- Separate final image size from retained intermediate snapshots. A successful build can still leave hundreds of GB in BuildKit.
- For disposable hosts, run
docker builder prune -fafter a successful stage and before the next disk-heavy stage. - For persistent hosts, add a cache cap, age policy, and explicit
--keep-cacheor--no-pruneescape hatch only where reuse is valuable. - Add a free-space admission gate before the next stage. The build should fail early with a readable message while logs can still be written.
- Update disk requirements to peak footprint: source clone + builder artifacts + final image + BuildKit cache + next-stage output.
Use this when BuildKit cache fills the host between stages.
This keeps the fix framed around stage boundaries and peak disk footprint, not just "add more disk."
I would frame this as a stage-boundary policy: after the engine image is committed, the next stage should not inherit the full intermediate cache unless the builder is intentionally persistent and the cache has measured reuse value.
Evidence I would capture before/after the engine stage:
df -hT / /var/lib/docker /var/lib/containerd
df -i / /var/lib/docker /var/lib/containerd
docker system df -v
docker builder du
du -xh /var/lib/docker/buildkit /var/lib/containerd 2>/dev/null | sort -h | tail -80
For disposable or one-shot builders, make prune the default after a successful stage and add an opt-out like --keep-cache. For persistent builders, add a cache cap/TTL and a free-space gate before the game/app build starts.
Turn one BuildKit disk-full failure into a build policy.
The $99 policy is for teams running large container pipelines where intermediate layer cache, source builds, final images, and app/game artifacts compete for one host volume. You get a safe/review/do-not-touch boundary, prune policy, and free-space gates.
Do Not Prune Blindly
- Persistent builder cache before confirming it has no measured reuse value.
- Named Docker volumes or containerd state used by running services.
- Final images or artifacts required by the next pipeline stage.
- Logs that show the first ENOSPC path and stage boundary where the host crossed the limit.