Nix GC EBUSY Runner Disk Full Recovery
When a self-hosted Nix CI runner leaks stale patched bind mounts, nix-collect-garbage can abort on the first EBUSY store path and free nothing. Treat that as a runner policy incident: prove which mounts are stale, preserve active build roots, then make GC resilient before the root filesystem fills again.
Separate stale patched mounts from live builds before lazy-unmounting anything.
The risky part is not running umount -l. It is doing it without proving the mount is stale, older than the canceled build, and not tied to an active GC root or current builder process.
mount evidence -> live root check -> lazy unmount -> GC retry
Capture the mount leak and GC failure before reclaiming space.
This packet gives maintainers enough to distinguish stale FOD patched bind mounts, active builds, GC roots, disk bytes, disk inodes, and the first EBUSY store path that stopped the collection.
df -hT; grep patched. /proc/1/mountinfo; nix-store --gc --print-roots
Runbook: Break The GC Abort Loop
- Stop treating the full disk as a generic cache issue. If GC aborts on EBUSY, every normal reclaim path may free zero bytes until the stale mount boundary is resolved.
- Capture mount evidence from
/proc/1/mountinfo, not justmountoutput. Preserve source and target paths for every.patched.mount. - Check GC roots and active builder processes before unmounting. A lazy unmount is acceptable only after the mount is stale or the build owner has been stopped.
- Run GC again after unmounting stale patched mounts and record bytes/paths freed. The before/after proves whether EBUSY was the blocking condition.
- Add a startup guard for persistent runners: if patched mounts older than a threshold exist, alert and quarantine the runner before new builds consume the remaining root filesystem.
- Add acceptance tests for cancellation: cancel a FOD patching build, restart the daemon, rerun GC, and verify GC skips or reaps stale state instead of returning zero progress.
Use this when GC aborts with EBUSY and disk keeps filling.
This keeps the discussion focused on the operational invariant: stale patched mounts should not make every later GC reclaim zero bytes.
I would treat this as a GC progress failure, not just a disk cleanup problem. If one stale patched bind mount makes nix-collect-garbage abort with EBUSY, root can fill even though most store paths are otherwise reclaimable.
The useful recovery packet is:
df -hT / /nix /tmp
df -i / /nix /tmp
grep -E "patched\\." /proc/1/mountinfo
nix-store --gc --print-roots | head -200
journalctl -u nix-daemon --since "24 hours ago" | grep -E "EBUSY|patched|garbage|ENOSPC|No space" | tail -200
Then lazy-unmount only patched mounts proven stale or tied to a canceled build, rerun GC, and record bytes/paths freed. For recurrence, I would add a daemon/runner-start guard that alerts on old .patched.* mounts before the runner accepts more work.
Turn one Nix runner outage into a reusable storage policy.
The $99 policy is for self-hosted CI teams where Nix, Docker, build caches, runner work dirs, or leaked mounts can fill persistent runners. You get a safe/review/do-not-touch boundary, a recurrence guard, and acceptance checks for cancellation/GC behavior.
Do Not Delete First
- Active build outputs, live GC roots, and current builder work directories.
- Nix store paths before confirming whether EBUSY is caused by a stale mount rather than real ownership.
- Runner diagnostic logs that prove the original ENOSPC, EBUSY, cancellation, or daemon restart sequence.
- Docker volumes, cache stores, or workspace state on persistent runners until ownership and rebuildability are clear.