SafeDisk AI

Nix GC EBUSY Runner Disk Full Recovery

When a self-hosted Nix CI runner leaks stale patched bind mounts, nix-collect-garbage can abort on the first EBUSY store path and free nothing. Treat that as a runner policy incident: prove which mounts are stale, preserve active build roots, then make GC resilient before the root filesystem fills again.

Free runner decision card

Separate stale patched mounts from live builds before lazy-unmounting anything.

The risky part is not running umount -l. It is doing it without proving the mount is stale, older than the canceled build, and not tied to an active GC root or current builder process.

mount evidence -> live root check -> lazy unmount -> GC retry
Get $99 runner policy Read-only evidence Need $29 incident read Request payment link
Read-only evidence

Capture the mount leak and GC failure before reclaiming space.

This packet gives maintainers enough to distinguish stale FOD patched bind mounts, active builds, GC roots, disk bytes, disk inodes, and the first EBUSY store path that stopped the collection.

df -hT; grep patched. /proc/1/mountinfo; nix-store --gc --print-roots
Request $99 runner policy Request $29 incident triage

Runbook: Break The GC Abort Loop

  1. Stop treating the full disk as a generic cache issue. If GC aborts on EBUSY, every normal reclaim path may free zero bytes until the stale mount boundary is resolved.
  2. Capture mount evidence from /proc/1/mountinfo, not just mount output. Preserve source and target paths for every .patched. mount.
  3. Check GC roots and active builder processes before unmounting. A lazy unmount is acceptable only after the mount is stale or the build owner has been stopped.
  4. Run GC again after unmounting stale patched mounts and record bytes/paths freed. The before/after proves whether EBUSY was the blocking condition.
  5. Add a startup guard for persistent runners: if patched mounts older than a threshold exist, alert and quarantine the runner before new builds consume the remaining root filesystem.
  6. Add acceptance tests for cancellation: cancel a FOD patching build, restart the daemon, rerun GC, and verify GC skips or reaps stale state instead of returning zero progress.
Copy-ready issue reply

Use this when GC aborts with EBUSY and disk keeps filling.

This keeps the discussion focused on the operational invariant: stale patched mounts should not make every later GC reclaim zero bytes.

I would treat this as a GC progress failure, not just a disk cleanup problem. If one stale patched bind mount makes nix-collect-garbage abort with EBUSY, root can fill even though most store paths are otherwise reclaimable.

The useful recovery packet is:

df -hT / /nix /tmp
df -i / /nix /tmp
grep -E "patched\\." /proc/1/mountinfo
nix-store --gc --print-roots | head -200
journalctl -u nix-daemon --since "24 hours ago" | grep -E "EBUSY|patched|garbage|ENOSPC|No space" | tail -200

Then lazy-unmount only patched mounts proven stale or tied to a canceled build, rerun GC, and record bytes/paths freed. For recurrence, I would add a daemon/runner-start guard that alerts on old .patched.* mounts before the runner accepts more work.
Request policy review
Paid scope

Turn one Nix runner outage into a reusable storage policy.

The $99 policy is for self-hosted CI teams where Nix, Docker, build caches, runner work dirs, or leaked mounts can fill persistent runners. You get a safe/review/do-not-touch boundary, a recurrence guard, and acceptance checks for cancellation/GC behavior.

No secrets, private store paths, or production logs needed. Redacted command output is enough to start.

Do Not Delete First