SafeDisk AI

Shared GPU Scratch Volume Disk Full

When a shared AI training volume reaches 100%, the right answer is rarely "delete the biggest directory." Student homes, active workdirs, model weights, W&B artifacts, checkpoints, and source builds need an owner-approved cleanup boundary before GPU jobs start losing writes.

Free shared-volume decision card

Separate active experiments from regenerable scratch before deleting anything.

The first policy question is ownership: which paths belong to one user, which are shared infrastructure, which are rebuildable caches, and which are active experiment state that needs explicit approval.

home + workdir + model cache + checkpoint + artifact ownership
Get $99 cleanup policy Read-only evidence Need $29 incident read Request payment link
Read-only evidence

Capture owner and reclaimability before cleanup.

This packet is designed for shared homes/workdirs such as /senpai-run/home, /senpai-run/workdirs, Hugging Face caches, W&B runs, checkpoints, and source build outputs.

df -h; du by owner; find caches/checkpoints/artifacts
Request $99 cleanup policy Request $29 incident triage

Runbook: Recover Without Losing Experiment State

  1. Stop or gate new writes before cleanup starts. A volume at 95-100% can lose logs, checkpoints, git writes, and W&B flushes while you are measuring it.
  2. Build an owner table for each large directory: active user, active issue/job, stale candidate, or infrastructure-owned.
  3. Let users self-delete only their own regenerable caches first: package caches, downloaded model copies, compiled build products, and failed run artifacts they can recreate.
  4. Keep active workdirs, current checkpoints, final submissions, and experiment logs review-first until the job owner marks them stale.
  5. Move repeated large writes out of the shared volume: per-pod ephemeral scratch, quotas, cache TTLs, or separate model-cache volumes.
  6. Add a recurrence guard: warn at 85%, block new large builds at 90%, and require owner approval for shared cleanup at 95%.
Copy-ready issue reply

Use this when a shared GPU workspace is full.

This keeps cleanup focused on ownership and rebuildability instead of deleting the biggest active experiment directory.

I would treat this as a shared-volume ownership problem, not a one-user cleanup problem.

Before deleting anything from student homes or workdirs, I would build a table with:

- owner
- path
- size
- active job / issue
- regenerable cache vs active experiment state
- approved cleanup action

Read-only evidence:

df -hT /senpai-run /
df -i /senpai-run /
du -xh /senpai-run --max-depth=2 | sort -h | tail -80
find /senpai-run -xdev -maxdepth 5 -type d \( -name ".cache" -o -name "huggingface" -o -name "wandb" -o -name "checkpoints" -o -name "outputs" -o -name "build" \) -print
find /senpai-run -xdev -type f -size +1G -printf "%s %u %p\n" | sort -n | tail -80

The safe immediate move is user-owned regenerable cache cleanup plus a write gate. The durable fix is per-user/per-pod scratch isolation or quota/TTL rules so one active build cannot re-fill the shared volume.
Request policy review
Paid scope

Turn one shared-volume outage into a cleanup policy.

The $99 policy is for teams running shared AI/GPU workspaces where one volume holds multiple users, model caches, checkpoints, W&B artifacts, source builds, and active workdirs. You get a safe/review/do-not-touch boundary and a recurrence guard.

No secrets, private datasets, or full logs needed. A redacted path/size table is enough to start.

Do Not Delete First