Shared GPU Scratch Volume Disk Full

When a shared AI training volume reaches 100%, the right answer is rarely "delete the biggest directory." Student homes, active workdirs, model weights, W&B artifacts, checkpoints, and source builds need an owner-approved cleanup boundary before GPU jobs start losing writes.

Ask free AI judgment

Find what you can delete.

Leave your email now. The scan summary can follow after the first reply; we send the free SafeDisk AI deletion trial step only if deletion risk is still unclear.

See sample result Ask AI about one file

Runbook: Recover Without Losing Experiment State

Stop or gate new writes before cleanup starts. A volume at 95-100% can lose logs, checkpoints, git writes, and W&B flushes while you are measuring it.
Build an owner table for each large directory: active user, active issue/job, stale candidate, or infrastructure-owned.
Let users self-delete only their own regenerable caches first: package caches, downloaded model copies, compiled build products, and failed run artifacts they can recreate.
Keep active workdirs, current checkpoints, final submissions, and experiment logs confirm-first until the job owner marks them stale.
Move repeated large writes out of the shared volume: per-pod ephemeral scratch, quotas, cache TTLs, or separate model-cache volumes.
Add a recurrence guard: warn at 85%, block new large builds at 90%, and require owner approval for shared cleanup at 95%.

Copy-ready issue reply

Use this when a shared GPU workspace is full.

This keeps cleanup focused on ownership and rebuildability instead of deleting the biggest active experiment directory.

I would treat this as a shared-volume ownership problem, not a one-user cleanup problem.

Before deleting anything from student homes or workdirs, I would build a table with:

- owner
- path
- size
- active job / issue
- regenerable cache vs active experiment state
- approved cleanup action

Read-only evidence:

df -hT /senpai-run /
df -i /senpai-run /
du -xh /senpai-run --max-depth=2 | sort -h | tail -80
find /senpai-run -xdev -maxdepth 5 -type d \( -name ".cache" -o -name "huggingface" -o -name "wandb" -o -name "checkpoints" -o -name "outputs" -o -name "build" \) -print
find /senpai-run -xdev -type f -size +1G -printf "%s %u %p\n" | sort -n | tail -80

The safe immediate move is user-owned regenerable cache cleanup plus a write gate. The durable fix is per-user/per-pod scratch isolation or quota/TTL rules so one active build cannot re-fill the shared volume.

Do Not Delete First

Active workdirs, current checkpoints, and final submissions without owner approval.
Shared model weights if jobs still reference them and the source is gated or slow to rehydrate.
W&B/artifact logs before confirming whether they are the only record of a failed experiment.
Other users' homes or experiment outputs from an advisor/helper pod without node-wide ownership.

Free AI deletion trial

Need a delete / confirm / protect answer?

Send the issue link, log excerpt, or storage summary first. We reply with the next safe move and offer the free SafeDisk AI deletion trial only if the incident still needs review.