Docker crash-loop storage incident
Hermes Agent Tirith /tmp Crash-loop Disk Full
A supervisor restart loop can turn a configuration error into a disk incident when each failed process leaves a new sandbox under /tmp. The safe fix is to stop the loop first, classify active versus orphaned dirs, then add a permanent restart boundary.
Get the exact AI storage cleanup step.
Copy the read-only check first. If the output still needs judgment, leave your email and we send the $29 Deep Cleanup step only when the agent cache or session state needs review.
Stop The Growth Before Cleaning
If a service is creating tens of MB per restart, deleting old directories while the loop continues only buys minutes. First prove which service is looping, then stop or gate that specific service.
Measure tmp growth, restart rate, and active files.
This gathers evidence without deleting anything. Run it inside the container if the dirs are container-local, or on the host if /tmp is bind-mounted.
tmp="${TMPDIR:-/tmp}"
echo "tmp=$tmp"
df -h "$tmp" 2>/dev/null || true
df -i "$tmp" 2>/dev/null || true
find "$tmp" -maxdepth 1 -type d -name 'tirith-*' -o -name 'tirith-install-*' 2>/dev/null \
| while read -r dir; do du -sh "$dir" 2>/dev/null; done \
| sort -hr | head -40
find "$tmp" -maxdepth 1 -type d \( -name 'tirith-*' -o -name 'tirith-install-*' \) -printf '%TY-%Tm-%Td %TH:%TM %p\n' 2>/dev/null \
| sort | tail -60
ps -eo pid,ppid,etime,stat,cmd | grep -E 's6|gateway|tirith|hermes' | grep -v grep || true
Cleanup Boundary
- Permanent config failures should not restart forever: token collision, missing platform token, or no-platform state should exit through a non-retrying path.
- Quarantine before delete: move old tirith dirs to a quarantine folder first if you are not certain they are orphaned.
- Keep active dirs: skip dirs with recently modified files or open file handles.
- Make cleanup scoped: match only the known tirith prefix under the resolved temp directory, not arbitrary
/tmpcontents.
Move old tirith dirs after the restart loop is stopped.
Use this only after the crashing service is stopped or gated. It moves dirs older than one hour instead of deleting them.
tmp="${TMPDIR:-/tmp}"
quarantine="$tmp/safedisk-tirith-quarantine-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$quarantine"
find "$tmp" -maxdepth 1 -type d \( -name 'tirith-*' -o -name 'tirith-install-*' \) -mmin +60 -print0 2>/dev/null \
| while IFS= read -r -d '' dir; do
mv "$dir" "$quarantine/"
done
du -sh "$quarantine" 2>/dev/null || true
Regression Test Shape
- Start a profile with a known token collision and assert it reaches a non-retrying terminal state.
- Count tirith dirs before and after several supervisor ticks; the count should not grow after fatal config failure.
- Keep retry behavior for transient network/process failures so this fix does not mask recoverable crashes.
- Emit a small metric for orphaned sandbox bytes so disk pressure is visible before ENOSPC.
Need an AI CLI cleanup order?
Submit the form first; the command output or one-line agent storage symptom can follow. We check whether free guidance is enough before asking for the $29 Deep Cleanup.