Docker crash-loop storage incident

Hermes Agent Tirith /tmp Crash-loop Disk Full

A supervisor restart loop can turn a configuration error into a disk incident when each failed process leaves a new sandbox under /tmp. The safe fix is to stop the loop first, classify active versus orphaned dirs, then add a permanent restart boundary.

AI storage cleanup request

Get the exact AI storage cleanup step.

Copy the read-only check first. If the output still needs judgment, leave your email and we send the $29 Deep Cleanup step only when the agent cache or session state needs review.

See sample result

Stop The Growth Before Cleaning

If a service is creating tens of MB per restart, deleting old directories while the loop continues only buys minutes. First prove which service is looping, then stop or gate that specific service.

Read-only first pass

Measure tmp growth, restart rate, and active files.

This gathers evidence without deleting anything. Run it inside the container if the dirs are container-local, or on the host if /tmp is bind-mounted.

tmp="${TMPDIR:-/tmp}"
echo "tmp=$tmp"
df -h "$tmp" 2>/dev/null || true
df -i "$tmp" 2>/dev/null || true

find "$tmp" -maxdepth 1 -type d -name 'tirith-*' -o -name 'tirith-install-*' 2>/dev/null \
  | while read -r dir; do du -sh "$dir" 2>/dev/null; done \
  | sort -hr | head -40

find "$tmp" -maxdepth 1 -type d \( -name 'tirith-*' -o -name 'tirith-install-*' \) -printf '%TY-%Tm-%Td %TH:%TM %p\n' 2>/dev/null \
  | sort | tail -60

ps -eo pid,ppid,etime,stat,cmd | grep -E 's6|gateway|tirith|hermes' | grep -v grep || true

Cleanup Boundary

Permanent config failures should not restart forever: token collision, missing platform token, or no-platform state should exit through a non-retrying path.
Quarantine before delete: move old tirith dirs to a quarantine folder first if you are not certain they are orphaned.
Keep active dirs: skip dirs with recently modified files or open file handles.
Make cleanup scoped: match only the known tirith prefix under the resolved temp directory, not arbitrary /tmp contents.

Conservative reclaim

Move old tirith dirs after the restart loop is stopped.

Use this only after the crashing service is stopped or gated. It moves dirs older than one hour instead of deleting them.

tmp="${TMPDIR:-/tmp}"
quarantine="$tmp/safedisk-tirith-quarantine-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$quarantine"

find "$tmp" -maxdepth 1 -type d \( -name 'tirith-*' -o -name 'tirith-install-*' \) -mmin +60 -print0 2>/dev/null \
  | while IFS= read -r -d '' dir; do
      mv "$dir" "$quarantine/"
    done

du -sh "$quarantine" 2>/dev/null || true

Regression Test Shape

Start a profile with a known token collision and assert it reaches a non-retrying terminal state.
Count tirith dirs before and after several supervisor ticks; the count should not grow after fatal config failure.
Keep retry behavior for transient network/process failures so this fix does not mask recoverable crashes.
Emit a small metric for orphaned sandbox bytes so disk pressure is visible before ENOSPC.

Deep Cleanup

Need an AI CLI cleanup order?

Submit the form first; the command output or one-line agent storage symptom can follow. We check whether free guidance is enough before asking for the $29 Deep Cleanup.