SafeDisk AI

Auth Fail Loop Session Disk Storm

When a background worker keeps retrying after an auth token is already known-dead, every failed call can write another session file, request dump, state row, or log batch. The fix is not only cleanup: it is a circuit breaker, retention cap, and disk budget around the failure loop.

No tokens, secrets, dumps, or private logs. Error class, path names, sizes, and growth rate are enough to scope the policy.

$99 circuit-breaker policy

Turn a dead-token retry storm into a bounded worker failure.

Use this when auth errors, provider outages, or invalid API tokens cause background jobs to keep writing sessions, request dumps, state rows, or logs without forward progress.

classify auth -> open breaker -> cap dumps -> preserve first error
Read-only evidence

Measure growth rate, backlog multiplier, and dump retention before cleanup.

These checks capture the disk storm without exposing request contents or tokens. Replace paths and service names before pasting into a public issue.

df -h; find session_*; du -sh; journalctl -u worker
Request $99 fail-loop policy Request $29 incident read

Runbook: Break The Loop Before Deleting The Evidence

  1. Preserve the first auth failure and a small sample of session/dump metadata. The first error explains why the storm started; the repeated files only prove the multiplier.
  2. Classify invalid token, revoked credential, and 401/403 auth errors as non-retryable for the shared client.
  3. Trip the breaker before any expensive worker call, session file, request dump, or state row is created.
  4. Probe slowly while half-open. A dead provider or invalid token should not replay every queued event every schedule tick.
  5. Cap generated files by bytes, count, and age. Session retention should be enforced before writes, not only by a later cron job.
  6. Publish one alert with growth rate, affected path, breaker state, and cleanup-safe evidence; aggregate repeated auth errors after that.
Copy-ready issue reply

Use this when auth retries create session or dump files.

This keeps the fix concrete: failure taxonomy, circuit breaker state, half-open probe, retention budgets, and disk evidence.

I would make the disk-safety behavior part of the auth circuit breaker acceptance criteria, not a separate cleanup task.

Checks I would want before closing this:

- 401/token_invalidated/auth-revoked errors are classified as non-retryable at the shared client chokepoint.
- The breaker opens before a worker call can create a new session file, request dump, or state row.
- While open, scheduled backlog processing short-circuits without writing per-event failure artifacts.
- Half-open allows one slow probe, not one probe per queued event.
- Session/request-dump/state retention has max age, max count, and max bytes; it runs before the next write path.
- The alert contains the first auth error, affected paths, file growth rate, and current disk reserve.
- Tests cover backlog × dead-auth, so per-activity retry limits cannot still create an aggregate disk storm.
Request policy review

Do Not Delete First