Auth Fail Loop Session Disk Storm

When a background worker keeps retrying after an auth token is already known-dead, every failed call can write another session file, request dump, state row, or log batch. The fix is not only cleanup: it is a circuit breaker, retention cap, and disk budget around the failure loop.

Ask free AI judgment

Find what you can delete.

Leave your email now. The scan summary can follow after the first reply; we send the free SafeDisk AI deletion trial step only if deletion risk is still unclear.

See sample result Ask AI about one file

Runbook: Break The Loop Before Deleting The Evidence

Preserve the first auth failure and a small sample of session/dump metadata. The first error explains why the storm started; the repeated files only prove the multiplier.
Classify invalid token, revoked credential, and 401/403 auth errors as non-retryable for the shared client.
Trip the breaker before any expensive worker call, session file, request dump, or state row is created.
Probe slowly while half-open. A dead provider or invalid token should not replay every queued event every schedule tick.
Cap generated files by bytes, count, and age. Session retention should be enforced before writes, not only by a later cron job.
Publish one alert with growth rate, affected path, breaker state, and cleanup-safe evidence; aggregate repeated auth errors after that.

Copy-ready issue reply

Use this when auth retries create session or dump files.

This keeps the fix concrete: failure taxonomy, circuit breaker state, half-open probe, retention budgets, and disk evidence.

I would make the disk-safety behavior part of the auth circuit breaker acceptance criteria, not a separate cleanup task.

Checks I would want before closing this:

- 401/token_invalidated/auth-revoked errors are classified as non-retryable at the shared client chokepoint.
- The breaker opens before a worker call can create a new session file, request dump, or state row.
- While open, scheduled backlog processing short-circuits without writing per-event failure artifacts.
- Half-open allows one slow probe, not one probe per queued event.
- Session/request-dump/state retention has max age, max count, and max bytes; it runs before the next write path.
- The alert contains the first auth error, affected paths, file growth rate, and current disk reserve.
- Tests cover backlog × dead-auth, so per-activity retry limits cannot still create an aggregate disk storm.

Do Not Delete First

The first auth failure that shows whether the error is revoked token, bad credential, quota, or provider outage.
Session and request-dump counts before measuring growth rate and backlog multiplier.
State database size and newest rows before proving whether every retry writes persistent state.
Scheduler and retry settings before deciding whether per-activity retry caps actually bound aggregate writes.

Free AI deletion trial

Need a delete / confirm / protect answer?

Send the issue link, log excerpt, or storage summary first. We reply with the next safe move and offer the free SafeDisk AI deletion trial only if the incident still needs review.