Worker Diagnostic Reports Filling Root Disk

Solver, build, and data workers often write optional readiness reports, coverage JSON, temp decode files, and Docker caches onto the same root filesystem. When those reports are unbounded, the real failure becomes ENOSPC while the user sees a misleading artifact or decode error.

Ask free AI judgment

Find what you can delete.

Leave your email now. The scan summary can follow after the first reply; we send the free SafeDisk AI deletion trial step only if deletion risk is still unclear.

See sample result Ask AI about one file

Runbook: Optional Reports Must Not Break Required Work

Separate required artifacts from optional diagnostics. A readiness report, coverage report, or debug JSON should never consume the emergency reserve needed to decode the actual job artifact.
Prune before write, not only on a timer. The large-write path should enforce max age, max count, and max bytes before creating the next report.
Keep a per-job or per-snapshot floor: newest report, latest failure sample, and enough context for debugging, then delete older siblings.
Check the filesystem that actually stores the report and temp decode files. Root may be full even if the object store, database, or artifact checksum is healthy.
Preserve the original storage exception in worker diagnostics. Do not collapse ENOSPC into a misleading missing-table, missing-artifact, or decode fallback.
Add an operational repair path for existing hosts: dry-run prune, staged delete, before/after `df`, and a rollback-free deployment note.

Copy-ready issue reply

Use this when worker diagnostics fill root.

This turns the incident into concrete acceptance checks: retention, preflight, error preservation, and a safe operator cleanup path.

I would make this a pre-write retention gate, not only a background cleanup task.

Acceptance checks I would add:

- Enforce max age, max count, and max bytes for reports/snapshot-coverage before writing the next matrix-readiness report.
- Keep the newest N reports per snapshot/job plus the latest failure sample; prune older siblings first.
- Check both blocks and inodes on the filesystem that stores reports and temporary HDF5 decode files.
- If the emergency reserve would be breached, skip optional report writing and preserve the original worker error.
- Surface ENOSPC/root-disk context in worker_jobs diagnostics instead of falling through to the legacy-table/artifact-missing message.
- Add a dry-run operator cleanup command that prints before/after df and du for reports, temp decode paths, and Docker/build cache.

Do Not Delete First

The newest successful and failing report sample for the incident.
The temp decode path before recording whether ENOSPC happened there or in the report path.
Docker/build caches before measuring whether they are the largest reclaimable bucket.
Diagnostic logs that contain the original ENOSPC, checksum, or artifact decode exception.

Free AI deletion trial

Need a delete / confirm / protect answer?

Send the issue link, log excerpt, or storage summary first. We reply with the next safe move and offer the free SafeDisk AI deletion trial only if the incident still needs review.