Worker Diagnostic Reports Filling Root Disk
Solver, build, and data workers often write optional readiness reports, coverage JSON, temp decode files, and Docker caches onto the same root filesystem. When those reports are unbounded, the real failure becomes ENOSPC while the user sees a misleading artifact or decode error.
$99 worker retention policy
Keep optional diagnostics from taking down the worker root disk.
Use this when readiness reports, coverage JSON, temp HDF5 decode files, Docker/build caches, or local job artifacts share `/` and can mask the original storage error.
retain reports by age + count + bytes before large writes
Read-only evidence
Measure report growth, worker root headroom, and temp decode paths first.
These checks capture storage pressure without deleting reports first. They are safe to paste into an issue because they use paths, counts, and sizes only.
df -h /; du -sh reports/*; find reports -name '*.json' ...
Runbook: Optional Reports Must Not Break Required Work
- Separate required artifacts from optional diagnostics. A readiness report, coverage report, or debug JSON should never consume the emergency reserve needed to decode the actual job artifact.
- Prune before write, not only on a timer. The large-write path should enforce max age, max count, and max bytes before creating the next report.
- Keep a per-job or per-snapshot floor: newest report, latest failure sample, and enough context for debugging, then delete older siblings.
- Check the filesystem that actually stores the report and temp decode files. Root may be full even if the object store, database, or artifact checksum is healthy.
- Preserve the original storage exception in worker diagnostics. Do not collapse ENOSPC into a misleading missing-table, missing-artifact, or decode fallback.
- Add an operational repair path for existing hosts: dry-run prune, staged delete, before/after `df`, and a rollback-free deployment note.
Copy-ready issue reply
Use this when worker diagnostics fill root.
This turns the incident into concrete acceptance checks: retention, preflight, error preservation, and a safe operator cleanup path.
I would make this a pre-write retention gate, not only a background cleanup task.
Acceptance checks I would add:
- Enforce max age, max count, and max bytes for reports/snapshot-coverage before writing the next matrix-readiness report.
- Keep the newest N reports per snapshot/job plus the latest failure sample; prune older siblings first.
- Check both blocks and inodes on the filesystem that stores reports and temporary HDF5 decode files.
- If the emergency reserve would be breached, skip optional report writing and preserve the original worker error.
- Surface ENOSPC/root-disk context in worker_jobs diagnostics instead of falling through to the legacy-table/artifact-missing message.
- Add a dry-run operator cleanup command that prints before/after df and du for reports, temp decode paths, and Docker/build cache.
Do Not Delete First
- The newest successful and failing report sample for the incident.
- The temp decode path before recording whether ENOSPC happened there or in the report path.
- Docker/build caches before measuring whether they are the largest reclaimable bucket.
- Diagnostic logs that contain the original ENOSPC, checksum, or artifact decode exception.