CI runner storage incident
GitHub Actions Go Race Test Output Disk Full
Large Go race-detector runs can write enough JSON or log output to exhaust the runner, truncate the result file, and make the analysis step fail on malformed output instead of surfacing the real test result. Treat it as two incidents: runner headroom and parser resilience.
Get the exact runner cleanup step.
Leave your email now; the scan summary or failing job link can follow after the first reply. We send the $29 Deep Cleanup step only if the runner still needs review.
Capture Evidence Before Re-running
Before adding another retry, capture runner headroom before and after the heavy package group. The goal is to prove whether the race output, Docker cache, workspace, or tool cache is the true peak consumer.
Print disk, inode, workspace, cache, and output-file sizes.
This is read-only. Put it immediately before the Go test step and again in an if: always() step after test output is generated.
- name: Runner storage snapshot
if: always()
shell: bash
run: |
set +e
echo "== df =="
df -h . "$RUNNER_TEMP" "$RUNNER_TOOL_CACHE" 2>/dev/null || true
df -i . "$RUNNER_TEMP" "$RUNNER_TOOL_CACHE" 2>/dev/null || true
echo "== largest local buckets =="
du -sh . "$RUNNER_TEMP" "$RUNNER_TOOL_CACHE" ~/.cache ~/go/pkg/mod /var/lib/docker 2>/dev/null | sort -hr | head -40
echo "== test output files =="
find . "$RUNNER_TEMP" -type f \( -name '*.json' -o -name '*test*.out' -o -name '*.log' \) -size +50M -print0 2>/dev/null \
| xargs -0 ls -lh 2>/dev/null | sort -k5 -hr | head -40
Safe Fix Order
- Reserve space before the test: fail early if free space is below the measured peak plus a reserve, instead of producing a truncated JSON file.
- Stream or shard output: pipe JSON through a consumer, gzip chunks, or split by package so one file cannot consume the remaining disk.
- Clean runner-local caches before the job: Docker build cache, unused images, and language caches are candidates only if the workflow does not need them later.
- Make the parser tolerant: if the output is truncated, report that the run was storage-corrupt and keep any completed test records instead of panicking.
- Track peak bytes: store a tiny artifact with before/after
dfand output-file sizes so regressions are visible.
Do Not Delete Blindly
- Do not remove the current workspace or test output before the parser has consumed it.
- Do not prune Docker volumes if integration tests rely on named volumes or local databases.
- Do not hide the failure with retries; a retry on another runner makes the cause harder to prove.
Need a cleanup order for this runner?
Submit the form first; the failing job link can follow. We check whether free guidance is enough before asking for the $29 Deep Cleanup.