File Lock Heartbeat Disk Full Stale Lock
When a file-lock heartbeat silently fails on ENOSPC, inode exhaustion, permissions, or a missing heartbeat path, the lock holder may still be inside the critical section while another process decides the lock is stale and enters too.
Find the biggest storage culprit first.
Run the Chrome or Edge web scan, delete one approved low-risk item free, then use the $29 Deep Cleanup only if meaningful space remains.
First Response Runbook
A heartbeat failure should not be treated as a successful lock refresh. It should create an explicit lock-health state that downstream stale-lock logic can reason about.
- Log heartbeat refresh failures with the lock path, heartbeat path, operation, and filesystem error.
- Classify ENOSPC, EDQUOT, EIO, EACCES, EPERM, and missing heartbeat paths separately from normal stale timeout.
- When heartbeat refresh fails, decide whether the holder aborts protected work, releases the lock, or marks the lock as unhealthy.
- Do not let a contender steal only because mtime is stale when disk or permission failure could explain the stale heartbeat.
- Require dead-owner evidence, an explicit fencing token, or a recovery lock before allowing a steal.
- Add a two-contender regression test: holder heartbeat fails, contender polls, and both processes never enter the critical section at once.
Use this checklist when a heartbeat error is currently swallowed.
It keeps the fix focused on preventing concurrent entry, not only printing a warning.
I would make the heartbeat failure visible, and I would also define what happens to the protected critical section once the heartbeat path becomes unhealthy.
Acceptance checks I would add:
- Inject utimes(heartbeatPath) failure with ENOSPC/EIO/EACCES and assert a warning includes the lock path and operation.
- Treat heartbeat-write failure as lock-health degradation, not a silent success.
- The holder should either abort protected work or mark the lock as non-stealable until ownership is resolved.
- The stale-lock detector should require both stale mtime and dead owner/process evidence before stealing.
- Add a two-contender regression test: holder heartbeat fails, second process polls, and concurrent entry never happens.
- Surface lock-dir filesystem and inode status so disk-full and permission failures are distinguishable.
Evidence To Collect
- Lock path, heartbeat path, owner PID, and stale timeout.
- Filesystem free space and inode status for the directory that stores the heartbeat.
- The exact error returned by heartbeat refresh, not just whether the timer is still running.
- The condition a contender uses before stealing the lock.
- Whether protected writes can continue after heartbeat refresh has failed.
Paid Scope
The $29 incident triage reviews one lock or runner failure and returns the safest next diagnostic step. The $29 deep cleanup turns one representative incident into a stale-lock policy, failure taxonomy, and regression checklist for your agent, CLI, or CI tool.
Still full after the free cleanup?
Send your email once. We reply with the $29 payment link, one clarification, or a no-pay answer if the free cleanup is enough.