SafeDisk AI

File Lock Heartbeat Disk Full Stale Lock

When a file-lock heartbeat silently fails on ENOSPC, inode exhaustion, permissions, or a missing heartbeat path, the lock holder may still be inside the critical section while another process decides the lock is stale and enters too.

AI tool lock safety

Make heartbeat failure visible before a stale lock becomes concurrent writes.

Use a small regression checklist to separate lock acquisition, heartbeat refresh, stale detection, and lock stealing. The goal is not more logging alone; it is a safe rule for what happens after heartbeat writes stop working.

df -h "$LOCK_DIR"; df -i "$LOCK_DIR"; test -w "$LOCK_DIR"; stat "$HEARTBEAT_PATH"

Policy pilot

Turn one lock incident into a testable rule.

$99 for one representative file-lock or agent-workspace incident: failure mode, safe stale-lock rule, and regression checklist.

Request $29 incident triage

Do not paste private logs, secrets, or full traces into this form.

First Response Runbook

A heartbeat failure should not be treated as a successful lock refresh. It should create an explicit lock-health state that downstream stale-lock logic can reason about.

  1. Log heartbeat refresh failures with the lock path, heartbeat path, operation, and filesystem error.
  2. Classify ENOSPC, EDQUOT, EIO, EACCES, EPERM, and missing heartbeat paths separately from normal stale timeout.
  3. When heartbeat refresh fails, decide whether the holder aborts protected work, releases the lock, or marks the lock as unhealthy.
  4. Do not let a contender steal only because mtime is stale when disk or permission failure could explain the stale heartbeat.
  5. Require dead-owner evidence, an explicit fencing token, or a recovery lock before allowing a steal.
  6. Add a two-contender regression test: holder heartbeat fails, contender polls, and both processes never enter the critical section at once.
Copy-ready issue reply

Use this checklist when a heartbeat error is currently swallowed.

It keeps the fix focused on preventing concurrent entry, not only printing a warning.

I would make the heartbeat failure visible, and I would also define what happens to the protected critical section once the heartbeat path becomes unhealthy.

Acceptance checks I would add:

- Inject utimes(heartbeatPath) failure with ENOSPC/EIO/EACCES and assert a warning includes the lock path and operation.
- Treat heartbeat-write failure as lock-health degradation, not a silent success.
- The holder should either abort protected work or mark the lock as non-stealable until ownership is resolved.
- The stale-lock detector should require both stale mtime and dead owner/process evidence before stealing.
- Add a two-contender regression test: holder heartbeat fails, second process polls, and concurrent entry never happens.
- Surface lock-dir filesystem and inode status so disk-full and permission failures are distinguishable.

Evidence To Collect

Paid Scope

The $29 incident triage reviews one lock or runner failure and returns the safest next diagnostic step. The $99 team pilot turns one representative incident into a stale-lock policy, failure taxonomy, and regression checklist for your agent, CLI, or CI tool.