Skip to content

Recover from stale temp file left by crashed writer#11

Open
mcfnord wants to merge 1 commit intosoftins:masterfrom
mcfnord:stale-lock-recovery
Open

Recover from stale temp file left by crashed writer#11
mcfnord wants to merge 1 commit intosoftins:masterfrom
mcfnord:stale-lock-recovery

Conversation

@mcfnord
Copy link
Copy Markdown

@mcfnord mcfnord commented Apr 15, 2026

Problem

If the PHP process writing new cache data is killed mid-flight (e.g. a connection reset at the PHP-FPM socket level), register_shutdown_function('cleanup') does not run. This leaves a zero-byte .tmp lock file behind.

All subsequent requests for that endpoint then enter the acquisition loop, fail to open the .tmp file exclusively, and loop in 200ms sleeps until the 20-second timeout fires — returning a non-JSON die() response. Any caller with a shorter timeout (e.g. 10s) sees a 499/503 instead. The endpoint appears permanently broken until the stale file is manually removed.

Fix

After a failed fopen(..., 'x'), check whether the .tmp file is older than 30 seconds. A normal successful fetch completes in well under 2 seconds, so a 30-second threshold only fires on genuinely abandoned locks. When detected, log the event, delete the file, and continue — the next loop iteration acquires the lock cleanly and performs a fresh fetch.

Test plan

  • Confirm fix self-heals an existing stale .tmp file on next request (verified in production: log emitted "Stale lock file detected", fresh data returned in ~1250ms)
  • Confirm normal concurrent requests are unaffected (stale check only runs after fopen fails, and only when file is >30s old)
  • Confirm the 20-second hard timeout remains as a backstop

🤖 Generated with Claude Code

If the PHP process writing new cache data crashes (e.g. connection reset
by peer), register_shutdown_function may not run, leaving a zero-byte
.tmp lock file behind. All subsequent requests then loop waiting for a
cache update that never arrives, timing out after 20s (or sooner if the
caller disconnects).

Detect this by checking whether the .tmp file is older than 30 seconds
(well past the ~1.5s a normal fetch takes) and removing it so the next
waiter can take the lock and complete the fetch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant