fix: Eliminate binary image cache lock contention during SDK startup#7766
fix: Eliminate binary image cache lock contention during SDK startup#7766
Conversation
sentrycrashbic_registerAddedCallback held binaryImagesMutex for the entire iteration of all cached images (~700+). Any concurrent binaryImageAdded call — triggered by dlopen on another thread — blocked on the mutex for the full duration, causing fatal app hangs of 1–2+ seconds on cold start. ### Changes C layer (SentryCrashBinaryImageCache.c): - Snapshot tailNode under the mutex (nanoseconds), iterate outside it. Nodes are never freed, so the snapshot pointer remains valid. - Make tailNode _Atomic — eliminates the first mutex round-trip in binaryImageAdded and fixes the formal data race in iterateOverImages (crash handler path). - Make imageAddedCallback / imageRemovedCallback _Atomic with acquire/release semantics — fixes TSan-confirmed data race on callback function pointers. Swift layer (SentryBinaryImageCache.swift): - Move sentrycrashbic_register* calls outside lock.synchronized in start() / stop(). - Downgrade NSRecursiveLock to NSLock (re-entrant lock no longer needed). - Read isDebug under the lock — fixes data race. ### Result Thread B blocked time during registerAddedCallback: 148ms → 0.27ms (551× improvement).
Semver Impact of This PR🟢 Patch (bug fixes) 📋 Changelog PreviewThis is how your changes will appear in the changelog. New Features ✨
Bug Fixes 🐛
Documentation 📚
Internal Changes 🔧Deps
Other
Other
🤖 This preview updates automatically when you update the PR. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7766 +/- ##
=============================================
- Coverage 85.412% 85.133% -0.280%
=============================================
Files 487 487
Lines 29086 29105 +19
Branches 12592 12593 +1
=============================================
- Hits 24843 24778 -65
- Misses 4193 4275 +82
- Partials 50 52 +2
... and 16 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
During SDK startup, hundreds of binary images are loaded and inserted into the cache before any lookup occurs. The previous implementation sorted on every insert — O(n) per insert due to Array.insert(at:), making startup O(n²). This change appends in O(1) and defers sorting to the first lookup, reducing total startup cost from ~180ms to ~5ms for 5000 images.
itaybre
left a comment
There was a problem hiding this comment.
Looks promising, but got some questions: mostly worried about sentrycrashbic_stopCache clearing the cache
| static void | ||
| binaryImageRemoved(const struct mach_header *header, intptr_t slide) | ||
| { | ||
| SentryCrashBinaryImageNode *nextNode = &rootNode; |
There was a problem hiding this comment.
m: Not new, but should we worry about sentrycrashbic_stopCache being called while we iterate this?
There was a problem hiding this comment.
Moved the free to the next start cycle in order to not iterate while being freed. Technically still a possibility with rapid stop/start SDK calls, but this should not be a realistic case. Please comment if you have a different solution in mind or if this is good enough for now.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| SentryCrashBinaryImageNode *next = pending->next; | ||
| free(pending); | ||
| pending = next; | ||
| } |
There was a problem hiding this comment.
Use-after-free when startCache frees nodes during iteration
Medium Severity
The registerAddedCallback iteration now runs outside the mutex, while startCache frees pendingFreeList also outside the mutex. If stopCache defers nodes, then startCache is called on another thread (freeing those nodes), while the iteration is still traversing them, this is a use-after-free. Critically, startCache sets tailNode back to non-NULL, so the iteration's while loop condition (tailNode != NULL) passes again after the restart, causing it to continue following freed node->next pointers. The comment "Defer freeing — readers outside the mutex may still be traversing" indicates awareness of readers, but startCache doesn't respect that invariant.
Additional Locations (1)
| sentrycrashbic_registerAddedCallback(nil) | ||
| sentrycrashbic_registerRemovedCallback(nil) | ||
| lock.synchronized { | ||
| sentrycrashbic_registerAddedCallback(nil) |
There was a problem hiding this comment.
Non-atomic start/stop creates callback-cache state mismatch
Medium Severity
Splitting callback registration out of lock.synchronized in start() and stop() creates race windows. If stop() runs between start()'s lock release and callback registration, the callbacks get registered after stop() already cleared cache to nil. The callbacks remain active but guard self.cache != nil causes all images to be silently dropped, leaving a permanently empty cache. The previous code was atomic—both operations were inside one lock.synchronized block. Since SentrySDK.close() can be called from any thread, this is reachable in production.


📜 Description
The PR aims to do smaller and low risk changes to the binary image cache.
sentrycrashbic_registerAddedCallback held binaryImagesMutex for the entire iteration of all cached images (~700+). Any concurrent binaryImageAdded call — triggered by dlopen on another thread — blocked on the mutex for the full duration, causing fatal app hangs of 1–2+ seconds on cold start.
Also, since we do bulk insertion at startup, insert images with constant time and lazily sort.
Changes
Result
In the added repro, thread B blocked time during registerAddedCallback went from 148ms → 0.27ms.
💡 Motivation and Context
Relates to #4618
💚 How did you test it?
📝 Checklist
You have to check all boxes before merging:
sendDefaultPIIis enabled.