Skip to content

fix: Eliminate binary image cache lock contention during SDK startup#7766

Open
denrase wants to merge 7 commits intomainfrom
fix/binary-image-cache-contention
Open

fix: Eliminate binary image cache lock contention during SDK startup#7766
denrase wants to merge 7 commits intomainfrom
fix/binary-image-cache-contention

Conversation

@denrase
Copy link
Copy Markdown
Collaborator

@denrase denrase commented Mar 31, 2026

📜 Description

The PR aims to do smaller and low risk changes to the binary image cache.

sentrycrashbic_registerAddedCallback held binaryImagesMutex for the entire iteration of all cached images (~700+). Any concurrent binaryImageAdded call — triggered by dlopen on another thread — blocked on the mutex for the full duration, causing fatal app hangs of 1–2+ seconds on cold start.

Also, since we do bulk insertion at startup, insert images with constant time and lazily sort.

Changes

  • SentryCrashBinaryImageCache.c:
    • Snapshot tailNode under the mutex, iterate outside it. Nodes are never freed, so the snapshot pointer should remain valid.
    • Make callbacks _Atomic with acquire/release semantics.
  • Swift layer (SentryBinaryImageCache.swift):
    • Move sentrycrashbic_register* calls outside lock.synchronized in start() / stop().
    • Downgrade NSRecursiveLock to NSLock (re-entrant lock no longer needed).

Result

In the added repro, thread B blocked time during registerAddedCallback went from 148ms → 0.27ms.

💡 Motivation and Context

Relates to #4618

💚 How did you test it?

📝 Checklist

You have to check all boxes before merging:

  • I added tests to verify the changes.
  • No new PII added or SDK only sends newly added PII if sendDefaultPII is enabled.
  • I updated the docs if needed.
  • I updated the wizard if needed.
  • Review from the native team if needed.
  • No breaking change or entry added to the changelog.
  • No breaking change for hybrid SDKs or communicated to hybrid SDKs.

sentrycrashbic_registerAddedCallback held binaryImagesMutex for the entire iteration of all cached images (~700+). Any concurrent binaryImageAdded call — triggered by dlopen on another thread — blocked on the mutex for the full duration, causing fatal app hangs of 1–2+ seconds on cold start.

### Changes

 C layer (SentryCrashBinaryImageCache.c):
 - Snapshot tailNode under the mutex (nanoseconds), iterate outside it. Nodes are never freed, so the snapshot pointer remains valid.
 - Make tailNode _Atomic — eliminates the first mutex round-trip in binaryImageAdded and fixes the formal data race in iterateOverImages (crash handler path).
 - Make imageAddedCallback / imageRemovedCallback _Atomic with acquire/release semantics — fixes TSan-confirmed data race on callback function pointers.

 Swift layer (SentryBinaryImageCache.swift):
 - Move sentrycrashbic_register* calls outside lock.synchronized in start() / stop().
 - Downgrade NSRecursiveLock to NSLock (re-entrant lock no longer needed).
 - Read isDebug under the lock — fixes data race.

 ### Result

 Thread B blocked time during registerAddedCallback: 148ms → 0.27ms (551× improvement).
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

Semver Impact of This PR

🟢 Patch (bug fixes)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).


New Features ✨

  • Add per-call attachAllThreads override by itaybre in #7767
  • Add attachAllThreads option by itaybre in #7764

Bug Fixes 🐛

  • Eliminate binary image cache lock contention during SDK startup by denrase in #7766
  • Move SessionTracker file I/O off the main thread by denrase in #7704
  • Copy incoming tags dict to prevent crash by itaybre in #7763
  • Per-instance unmaskView propagates to child views by denrase in #7733

Documentation 📚

  • Add SentryCrash analysis and improvement plan document by itaybre in #7528

Internal Changes 🔧

Deps

  • Bump ruby/setup-ruby from 1.295.0 to 1.298.0 by dependabot in #7751
  • Bump codecov/codecov-action from 5.5.3 to 6.0.0 by dependabot in #7749
  • Bump getsentry/craft from 2.25.0 to 2.25.2 by dependabot in #7748
  • Bump fastlane-plugin-sentry from 2.4.0 to 2.5.0 by dependabot in #7746
  • Bump actions/deploy-pages from 4.0.5 to 5.0.0 by dependabot in #7747
  • Update clang-format version by sentry-mobile-updater in #7755
  • Bump getsentry/craft/.github/workflows/changelog-preview.yml from 2.25.0 to 2.25.2 by dependabot in #7750
  • Bump activesupport from 7.2.3 to 7.2.3.1 by dependabot in #7732
  • Bump fastlane-plugin-sentry from 2.3.0 to 2.4.0 by dependabot in #7724
  • Bump ruby/setup-ruby from 1.292.0 to 1.295.0 by dependabot in #7728
  • Bump getsentry/craft from 2.24.1 to 2.25.0 by dependabot in #7729
  • Bump codecov/codecov-action from 5.5.2 to 5.5.3 by dependabot in #7725
  • Bump actions/create-github-app-token from 2.2.1 to 3.0.0 by dependabot in #7726
  • Bump getsentry/craft/.github/workflows/changelog-preview.yml from 2.24.1 to 2.25.0 by dependabot in #7727
  • Bump json from 2.18.1 to 2.19.2 by dependabot in #7709

Other

  • Update validate-pr workflow by stephanie-anderson in #7769
  • Pin GitHub Actions to full-length commit SHAs by joshuarli in #7730

Other

  • impr: Align app lifecycle breadcrumb states with app context by denrase in #7703

🤖 This preview updates automatically when you update the PR.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 93.33333% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.133%. Comparing base (2b71685) to head (4f2cffe).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ces/Swift/Core/Helper/SentryBinaryImageCache.swift 88.571% 4 Missing ⚠️
...entryCrash/Recording/SentryCrashBinaryImageCache.c 97.500% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##              main     #7766       +/-   ##
=============================================
- Coverage   85.412%   85.133%   -0.280%     
=============================================
  Files          487       487               
  Lines        29086     29105       +19     
  Branches     12592     12593        +1     
=============================================
- Hits         24843     24778       -65     
- Misses        4193      4275       +82     
- Partials        50        52        +2     
Files with missing lines Coverage Δ
...entryCrash/Recording/SentryCrashBinaryImageCache.c 96.261% <97.500%> (+0.516%) ⬆️
...ces/Swift/Core/Helper/SentryBinaryImageCache.swift 96.363% <88.571%> (+0.209%) ⬆️

... and 16 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b71685...4f2cffe. Read the comment docs.

denrase added 3 commits March 31, 2026 15:10
During SDK startup, hundreds of binary images are loaded and inserted into the cache before any lookup occurs. The previous implementation sorted on every insert — O(n) per insert due to Array.insert(at:), making startup O(n²). This change appends in O(1) and defers sorting to the first lookup, reducing total startup cost from ~180ms to ~5ms for 5000 images.
@denrase denrase marked this pull request as ready for review March 31, 2026 13:48
Copy link
Copy Markdown
Contributor

@itaybre itaybre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising, but got some questions: mostly worried about sentrycrashbic_stopCache clearing the cache

static void
binaryImageRemoved(const struct mach_header *header, intptr_t slide)
{
SentryCrashBinaryImageNode *nextNode = &rootNode;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: Not new, but should we worry about sentrycrashbic_stopCache being called while we iterate this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the free to the next start cycle in order to not iterate while being freed. Technically still a possibility with rapid stop/start SDK calls, but this should not be a realistic case. Please comment if you have a different solution in mind or if this is good enough for now.

@denrase denrase requested a review from itaybre April 3, 2026 09:10
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

SentryCrashBinaryImageNode *next = pending->next;
free(pending);
pending = next;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use-after-free when startCache frees nodes during iteration

Medium Severity

The registerAddedCallback iteration now runs outside the mutex, while startCache frees pendingFreeList also outside the mutex. If stopCache defers nodes, then startCache is called on another thread (freeing those nodes), while the iteration is still traversing them, this is a use-after-free. Critically, startCache sets tailNode back to non-NULL, so the iteration's while loop condition (tailNode != NULL) passes again after the restart, causing it to continue following freed node->next pointers. The comment "Defer freeing — readers outside the mutex may still be traversing" indicates awareness of readers, but startCache doesn't respect that invariant.

Additional Locations (1)
Fix in Cursor Fix in Web

sentrycrashbic_registerAddedCallback(nil)
sentrycrashbic_registerRemovedCallback(nil)
lock.synchronized {
sentrycrashbic_registerAddedCallback(nil)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-atomic start/stop creates callback-cache state mismatch

Medium Severity

Splitting callback registration out of lock.synchronized in start() and stop() creates race windows. If stop() runs between start()'s lock release and callback registration, the callbacks get registered after stop() already cleared cache to nil. The callbacks remain active but guard self.cache != nil causes all images to be silently dropped, leaving a permanently empty cache. The previous code was atomic—both operations were inside one lock.synchronized block. Since SentrySDK.close() can be called from any thread, this is reachable in production.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants