Skip to content

Reduce retained AST nodes during Java analysis#413

Closed
somiljain2006 wants to merge 4 commits into
cbomkit:mainfrom
somiljain2006:Memory-optimization-fix
Closed

Reduce retained AST nodes during Java analysis#413
somiljain2006 wants to merge 4 commits into
cbomkit:mainfrom
somiljain2006:Memory-optimization-fix

Conversation

@somiljain2006
Copy link
Copy Markdown

@somiljain2006 somiljain2006 commented May 14, 2026

Fixes #371

Reduce retained AST nodes during Java analysis to lower heap usage

Previously, the analyzer stored every Sonar AST node in a large List during analysis. On large repositories, this caused unnecessary object retention and increased JVM heap pressure, leading to excessive garbage collection activity and a higher risk of OutOfMemoryError.

This change reduces long-lived retention of AST nodes and processes them in a more memory-efficient manner.

Validation:

  • Successfully analyzed Apache SkyWalking (~2676 source files)
  • Completed Sonar analysis with -Xmx4g
  • No OOM or Full GC observed during analysis
  • Stable GC behavior and successful report upload

This improves scalability and reduces memory overhead when analyzing large enterprise repositories. If this optimization continues to validate well in production-scale repositories, similar memory-retention improvements can later be applied to the Python, C#, and Go aggregators as follow-up work.

@somiljain2006 somiljain2006 requested a review from a team as a code owner May 14, 2026 11:14
@somiljain2006 somiljain2006 force-pushed the Memory-optimization-fix branch from 7268e88 to d1c0e59 Compare May 14, 2026 11:16
@somiljain2006
Copy link
Copy Markdown
Author

@san-zrl Can you please review this pr?

@san-zrl san-zrl assigned san-zrl and n1ckl0sk0rtge and unassigned san-zrl May 28, 2026
Copy link
Copy Markdown
Contributor

@n1ckl0sk0rtge n1ckl0sk0rtge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this — heap pressure on large Java repos is a real pain point and the streaming/consumer direction is a sensible idea. I'd like to ask for a few changes and a stronger validation story before this lands. There is also a competing approach in #417 that targets the same issue at the engine layer (eagerly releasing DetectionStore / AST refs); maintainers will likely need to choose between the two, so it would help to coordinate.

Validation request

The PR validates on Apache SkyWalking (~2 676 files, -Xmx4g), but the issue reporter on #371 needed -Xmx90g on Elasticsearch-scale (~28k files). The optimization here removes the List<INode> of model objects in JavaAggregator, which on 20k findings is roughly ~160 KB — that's noise at 90 GB scale. The real heap dominator is almost certainly the Sonar Tree references held by DetectionStore, which this PR doesn't touch (and which #417 does).

Could you please run the scenario from #371 (or a comparable repo such as elastic/elasticsearch or apache/kafka) and post:

  • Before / after heap comparison at the same -Xmx setting, using JFR (-XX:StartFlightRecording=filename=scan.jfr,settings=profile).
  • Peak heap, Full-GC count/duration, and total scan wall-clock from the JFR recording.
  • The -Xmx value below which the unpatched code OOMs and at which the patched code completes.

Without this, "fixes #371" can't be verified, and it's hard to decide between this PR and #417.

Must fix

  1. Concurrency — static liveConsumer is overwritten on concurrent ScannerManager construction. JavaAggregator.liveConsumer is static volatile. In a multi-module Sonar run, instantiating a second ScannerManager replaces the first's consumer — Java detections then route to the wrong liveOutputFile, silently. Make the consumer per-instance, or fail fast on re-registration.

  2. Concurrency — ScannerManager.getOutputFile() races on nonJavaMerged. Two threads can both observe false, both enter the if-block, both merge non-Java streams into liveOutputFile → duplicated components and inflated occurrence counts in the CBOM. Synchronize the method or use AtomicBoolean.compareAndSet.

  3. Public-API binary break with no deprecation path. IOutputFile.add(List<INode>)add(Stream<INode>) and IOutputFileFactory.createOutputFormat(...) are part of the published output module surface (used by cbomkit-lib, cf. recent integration in 04f4963). Please either (a) add a default add(List<INode>) { add(nodes.stream()); } shim marked @Deprecated(forRemoval = true), or (b) coordinate a major version bump + cbomkit-lib update. As a smaller side-suggestion, Iterable<INode> would be a less fragile parameter type than Stream (single-use, not replayable).

  4. docs/LANGUAGE_SUPPORT.md (around lines 328–335) still references JavaAggregator.getDetectedNodes() in the integration example — that method is removed by this PR, so the documented snippet no longer compiles. Please update the snippet to describe the new consumer/streaming wiring.

  5. No functional-equivalence test for CBOM output. The add(List) recursion in CBOMOutputFile was rewritten into processSingleNode / addChildren. Component dedup (occurrenceCount keyed on location), parent-bom-ref threading, and evidence assembly could shift subtly. Please add a golden-file regression test that runs a fixed fixture through CBOMOutputFile and asserts the CycloneDX output is unchanged from the pre-PR version.

  6. Cross-language asymmetry and dead code. JavaAggregator adopts the consumer pattern and loses getDetectedNodes(). PythonAggregator / GoAggregator / CSharpAggregator gain a streamDetectedNodes() accessor that nothing calls — ScannerManager still does getDetectedNodes().stream() on them. Either roll the consumer pattern out to all four aggregators in this PR (preserving the "all aggregators mirror each other" convention) or drop the unused streamDetectedNodes() methods and keep this PR Java-only.

Should fix

  1. Lock granularity in CBOMOutputFile.accept. The synchronized method holds the lock for the whole recursive processSingleNode traversal. Under Sonar's file-level parallelism, every Java match now serializes through this lock — previously there was one acquisition per scan, now O(findings). Consider ConcurrentHashMap for the components map + AtomicLong for counters, and shrink the synchronized region to just the map mutations.

  2. Synchronization protocol inconsistency. JavaAggregator.addNodes and reset are synchronized but registerConsumer is not. Volatile saves you today but it's a footgun — please synchronize registerConsumer too.

  3. ScannerManager.getOutputFile() is now a stateful query. First call mutates state (lazy non-Java merge + flips the flag); subsequent calls return the same instance. If anything calls it before Java aggregation completes (early stats, etc.) the output stays incomplete forever. At minimum Javadoc the lifecycle contract; better, merge non-Java eagerly in a post-job rather than on the getter's first call.

  4. ScannerManager.getStatistics() semantics may have shifted. Previously the count came from traversing the full aggregated node list (including children). Now Java's count comes from JavaAggregator.getTotalNodeCount() (only top-level addNodes calls). Please verify the numeric output matches the previous behavior on the same input, or document the new semantics.

  5. ScannerManager.java:383 directly instantiates new CBOMOutputFile() instead of going through IOutputFileFactory. The plugin layer should depend on the abstraction. factory.createOutputFormat(Stream.empty()) would be cleaner.

  6. Test gaps in JavaAggregatorTest / ScannerManagerTest:

    • registerConsumer called twice — second call should replace the first (and the first should no longer receive nodes).
    • addNodes called with no consumer registered — totalNodeCount must still increment.
    • getOutputFile() called twice — same instance, and non-Java nodes are NOT merged a second time.
    • Concurrent addNodes vs reset — at least a basic stress test for the synchronized API surface.
  7. No CHANGELOG / migration notes for the breaking API surface. Tied to item 3.

Suggestions

  • Hoisting new JavaScanContext(this.context) out of the loop in JavaBaseDetectionRule.java:97-106 is a nice micro-win — keep that regardless of how the rest of the PR evolves.
  • Consider lifting accept(INode) onto IOutputFile itself, so callers don't have to know about CBOMOutputFile concretely to wire the consumer.
  • Document the no-cycle invariant on the INode tree (the new recursion would stack-overflow on cycles; not a regression, but worth a one-line comment now that the recursion path is more prominent).

Minimal path forward

If you'd like to land this incrementally, my suggestion would be:

  1. Drop the IOutputFile / IOutputFileFactory signature change for now (keep List<INode> on the interface; the consumer can wrap a single-node List.of(node) internally).
  2. Fix the three concurrency bugs (1, 2, 8).
  3. Add the CBOM golden-file equivalence test (5).
  4. Either roll out the consumer pattern to all four aggregators or remove the unused streamDetectedNodes() (6).
  5. Post the Elasticsearch JFR comparison — if the heap drop isn't material at the reported scale, defer to #417 for the engine-layer fix and keep this PR scoped to whatever Java-side wins do show up.

Happy to re-review once the validation data and the concurrency fixes are in.

@n1ckl0sk0rtge n1ckl0sk0rtge added the bug Something isn't working label May 28, 2026
Signed-off-by: somiljain2006 <somil16022006@gmail.com>
Signed-off-by: somiljain2006 <somil16022006@gmail.com>
Signed-off-by: somiljain2006 <somil16022006@gmail.com>
Signed-off-by: somiljain2006 <somil16022006@gmail.com>
@somiljain2006 somiljain2006 force-pushed the Memory-optimization-fix branch from 58eafa5 to 80219db Compare May 29, 2026 16:09
@somiljain2006
Copy link
Copy Markdown
Author

After running additional validation against large repositories (Kafka and Elasticsearch) and reducing the scanner heap size down to -Xmx512m, I was unable to demonstrate a meaningful memory improvement from this change. Both the patched and unpatched versions completed successfully across the tested configurations.

Based on these results, I do not have sufficient evidence to claim that this PR addresses the memory characteristics reported in #371. The benchmark data suggests that the aggregator-side accumulation is not a significant contributor on the workloads I tested.

While the PR still contains the streaming and lifecycle improvements discussed above, I do not currently have a strong enough justification to continue pursuing it. To avoid adding maintenance burden without clear demonstrated benefit, I am closing this PR.

Thank you for the review feedback and for pushing for additional validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High heap memory requirements

3 participants