Reduce retained AST nodes during Java analysis#413
Conversation
7268e88 to
d1c0e59
Compare
|
@san-zrl Can you please review this pr? |
n1ckl0sk0rtge
left a comment
There was a problem hiding this comment.
Thanks for tackling this — heap pressure on large Java repos is a real pain point and the streaming/consumer direction is a sensible idea. I'd like to ask for a few changes and a stronger validation story before this lands. There is also a competing approach in #417 that targets the same issue at the engine layer (eagerly releasing DetectionStore / AST refs); maintainers will likely need to choose between the two, so it would help to coordinate.
Validation request
The PR validates on Apache SkyWalking (~2 676 files, -Xmx4g), but the issue reporter on #371 needed -Xmx90g on Elasticsearch-scale (~28k files). The optimization here removes the List<INode> of model objects in JavaAggregator, which on 20k findings is roughly ~160 KB — that's noise at 90 GB scale. The real heap dominator is almost certainly the Sonar Tree references held by DetectionStore, which this PR doesn't touch (and which #417 does).
Could you please run the scenario from #371 (or a comparable repo such as elastic/elasticsearch or apache/kafka) and post:
- Before / after heap comparison at the same
-Xmxsetting, using JFR (-XX:StartFlightRecording=filename=scan.jfr,settings=profile). - Peak heap, Full-GC count/duration, and total scan wall-clock from the JFR recording.
- The
-Xmxvalue below which the unpatched code OOMs and at which the patched code completes.
Without this, "fixes #371" can't be verified, and it's hard to decide between this PR and #417.
Must fix
-
Concurrency — static
liveConsumeris overwritten on concurrentScannerManagerconstruction.JavaAggregator.liveConsumerisstatic volatile. In a multi-module Sonar run, instantiating a secondScannerManagerreplaces the first's consumer — Java detections then route to the wrongliveOutputFile, silently. Make the consumer per-instance, or fail fast on re-registration. -
Concurrency —
ScannerManager.getOutputFile()races onnonJavaMerged. Two threads can both observefalse, both enter the if-block, both merge non-Java streams intoliveOutputFile→ duplicated components and inflated occurrence counts in the CBOM. Synchronize the method or useAtomicBoolean.compareAndSet. -
Public-API binary break with no deprecation path.
IOutputFile.add(List<INode>)→add(Stream<INode>)andIOutputFileFactory.createOutputFormat(...)are part of the publishedoutputmodule surface (used bycbomkit-lib, cf. recent integration in 04f4963). Please either (a) add adefaultadd(List<INode>) { add(nodes.stream()); }shim marked@Deprecated(forRemoval = true), or (b) coordinate a major version bump + cbomkit-lib update. As a smaller side-suggestion,Iterable<INode>would be a less fragile parameter type thanStream(single-use, not replayable). -
docs/LANGUAGE_SUPPORT.md(around lines 328–335) still referencesJavaAggregator.getDetectedNodes()in the integration example — that method is removed by this PR, so the documented snippet no longer compiles. Please update the snippet to describe the new consumer/streaming wiring. -
No functional-equivalence test for CBOM output. The
add(List)recursion inCBOMOutputFilewas rewritten intoprocessSingleNode/addChildren. Component dedup (occurrenceCountkeyed on location), parent-bom-ref threading, and evidence assembly could shift subtly. Please add a golden-file regression test that runs a fixed fixture throughCBOMOutputFileand asserts the CycloneDX output is unchanged from the pre-PR version. -
Cross-language asymmetry and dead code.
JavaAggregatoradopts the consumer pattern and losesgetDetectedNodes().PythonAggregator/GoAggregator/CSharpAggregatorgain astreamDetectedNodes()accessor that nothing calls —ScannerManagerstill doesgetDetectedNodes().stream()on them. Either roll the consumer pattern out to all four aggregators in this PR (preserving the "all aggregators mirror each other" convention) or drop the unusedstreamDetectedNodes()methods and keep this PR Java-only.
Should fix
-
Lock granularity in
CBOMOutputFile.accept. The synchronized method holds the lock for the whole recursiveprocessSingleNodetraversal. Under Sonar's file-level parallelism, every Java match now serializes through this lock — previously there was one acquisition per scan, now O(findings). ConsiderConcurrentHashMapfor the components map +AtomicLongfor counters, and shrink the synchronized region to just the map mutations. -
Synchronization protocol inconsistency.
JavaAggregator.addNodesandresetaresynchronizedbutregisterConsumeris not. Volatile saves you today but it's a footgun — please synchronizeregisterConsumertoo. -
ScannerManager.getOutputFile()is now a stateful query. First call mutates state (lazy non-Java merge + flips the flag); subsequent calls return the same instance. If anything calls it before Java aggregation completes (early stats, etc.) the output stays incomplete forever. At minimum Javadoc the lifecycle contract; better, merge non-Java eagerly in a post-job rather than on the getter's first call. -
ScannerManager.getStatistics()semantics may have shifted. Previously the count came from traversing the full aggregated node list (including children). Now Java's count comes fromJavaAggregator.getTotalNodeCount()(only top-leveladdNodescalls). Please verify the numeric output matches the previous behavior on the same input, or document the new semantics. -
ScannerManager.java:383directly instantiatesnew CBOMOutputFile()instead of going throughIOutputFileFactory. The plugin layer should depend on the abstraction.factory.createOutputFormat(Stream.empty())would be cleaner. -
Test gaps in
JavaAggregatorTest/ScannerManagerTest:registerConsumercalled twice — second call should replace the first (and the first should no longer receive nodes).addNodescalled with no consumer registered —totalNodeCountmust still increment.getOutputFile()called twice — same instance, and non-Java nodes are NOT merged a second time.- Concurrent
addNodesvsreset— at least a basic stress test for the synchronized API surface.
-
No CHANGELOG / migration notes for the breaking API surface. Tied to item 3.
Suggestions
- Hoisting
new JavaScanContext(this.context)out of the loop inJavaBaseDetectionRule.java:97-106is a nice micro-win — keep that regardless of how the rest of the PR evolves. - Consider lifting
accept(INode)ontoIOutputFileitself, so callers don't have to know aboutCBOMOutputFileconcretely to wire the consumer. - Document the no-cycle invariant on the
INodetree (the new recursion would stack-overflow on cycles; not a regression, but worth a one-line comment now that the recursion path is more prominent).
Minimal path forward
If you'd like to land this incrementally, my suggestion would be:
- Drop the
IOutputFile/IOutputFileFactorysignature change for now (keepList<INode>on the interface; the consumer can wrap a single-nodeList.of(node)internally). - Fix the three concurrency bugs (1, 2, 8).
- Add the CBOM golden-file equivalence test (5).
- Either roll out the consumer pattern to all four aggregators or remove the unused
streamDetectedNodes()(6). - Post the Elasticsearch JFR comparison — if the heap drop isn't material at the reported scale, defer to #417 for the engine-layer fix and keep this PR scoped to whatever Java-side wins do show up.
Happy to re-review once the validation data and the concurrency fixes are in.
Signed-off-by: somiljain2006 <somil16022006@gmail.com>
Signed-off-by: somiljain2006 <somil16022006@gmail.com>
Signed-off-by: somiljain2006 <somil16022006@gmail.com>
Signed-off-by: somiljain2006 <somil16022006@gmail.com>
58eafa5 to
80219db
Compare
|
After running additional validation against large repositories (Kafka and Elasticsearch) and reducing the scanner heap size down to -Xmx512m, I was unable to demonstrate a meaningful memory improvement from this change. Both the patched and unpatched versions completed successfully across the tested configurations. Based on these results, I do not have sufficient evidence to claim that this PR addresses the memory characteristics reported in #371. The benchmark data suggests that the aggregator-side accumulation is not a significant contributor on the workloads I tested. While the PR still contains the streaming and lifecycle improvements discussed above, I do not currently have a strong enough justification to continue pursuing it. To avoid adding maintenance burden without clear demonstrated benefit, I am closing this PR. Thank you for the review feedback and for pushing for additional validation. |
Fixes #371
Reduce retained AST nodes during Java analysis to lower heap usage
Previously, the analyzer stored every Sonar AST node in a large List during analysis. On large repositories, this caused unnecessary object retention and increased JVM heap pressure, leading to excessive garbage collection activity and a higher risk of OutOfMemoryError.
This change reduces long-lived retention of AST nodes and processes them in a more memory-efficient manner.
Validation:
This improves scalability and reduces memory overhead when analyzing large enterprise repositories. If this optimization continues to validate well in production-scale repositories, similar memory-retention improvements can later be applied to the Python, C#, and Go aggregators as follow-up work.