Skip to content

Commit aede1a2

Browse files
authored
Merge pull request #168 from codellm-devkit/feat/neo4j-java-support
feat(java): read-only Neo4j analysis backend (Neo4j parity across Java/Python/TypeScript)
2 parents 541ea2c + e0893bf commit aede1a2

14 files changed

Lines changed: 1372 additions & 46 deletions

CHANGELOG.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
6363
where calls to a bare module name that is also imported (e.g. `os`/`re`/`json`) are dropped from
6464
the emitted call graph. `PythonAnalysis` / `CLDK.analysis(language="python")` accept the same
6565
optional `neo4j_config`.
66-
- Bumped `codeanalyzer-python` to `0.2.0` (adds the Neo4j graph emitter).
66+
- Read-only Neo4j-backed **Java** analysis backend (`cldk.analysis.java.neo4j.JNeo4jBackend`),
67+
completing Neo4j parity across all three languages. It reconstructs the canonical `JApplication`
68+
from the graph `codeanalyzer-java` (>= 2.4.0) emits with `--emit neo4j` and answers all 36
69+
`JavaAnalysisBackend` queries with the in-memory backend's logic. Verified against the daytrader8
70+
sample (145 classes): everything the graph actually contains reconstructs identically to
71+
`JCodeanalyzer` (97% of checks). Three projection gaps in the `codeanalyzer-java` 2.4.0 emitter
72+
(fields collapsing to one node, imports reduced to packages, a truncated call graph) are **fixed
73+
in 2.4.1** (codeanalyzer-java#156/#157/#158, verified on daytrader — `J_CALLS` went 287 → 1702),
74+
the version the SDK release now bundles. `JavaAnalysis` / `CLDK.java(...)` accept a
75+
`Neo4jConnectionConfig` as the `backend=` config to select it.
76+
- Bumped `codeanalyzer-python` to `0.2.0` (adds the Neo4j graph emitter); the bundled
77+
`codeanalyzer-java` jar is now `2.4.1` (adds the Neo4j graph emitter + the field/import/call-graph
78+
projection fixes). The Java analyzer jar is no longer a pip dependency — the SDK release workflow
79+
downloads the latest `codeanalyzer-java` jar into the bundled `jar/` directory.
6780
- Optional `neo4j` extra (`pip install cldk[neo4j]`) for the Neo4j Python driver.
6881

82+
### Fixed
83+
- **Bundled JDK download for the Java backend.** `ensure_jdk` resolved the Temurin JVM via the
84+
Adoptium `/assets/version/{release}` endpoint, which now returns 404 for pinned releases (e.g.
85+
`jdk-21.0.5+11`) — so the first Java analysis on a clean machine failed before it started. It now
86+
resolves via the `/binary/version/...` endpoint (following the redirect to the GitHub asset) and
87+
reads the checksum from the asset's `.sha256.txt`.
88+
6989
## [v1.0.7] - 2026-02-14
7090

7191
### Added

README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ pip install cldk
5555
Optional extras:
5656

5757
```bash
58-
pip install "cldk[neo4j]" # read-only Neo4j graph backend (Python / TypeScript)
58+
pip install "cldk[neo4j]" # read-only Neo4j graph backend (Java / Python / TypeScript)
5959
```
6060

6161
## Quick Start
@@ -117,11 +117,11 @@ classes = analysis.get_all_classes()
117117
118118
## Supported Languages & Backends
119119

120-
Each language is analyzed by a dedicated `codeanalyzer-*` engine; CLDK normalizes the result into typed models exposed through the same API.
120+
Each language is analyzed by a dedicated `codeanalyzer-*` engine; CLDK normalizes the result into typed models exposed through the same API. All three also support an optional **read-only Neo4j backend** — pass a `Neo4jConnectionConfig` and the SDK answers the same queries with Cypher over a graph the analyzer populates out of band (`--emit neo4j`).
121121

122122
| Language | Analysis engine | What it provides |
123123
| --- | --- | --- |
124-
| **Java** | [`codeanalyzer-java`](https://github.com/codellm-devkit/codeanalyzer-java) | WALA + JavaParser. Bytecode-level call graphs, type hierarchies, symbol resolution, and method/field declarations. |
124+
| **Java** | [`codeanalyzer-java`](https://github.com/codellm-devkit/codeanalyzer-java) | WALA + JavaParser. Bytecode-level call graphs, type hierarchies, symbol resolution, CRUD-operation and entry-point detection. Optional read-only **Neo4j** graph backend. |
125125
| **Python** | [`codeanalyzer-python`](https://github.com/codellm-devkit/codeanalyzer-python) | Jedi with optional CodeQL augmentation. Symbol tables, call graphs, and class/method resolution. Optional read-only **Neo4j** graph backend. |
126126
| **TypeScript / JavaScript** | [`codeanalyzer-typescript`](https://github.com/codellm-devkit/codeanalyzer-typescript) | ts-morph with Jelly-based call graphs. Symbols, call graph, types, decorators, and call sites. Optional read-only **Neo4j** graph backend. |
127127

@@ -147,13 +147,14 @@ graph TD
147147
P --> EP[codeanalyzer-python<br/>Jedi · CodeQL]
148148
T --> ET[codeanalyzer-typescript<br/>ts-morph · Jelly]
149149
150-
P -. read-only .-> N[(Neo4j)]
150+
J -. read-only .-> N[(Neo4j)]
151+
P -. read-only .-> N
151152
T -. read-only .-> N
152153
```
153154

154155
**Data models** — each language has its own set of Pydantic models under `cldk.models` (`cldk.models.java`, `cldk.models.python`, `cldk.models.typescript`). They give you structured, typed, dot-accessible representations of classes, methods, fields, and statements, with JSON serialization and shared conventions across languages.
155156

156-
**Analysis backends** — each language has a backend under `cldk.analysis.<language>` that coordinates its engine (see the table above) and maps the result onto the data models. Backends are orchestrated internally; you only call high-level methods such as `get_symbol_table()`, `get_method_body(...)`, and `get_call_graph(...)`, and CLDK handles tool coordination, parsing, and marshalling under the hood.
157+
**Analysis backends** — each language has a backend under `cldk.analysis.<language>` that coordinates its engine (see the table above) and maps the result onto the data models. The read-only Neo4j backends (`cldk.analysis.<language>.neo4j`) reconstruct the *same* models from a Cypher graph, so they are drop-in interchangeable with the in-process analyzers. Backends are orchestrated internally; you only call high-level methods such as `get_symbol_table()`, `get_method_body(...)`, and `get_call_graph(...)`, and CLDK handles tool coordination, parsing, and marshalling under the hood.
157158

158159
## Documentation
159160

cldk/analysis/commons/backend_config.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -99,9 +99,8 @@ class Neo4jConnectionConfig:
9999
application_name: str | None = None
100100

101101

102-
# Per-language discriminated unions the facades match on. Java has no Neo4j backend yet, so its
103-
# only admissible config is the codeanalyzer one.
104-
JavaBackend = CodeAnalyzerConfig
102+
# Per-language discriminated unions the facades match on.
103+
JavaBackend = Union[CodeAnalyzerConfig, Neo4jConnectionConfig]
105104
PyBackend = Union[PyCodeAnalyzerConfig, Neo4jConnectionConfig]
106105
TSBackend = Union[CodeAnalyzerConfig, Neo4jConnectionConfig]
107106

cldk/analysis/java/codeanalyzer/_jdk.py

Lines changed: 31 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,13 @@
3636
from __future__ import annotations
3737

3838
import hashlib
39-
import json
4039
import logging
4140
import os
4241
import platform
4342
import stat
4443
import tarfile
44+
import urllib.error
45+
import urllib.parse
4546
import urllib.request
4647
import zipfile
4748
from pathlib import Path
@@ -70,20 +71,36 @@ def _os_arch(cls) -> tuple[str, str]:
7071

7172
@classmethod
7273
def _resolve_asset(cls) -> tuple[str, str]:
73-
"""Return ``(download_url, sha256)`` for the pinned JDK binary."""
74+
"""Return ``(download_url, sha256)`` for the pinned JDK binary.
75+
76+
Resolves via the Adoptium ``/binary/version`` endpoint, which 307-redirects to the
77+
GitHub release asset; the checksum comes from the asset's adjacent ``.sha256.txt``. The
78+
older ``/assets/version/{release}`` query endpoint is not used: it returns 404 for pinned
79+
releases (e.g. ``jdk-21.0.5+11``), even though the release exists.
80+
"""
7481
os_, arch = cls._os_arch()
75-
url = (
76-
f"{cls._API}/assets/version/{JDK_RELEASE}"
77-
f"?os={os_}&architecture={arch}&image_type=jdk"
78-
f"&jvm_impl=hotspot&heap_size=normal&vendor=eclipse"
79-
)
80-
req = urllib.request.Request(url, headers={"User-Agent": "cldk"})
81-
with urllib.request.urlopen(req, timeout=30) as resp:
82-
data = json.load(resp)
83-
if not data:
84-
raise RuntimeError(f"No Temurin {JDK_RELEASE} build for {os_}/{arch}")
85-
pkg = data[0]["binaries"][0]["package"]
86-
return pkg["link"], pkg["checksum"]
82+
release = urllib.parse.quote(JDK_RELEASE, safe="") # encode the '+' in the path
83+
binary_url = f"{cls._API}/binary/version/{release}/{os_}/{arch}/jdk/hotspot/normal/eclipse"
84+
85+
# Capture the redirect target (the GitHub asset URL) without downloading the binary.
86+
class _NoRedirect(urllib.request.HTTPRedirectHandler):
87+
def redirect_request(self, *args, **kwargs):
88+
return None
89+
90+
opener = urllib.request.build_opener(_NoRedirect)
91+
req = urllib.request.Request(binary_url, headers={"User-Agent": "cldk"})
92+
try:
93+
opener.open(req, timeout=30)
94+
raise RuntimeError(f"Expected a redirect to the Temurin {JDK_RELEASE} asset from {binary_url}")
95+
except urllib.error.HTTPError as exc:
96+
if exc.code not in (301, 302, 303, 307, 308) or not exc.headers.get("Location"):
97+
raise RuntimeError(f"No Temurin {JDK_RELEASE} build for {os_}/{arch} (HTTP {exc.code})") from exc
98+
asset_url = exc.headers["Location"]
99+
100+
sha_req = urllib.request.Request(asset_url + ".sha256.txt", headers={"User-Agent": "cldk"})
101+
with urllib.request.urlopen(sha_req, timeout=30) as resp:
102+
sha = resp.read().decode().split()[0]
103+
return asset_url, sha
87104

88105
@classmethod
89106
def _java_home(cls, root: Path) -> Path:

cldk/analysis/java/java_analysis.py

Lines changed: 28 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -52,12 +52,13 @@
5252

5353
from tree_sitter import Tree
5454

55-
from cldk.analysis.commons.backend_config import CodeAnalyzerConfig, JavaBackend, cache_subdir
55+
from cldk.analysis.commons.backend_config import CodeAnalyzerConfig, JavaBackend, Neo4jConnectionConfig, cache_subdir
5656
from cldk.analysis.commons.treesitter import TreesitterJava
5757
from cldk.models.java import JCallable
5858
from cldk.models.java import JApplication
5959
from cldk.models.java.models import JCRUDOperation, JComment, JCompilationUnit, JMethodDetail, JType, JField
6060
from cldk.analysis.java.codeanalyzer import JCodeanalyzer
61+
from cldk.analysis.java.neo4j import JNeo4jBackend
6162
from cldk.analysis.java.backend import JavaAnalysisBackend
6263

6364

@@ -149,22 +150,33 @@ def __init__(
149150
self.eager_analysis = eager_analysis
150151
self.target_files = target_files
151152
self.backend_config: JavaBackend = backend if backend is not None else CodeAnalyzerConfig()
152-
# Java has a single backend family; the config only carries the cache root. analysis.json
153-
# is cached under <cache_dir>/java (None in source_code mode, where the analyzer streams
154-
# results over a pipe).
155-
cache_path = cache_subdir(self.backend_config.cache_dir, project_dir, "java")
156-
if cache_path is not None:
157-
cache_path.mkdir(parents=True, exist_ok=True)
158153
self.treesitter_java: TreesitterJava = TreesitterJava()
159-
# Initialize the analysis backend
160-
self.backend: JavaAnalysisBackend = JCodeanalyzer(
161-
project_dir=self.project_dir,
162-
source_code=self.source_code,
163-
eager_analysis=self.eager_analysis,
164-
analysis_level=self.analysis_level,
165-
analysis_json_path=cache_path,
166-
target_files=self.target_files,
167-
)
154+
self.backend: JavaAnalysisBackend
155+
if isinstance(self.backend_config, Neo4jConnectionConfig):
156+
# Read-only: the graph is populated out of band; the SDK only polls it.
157+
cfg = self.backend_config
158+
application_name = cfg.application_name or (Path(project_dir).name if project_dir else None)
159+
self.backend = JNeo4jBackend(
160+
neo4j_uri=cfg.uri,
161+
neo4j_username=cfg.username,
162+
neo4j_password=cfg.password,
163+
neo4j_database=cfg.database,
164+
application_name=application_name,
165+
)
166+
else:
167+
# The config only carries the cache root. analysis.json is cached under <cache_dir>/java
168+
# (None in source_code mode, where the analyzer streams results over a pipe).
169+
cache_path = cache_subdir(self.backend_config.cache_dir, project_dir, "java")
170+
if cache_path is not None:
171+
cache_path.mkdir(parents=True, exist_ok=True)
172+
self.backend = JCodeanalyzer(
173+
project_dir=self.project_dir,
174+
source_code=self.source_code,
175+
eager_analysis=self.eager_analysis,
176+
analysis_level=self.analysis_level,
177+
analysis_json_path=cache_path,
178+
target_files=self.target_files,
179+
)
168180

169181
def get_imports(self) -> List[str]:
170182
"""Return all import statements in the source code.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
################################################################################
2+
# Copyright IBM Corporation 2026
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
################################################################################
16+
17+
"""Read-only Neo4j-backed Java analysis backend (Cypher queries over the codeanalyzer-java graph)."""
18+
19+
from cldk.analysis.java.neo4j.config import Neo4jConnectionConfig
20+
from cldk.analysis.java.neo4j.neo4j_backend import JNeo4jBackend
21+
22+
__all__ = ["JNeo4jBackend", "Neo4jConnectionConfig"]

cldk/analysis/java/neo4j/config.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
################################################################################
2+
# Copyright IBM Corporation 2026
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
################################################################################
16+
17+
"""Connection settings for the read-only Neo4j-backed Java analysis backend.
18+
19+
The definition has been hoisted to :mod:`cldk.analysis.commons.backend_config`; it is re-exported
20+
here for symmetry with the Python and TypeScript backends.
21+
"""
22+
23+
from __future__ import annotations
24+
25+
from cldk.analysis.commons.backend_config import Neo4jConnectionConfig
26+
27+
__all__ = ["Neo4jConnectionConfig"]

0 commit comments

Comments
 (0)