Skip to content

Commit 21b6a8b

Browse files
committed
docs: restructure README to match codeanalyzer-typescript; ignore node_modules/.astro
Give the README the same layout as the `cants` (TypeScript) sibling: centered header + badges, intro, table of contents, Features, sectioned Installation (Prerequisites / pip / shell script / build-from-source), Usage (Options + Examples), Output targets (analysis.json / Neo4j / schema), Development, License. Content is unchanged in substance and the auto-generated `canpy --help` block is preserved verbatim (scripts/update_readme.py --check passes). Also add node_modules/ and .astro/ to .gitignore — they are docs-site build artifacts that should never be committed.
1 parent 6498575 commit 21b6a8b

2 files changed

Lines changed: 164 additions & 134 deletions

File tree

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,3 +180,7 @@ analysis.json
180180

181181
# UV
182182
uv.lock
183+
184+
# Node / Astro docs-site build artifacts (never commit these)
185+
node_modules/
186+
.astro/

README.md

Lines changed: 160 additions & 134 deletions
Original file line numberDiff line numberDiff line change
@@ -1,84 +1,128 @@
1-
![logo](https://github.com/codellm-devkit/codeanalyzer-python/blob/main/docs/assets/logo.png?raw=true)
1+
<div align="center">
2+
3+
<img src="https://github.com/codellm-devkit/codeanalyzer-python/blob/main/docs/assets/logo.png?raw=true" alt="CodeLLM-DevKit" />
4+
5+
# codeanalyzer-python (`canpy`)
6+
7+
**A Python static-analysis toolkit — the CLDK backend that emits a canonical symbol table and call graph, as `analysis.json` or a Neo4j property graph.**
8+
9+
[![PyPI](https://img.shields.io/pypi/v/codeanalyzer-python?style=for-the-badge&logo=pypi&logoColor=white)](https://pypi.org/project/codeanalyzer-python/)
10+
[![Python](https://img.shields.io/pypi/pyversions/codeanalyzer-python?style=for-the-badge&logo=python&logoColor=white)](https://pypi.org/project/codeanalyzer-python/)
11+
[![Release](https://img.shields.io/github/actions/workflow/status/codellm-devkit/codeanalyzer-python/release.yml?style=for-the-badge&label=release&logo=github)](https://github.com/codellm-devkit/codeanalyzer-python/actions/workflows/release.yml)
12+
[![License](https://img.shields.io/badge/License-Apache%202.0-blue?style=for-the-badge)](./LICENSE)
13+
14+
</div>
15+
16+
---
17+
18+
`canpy` is a static analyzer for Python built on [Jedi](https://jedi.readthedocs.io/), with optional
19+
[CodeQL](https://codeql.github.com/)-resolved call edges and
20+
[Tree-sitter](https://tree-sitter.github.io/) parsing. It produces the canonical CodeLLM-DevKit
21+
(CLDK) `analysis.json` — a symbol table plus a call graph — and can project that same analysis into a
22+
**Neo4j property graph**. It is the Python backend behind
23+
[CLDK](https://github.com/codellm-devkit/python-sdk), mirroring its
24+
[TypeScript](https://github.com/codellm-devkit/codeanalyzer-typescript) (`cants`) and
25+
[Java](https://github.com/codellm-devkit/codeanalyzer-java) siblings.
26+
27+
Every run produces a symbol table **and** a call graph. Edges come from Jedi's lexical resolution by
28+
default; `--codeql` resolves additional edges (RPC / third-party / dynamically-dispatched targets)
29+
and merges them with the Jedi-derived edges, also backfilling callees Jedi could not resolve.
30+
31+
## Table of Contents
32+
33+
- [Features](#features)
34+
- [Installation](#installation)
35+
- [Prerequisites](#prerequisites)
36+
- [Install via pip (PyPI)](#install-via-pip-pypi)
37+
- [Install via shell script](#install-via-shell-script)
38+
- [Build from source](#build-from-source)
39+
- [Usage](#usage)
40+
- [Options](#options)
41+
- [Examples](#examples)
42+
- [Output targets](#output-targets)
43+
- [`analysis.json` (default)](#analysisjson-default)
44+
- [Neo4j graph](#neo4j-graph)
45+
- [Schema contract](#schema-contract)
46+
- [Development](#development)
47+
- [License](#license)
48+
49+
## Features
50+
51+
- **Symbol table** — modules, classes, functions, methods, variables, decorators, imports, and
52+
docstrings, with precise source spans.
53+
- **Call graph** — Jedi's lexical resolver by default, with optional **CodeQL**-resolved edges
54+
(`--codeql`) for RPC / third-party / dynamically-dispatched targets, merged with the Jedi edges;
55+
CodeQL also backfills callees Jedi could not resolve.
56+
- **Neo4j output** — project the analysis into a labeled property graph: a self-contained
57+
`graph.cypher` snapshot, or an **incremental** push to a live database over Bolt.
58+
- **Versioned schema** — a machine-readable, version-stamped Neo4j schema contract (`--emit schema`),
59+
checked in as `schema.neo4j.json` and shipped with every release.
60+
- **Incremental cache** — per-file results are cached under `.codeanalyzer`; `--lazy` (default)
61+
reuses them, `--eager` forces a clean rebuild. `--ray` distributes the work across cores.
62+
- **Compact output** — canonical `analysis.json`, or binary `analysis.msgpack` for smaller artifacts.
263

3-
# A Python Static Analysis Toolkit (and Library)
64+
## Installation
465

5-
A comprehensive static analysis tool for Python source code that provides symbol table generation, call graph analysis, and semantic analysis using Jedi, CodeQL, and Tree-sitter — emitted as the canonical `analysis.json`, or projected into a **Neo4j property graph**.
66+
### Prerequisites
667

7-
## Installation
68+
- **Python 3.10 or newer.**
69+
- A C toolchain and the `venv` / development headers — the analyzer builds an isolated virtual
70+
environment per project (via Python's `venv`) so Jedi can resolve types and imports:
871

9-
```bash
10-
pip install codeanalyzer-python
11-
```
72+
```sh
73+
# Ubuntu / Debian
74+
sudo apt install python3-venv python3-dev build-essential
1275

13-
For the optional **live Neo4j push** (`--emit neo4j --neo4j-uri …`), install the `neo4j` extra:
76+
# Fedora / RHEL / CentOS
77+
sudo dnf group install "Development Tools" && sudo dnf install python3-venv python3-devel
1478

15-
```bash
16-
pip install 'codeanalyzer-python[neo4j]'
17-
```
79+
# macOS
80+
xcode-select --install
81+
```
1882

19-
Or install the CLI as an isolated tool with the one-line installer (provisions via uv / pipx / pip):
83+
### Install via pip (PyPI)
2084

21-
```bash
22-
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/codellm-devkit/codeanalyzer-python/releases/latest/download/canpy-installer.sh | sh
85+
```sh
86+
pip install codeanalyzer-python
87+
canpy --help
2388
```
2489

25-
### Prerequisites
26-
27-
- Python 3.12 or higher
90+
For the optional **live Neo4j push** (`--emit neo4j --neo4j-uri …`), install the `neo4j` extra:
2891

29-
#### System Package Requirements
92+
```sh
93+
pip install 'codeanalyzer-python[neo4j]'
94+
```
3095

31-
The tool creates virtual environments internally using Python's built-in `venv` module.
96+
### Install via shell script
3297

33-
**Ubuntu/Debian systems:**
34-
```bash
35-
sudo apt update
36-
sudo apt install python3.12-venv python3-dev build-essential
37-
```
98+
Install the CLI as an isolated tool with the one-line installer (provisions via uv / pipx / pip):
3899

39-
**Fedora/RHEL/CentOS systems:**
40-
```bash
41-
sudo dnf group install "Development Tools"
42-
sudo dnf install python3-pip python3-venv python3-devel
43-
```
44-
or on older versions:
45-
```bash
46-
sudo yum groupinstall "Development Tools"
47-
sudo yum install python3-pip python3-venv python3-devel
100+
```sh
101+
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/codellm-devkit/codeanalyzer-python/releases/latest/download/canpy-installer.sh | sh
48102
```
49103

50-
**macOS systems:**
51-
```bash
52-
# Install Xcode Command Line Tools (for compilation)
53-
xcode-select --install
54-
55-
# If using Homebrew Python (recommended)
56-
brew install python@3.12
104+
### Build from source
57105

58-
# If using pyenv (popular Python version manager)
59-
# First ensure pyenv is properly installed and configured
60-
pyenv install 3.12.0 # or latest 3.12.x version
61-
pyenv global 3.12.0 # or pyenv local 3.12.0 for project-specific
106+
This project uses [uv](https://docs.astral.sh/uv/) for dependency management.
62107

63-
# If using system Python, you may need to install certificates
64-
/Applications/Python\ 3.12/Install\ Certificates.command
108+
```sh
109+
git clone https://github.com/codellm-devkit/codeanalyzer-python
110+
cd codeanalyzer-python
111+
uv sync --all-groups
112+
uv run canpy --help
65113
```
66114

67-
> **Note:** These packages are required as the tool uses Python's built-in `venv` module to create isolated environments for analysis.
68-
69115
## Usage
70116

71-
`canpy` provides a command-line interface for performing static analysis on Python projects.
72-
73-
### Basic Usage
74-
75-
```bash
117+
```sh
76118
canpy --input /path/to/python/project
77119
```
78120

79-
### Command Line Options
121+
With no `--output`, the analysis is printed to stdout as compact JSON; with `--output <dir>` it is
122+
written to `analysis.json` (or `graph.cypher` for `--emit neo4j`, or `analysis.msgpack` with
123+
`--format msgpack`) in that directory.
80124

81-
To view the available options and commands, run `canpy --help`. You should see output similar to the following:
125+
### Options
82126

83127
<!-- BEGIN canpy-help -->
84128

@@ -148,53 +192,44 @@ $ canpy --help
148192

149193
### Examples
150194

151-
1. **Basic analysis with symbol table:**
152-
```bash
153-
canpy --input ./my-python-project
195+
1. **Basic analysis to stdout, or to a file:**
196+
```sh
197+
canpy --input ./my-python-project # compact JSON on stdout
198+
canpy --input ./my-python-project --output ./out # → ./out/analysis.json
154199
```
155200

156-
This will print the symbol table to stdout in JSON format. If you want to save the output, you can use the `--output` option.
157-
158-
```bash
159-
canpy --input ./my-python-project --output /path/to/analysis-results
201+
2. **Binary output (msgpack):**
202+
```sh
203+
canpy --input ./my-python-project --output ./out --format msgpack # → ./out/analysis.msgpack
160204
```
161205

162-
Now, you can find the analysis results in `analysis.json` in the specified directory.
163-
164-
2. **Change output format to msgpack:**
165-
```bash
166-
canpy --input ./my-python-project --output /path/to/analysis-results --format msgpack
167-
```
168-
169-
This will save the analysis results in `analysis.msgpack` in the specified directory.
170-
171-
3. **Analysis with CodeQL enabled:**
172-
```bash
206+
3. **Resolve extra call edges with CodeQL:**
207+
```sh
173208
canpy --input ./my-python-project --codeql
174209
```
175-
Every run produces a symbol table **and** a call graph. By default, edges come from Jedi's lexical analysis. Adding `--codeql` resolves additional edges (including RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges. CodeQL also backfills resolved callees on Jedi-emitted call sites where Jedi couldn't resolve them.
176-
177-
***Note: CodeQL integration is experimental. The CLI is downloaded into `<cache_dir>/codeql/` on first use and reused thereafter.***
178-
179-
4. **Eager analysis with custom cache directory:**
180-
```bash
181-
canpy --input ./my-python-project --eager --cache-dir /path/to/custom-cache
182-
```
183-
This will rebuild the analysis cache at every run and store it in `/path/to/custom-cache/.codeanalyzer`.
210+
By default, edges come from Jedi's lexical analysis. Adding `--codeql` resolves additional edges
211+
(including RPC / third-party / dynamically-dispatched targets) and merges them with the
212+
Jedi-derived edges; CodeQL also backfills resolved callees Jedi could not resolve. CodeQL
213+
integration is experimental; the CLI is downloaded into `<cache_dir>/codeql/` on first use.
184214

185-
5. **Emit a Neo4j snapshot, or push to a live database:**
186-
```bash
215+
4. **Emit a Neo4j snapshot, or push to a live database:**
216+
```sh
187217
canpy --input ./my-python-project --emit neo4j --output ./out # → ./out/graph.cypher
188218
canpy --input ./my-python-project --emit neo4j \
189219
--neo4j-uri bolt://localhost:7687 --neo4j-user neo4j --neo4j-password secret
190220
```
191221

192-
6. **Emit the Neo4j schema contract:**
193-
```bash
194-
canpy --emit schema # print schema.json to stdout (no project needed)
222+
5. **Emit the Neo4j schema contract:**
223+
```sh
224+
canpy --emit schema # print schema.json to stdout (no project needed)
195225
canpy --emit schema --output ./out # → ./out/schema.json
196226
```
197227

228+
6. **Force a clean rebuild with a custom cache directory:**
229+
```sh
230+
canpy --input ./my-python-project --eager --cache-dir /path/to/custom-cache
231+
```
232+
198233
## Output targets
199234

200235
`canpy` builds one analysis in memory and can emit it three ways (`--emit`):
@@ -210,18 +245,32 @@ A `PyApplication` document — the canonical CLDK contract:
210245
}
211246
```
212247

213-
By default this is printed to stdout in JSON; with `--output` it is written to `analysis.json` (or `analysis.msgpack` with `--format msgpack`, a more compact binary format).
248+
By default this is printed to stdout in JSON; with `--output` it is written to `analysis.json` (or
249+
`analysis.msgpack` with `--format msgpack`, a more compact binary format).
214250

215251
### Neo4j graph
216252

217-
`--emit neo4j` projects the same analysis into a labeled property graph. Every node label is `Py`-prefixed and every relationship type is `PY_`-prefixed (e.g. `:PyClass`, `PY_CALLS`) so multiple language analyzers can share one database without label or relationship-type collisions. Declarations are keyed by their signature under a shared `:PySymbol` label; calls, imports, inheritance, decorators, and call sites are relationships:
253+
`--emit neo4j` projects the same analysis into a labeled property graph. Every node label is
254+
`Py`-prefixed and every relationship type is `PY_`-prefixed (e.g. `:PyClass`, `PY_CALLS`) so multiple
255+
language analyzers can share one database without label or relationship-type collisions. Declarations
256+
are keyed by their signature under a shared `:PySymbol` label; calls, imports, inheritance,
257+
decorators, and call sites are relationships:
218258

219-
- **Without `--neo4j-uri`** — writes a self-contained `graph.cypher` (constraints + indexes, a scoped wipe, then batched `MERGE`s). Load it with `cypher-shell < graph.cypher`. Needs no extra dependencies.
220-
- **With `--neo4j-uri`** — pushes to a live Neo4j over Bolt **incrementally**: only modules whose content hash changed are rewritten, and on a full run modules whose source file vanished are pruned. Requires the `neo4j` extra. Every graph carries a `schema_version` on its `:PyApplication` node.
259+
- **Without `--neo4j-uri`** — writes a self-contained `graph.cypher` (constraints + indexes, a scoped
260+
wipe, then batched `MERGE`s). Load it with `cypher-shell < graph.cypher`. Needs no extra
261+
dependencies.
262+
- **With `--neo4j-uri`** — pushes to a live Neo4j over Bolt **incrementally**: only modules whose
263+
content hash changed are rewritten, and on a full run modules whose source file vanished are
264+
pruned. Requires the `neo4j` extra. Every graph carries a `schema_version` on its `:PyApplication`
265+
node.
221266

222-
Call-graph endpoints that aren't present in the symbol table (third-party / framework / RPC targets) are materialized as `:PyExternal` ghost nodes, mirroring the analyzer's own ghost-node behaviour.
267+
Call-graph endpoints that aren't present in the symbol table (third-party / framework / RPC targets)
268+
are materialized as `:PyExternal` ghost nodes, mirroring the analyzer's own ghost-node behaviour.
223269

224-
The connection options also read from the standard Neo4j environment variables — `NEO4J_URI`, `NEO4J_USERNAME`, `NEO4J_PASSWORD`, `NEO4J_DATABASE` — when the corresponding flag is omitted (an explicit flag wins). Prefer the env var for the password so it doesn't land in shell history or the process list:
270+
The connection options also read from the standard Neo4j environment variables — `NEO4J_URI`,
271+
`NEO4J_USERNAME`, `NEO4J_PASSWORD`, `NEO4J_DATABASE` — when the corresponding flag is omitted (an
272+
explicit flag wins). Prefer the env var for the password so it doesn't land in shell history or the
273+
process list:
225274

226275
```sh
227276
export NEO4J_URI=bolt://localhost:7687
@@ -231,59 +280,36 @@ canpy -i ./my-project --emit neo4j # credentials picked up from the environm
231280

232281
### Schema contract
233282

234-
`--emit schema` writes the machine-readable, version-stamped Neo4j schema (`schema.json`: node labels, relationships, properties, constraints, and indexes). It needs no project and is checked into the repo as `schema.neo4j.json` and bundled in every release as a GitHub Release asset, so a consumer can validate producer/consumer compatibility without invoking the tool. The shape of the contract matches the [`codeanalyzer-typescript`](https://github.com/codellm-devkit/codeanalyzer-typescript) backend.
283+
`--emit schema` writes the machine-readable, version-stamped Neo4j schema (`schema.json`: node labels,
284+
relationships, properties, constraints, and indexes). It needs no project and is checked into the repo
285+
as `schema.neo4j.json` and bundled in every release as a GitHub Release asset, so a consumer can
286+
validate producer/consumer compatibility without invoking the tool. The shape of the contract matches
287+
the [`codeanalyzer-typescript`](https://github.com/codellm-devkit/codeanalyzer-typescript) backend.
235288

236-
A UML of the `analysis.json` schema (the `PyApplication` containment tree) is checked in as [`schema-uml.drawio`](./schema-uml.drawio).
289+
A UML of the `analysis.json` schema (the `PyApplication` containment tree) is checked in as
290+
[`schema-uml.drawio`](./schema-uml.drawio), and the property-graph schema as
291+
[`neo4j-schema.drawio`](./neo4j-schema.drawio).
237292

238293
## Development
239294

240-
This project uses [uv](https://docs.astral.sh/uv/) for dependency management during development.
241-
242-
### Development Setup
243-
244-
1. Install [uv](https://docs.astral.sh/uv/getting-started/installation/)
245-
246-
2. Clone the repository:
247-
```bash
248-
git clone https://github.com/codellm-devkit/codeanalyzer-python
249-
cd codeanalyzer-python
250-
```
251-
252-
3. Install dependencies using uv:
253-
```bash
254-
uv sync --all-groups
255-
```
256-
This will install all dependencies including development and test dependencies.
257-
258-
### Running from Source
259-
260-
```bash
261-
uv run canpy --input /path/to/python/project
262-
uv run canpy --emit schema > schema.neo4j.json # regenerate the checked-in schema contract
263-
```
264-
265-
### Running Tests
295+
This project uses [uv](https://docs.astral.sh/uv/).
266296

267-
```bash
268-
uv run pytest --pspec -s
297+
```sh
298+
uv sync --all-groups
299+
uv run canpy --input /path/to/project # run from source
300+
uv run canpy --emit schema > schema.neo4j.json # regenerate the checked-in schema contract
301+
uv run python scripts/update_readme.py # regenerate the canpy --help block above
302+
uv run pytest # run the test suite
269303
```
270304

271305
The Neo4j schema-conformance test always runs. The Neo4j **bolt** integration test spins up a real
272306
Neo4j via [Testcontainers](https://testcontainers.com/) and is **opt-in** — it needs a container
273307
runtime (Docker or Podman) and is enabled with an environment variable:
274308

275-
```bash
309+
```sh
276310
RUN_CONTAINER_TESTS=1 uv run pytest test/test_neo4j_bolt.py -s
277311
```
278312

279-
### Development Dependencies
280-
281-
The project includes additional dependency groups for development:
313+
## License
282314

283-
- **test**: pytest and related testing tools (plus `neo4j` + `testcontainers` for the opt-in Neo4j test)
284-
- **dev**: development tools like ipdb
285-
286-
Install all groups with:
287-
```bash
288-
uv sync --all-groups
289-
```
315+
Apache 2.0 — see [LICENSE](./LICENSE).

0 commit comments

Comments
 (0)