mps-cli-py: Implemented NumPy flat storage, JAR-direct parsing and fixed caching logic by Prithvi686 · Pull Request #62 · mbeddr/mps-cli

Prithvi686 · 2026-06-14T11:17:14Z

Problem

Parsingg a huge repository (around 11M nodes) took 570-600s on every run (cold or warm) since there was no effective cache for jar models. The parser extracted every relevant jar to a temp folder on disk just to scan its contents and then deletedd it. ModelCache existed but it actually keyed on extracted file path and mtime so the extracted files always got the current timestamp so the cache key was different on every run, meaning every jar model was always parsed from scratch regardless of how many times the same project had been parsed before. With around 595 JAR files this was hundreds of mb of pointless disk I/O on every run.

Changes

Storage

implemented flat NumPy int32 array storage in SModel replacingg per-node Python object graphs
implemented SNode as a two-field view (model, idx) into SModel arrays
implemented FlatModelPacker with three pure packing functions for strings, properties and references

Cache

Implemented npz+pkl cache format per JAR loaded via mmap_mode='r' so the OS maps arrays into virtual memory without reading bytes upfront
Fixed ModelCache never hitting for JAR models by keying ParseCache on the JAR file's own mtime and size instead of extracted file paths

Parsing pipeline

implemented JarScanner to discover jar contents via ZIP central directory without extracting anything to disk
implemented DiskModelLoader with background preread thread to overlap disk and JAR I/O phases
implemented single ProcessPool spanning both JAR parsing phases removing the per-JAR Windows process spawn overhead
moved parse_fpr and parse_mps from MpbBatchParser to DiskModelLoader

Other

fixed race condition in SLanguageBuilder by adding threading.Lock for concurrent concept registration during warm cache loads
added _concepts_by_name dict to SLanguage for fast concept lookup
replaced print statements with logging
added 22 new tests for FlatModelPacker and ParseCache
added numpy as a required dependency, updated CI workflow to install the package via pip so dependencies are available in CI
renamed mps-cli-py/Readme.md to README.md to match the readme declaration in pyproject.toml. This mismatch was already existing but I think was never triggered because CI did not build the package and it ran tests against the source tree. The pip install -e . step added in this PR is the first thing I guess to read pyproject.toml's readme field and the Linux CI runner is maybe case-sensitive so it failed to find README.md while Windows never noticed the difference.

Parse results now in the context of performance

Tested on a huge repository containing around 676 solutions and 2625 models and 11,179,666 nodes.

Before

around 550-600 seconds for every run I guess..

After

cold run - around 90-100 seconds
warm run - around 25-30 seconds

…orrected caching logic.. brief implementation details as below: - Implemented flat NumPy int32 array storage in SModel for all node data(concept, role, parent, first_child, next_sibling) with packed bytes blobs and int32 offset arrays for strings, properties and references - Implemented SNode as a thin two-field view (model, idx) into SModel arrays replacing per-node Python object storage - Implemented FlatModelPacker with pack_strings, pack_properties and pack_references as pure functions extracted from SModel - Implemented ParseCache with npz+pkl format per jar loaded via mmap_mode='r' replacing pickle-per-JAR format - Implemented JarScanner to read ZIP central directories without extracting JARS to disk - Implemented DiskModelLoader with background preread thread to overlap disk and jar I/O phases - Moved parse_fpr and parse_mps from MpbBatchParser to DiskModelLoader - Added threading.Lock to SLanguageBuilder and _concepts_by_name dict to SLanguage for concurrent concept registration - Fixed ModelCache never hitting for jar models due to temp file timestamp issue - Fixed race condition in SLanguageBuilder when called from multiple threads during warm cache loads - Added new tests for FlatModelPacker and ParseCache round-trips Cold run: reduced from around 570 seconds to only around 80-100 seconds and Warm run: reduced from around 300 seconds to only around 25-30 seconds

…kage via pip so numpy and other declared dependencies are available during test runs

…ares readme as "README.md" but the file was named as Readme.md. Case-insensitive filesystemss (for ex Windows) never noticed I guess since the two names resolve to the same file there. This was an existing mismatch that never caused any issues before I think because CI never built the package and it ran tests against the source tree. The previous commit added pip install -e . to ci workflow and this I think is the first thing to actually read pyproject.toml's readme field and the Linux CI runner is case-sensitive I think so it most likely fails to find README.md..

Prithvi686 changed the title ~~mps-cli-py: Implemented NumPy flat storage, JAR-direct parsing, and cache overhaul~~ mps-cli-py: Implemented NumPy flat storage, JAR-direct parsing and fixed caching logic Jun 14, 2026

Prithvi686 added 2 commits June 14, 2026 17:07

mps-cli-py: updated mps_cli_py_build.yaml workflow to install the pac…

4f7e313

…kage via pip so numpy and other declared dependencies are available during test runs

Prithvi686 requested a review from danielratiu June 14, 2026 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mps-cli-py: Implemented NumPy flat storage, JAR-direct parsing and fixed caching logic#62

mps-cli-py: Implemented NumPy flat storage, JAR-direct parsing and fixed caching logic#62
Prithvi686 wants to merge 3 commits into
mainfrom
feature/E3AARCHAI-23403_improve_parser_performance_in_huge_repositories

Prithvi686 commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Prithvi686 commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Parse results now in the context of performance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Prithvi686 commented Jun 14, 2026 •

edited

Loading