Skip to content

mps-cli-py: Implemented NumPy flat storage, JAR-direct parsing and fixed caching logic#62

Open
Prithvi686 wants to merge 3 commits into
mainfrom
feature/E3AARCHAI-23403_improve_parser_performance_in_huge_repositories
Open

mps-cli-py: Implemented NumPy flat storage, JAR-direct parsing and fixed caching logic#62
Prithvi686 wants to merge 3 commits into
mainfrom
feature/E3AARCHAI-23403_improve_parser_performance_in_huge_repositories

Conversation

@Prithvi686

@Prithvi686 Prithvi686 commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Problem

Parsingg a huge repository (around 11M nodes) took 570-600s on every run (cold or warm) since there was no effective cache for jar models. The parser extracted every relevant jar to a temp folder on disk just to scan its contents and then deletedd it. ModelCache existed but it actually keyed on extracted file path and mtime so the extracted files always got the current timestamp so the cache key was different on every run, meaning every jar model was always parsed from scratch regardless of how many times the same project had been parsed before. With around 595 JAR files this was hundreds of mb of pointless disk I/O on every run.

Changes

Storage

  • implemented flat NumPy int32 array storage in SModel replacingg per-node Python object graphs
  • implemented SNode as a two-field view (model, idx) into SModel arrays
  • implemented FlatModelPacker with three pure packing functions for strings, properties and references

Cache

  • Implemented npz+pkl cache format per JAR loaded via mmap_mode='r' so the OS maps arrays into virtual memory without reading bytes upfront
  • Fixed ModelCache never hitting for JAR models by keying ParseCache on the JAR file's own mtime and size instead of extracted file paths

Parsing pipeline

  • implemented JarScanner to discover jar contents via ZIP central directory without extracting anything to disk
  • implemented DiskModelLoader with background preread thread to overlap disk and JAR I/O phases
  • implemented single ProcessPool spanning both JAR parsing phases removing the per-JAR Windows process spawn overhead
  • moved parse_fpr and parse_mps from MpbBatchParser to DiskModelLoader

Other

  • fixed race condition in SLanguageBuilder by adding threading.Lock for concurrent concept registration during warm cache loads
  • added _concepts_by_name dict to SLanguage for fast concept lookup
  • replaced print statements with logging
  • added 22 new tests for FlatModelPacker and ParseCache
  • added numpy as a required dependency, updated CI workflow to install the package via pip so dependencies are available in CI
  • renamed mps-cli-py/Readme.md to README.md to match the readme declaration in pyproject.toml. This mismatch was already existing but I think was never triggered because CI did not build the package and it ran tests against the source tree. The pip install -e . step added in this PR is the first thing I guess to read pyproject.toml's readme field and the Linux CI runner is maybe case-sensitive so it failed to find README.md while Windows never noticed the difference.

Parse results now in the context of performance

Tested on a huge repository containing around 676 solutions and 2625 models and 11,179,666 nodes.

Before

  • around 550-600 seconds for every run I guess..

After

  • cold run - around 90-100 seconds
  • warm run - around 25-30 seconds

…orrected caching logic.. brief implementation details as below:

- Implemented flat NumPy int32 array storage in SModel for all node data(concept, role, parent, first_child, next_sibling) with packed bytes blobs and int32 offset
  arrays for strings, properties and references
- Implemented SNode as a thin two-field view (model, idx) into SModel arrays replacing per-node Python object storage
- Implemented FlatModelPacker with pack_strings, pack_properties and pack_references as pure functions extracted from SModel
- Implemented ParseCache with npz+pkl format per jar loaded via mmap_mode='r' replacing pickle-per-JAR format
- Implemented JarScanner to read ZIP central directories without extracting JARS to disk
- Implemented DiskModelLoader with background preread thread to overlap disk and jar I/O phases
- Moved parse_fpr and parse_mps from MpbBatchParser to DiskModelLoader
- Added threading.Lock to SLanguageBuilder and _concepts_by_name dict to SLanguage for concurrent concept registration
- Fixed ModelCache never hitting for jar models due to temp file timestamp issue
- Fixed race condition in SLanguageBuilder when called from multiple threads during warm cache loads
- Added new tests for FlatModelPacker and ParseCache round-trips

Cold run: reduced from around 570 seconds to only around 80-100 seconds and Warm run: reduced from around 300 seconds to only around 25-30 seconds
@Prithvi686 Prithvi686 changed the title mps-cli-py: Implemented NumPy flat storage, JAR-direct parsing, and cache overhaul mps-cli-py: Implemented NumPy flat storage, JAR-direct parsing and fixed caching logic Jun 14, 2026
…kage via pip so numpy and other declared dependencies are available during test runs
…ares readme as "README.md" but the file was named as Readme.md. Case-insensitive filesystemss (for ex Windows) never noticed I guess since the two names resolve to the same file there.

This was an existing mismatch that never caused any issues before I think because CI never built the package and it ran tests against the source tree. The
previous commit added pip install -e . to ci workflow and this I think is the first thing to actually read pyproject.toml's readme field and the Linux CI runner
is case-sensitive I think so it most likely fails to find README.md..
@Prithvi686 Prithvi686 requested a review from danielratiu June 14, 2026 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant