mps-cli-py: Implemented NumPy flat storage, JAR-direct parsing and fixed caching logic#62
Open
Prithvi686 wants to merge 3 commits into
Conversation
…orrected caching logic.. brief implementation details as below: - Implemented flat NumPy int32 array storage in SModel for all node data(concept, role, parent, first_child, next_sibling) with packed bytes blobs and int32 offset arrays for strings, properties and references - Implemented SNode as a thin two-field view (model, idx) into SModel arrays replacing per-node Python object storage - Implemented FlatModelPacker with pack_strings, pack_properties and pack_references as pure functions extracted from SModel - Implemented ParseCache with npz+pkl format per jar loaded via mmap_mode='r' replacing pickle-per-JAR format - Implemented JarScanner to read ZIP central directories without extracting JARS to disk - Implemented DiskModelLoader with background preread thread to overlap disk and jar I/O phases - Moved parse_fpr and parse_mps from MpbBatchParser to DiskModelLoader - Added threading.Lock to SLanguageBuilder and _concepts_by_name dict to SLanguage for concurrent concept registration - Fixed ModelCache never hitting for jar models due to temp file timestamp issue - Fixed race condition in SLanguageBuilder when called from multiple threads during warm cache loads - Added new tests for FlatModelPacker and ParseCache round-trips Cold run: reduced from around 570 seconds to only around 80-100 seconds and Warm run: reduced from around 300 seconds to only around 25-30 seconds
…kage via pip so numpy and other declared dependencies are available during test runs
…ares readme as "README.md" but the file was named as Readme.md. Case-insensitive filesystemss (for ex Windows) never noticed I guess since the two names resolve to the same file there. This was an existing mismatch that never caused any issues before I think because CI never built the package and it ran tests against the source tree. The previous commit added pip install -e . to ci workflow and this I think is the first thing to actually read pyproject.toml's readme field and the Linux CI runner is case-sensitive I think so it most likely fails to find README.md..
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Parsingg a huge repository (around 11M nodes) took 570-600s on every run (cold or warm) since there was no effective cache for jar models. The parser extracted every relevant jar to a temp folder on disk just to scan its contents and then deletedd it. ModelCache existed but it actually keyed on extracted file path and mtime so the extracted files always got the current timestamp so the cache key was different on every run, meaning every jar model was always parsed from scratch regardless of how many times the same project had been parsed before. With around 595 JAR files this was hundreds of mb of pointless disk I/O on every run.
Changes
Storage
Cache
Parsing pipeline
Other
Parse results now in the context of performance
Tested on a huge repository containing around 676 solutions and 2625 models and 11,179,666 nodes.
Before
After