Rewrite JS parser and Segmenter: by tmihalac · Pull Request #255 · RHEcosystemAppEng/vulnerability-analysis

tmihalac · 2026-06-17T15:01:39Z

Replaced esprima-based JavaScript segmenter with tree-sitter for reliable parsing of modern JS syntax (optional chaining, nullish coalescing, top-level await)
- Fixed JS function name extraction: keyword filtering, position-aware matching, redundant pattern removal, generator/TypeScript/anonymous-export support
- Added build-artifact filtering (should_skip) that excludes app-level dist/, build/static/, .min.js while preserving node_modules/*/dist/ as legitimate third-party source
- Added empty-name guards in CCA BFS to prevent documents with unextractable function names from entering call-chain analysis
- Fixed _get_function_calls regex to detect calls through optional chaining (obj?.method())

Fixes https://redhat.atlassian.net/browse/APPENG-5453

- Replaced esprima-based JavaScript segmenter with tree-sitter for reliable parsing of modern JS syntax (optional chaining, nullish coalescing, top-level await) - Fixed JS function name extraction: keyword filtering, position-aware matching, redundant pattern removal, generator/TypeScript/anonymous-export support - Added build-artifact filtering (should_skip) that excludes app-level dist/, build/static/, .min.js while preserving node_modules/*/dist/ as legitimate third-party source - Added empty-name guards in CCA BFS to prevent documents with unextractable function names from entering call-chain analysis - Fixed _get_function_calls regex to detect calls through optional chaining (obj?.method()) Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

vbelouso · 2026-06-17T15:02:03Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Code Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

tmihalac · 2026-06-18T12:33:26Z

/test-heavy

tmihalac · 2026-06-18T18:06:27Z

/test-heavy

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

…S parser - Replace 3 catastrophic regex patterns (P4/P9/P11) in get_function_name with safe arrow_header check + simple name extraction, eliminating 12-38s hangs - Add arrow_header extension for destructured params like ({a,b}) => {body} where first '{' is in params, with guard to skip non-arrow functions - Use negative lookahead =(?!>) in P9 to distinguish assignment from arrow - Fix should_skip to only match top-level exclude dirs (dist/, build/) not nested ones (node_modules/*/dist/) - Add thread-safe double-checked locking for tree-sitter language init - Refactor class/method extraction to handle anonymous classes, generators, getters/setters, and export patterns - Fix is_function to use class\b word boundary instead of class\s+[\w$]+ - Strip optional chaining '?' from _get_function_calls results - Guard empty function names in CCA and create_map_of_local_vars with ValueError handling - Add direct export detection (export function/const) in is_function_exported - Add 387 JS parser tests and 139 JS segmenter tests covering backtracking regression, arrow functions, anonymous classes, and should_skip filtering Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

zvigrinberg

@tmihalac Looks very good in overall, performance wise, and supporting better a modern JS ESM Syntax.

Please see some minor comments.

zvigrinberg · 2026-06-22T05:01:12Z

+_JS_KEYWORDS = frozenset({
+    'if', 'for', 'while', 'switch', 'catch', 'with', 'do', 'else',
+    'return', 'throw', 'new', 'delete', 'typeof', 'void',
+    'try', 'finally', 'function', 'class', 'extends',
+    'break', 'continue', 'var', 'let', 'const', 'yield', 'await',
+    'debugger', 'instanceof',
+})
+


@tmihalac Not sure if it's critical, but anyway, You're missing here the following reserved keywords in js:

case
default
export
false
true
import
in
null
super
this

These keywords (default, case, export,
import, in, super, this) are deliberately excluded from _JS_KEYWORDS because
they're valid method names in JS (e.g., commander.default(), queryBuilder.in(),
serializer.export()).

@tmihalac these are not valid function names in javascript, only as methods under class.
But because it's cascading to it after checking functions with regex so i would say it is OK....
But it's also applies for some of the keywords in _JS_KEYWORDS set

Welcome to Node.js v22.22.2. Type ".help" for more information. > class Test { ... constructor(arg1, arg2) { ... this.arg1 = arg1; ... this.arg2 = arg2; ... } ... ... export() { ... return `This is a ${this.arg1} ${this.arg2}.`; ... } ... ... delete() { ... return "delete can be defined as method name" ; ... } ... ... with() { ... return "with can be defined as method name"; ... } ... ... } undefined > > let test = new Test(1,2) undefined > test.with() 'with can be defined as method name' > test.delete() 'delete can be defined as method name' >

zvigrinberg · 2026-06-22T05:18:39Z

+            if inside_node_modules:
+                continue
+            if i == 0:
+                if i + 1 < len(parts) - 1:


@tmihalac It's factual that i==0 in that case, so why not:

Suggested change

if i + 1 < len(parts) - 1:

if 1 < len(parts) - 1:

Or equivalently:

Suggested change

if i + 1 < len(parts) - 1:

if 2 < len(parts) :

zvigrinberg · 2026-06-22T05:32:06Z

@@ -119,7 +153,10 @@ def is_function(self, function: Document) -> bool:

        content = function.page_content.strip()

-        if re.match(r'^\s*(?:export\s+(?:default\s+)?)?class\s+\w+', content):
+        if re.match(r'^\s*(?:export\s+(?:default\s+)?)?class\b', content):
+            return False
+
+        if re.match(r'^\s*(?:export\s+)?(?:declare\s+)?(?:const\s+)?(?:interface|enum)\s+', content):
            return False

        return True
@@ -130,31 +167,34 @@ def supported_files_extensions(self) -> list[str]:
    def _get_function_calls(self, caller_function: Document, callee_function_name: str, code_documents: dict[str, Document] = None) -> list[str]:
        """
        Extract all function calls matching the given callee_function_name from caller_function.
-        
+
        Detects both:
        - Direct function calls: functionName() or obj.functionName()
        - Callback references: setTimeout(functionName, ...) or arr.forEach(functionName)
-        
+
        Also handles aliased imports in JavaScript:
        - ES6: import { originalName as alias } from 'module'
        - CommonJS: const { originalName: alias } = require('module')
-        
+
        Args:
            caller_function: Document containing the caller function code
            callee_function_name: Name of the function to find calls to
            code_documents: Optional dict mapping source paths to full file Documents (for alias resolution)
-            
+
        Returns:
            List of function call patterns found (e.g., ['func', 'obj.func'])
        """
+        if not callee_function_name:
+            return []
+
        content = caller_function.page_content
        content = '\n'.join([line if not self.is_comment_line(line) else ''for line in content.splitlines()])

-        direct_call_pattern = rf'((?:[\w.()]+\.)?(?<![a-zA-Z0-9_]){re.escape(callee_function_name)})\s*\('
-        direct_calls = [matching.group(1) for matching in re.finditer(direct_call_pattern, content, re.MULTILINE)]
+        direct_call_pattern = rf'((?:[\w.?()]+\.)?(?<![a-zA-Z0-9_]){re.escape(callee_function_name)})\s*\('
+        direct_calls = [matching.group(1).replace('?', '') for matching in re.finditer(direct_call_pattern, content, re.MULTILINE)]

-        callback_pattern = rf'(?<=[\(,])\s*((?:[\w.()]+\.)?(?<![a-zA-Z0-9_]){re.escape(callee_function_name)})(?!\s*\()'
-        callback_calls = [matching.group(1).strip() for matching in re.finditer(callback_pattern, content, re.MULTILINE)]
+        callback_pattern = rf'(?<=[\(,])\s*((?:[\w.?()]+\.)?(?<![a-zA-Z0-9_]){re.escape(callee_function_name)})(?!\s*\()'
+        callback_calls = [matching.group(1).strip().replace('?', '') for matching in re.finditer(callback_pattern, content, re.MULTILINE)]


@tmihalac I Would add comments to each one of the regex(s) , what exactly it handles, so it will be clear exactly what is the target of each one of the regex(s), as if one will add a new regex to handle an edge-case, it won't break anything or won't have side effects ( the precedence of the regexs here are crucial), because one would have a better clarity about all the regex(s).
another useful objective for this comments about regex(s) is to order the current ones with numbers ( 1, 2, 3....), so future additions would see if their target edge cases are not targeted yet they will be placed underneath in a new regex ( that will be documented and numbered as well), or will fix an existing related regex with something small it misses.

zvigrinberg · 2026-06-22T05:34:18Z

+            search_from = body_start
+            while search_from < len(content):
+                arrow_pos = content.find('=>', search_from)
+                if arrow_pos == -1 or arrow_pos > body_start + 500:


@tmihalac This is a reasonable assumption covering 99.999% of the cases :)

zvigrinberg · 2026-06-22T05:38:18Z

+        segmenter_cls = self.LANGUAGE_SEGMENTERS.get(language)
+        if segmenter_cls is not None and hasattr(segmenter_cls, "should_skip") and isinstance(blob.source, str) and segmenter_cls.should_skip(blob.source):
+            return
+


@tmihalac It does the job..
Wouldn't it be cleaner If we'll add this should_skip attribute to all segmenters with constant default false?

The base class is langchain's

@tmihalac You're right that for langchain' segmenters you don't have control over it, but in ours we have ( subclasses)...
So anyway we need that one.

tmihalac requested a review from zvigrinberg June 17, 2026 15:01

tmihalac requested a review from TamarW0 June 18, 2026 08:46

tmihalac added 2 commits June 21, 2026 10:36

Remove parser_threshold parameter from ExtendedLanguageParser creation

b247db4

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

zvigrinberg requested changes Jun 22, 2026

View reviewed changes

Conversation

tmihalac commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vbelouso commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

tmihalac commented Jun 18, 2026

Uh oh!

tmihalac commented Jun 18, 2026

Uh oh!

zvigrinberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zvigrinberg Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tmihalac commented Jun 17, 2026 •

edited

Loading

vbelouso commented Jun 17, 2026 •

edited

Loading

zvigrinberg Jun 22, 2026 •

edited

Loading