Skip to content

Rewrite JS parser and Segmenter:#255

Open
tmihalac wants to merge 3 commits into
RHEcosystemAppEng:mainfrom
tmihalac:rewrite-js-segmenter-and-parsers
Open

Rewrite JS parser and Segmenter:#255
tmihalac wants to merge 3 commits into
RHEcosystemAppEng:mainfrom
tmihalac:rewrite-js-segmenter-and-parsers

Conversation

@tmihalac

@tmihalac tmihalac commented Jun 17, 2026

Copy link
Copy Markdown
  • Replaced esprima-based JavaScript segmenter with tree-sitter for reliable parsing of modern JS syntax (optional chaining, nullish coalescing, top-level await)
    • Fixed JS function name extraction: keyword filtering, position-aware matching, redundant pattern removal, generator/TypeScript/anonymous-export support
    • Added build-artifact filtering (should_skip) that excludes app-level dist/, build/static/, .min.js while preserving node_modules/*/dist/ as legitimate third-party source
    • Added empty-name guards in CCA BFS to prevent documents with unextractable function names from entering call-chain analysis
    • Fixed _get_function_calls regex to detect calls through optional chaining (obj?.method())

Fixes https://redhat.atlassian.net/browse/APPENG-5453

- Replaced esprima-based JavaScript segmenter with tree-sitter for reliable
  parsing of modern JS syntax (optional chaining, nullish coalescing, top-level
  await)
  - Fixed JS function name extraction: keyword filtering, position-aware matching,
  redundant pattern removal, generator/TypeScript/anonymous-export support
  - Added build-artifact filtering (should_skip) that excludes app-level dist/,
  build/static/, .min.js while preserving node_modules/*/dist/ as legitimate
  third-party source
  - Added empty-name guards in CCA BFS to prevent documents with unextractable
  function names from entering call-chain analysis
  - Fixed _get_function_calls regex to detect calls through optional chaining
  (obj?.method())

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac tmihalac requested a review from zvigrinberg June 17, 2026 15:01
@vbelouso

vbelouso commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Code Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@tmihalac tmihalac requested a review from TamarW0 June 18, 2026 08:46
@tmihalac

Copy link
Copy Markdown
Author

/test-heavy

1 similar comment
@tmihalac

Copy link
Copy Markdown
Author

/test-heavy

tmihalac added 2 commits June 21, 2026 10:36
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
…S parser

  - Replace 3 catastrophic regex patterns (P4/P9/P11) in get_function_name with
    safe arrow_header check + simple name extraction, eliminating 12-38s hangs
  - Add arrow_header extension for destructured params like ({a,b}) => {body}
    where first '{' is in params, with guard to skip non-arrow functions
  - Use negative lookahead =(?!>) in P9 to distinguish assignment from arrow
  - Fix should_skip to only match top-level exclude dirs (dist/, build/) not
    nested ones (node_modules/*/dist/)
  - Add thread-safe double-checked locking for tree-sitter language init
  - Refactor class/method extraction to handle anonymous classes, generators,
    getters/setters, and export patterns
  - Fix is_function to use class\b word boundary instead of class\s+[\w$]+
  - Strip optional chaining '?' from _get_function_calls results
  - Guard empty function names in CCA and create_map_of_local_vars with
    ValueError handling
  - Add direct export detection (export function/const) in is_function_exported
  - Add 387 JS parser tests and 139 JS segmenter tests covering backtracking
    regression, arrow functions, anonymous classes, and should_skip filtering

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

@zvigrinberg zvigrinberg left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Looks very good in overall, performance wise, and supporting better a modern JS ESM Syntax.

Please see some minor comments.

Comment on lines +26 to +33
_JS_KEYWORDS = frozenset({
'if', 'for', 'while', 'switch', 'catch', 'with', 'do', 'else',
'return', 'throw', 'new', 'delete', 'typeof', 'void',
'try', 'finally', 'function', 'class', 'extends',
'break', 'continue', 'var', 'let', 'const', 'yield', 'await',
'debugger', 'instanceof',
})

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Not sure if it's critical, but anyway, You're missing here the following reserved keywords in js:

case
default
export
false
true
import
in
null
super
this

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These keywords (default, case, export,
import, in, super, this) are deliberately excluded from _JS_KEYWORDS because
they're valid method names in JS (e.g., commander.default(), queryBuilder.in(),
serializer.export()).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac these are not valid function names in javascript, only as methods under class.
But because it's cascading to it after checking functions with regex so i would say it is OK....
But it's also applies for some of the keywords in _JS_KEYWORDS set

Welcome to Node.js v22.22.2.
Type ".help" for more information.
> class Test {
...   constructor(arg1, arg2) {
...     this.arg1 = arg1;
...     this.arg2 = arg2;
...   }
... 
...   export() {
...     return `This is a ${this.arg1} ${this.arg2}.`;
...   }
... 
...   delete() {
...     return "delete can be defined as method name" ;
...   }
...   
...   with() {
...     return "with can be defined as method name";
...   }
...   
... }
undefined
> 
> let test = new Test(1,2)
undefined
> test.with()
'with can be defined as method name'
> test.delete()
'delete can be defined as method name'
> 

if inside_node_modules:
continue
if i == 0:
if i + 1 < len(parts) - 1:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac It's factual that i==0 in that case, so why not:

Suggested change
if i + 1 < len(parts) - 1:
if 1 < len(parts) - 1:

Or equivalently:

Suggested change
if i + 1 < len(parts) - 1:
if 2 < len(parts) :

Comment on lines 44 to +197
@@ -119,7 +153,10 @@ def is_function(self, function: Document) -> bool:

content = function.page_content.strip()

if re.match(r'^\s*(?:export\s+(?:default\s+)?)?class\s+\w+', content):
if re.match(r'^\s*(?:export\s+(?:default\s+)?)?class\b', content):
return False

if re.match(r'^\s*(?:export\s+)?(?:declare\s+)?(?:const\s+)?(?:interface|enum)\s+', content):
return False

return True
@@ -130,31 +167,34 @@ def supported_files_extensions(self) -> list[str]:
def _get_function_calls(self, caller_function: Document, callee_function_name: str, code_documents: dict[str, Document] = None) -> list[str]:
"""
Extract all function calls matching the given callee_function_name from caller_function.

Detects both:
- Direct function calls: functionName() or obj.functionName()
- Callback references: setTimeout(functionName, ...) or arr.forEach(functionName)

Also handles aliased imports in JavaScript:
- ES6: import { originalName as alias } from 'module'
- CommonJS: const { originalName: alias } = require('module')

Args:
caller_function: Document containing the caller function code
callee_function_name: Name of the function to find calls to
code_documents: Optional dict mapping source paths to full file Documents (for alias resolution)

Returns:
List of function call patterns found (e.g., ['func', 'obj.func'])
"""
if not callee_function_name:
return []

content = caller_function.page_content
content = '\n'.join([line if not self.is_comment_line(line) else ''for line in content.splitlines()])

direct_call_pattern = rf'((?:[\w.()]+\.)?(?<![a-zA-Z0-9_]){re.escape(callee_function_name)})\s*\('
direct_calls = [matching.group(1) for matching in re.finditer(direct_call_pattern, content, re.MULTILINE)]
direct_call_pattern = rf'((?:[\w.?()]+\.)?(?<![a-zA-Z0-9_]){re.escape(callee_function_name)})\s*\('
direct_calls = [matching.group(1).replace('?', '') for matching in re.finditer(direct_call_pattern, content, re.MULTILINE)]

callback_pattern = rf'(?<=[\(,])\s*((?:[\w.()]+\.)?(?<![a-zA-Z0-9_]){re.escape(callee_function_name)})(?!\s*\()'
callback_calls = [matching.group(1).strip() for matching in re.finditer(callback_pattern, content, re.MULTILINE)]
callback_pattern = rf'(?<=[\(,])\s*((?:[\w.?()]+\.)?(?<![a-zA-Z0-9_]){re.escape(callee_function_name)})(?!\s*\()'
callback_calls = [matching.group(1).strip().replace('?', '') for matching in re.finditer(callback_pattern, content, re.MULTILINE)]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac I Would add comments to each one of the regex(s) , what exactly it handles, so it will be clear exactly what is the target of each one of the regex(s), as if one will add a new regex to handle an edge-case, it won't break anything or won't have side effects ( the precedence of the regexs here are crucial), because one would have a better clarity about all the regex(s).
another useful objective for this comments about regex(s) is to order the current ones with numbers ( 1, 2, 3....), so future additions would see if their target edge cases are not targeted yet they will be placed underneath in a new regex ( that will be documented and numbered as well), or will fix an existing related regex with something small it misses.

search_from = body_start
while search_from < len(content):
arrow_pos = content.find('=>', search_from)
if arrow_pos == -1 or arrow_pos > body_start + 500:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac This is a reasonable assumption covering 99.999% of the cases :)

Comment on lines +150 to +153
segmenter_cls = self.LANGUAGE_SEGMENTERS.get(language)
if segmenter_cls is not None and hasattr(segmenter_cls, "should_skip") and isinstance(blob.source, str) and segmenter_cls.should_skip(blob.source):
return

@zvigrinberg zvigrinberg Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac It does the job..
Wouldn't it be cleaner If we'll add this should_skip attribute to all segmenters with constant default false?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base class is langchain's

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac You're right that for langchain' segmenters you don't have control over it, but in ours we have ( subclasses)...
So anyway we need that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants