Rewrite JS parser and Segmenter:#255
Conversation
- Replaced esprima-based JavaScript segmenter with tree-sitter for reliable parsing of modern JS syntax (optional chaining, nullish coalescing, top-level await) - Fixed JS function name extraction: keyword filtering, position-aware matching, redundant pattern removal, generator/TypeScript/anonymous-export support - Added build-artifact filtering (should_skip) that excludes app-level dist/, build/static/, .min.js while preserving node_modules/*/dist/ as legitimate third-party source - Added empty-name guards in CCA BFS to prevent documents with unextractable function names from entering call-chain analysis - Fixed _get_function_calls regex to detect calls through optional chaining (obj?.method()) Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
/test-heavy |
1 similar comment
|
/test-heavy |
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
…S parser
- Replace 3 catastrophic regex patterns (P4/P9/P11) in get_function_name with
safe arrow_header check + simple name extraction, eliminating 12-38s hangs
- Add arrow_header extension for destructured params like ({a,b}) => {body}
where first '{' is in params, with guard to skip non-arrow functions
- Use negative lookahead =(?!>) in P9 to distinguish assignment from arrow
- Fix should_skip to only match top-level exclude dirs (dist/, build/) not
nested ones (node_modules/*/dist/)
- Add thread-safe double-checked locking for tree-sitter language init
- Refactor class/method extraction to handle anonymous classes, generators,
getters/setters, and export patterns
- Fix is_function to use class\b word boundary instead of class\s+[\w$]+
- Strip optional chaining '?' from _get_function_calls results
- Guard empty function names in CCA and create_map_of_local_vars with
ValueError handling
- Add direct export detection (export function/const) in is_function_exported
- Add 387 JS parser tests and 139 JS segmenter tests covering backtracking
regression, arrow functions, anonymous classes, and should_skip filtering
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
zvigrinberg
left a comment
There was a problem hiding this comment.
@tmihalac Looks very good in overall, performance wise, and supporting better a modern JS ESM Syntax.
Please see some minor comments.
| _JS_KEYWORDS = frozenset({ | ||
| 'if', 'for', 'while', 'switch', 'catch', 'with', 'do', 'else', | ||
| 'return', 'throw', 'new', 'delete', 'typeof', 'void', | ||
| 'try', 'finally', 'function', 'class', 'extends', | ||
| 'break', 'continue', 'var', 'let', 'const', 'yield', 'await', | ||
| 'debugger', 'instanceof', | ||
| }) | ||
|
|
There was a problem hiding this comment.
@tmihalac Not sure if it's critical, but anyway, You're missing here the following reserved keywords in js:
case
default
export
false
true
import
in
null
super
this
There was a problem hiding this comment.
These keywords (default, case, export,
import, in, super, this) are deliberately excluded from _JS_KEYWORDS because
they're valid method names in JS (e.g., commander.default(), queryBuilder.in(),
serializer.export()).
There was a problem hiding this comment.
@tmihalac these are not valid function names in javascript, only as methods under class.
But because it's cascading to it after checking functions with regex so i would say it is OK....
But it's also applies for some of the keywords in _JS_KEYWORDS set
Welcome to Node.js v22.22.2.
Type ".help" for more information.
> class Test {
... constructor(arg1, arg2) {
... this.arg1 = arg1;
... this.arg2 = arg2;
... }
...
... export() {
... return `This is a ${this.arg1} ${this.arg2}.`;
... }
...
... delete() {
... return "delete can be defined as method name" ;
... }
...
... with() {
... return "with can be defined as method name";
... }
...
... }
undefined
>
> let test = new Test(1,2)
undefined
> test.with()
'with can be defined as method name'
> test.delete()
'delete can be defined as method name'
> | if inside_node_modules: | ||
| continue | ||
| if i == 0: | ||
| if i + 1 < len(parts) - 1: |
There was a problem hiding this comment.
@tmihalac It's factual that i==0 in that case, so why not:
| if i + 1 < len(parts) - 1: | |
| if 1 < len(parts) - 1: |
Or equivalently:
| if i + 1 < len(parts) - 1: | |
| if 2 < len(parts) : |
| @@ -119,7 +153,10 @@ def is_function(self, function: Document) -> bool: | |||
|
|
|||
| content = function.page_content.strip() | |||
|
|
|||
| if re.match(r'^\s*(?:export\s+(?:default\s+)?)?class\s+\w+', content): | |||
| if re.match(r'^\s*(?:export\s+(?:default\s+)?)?class\b', content): | |||
| return False | |||
|
|
|||
| if re.match(r'^\s*(?:export\s+)?(?:declare\s+)?(?:const\s+)?(?:interface|enum)\s+', content): | |||
| return False | |||
|
|
|||
| return True | |||
| @@ -130,31 +167,34 @@ def supported_files_extensions(self) -> list[str]: | |||
| def _get_function_calls(self, caller_function: Document, callee_function_name: str, code_documents: dict[str, Document] = None) -> list[str]: | |||
| """ | |||
| Extract all function calls matching the given callee_function_name from caller_function. | |||
|
|
|||
| Detects both: | |||
| - Direct function calls: functionName() or obj.functionName() | |||
| - Callback references: setTimeout(functionName, ...) or arr.forEach(functionName) | |||
|
|
|||
| Also handles aliased imports in JavaScript: | |||
| - ES6: import { originalName as alias } from 'module' | |||
| - CommonJS: const { originalName: alias } = require('module') | |||
|
|
|||
| Args: | |||
| caller_function: Document containing the caller function code | |||
| callee_function_name: Name of the function to find calls to | |||
| code_documents: Optional dict mapping source paths to full file Documents (for alias resolution) | |||
|
|
|||
| Returns: | |||
| List of function call patterns found (e.g., ['func', 'obj.func']) | |||
| """ | |||
| if not callee_function_name: | |||
| return [] | |||
|
|
|||
| content = caller_function.page_content | |||
| content = '\n'.join([line if not self.is_comment_line(line) else ''for line in content.splitlines()]) | |||
|
|
|||
| direct_call_pattern = rf'((?:[\w.()]+\.)?(?<![a-zA-Z0-9_]){re.escape(callee_function_name)})\s*\(' | |||
| direct_calls = [matching.group(1) for matching in re.finditer(direct_call_pattern, content, re.MULTILINE)] | |||
| direct_call_pattern = rf'((?:[\w.?()]+\.)?(?<![a-zA-Z0-9_]){re.escape(callee_function_name)})\s*\(' | |||
| direct_calls = [matching.group(1).replace('?', '') for matching in re.finditer(direct_call_pattern, content, re.MULTILINE)] | |||
|
|
|||
| callback_pattern = rf'(?<=[\(,])\s*((?:[\w.()]+\.)?(?<![a-zA-Z0-9_]){re.escape(callee_function_name)})(?!\s*\()' | |||
| callback_calls = [matching.group(1).strip() for matching in re.finditer(callback_pattern, content, re.MULTILINE)] | |||
| callback_pattern = rf'(?<=[\(,])\s*((?:[\w.?()]+\.)?(?<![a-zA-Z0-9_]){re.escape(callee_function_name)})(?!\s*\()' | |||
| callback_calls = [matching.group(1).strip().replace('?', '') for matching in re.finditer(callback_pattern, content, re.MULTILINE)] | |||
There was a problem hiding this comment.
@tmihalac I Would add comments to each one of the regex(s) , what exactly it handles, so it will be clear exactly what is the target of each one of the regex(s), as if one will add a new regex to handle an edge-case, it won't break anything or won't have side effects ( the precedence of the regexs here are crucial), because one would have a better clarity about all the regex(s).
another useful objective for this comments about regex(s) is to order the current ones with numbers ( 1, 2, 3....), so future additions would see if their target edge cases are not targeted yet they will be placed underneath in a new regex ( that will be documented and numbered as well), or will fix an existing related regex with something small it misses.
| search_from = body_start | ||
| while search_from < len(content): | ||
| arrow_pos = content.find('=>', search_from) | ||
| if arrow_pos == -1 or arrow_pos > body_start + 500: |
There was a problem hiding this comment.
@tmihalac This is a reasonable assumption covering 99.999% of the cases :)
| segmenter_cls = self.LANGUAGE_SEGMENTERS.get(language) | ||
| if segmenter_cls is not None and hasattr(segmenter_cls, "should_skip") and isinstance(blob.source, str) and segmenter_cls.should_skip(blob.source): | ||
| return | ||
|
|
There was a problem hiding this comment.
@tmihalac It does the job..
Wouldn't it be cleaner If we'll add this should_skip attribute to all segmenters with constant default false?
There was a problem hiding this comment.
@tmihalac You're right that for langchain' segmenters you don't have control over it, but in ours we have ( subclasses)...
So anyway we need that one.
Fixes https://redhat.atlassian.net/browse/APPENG-5453