Skip to content

Add "loop" feature to instructions in static extractors.#3075

Open
larchchen wants to merge 1 commit into
mandiant:masterfrom
larchchen:inside-loop-characteristics
Open

Add "loop" feature to instructions in static extractors.#3075
larchchen wants to merge 1 commit into
mandiant:masterfrom
larchchen:inside-loop-characteristics

Conversation

@larchchen
Copy link
Copy Markdown
Contributor

@larchchen larchchen commented May 15, 2026

Detecting API happening in a loop is an effective approach for exploits leveraging racing conditions.

A classic example is DirtyCow (CVE-2016-5195)
By detecting madvise calls inside a loop with MADV_DONTNEED argument.

  scopes:
    static: basic block
  features:
    - and:
      - api: madvise
      - number: 4 # Constant for MADV_DONTNEED
      - characteristic: inside loop

A more recent example CVE-2024-50066 can be covered by

  - and:
    - api: madvise
    - number: 25 # MADV_COLLAPSE
    - characteristic: inside loop

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed
  • This submission includes AI-generated code and I have provided details in the description.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

@github-actions github-actions Bot dismissed their stale review May 15, 2026 14:53

CHANGELOG updated or no update needed, thanks! 😄

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the 'inside loop' characteristic for basic blocks in the BinExport2 extractor. It adds a new utility module, capa/features/extractors/loops.py, which uses networkx to identify vertices within cycles. Feedback identifies a structural issue in the new loops module where code precedes the license header and lacks necessary imports. Additionally, suggestions were made to move a local import to the top level for PEP 8 compliance and to simplify edge collection logic using a list comprehension.

Comment thread capa/features/extractors/loops.py Outdated
Comment thread capa/features/extractors/binexport2/__init__.py Outdated
Comment thread capa/features/extractors/binexport2/__init__.py Outdated
@github-actions github-actions Bot dismissed stale reviews from themself May 15, 2026 14:57

CHANGELOG updated or no update needed, thanks! 😄

@larchchen larchchen force-pushed the inside-loop-characteristics branch 2 times, most recently from 66d9ff9 to 9b38843 Compare May 15, 2026 15:17
@mike-hunhoff
Copy link
Copy Markdown
Collaborator

Thank you for iterating on this! After reviewing the implementation and thinking about how to best align this with capa's architecture and performance constraints, I suggest a unified path forward that simplifies the rule authoring experience and minimizes performance overhead.

Here is the suggested implementation strategy:

1. Reuse the existing loop characteristic

Instead of introducing a new name like inside loop at the basic block scope, let's just yield the existing Characteristic("loop") at the instruction scope for instructions contained within a cycle.

  • Why: This avoids inflating the rule vocabulary and maintains perfect backward compatibility. By extracting it at the narrowest scope (instruction), it will automatically bubble up to basic block and function scopes properly.

2. Perform Graph Analysis Once Per Function (Performance)

To avoid the heavy performance penalty of running networkx graph analysis repeatedly, I suggest computing the loop vertices once at the function level and sharing them with the instruction extractor:

  • In the function extractor (where we already build edges for the current has_loop check), use your SCC logic to extract the set of basic block addresses that participate in cycles.
  • Store this set in the transient FunctionHandle.ctx dictionary (e.g., fh.ctx["cyclic_blocks"]).
  • In the instruction extractor, simply check if the instruction's parent basic block address is in that set. This is a fast $O(1)$ lookup with no heavy object creation at the instruction level, and the memory is freed as soon as capa moves to the next function.

3. Cross-Backend Consistency

Since capa rules strive to be backend-agnostic, this pattern should be applied across all static backends (Vivisect, Ghidra, IDA, etc.) that already support the function-level loop characteristic.

This approach fits perfectly into capa's style of using transient handle contexts for performance optimization while keeping the rule language intuitive. Let me know what you think!

@larchchen larchchen force-pushed the inside-loop-characteristics branch from 9b38843 to 80384c8 Compare May 18, 2026 09:04
@larchchen larchchen changed the title Add "inside loop" feature to basic blocks in Binexport2 extractor. Add "loop" feature to instructions in static extractors. May 18, 2026
@larchchen larchchen force-pushed the inside-loop-characteristics branch from 80384c8 to a3c4cf1 Compare May 18, 2026 09:07
Detecting API happening in a loop is an effective approach for exploits leveraging racing conditions.

A classic example is DirtyCow (CVE-2016-5195)
By detecting madvise calls inside a loop with MADV_DONTNEED argument.

```
  scopes:
    static: basic block
  features:
    - and:
      - api: madvise
      - number: 4 # Constant for MADV_DONTNEED
      - characteristic: inside loop
```

A more recent example CVE-2024-50066 can be covered by

```
  - and:
    - api: madvise
    - number: 25 # MADV_COLLAPSE
    - characteristic: inside loop
```
@larchchen larchchen force-pushed the inside-loop-characteristics branch from a3c4cf1 to cbc7287 Compare May 18, 2026 09:10
@larchchen
Copy link
Copy Markdown
Contributor Author

Thanks for your comments @mike-hunhoff

I have made changes according to you suggestions, maybe you want to take a look again.

However, I have a few more questions:

  1. If we are firing a "loop" characteristics for every instruction in a loop, wouldn't it resulting into too large feature spaces to match as well?
  2. The instruction feature generation runs before the function feature generation inside static.py, so I need to calculate the cyclic_blocks at the time of FunctionContext __post_init__ instead of at the function iteration.

If it looks good to you, I'll make the similar changes to other extractors.

Copy link
Copy Markdown
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id be curious to see some examples of rules that use this new feature

os: set[str]
arch: set[str]

def __post_init__(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we haven't used post_init yet in this codebase so i wonder if you could do this in a way that's more consistent.

perhaps via a constructor or from_foo(...) classmethod.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this could be placed into AnalysisContext (as a dict keyed by vertex)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think put it into AnalysisContext is a good idea for now, as AnalysisContext is more about global data can be repeated used, but this one is only used within a function scope. Depending on the program complexity, the data may not be very small if we keep the all loops information across the entire scan life-span.

Keeping it inside FunctionContext will just let the data in RAM within single function feature generation, and will be GC-ed right after.

Although I feel post_init is kinda standard approach for dataclass, I can also just put the code inside get_functions of BinExport2FeatureExtractor, make it part of the intialiser list.

How about that?

@larchchen
Copy link
Copy Markdown
Contributor Author

id be curious to see some examples of rules that use this new feature

I gave out some snippets of rule examples in the first comment. How does it look to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants