chore(ai): Add check-code-attribution skill (JAVA-499)#5449
chore(ai): Add check-code-attribution skill (JAVA-499)#54490xadam-brown wants to merge 1 commit into
Conversation
📲 Install BuildsAndroid
|
Adds a check-code-attribution skill that validates license headers + THIRD_PARTY_NOTICES.md entries for code copied or adapted from third parties. Also verifies license compatiblity against Sentry's licensing policy. Focus is limited to the branch diff. Reports any issues found via PR comments (when run on CI) or to the terminal (when run locally). To run it in Claude Code: ``` /check-code-attribution ``` Runs on CI automatically via [Warden](https://warden.sentry.dev/). - Purely advisory / does not block merge. - Generates PR comments with code suggestions for all discovered issues. - Automatically manages removing stale comments as PRs are updated. Current Warden configs: ┌─────────────────┬─────────────────────────────┬───────────────────────────────────────────────────┐ │ Setting │ Value │ Effect │ ├─────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤ │ model │ anthropic/claude-sonnet-4-6 │ Model used for analysis │ ├─────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤ │ maxTurns │ 30 │ Max tool calls per chunk │ ├─────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤ │ skill │ check-code-attribution │ Per-file vendored code attribution check │ ├─────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤ │ failOn │ off │ Do not fail workflow if attribution issues found │ ├─────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤ │ reportOn │ medium │ Show findings at >= medium severity via PR comment│ ├─────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤ │ requestChanges │ false │ Never post REQUEST_CHANGES comments on PRs │ ├─────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤ │ failCheck │ false │ No red X on workflow in GitHub UI if it fails │ ├─────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤ │ triggers │ pull_request + local │ Runs on PR open/sync and local warden invocations │ ├─────────────────┼─────────────────────────────┼───────────────────────────────────────────────────┤ │ reportOnSuccess │ false (default) │ No comment when everything is clean │ └─────────────────┴─────────────────────────────┴───────────────────────────────────────────────────┘ Going forward, we can consider blocking PRs once we've had a chance to vet behavior in the wild.
9ced7c7 to
93b92d7
Compare
| } else { | ||
| const reason = s.expectFinding | ||
| ? 'expected finding (>= medium), got none' | ||
| : `expected no finding (>= medium), got ${count}`; | ||
| console.log(`${RED}FAIL${RESET} ${s.id} (${reason})`); | ||
|
|
||
| failures.push({ |
There was a problem hiding this comment.
Bug: The validateExpected function only validates .java fixture files, ignoring other file types like .md. This can lead to silently passing tests if non-Java fixtures are missing.
Severity: MEDIUM
Suggested Fix
Modify the validateExpected function to check all files in the scenarios directory against the EXPECTED.json manifest, not just .java files. This ensures all test fixtures are accounted for during validation.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location:
.claude/skills/check-code-attribution/validation-tests/assert-scenarios.mjs#L299-L305
Potential issue: The validation logic in the `validateExpected` function is incomplete.
It filters for and validates only `.java` files found on disk against the
`EXPECTED.json` manifest. Consequently, other required test fixture files, such as
`THIRD_PARTY_NOTICES.mismatch-snippet.md`, are not validated. If such a file were
accidentally deleted or renamed, the validation step would pass silently. The test would
only fail later when a shell script attempts to access the non-existent file, making the
root cause harder to identify.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 93b92d7. Configure here.
|
|
||
| ## Example — HeaderCompleteAndNoticePresent (Apache 2.0) | ||
|
|
||
| **Source:** https://github.com/example/complete-with-notices<br> |
There was a problem hiding this comment.
Catalog source URL mismatches scenario header URLs
Medium Severity
The header-complete-and-notice-present scenario expects no findings, but the catalog Source URL (https://github.com/example/complete-with-notices) doesn't match either URL in the Java file header (https://github.com/example, https://github.com/example/something). Since source URL is a required field and SKILL.md flags header-vs-NOTICES inconsistencies on required fields, the AI can intermittently report a finding here, making this negative test case flaky.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 93b92d7. Configure here.
| @@ -0,0 +1,21 @@ | |||
| /* | |||
There was a problem hiding this comment.
Most of this PR's diff is test fixtures for the automated tests (runnable via check-code-attribution-tests.sh).
The interesting (ie, non-test) parts live in SKILL.md and warden.toml.
runningcode
left a comment
There was a problem hiding this comment.
Thanks for the well thought out approach as well as all the tests. I think we don't have so much precedent for testing skills. Other repos use an LLM to judge that skills work correctly. I think that is a bit overkill here so I think your approach with the bash script is a good compromise. To be honest, I wasn't expecting any tests here.
I've approved this PR. You can consider all my comments to be nits and I'm happy to discuss them.
I wonder if we or someone has best practices on creating skills. In my head I imagined that skills should only describe "what" needs to be done and letting the LLM figure out the "how" rather than explaining the "how" in the skill. I've left some comments to that effect.
| - **Escape the period** after the number (`1\.` not `1.`) so markdown does not collapse entries into a tight list. | ||
| - Leave an empty line between each numbered finding. | ||
|
|
||
| ## Validation (maintainers) |
There was a problem hiding this comment.
Should validation of the skill itself be a part of the skill? I feel like this could pollute the context window.
|
|
||
| ### Warden CLI (optional local parity check) | ||
|
|
||
| Warden does **not** use Cursor auth. Before running Warden locally, configure a provider (same model family as `warden.toml`, or override with `-m`): |
There was a problem hiding this comment.
why do we need these instructions in this skill?
| For all other files, perform these checks **before** deciding whether to proceed: | ||
|
|
||
| 1. **Read the file header** — use the Read tool to read the first 50 lines of the file. Look for vendored-code signals: `Copyright`, `Licensed under`, `SPDX-License-Identifier`, or vendoring language ("adapted from", "backported from", "based on", "copied from", "derived from", "inspired by", "ported from", "translated from", "vendored"). | ||
| 2. **Check THIRD_PARTY_NOTICES.md** — use Grep to search `THIRD_PARTY_NOTICES.md` for the file name without extension (e.g., search for `ANRWatchDog` when reviewing `ANRWatchDog.java`). A match means this is a known vendored file. **Renames:** if the diff is a rename (`similarity index` / `rename from` in the diff, or a delete of one path and add of another with the same content), also Grep for the **old** basename and read **Scope** sections in matching entries — NOTICES may still reference the previous class or path name. |
There was a problem hiding this comment.
nit: I don't think we should be presciptive about what tools to use
| **Mandatory on every run (do not skip):** | ||
|
|
||
| 1. `Read` the first 50 lines of the changed file. | ||
| 2. `Grep` `THIRD_PARTY_NOTICES.md` for the class name (filename without extension, e.g. `ANRWatchDog` for `ANRWatchDog.java`). On renames, also grep the old basename and read Scope sections (see Quick triage). |
There was a problem hiding this comment.
nit: i don't think we should be prescriptive about what tools to use. (like git)
| allowed-tools: Bash Read Grep Glob | ||
| --- | ||
|
|
||
| **Maintainers:** Only edit files in `.claude/skills/check-code-attribution` (the committed file) and run `npx @sentry/dotagents sync` from the command line to automatically update the matching files in `.agents/skills/check-code-attribution`. |
There was a problem hiding this comment.
I think this line should be in the README.md not SKILL.md otherwise we are polluting the context window.
|
|
||
| You are reviewing changed files for third-party code attribution compliance in **sentry-java**, an MIT-licensed repository. | ||
|
|
||
| ## Local runs — discover changed files first |
There was a problem hiding this comment.
Do we really need the local use case? I feel like it adds a lot of tokens.
I would ask the same thing about the warden CLI. What happens if you don't have these sections?
| # globally but are tuned for check-code-attribution. Attribution checks need the full | ||
| # file header and a NOTICES cross-check — not isolated diff hunks. | ||
| [[defaults.chunking.filePatterns]] | ||
| pattern = "**/*.api" |
There was a problem hiding this comment.
Why don't we want warden looking at .api?
| # Tighten to failOn = "medium" / requestChanges = true once the false-positive baseline is established. | ||
| failOn = "off" | ||
| reportOn = "medium" | ||
| ignorePaths = [ |
There was a problem hiding this comment.
should we ignore the build directory?
| ] | ||
|
|
||
| [[skills.triggers]] | ||
| type = "pull_request" |
There was a problem hiding this comment.
do these change the defaults?
| version = 1 | ||
|
|
||
| [defaults] | ||
| model = "anthropic/claude-sonnet-4-6" |
There was a problem hiding this comment.
is this a change from the default?


📜 Description
Adds a
check-code-attributionskill that validates license headers andTHIRD_PARTY_NOTICES.mdentries for code copied or adapted from third parties. Also verifies license compatibility against Sentry's Licensing Policy.The skill focuses on the branch diff only. It's a pure-LLM approach, in contrast to the part-deterministic, part-LLM approach we decided against from #5401.
Reports findings via PR comments when run on CI, or to the terminal when run locally.
Local
To run it from Claude Code:
CI
Warden configuration ensures the skill runs automatically on all PRs:
💡 Motivation and Context
Third-party code attribution is a legal and compliance requirement. Currently, attribution correctness is only caught during manual code review. This skill automates detection of vendored code in branch diffs and can help us flag missing or incomplete attributions before a PR is merged.
Background: Click to expand
Sentry SDKs and third-party code
3 possible ways third-party code enters Sentry’s SDKs (including sentry-java):
1. Plain vanilla dependencies
2. Shaded code
3. Vendored code
All third-party code must be properly attributed, and licenses must be compatible with Sentry’s licensing policies.
Plain deps + shaded code: We run an
enforce-license-complianceGitHub workflow that applies a FOSSA check to all plain vanilla dependencies and our few shaded dependencies, which ensures their licenses are properly attributed and are compatible with Sentry’s licensing policies.Vendored code: Relies on a manual process where developers add attributions to files containing vendored code + include a corresponding entry is included in the THIRD_PARTY_NOTICES.md file that ships with the SDK. Developers are also responsible for ensuring license compatibility.
The criteria for what counts as a proper attribution of vendored code lives in the AGENTS.md file under the heading “Third-Party Code Attribution”.
Goal of this PR: Create a skill that helps us properly attribute vendored code
Types of vendored code:
The skill introduced in this PR protects (1) from regression and identifies instances of (2). (Addressing (3) is out of scope – and is obviously non-trivial.)
Skill does not mandate that license headers exactly match the template from
AGENTS.md(link) so long as all template fields are present.That^^ lets us maintain our current, diverse header formats and remain relatively unopinionated going forward.
Example output
Local runs
PR comments
💚 How did you test it?
[1] Automated validation tests (
check-code-attribution-tests.sh) with scenario files covering:Note: the tests are not run on CI / are only run manually atm (see the
check-code-attribution-tests.shscript).[2] Ran the skill on branches with known attribution issues to verify correct detection and reporting.
Manual tests + output: Click to expand
Note the skill's output format has changed since these tests were run, but the behavior remains the same.
Diff 1: Remove entire license header
Output 1
Required attribution field(s) removed:
Diff 2: Modify existing license header, but retain all required fields
Output 2
Vendored code detected (Apache Commons Collections) – verify that
THIRD_PARTY_NOTICES.mdreflects your updates.Diff 3: Modify existing license header by removing one or more required fields
Output 3
Required attribution field(s) removed:
Copyright (C) 2016 Matej Tymeswas removed from the license header. Please restore it.Diff 4: Leave existing license header unchanged, but make an inconsistent modification to THIRD_PARTY_NOTICES.md entry
Output 4
Entry metadata inconsistent with source file headers:
THIRD_PARTY_NOTICES.md, but the source files (QueueFile.java,FileObjectQueue.java,ObjectQueue.java) all still say "Copyright (C) 2010Square, Inc."
Diff 5: Leave existing license header unchanged, but remove THIRD_PARTY_NOTICES.md entry
Output 5
Source file(s) still reference this library:
-
io.sentry.android.core.SentryShakeDetectorstill contains attribution header for Square's Seismic. Either restore theTHIRD_PARTY_NOTICES.mdentry or remove the vendored code.Diff 6: Add newly-vendored code with valid license header and THIRD_PARTY_NOTICES.md entry
Output 6
Vendored code detected (Dropwizard Metrics SlidingWindowReservoir) – verify that
THIRD_PARTY_NOTICES.mdreflects your updates.Diff 7: Add newly-vendored code with valid license header but no THIRD_PARTY_NOTICES.md entry
Output 7
Vendored code detected (Caffeine Cache) — attribution header is complete.
THIRD_PARTY_NOTICES.md. An entry needs to be added.Diff 8: Add newly-vendored code with an invalid license header and existing THIRD_PARTY_NOTICES.md entry
Output 8
Vendored code detected (Resilience4j RateLimiter) — missing required fields:
Copyright 2019 Robert Winkler and Bohdan StorozhukandLicensed under the Apache License, Version 2.0(per the existing
THIRD_PARTY_NOTICES.mdentry).Diff 9: Add newly-vendored code with an invalid license header and no THIRD_PARTY_NOTICES.md entry
Output 9
Vendored code detected (Guava RateLimiter) — missing required fields:
THIRD_PARTY_NOTICES.md. An entry needs to be added for Guava RateLimiter.Diff 10: Add newly-vendored code with an invalid license header, no THIRD_PARTY_NOTICES.md entry, and a new license type
Output 10
Vendored code detected (compact-writer) — missing required fields:
THIRD_PARTY_NOTICES.md. An entry needs to be added.THIRD_PARTY_NOTICES.md. Please verify it is compatible with Sentry's licensing policies:https://open.sentry.io/licensing/.
Diff 11: False positive
Output 11 (numbered as “1.” because it’s the first entry in the False Positives section)
[3] Local Warden runs.
Running Warden locally: Click to expand
Added to my
.*profile:[4] Pushed up diffs with attribution violations in a draft PR to vet the UX (see #5444).
📝 Checklist
sendDefaultPIIis enabled.🔮 Next steps
failOn/failCheckonce we've vetted behavior in the wild#skip-changelog