fix: prevent processing 3rd party maintainers [CM-1097]#4035
fix: prevent processing 3rd party maintainers [CM-1097]#4035
Conversation
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
|
|
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce false maintainer extraction by adding heuristics to detect and skip third-party/vendored files (both in backend filtering and in the LLM extraction prompt), and expands governance filename/stem matching to include CONTRIBUTING files.
Changes:
- Added
_is_third_party_path()with directory-name, versioned-directory, and “deep path without governance keywords” rejection rules, and applied it inanalyze_and_build_result(). - Expanded governance matching by adding
contributing.mdto known paths andcontributingto governance stems. - Updated the LLM extraction prompt to require an upfront third-party path check with rules and examples.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ".github/codeowners", | ||
| "security-insights.md", | ||
| "readme.md", | ||
| "contributing.md", | ||
| } |
There was a problem hiding this comment.
contributing.md/contributing are added to the governance match sets, but _ripgrep_search() later excludes basenames in EXCLUDED_FILENAMES (which includes contributing.md and contributing). As a result, CONTRIBUTING files still won’t be discovered as candidates, so this change won’t improve coverage as intended. Consider either removing CONTRIBUTING from EXCLUDED_FILENAMES or not adding it to KNOWN_PATHS/GOVERNANCE_STEMS (and update the PR description accordingly).
| GOVERNANCE_PATH_KEYWORDS = { | ||
| "maintainer", | ||
| "codeowner", | ||
| "owner", | ||
| "contributor", | ||
| "governance", | ||
| "committer", | ||
| "reviewer", | ||
| "approver", | ||
| "emeritus", | ||
| } |
There was a problem hiding this comment.
GOVERNANCE_PATH_KEYWORDS is used to exempt deep paths from the third-party filter, but it doesn’t include contributing even though this PR adds contributing to governance stems/known paths. This can cause deep CONTRIBUTING-related governance paths (e.g., docs/contributing/...) to be treated as third-party under Rule 3. Consider adding contributing (and any other governance stems you expect in paths) to GOVERNANCE_PATH_KEYWORDS to keep the heuristics consistent.
| - **Third-Party Check (MANDATORY — evaluate FIRST)**: Examine the **full file path** below. You MUST return `{{"error": "not_found"}}` immediately if ANY of these three rules match: | ||
|
|
||
| **Rule 1 — Vendor/dependency directory**: reject if any directory in the path is one of: | ||
| `vendor`, `node_modules`, `3rdparty`, `3rd_party`, `third_party`, `thirdparty`, `third-party`, `external`, `external_packages`, `extern`, `ext`, `deps`, `deps_src`, `dependencies`, `depend`, `bundled`, `bundled_deps`, `Pods`, `Godeps`, `bower_components`, `gems`, `submodules`, `internal-complibs`, `runtime-library`, `lib-src`, `lib-python`, `contrib`, `vendored`, or ends with `.dist-info`. | ||
|
|
||
| **Rule 2 — Versioned directory**: reject if any directory in the path contains a version number pattern like `X.Y` or `X.Y.Z` (e.g. `jquery-ui-1.12.1`, `zlib-1.2.8`, `ffmpeg-7.1.1`, `mesa-24.0.2`). Versioned directories are almost always bundled third-party packages. | ||
|
|
||
| **Rule 3 — Deep path without governance keyword**: reject if the path has more than 3 segments (e.g. `a/b/c/file`) AND does not contain any governance keyword (maintainer, codeowner, owner, contributor, governance, committer, reviewer, approver, emeritus). Deep files without governance keywords are typically unrelated to project governance. | ||
|
|
||
| **Examples of paths that MUST be rejected:** | ||
| - `vendor/google.golang.org/grpc/MAINTAINERS.md` (Rule 1: vendor) | ||
| - `node_modules/tunnel/README.md` (Rule 1: node_modules) | ||
| - `bundled/taskflow-3.10.0/README.md` (Rule 1: bundled + Rule 2: version) | ||
| - `gui-editors/gui-editor-apex/src/main/webapp/dist/js/jquery-ui-1.12.1/AUTHORS.txt` (Rule 2: version) | ||
| - `src/java.desktop/share/native/libsplashscreen/libpng/README` (Rule 3: deep, no governance keyword) | ||
| - `web/static-dist/libs/bootstrap/README.md` (Rule 3: deep, no governance keyword) | ||
| - `css/bootstrap/README.md` (Rule 3: depth=3 is fine, but this is actually a third-party asset — use your judgment) | ||
|
|
There was a problem hiding this comment.
The prompt’s vendor/third-party rules are not aligned with the backend _is_third_party_path() implementation:
- Rule 1’s directory list is missing entries that the backend rejects (e.g.
externallibs,bower_components_external). - The prompt includes
Pods/Godepswith capital letters, while the backend lowercases paths before matching. - The
css/bootstrap/README.mdexample says to “use your judgment” even though it doesn’t match any of the three mandatory reject rules, which conflicts with “MANDATORY — evaluate FIRST” + “MUST return … immediately”.
To avoid model/backend divergence, make the prompt list/casing match the code exactly and remove (or rework) the “use your judgment” example so it doesn’t contradict the stated rules.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d2a588f. Configure here.
| ".github/codeowners", | ||
| "security-insights.md", | ||
| "readme.md", | ||
| "contributing.md", |
There was a problem hiding this comment.
contributing.md contradicts existing exclusion rules
Medium Severity
Adding contributing.md to KNOWN_PATHS and contributing to GOVERNANCE_STEMS directly contradicts EXCLUDED_FILENAMES, which still contains both entries. The _ripgrep_search method filters out any file whose basename is in EXCLUDED_FILENAMES, so contributing.md can never be discovered as a candidate. Additionally, KNOWN_PATHS is passed as example governance files to the LLM file-detection prompt, which simultaneously says "CONTRIBUTING.md must ALWAYS be ignored." This creates a self-contradictory LLM prompt and makes the KNOWN_PATHS addition effectively dead code. It also disables section extraction for contributing.md if it reaches analysis via a saved file path, sending the full (mostly irrelevant) contributing guide to the LLM.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit d2a588f. Configure here.


This pull request enhances the accuracy and reliability of maintainer extraction by introducing robust detection and exclusion of third-party and vendored files. The changes add new logic and rules to identify and skip files that are likely to be bundled dependencies or unrelated to project governance, both in the backend logic and the extraction prompt. This helps prevent false maintainer detections from third-party code and clarifies the extraction process.
Third-party and vendor file detection and exclusion:
THIRD_PARTY_DIR_EXACT) and a regular expression (_VERSION_DIR_RE) to identify third-party, vendored, or versioned directories in file paths. Also introduced a maximum path depth (MAX_NON_GOVERNANCE_DEPTH) for files without governance keywords to further filter out likely third-party files._is_third_party_pathclass method to encapsulate the logic for detecting third-party or vendor file paths using the above rules._is_third_party_pathcheck into theanalyze_and_build_resultmethod, so that files matching any exclusion rule are skipped and logged, raising aNO_MAINTAINER_FOUNDerror.Extraction prompt improvements:
get_extraction_promptto explicitly instruct the model to check for third-party/vendor file paths first, with detailed rules and examples for exclusion. This ensures the extraction process aligns with the backend logic and avoids extracting maintainers from irrelevant files.Governance and scoring keyword updates:
contributing.mdto the list of recognized governance filenames andcontributingto the set of governance stems, improving coverage of legitimate maintainer files. [1] [2]Dependency update:
remodule to support regular expression matching for versioned directories.Note
Medium Risk
Introduces new heuristics that change which files are analyzed, which could inadvertently skip legitimate governance files in deeply nested or unusually named paths.
Overview
Prevents maintainer extraction from vendored/third-party files by adding path-based heuristics (known dependency directory names, versioned directories, and deep paths lacking governance keywords) and skipping those candidates early in
analyze_and_build_resultwith aNO_MAINTAINER_FOUNDoutcome.Updates the LLM extraction prompt to apply the same third-party/path rejection rules first (with explicit rules/examples) and expands filename/stem matching to include
contributing/contributing.mdas potential governance indicators.Reviewed by Cursor Bugbot for commit d2a588f. Bugbot is set up for automated code reviews on this repo. Configure here.