Skip to content

fix: prevent processing 3rd party maintainers [CM-1097]#4035

Open
mbani01 wants to merge 4 commits intomainfrom
fix/maintainers_from_thrid_party
Open

fix: prevent processing 3rd party maintainers [CM-1097]#4035
mbani01 wants to merge 4 commits intomainfrom
fix/maintainers_from_thrid_party

Conversation

@mbani01
Copy link
Copy Markdown
Contributor

@mbani01 mbani01 commented Apr 17, 2026

This pull request enhances the accuracy and reliability of maintainer extraction by introducing robust detection and exclusion of third-party and vendored files. The changes add new logic and rules to identify and skip files that are likely to be bundled dependencies or unrelated to project governance, both in the backend logic and the extraction prompt. This helps prevent false maintainer detections from third-party code and clarifies the extraction process.

Third-party and vendor file detection and exclusion:

  • Added a comprehensive set of directory names (THIRD_PARTY_DIR_EXACT) and a regular expression (_VERSION_DIR_RE) to identify third-party, vendored, or versioned directories in file paths. Also introduced a maximum path depth (MAX_NON_GOVERNANCE_DEPTH) for files without governance keywords to further filter out likely third-party files.
  • Implemented the _is_third_party_path class method to encapsulate the logic for detecting third-party or vendor file paths using the above rules.
  • Integrated the _is_third_party_path check into the analyze_and_build_result method, so that files matching any exclusion rule are skipped and logged, raising a NO_MAINTAINER_FOUND error.

Extraction prompt improvements:

  • Updated the maintainer extraction prompt in get_extraction_prompt to explicitly instruct the model to check for third-party/vendor file paths first, with detailed rules and examples for exclusion. This ensures the extraction process aligns with the backend logic and avoids extracting maintainers from irrelevant files.

Governance and scoring keyword updates:

  • Added contributing.md to the list of recognized governance filenames and contributing to the set of governance stems, improving coverage of legitimate maintainer files. [1] [2]

Dependency update:

  • Imported the re module to support regular expression matching for versioned directories.

Note

Medium Risk
Introduces new heuristics that change which files are analyzed, which could inadvertently skip legitimate governance files in deeply nested or unusually named paths.

Overview
Prevents maintainer extraction from vendored/third-party files by adding path-based heuristics (known dependency directory names, versioned directories, and deep paths lacking governance keywords) and skipping those candidates early in analyze_and_build_result with a NO_MAINTAINER_FOUND outcome.

Updates the LLM extraction prompt to apply the same third-party/path rejection rules first (with explicit rules/examples) and expands filename/stem matching to include contributing/contributing.md as potential governance indicators.

Reviewed by Cursor Bugbot for commit d2a588f. Bugbot is set up for automated code reviews on this repo. Configure here.

mbani01 added 4 commits April 17, 2026 17:42
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@mbani01 mbani01 requested a review from joanagmaia April 17, 2026 16:48
@mbani01 mbani01 self-assigned this Apr 17, 2026
Copilot AI review requested due to automatic review settings April 17, 2026 16:48
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce false maintainer extraction by adding heuristics to detect and skip third-party/vendored files (both in backend filtering and in the LLM extraction prompt), and expands governance filename/stem matching to include CONTRIBUTING files.

Changes:

  • Added _is_third_party_path() with directory-name, versioned-directory, and “deep path without governance keywords” rejection rules, and applied it in analyze_and_build_result().
  • Expanded governance matching by adding contributing.md to known paths and contributing to governance stems.
  • Updated the LLM extraction prompt to require an upfront third-party path check with rules and examples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 70 to 74
".github/codeowners",
"security-insights.md",
"readme.md",
"contributing.md",
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contributing.md/contributing are added to the governance match sets, but _ripgrep_search() later excludes basenames in EXCLUDED_FILENAMES (which includes contributing.md and contributing). As a result, CONTRIBUTING files still won’t be discovered as candidates, so this change won’t improve coverage as intended. Consider either removing CONTRIBUTING from EXCLUDED_FILENAMES or not adding it to KNOWN_PATHS/GOVERNANCE_STEMS (and update the PR description accordingly).

Copilot uses AI. Check for mistakes.
Comment on lines +180 to +190
GOVERNANCE_PATH_KEYWORDS = {
"maintainer",
"codeowner",
"owner",
"contributor",
"governance",
"committer",
"reviewer",
"approver",
"emeritus",
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GOVERNANCE_PATH_KEYWORDS is used to exempt deep paths from the third-party filter, but it doesn’t include contributing even though this PR adds contributing to governance stems/known paths. This can cause deep CONTRIBUTING-related governance paths (e.g., docs/contributing/...) to be treated as third-party under Rule 3. Consider adding contributing (and any other governance stems you expect in paths) to GOVERNANCE_PATH_KEYWORDS to keep the heuristics consistent.

Copilot uses AI. Check for mistakes.
Comment on lines +375 to +392
- **Third-Party Check (MANDATORY — evaluate FIRST)**: Examine the **full file path** below. You MUST return `{{"error": "not_found"}}` immediately if ANY of these three rules match:

**Rule 1 — Vendor/dependency directory**: reject if any directory in the path is one of:
`vendor`, `node_modules`, `3rdparty`, `3rd_party`, `third_party`, `thirdparty`, `third-party`, `external`, `external_packages`, `extern`, `ext`, `deps`, `deps_src`, `dependencies`, `depend`, `bundled`, `bundled_deps`, `Pods`, `Godeps`, `bower_components`, `gems`, `submodules`, `internal-complibs`, `runtime-library`, `lib-src`, `lib-python`, `contrib`, `vendored`, or ends with `.dist-info`.

**Rule 2 — Versioned directory**: reject if any directory in the path contains a version number pattern like `X.Y` or `X.Y.Z` (e.g. `jquery-ui-1.12.1`, `zlib-1.2.8`, `ffmpeg-7.1.1`, `mesa-24.0.2`). Versioned directories are almost always bundled third-party packages.

**Rule 3 — Deep path without governance keyword**: reject if the path has more than 3 segments (e.g. `a/b/c/file`) AND does not contain any governance keyword (maintainer, codeowner, owner, contributor, governance, committer, reviewer, approver, emeritus). Deep files without governance keywords are typically unrelated to project governance.

**Examples of paths that MUST be rejected:**
- `vendor/google.golang.org/grpc/MAINTAINERS.md` (Rule 1: vendor)
- `node_modules/tunnel/README.md` (Rule 1: node_modules)
- `bundled/taskflow-3.10.0/README.md` (Rule 1: bundled + Rule 2: version)
- `gui-editors/gui-editor-apex/src/main/webapp/dist/js/jquery-ui-1.12.1/AUTHORS.txt` (Rule 2: version)
- `src/java.desktop/share/native/libsplashscreen/libpng/README` (Rule 3: deep, no governance keyword)
- `web/static-dist/libs/bootstrap/README.md` (Rule 3: deep, no governance keyword)
- `css/bootstrap/README.md` (Rule 3: depth=3 is fine, but this is actually a third-party asset — use your judgment)

Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt’s vendor/third-party rules are not aligned with the backend _is_third_party_path() implementation:

  • Rule 1’s directory list is missing entries that the backend rejects (e.g. externallibs, bower_components_external).
  • The prompt includes Pods/Godeps with capital letters, while the backend lowercases paths before matching.
  • The css/bootstrap/README.md example says to “use your judgment” even though it doesn’t match any of the three mandatory reject rules, which conflicts with “MANDATORY — evaluate FIRST” + “MUST return … immediately”.
    To avoid model/backend divergence, make the prompt list/casing match the code exactly and remove (or rework) the “use your judgment” example so it doesn’t contradict the stated rules.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d2a588f. Configure here.

".github/codeowners",
"security-insights.md",
"readme.md",
"contributing.md",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contributing.md contradicts existing exclusion rules

Medium Severity

Adding contributing.md to KNOWN_PATHS and contributing to GOVERNANCE_STEMS directly contradicts EXCLUDED_FILENAMES, which still contains both entries. The _ripgrep_search method filters out any file whose basename is in EXCLUDED_FILENAMES, so contributing.md can never be discovered as a candidate. Additionally, KNOWN_PATHS is passed as example governance files to the LLM file-detection prompt, which simultaneously says "CONTRIBUTING.md must ALWAYS be ignored." This creates a self-contradictory LLM prompt and makes the KNOWN_PATHS addition effectively dead code. It also disables section extraction for contributing.md if it reaches analysis via a saved file path, sending the full (mostly irrelevant) contributing guide to the LLM.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d2a588f. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants