Skip to content

fix(submissions): block cross-category website duplicates#2134

Merged
JSONbored merged 3 commits into
mainfrom
codex/fix-website-url-duplicate-detection
Jun 13, 2026
Merged

fix(submissions): block cross-category website duplicates#2134
JSONbored merged 3 commits into
mainfrom
codex/fix-website-url-duplicate-detection

Conversation

@JSONbored

@JSONbored JSONbored commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Motivation

  • A recent change narrowed the set of URL fields used for strict cross-category duplicate detection and inadvertently omitted websiteUrl/website_url, which allowed canonical website collisions to avoid deterministic strict-duplicate closure.
  • The intent is to restore deterministic blocking for submissions that share the same canonical website URL across non-collection categories while preserving related/collection behaviors.

Description

  • Added websiteUrl and website_url to CROSS_CATEGORY_STRICT_URL_FIELDS in apps/submission-gate/src/duplicates.ts so canonical website signals are included in cross-category strict duplicate checks.
  • Added a regression test it("treats same canonical website across different categories as a strict duplicate") to tests/submission-gate-worker.test.ts that validates a tools/mcp pair with the same normalized websiteUrl is treated as a strict duplicate and appears in related matches.
  • No other behavioral changes were made to URL_FIELDS, strictDuplicateUrls filtering, or multi-entry catalog logic.

Testing

  • Ran the focused test subset with pnpm exec vitest run tests/submission-gate-worker.test.ts -t "canonical website|canonical project|shared official docs|neutral duplicate submissions" --reporter=dot, which reported the test file passed and the selected tests succeeded (1 file passed, 4 tests passed, 92 skipped).
  • Ran git diff --check to ensure there are no formatting or whitespace issues and it succeeded.
  • The modified files are apps/submission-gate/src/duplicates.ts and tests/submission-gate-worker.test.ts and the added tests exercise the fixed behavior.

Codex Task

Summary by CodeRabbit

  • New Features

    • Enhanced duplicate detection by expanding website URL comparison to work across different submission categories, preventing duplicate content.
  • Tests

    • Added test cases validating cross-category duplicate detection accuracy using website URL matching.

@dosubot dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Jun 13, 2026
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 41eefb55-4f12-415c-bdb6-7ae2283f27a4

📥 Commits

Reviewing files that changed from the base of the PR and between c89748d and 3a646ea.

📒 Files selected for processing (2)
  • apps/submission-gate/src/duplicates.ts
  • tests/submission-gate-worker.test.ts

📝 Walkthrough

Walkthrough

The PR extends strict duplicate detection to include the websiteUrl and website_url fields, enabling cross-category duplicate matching on canonical website URLs, and validates this functionality with a new test case covering tools and mcp categories.

Changes

Website URL Strict Duplicate Detection

Layer / File(s) Summary
Extend strict URL fields to include website URL
apps/submission-gate/src/duplicates.ts
CROSS_CATEGORY_STRICT_URL_FIELDS is extended to include websiteUrl and website_url, allowing these URL field names to participate in cross-category strict duplicate detection.
Cross-category website URL duplicate detection test
tests/submission-gate-worker.test.ts
New test case asserts that entries sharing the same canonical websiteUrl across different categories (tools and mcp) are classified as strict duplicates, with both strict-match detection and related-content matching confirming canonical-website-based reasoning.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • JSONbored/awesome-claude#2133: Introduces strictDuplicateUrls intersection logic and related/strict duplicate test coverage that this PR builds upon.
  • JSONbored/awesome-claude#981: Modifies duplicate detection logic around canonical URLs and category-based strict vs related matching decisions.
  • JSONbored/awesome-claude#1069: Updates strict duplicate handling and extends tests to validate canonical URL-based duplicate classification.

Suggested labels

size:M

Poem

🐰 A rabbit hops through fields of URLs so bright,
Adding websiteUrl to strict detection's might—
Across categories they dance, both tools and mcp,
Canonical matches found with perfect harmony! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers motivation, implementation details, and testing for a platform/code PR. However, it lacks explicit checklist items from the template, particularly the platform/code PR section checklist, making it incomplete against the required template structure. Complete the required checklist items from the template, particularly the platform/code PR section (changed routes/components/endpoints/tools, screenshots/no visual impact, linked issue or no-issue rationale), and ensure all validation checkboxes are explicitly marked.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main fix: blocking cross-category website URL duplicates, which aligns directly with the core change adding websiteUrl/website_url to strict duplicate detection.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/fix-website-url-duplicate-detection

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@superagent-security

Copy link
Copy Markdown

Superagent didn't find any vulnerabilities or security issues in this PR.

@reviewwed

reviewwed Bot commented Jun 13, 2026

Copy link
Copy Markdown

🔍 Maintainer advisory review

Reviewed 2 changed file(s) — two independent AI reviewers. This is an advisory review — it does not merge or close the PR.

Suggested action: 🛠️ Request changes. (reviewers split: request changes / merge) Address the suggestions below before merging.

Reviewer A · gpt-oss-120b — recommends 🛠️ request changes
The change correctly expands the duplicate‑detection logic to include websiteUrl/website_url fields, matching the existing pattern for other URL fields. The added test validates that a canonicalized website URL is treated as a strict duplicate across categories, which aligns with the PR title. No new external network calls, command execution, or secret handling are introduced, so security impact is minimal. The modification follows existing code conventions and includes appropriate test coverage. However, the duplicate‑detection implementation relies on URL canonicalization; we need to confirm that the canonicalization routine (used for other URL fields) also processes websiteUrl correctly (e.g., stripping query strings). Without that verification, there is a risk of false‑positive or false‑negative duplicate detection, potentially affecting the submission pipeline.

Suggestions

  • Verify that the canonicalization function applied to other URL fields also normalizes websiteUrl (e.g., removes query parameters, lower‑cases host). Add a comment or unit test if needed.
  • Consider adding a regression test ensuring that submissions with distinct but similar websiteUrls (different paths) are not incorrectly flagged as duplicates.
  • Document the new field handling in the duplicate‑detection module README or comments for future maintainers.

Worth double-checking

  • If canonicalization does not strip query strings for websiteUrl, the new test may pass locally but fail in production, leading to missed duplicates or false closures.
  • Increasing the strict duplicate criteria may cause legitimate cross‑category entries to be rejected, potentially increasing manual review load.

Reviewer B · nemotron-3-120b-a12b — recommends ✅ merge
The change adds 'websiteUrl' and 'website_url' to CROSS_CATEGORY_STRICT_URL_FIELDS, ensuring same canonical website URLs across categories are flagged as strict duplicates. This aligns with the goal of preventing duplicate entries by source backing. The test case validates the behavior with query parameter differences and canonicalization. No security risks are introduced as it only extends existing URL field checks without new external calls or tainted flows. The change is consistent with existing patterns and adds appropriate test coverage.

No blocking issues spotted.

@gittensory

gittensory Bot commented Jun 13, 2026

Copy link
Copy Markdown

Note

Gittensory Gate skipped

PR closed before full evaluation. No late first comment was created.

Signal Result Evidence Action
Gate result ⚠️ Skipped #2134 is no longer open. No action.

💰 Earn for open-source contributions like this. Gittensor lets GitHub contributors earn for the work they already do — register to start earning →.

Checked by Gittensory, a quiet PR intelligence layer for OSS maintainers.

@gittensory gittensory Bot added the gittensory:reviewed Gittensor contributor context label Jun 13, 2026
@JSONbored JSONbored self-assigned this Jun 13, 2026
@JSONbored JSONbored merged commit 20aad33 into main Jun 13, 2026
26 checks passed
@JSONbored JSONbored deleted the codex/fix-website-url-duplicate-detection branch June 13, 2026 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gittensory:reviewed Gittensor contributor context size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant