Skip to content

feat : One Dataset, One Result – Smart Deduplication in Knowledge Space Search ✔#99

Open
Areeba-Tahir-18 wants to merge 1 commit intoINCF:mainfrom
Areeba-Tahir-18:newFeature
Open

feat : One Dataset, One Result – Smart Deduplication in Knowledge Space Search ✔#99
Areeba-Tahir-18 wants to merge 1 commit intoINCF:mainfrom
Areeba-Tahir-18:newFeature

Conversation

@Areeba-Tahir-18
Copy link
Copy Markdown

@Areeba-Tahir-18 Areeba-Tahir-18 commented Mar 26, 2026

Summary

This PR solves issue #68 and introduces a robust deduplication mechanism for the Knowledge Space search tool to ensure cleaner, more accurate search results.

Problem #68

Search results were showing duplicate datasets due to:

  • Aggregation from multiple datasources
  • Metadata variations (titles, descriptions, authors,capitalization).
  • Different URLs pointing to the same resource

Solution

This PR implements:

  1. Canonical dataset identity – uses datasource_id + dataset_id to uniquely identify datasets.
  2. URL normalization – removes query params and fragments to match identical datasets with different URLs.
  3. Title normalization – lowercasing, removing punctuation, extra spaces.
  4. Fuzzy title matching – detects highly similar titles (threshold 0.93) to remove duplicates.
  5. Titles Reordering - it handles titles reordering in the names of datasets

Impact of Feature In Real World UseCase

  • Faster searches – Users find the dataset they need quickly without seeing duplicates.
  • Clear results – Each dataset shows only once, making results easy to understand .
  • Consistent data – Datasets from different sources are shown cleanly in one place.
  • Better user experience - Good user experince
  • Reduced redundancy in dataset listings - no one get confused after seein duplicates
  • Better Data Reliability - reliable Data

Example
Previously, “Anesthesia EEG Dataset” appeared 3 times from DANDI. Now, only a single clean entry is returned.

@Areeba-Tahir-18
Copy link
Copy Markdown
Author

Areeba-Tahir-18 commented Mar 26, 2026

Merging this PR will successfully close the issue #68 .

@visakhmr and @QuantumByte-01 . I have updated my PR and removed large fies coming in PR as requested by @QuantumByte-01 . Kindly review PR when you have time . would appreciate your feedback . Thanks !!

@Areeba-Tahir-18
Copy link
Copy Markdown
Author

Just a point that came to my mind:

When the system shows duplicate datasets, users cannot trust the search results.
At first, even I was confused seeing this, but then I found it interesting to work on .I interact with agent multiple times
but when a new user came, duplication result will make them confuse as well and they will say like this " For one query it gives us same datasets 2 times, so I guess we should not trust on its search results blindly, heey not reliable. "
So I spent days verifying the issue and implementing the solution, and now the search results are clean and reliable for end users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant