feat : One Dataset, One Result – Smart Deduplication in Knowledge Space Search ✔ by Areeba-Tahir-18 · Pull Request #99 · INCF/knowledge-space-agent

Areeba-Tahir-18 · 2026-03-26T02:40:25Z

Summary

This PR solves issue #68 and introduces a robust deduplication mechanism for the Knowledge Space search tool to ensure cleaner, more accurate search results.

Problem #68

Search results were showing duplicate datasets due to:

Aggregation from multiple datasources
Metadata variations (titles, descriptions, authors,capitalization).
Different URLs pointing to the same resource

Solution

This PR implements:

Canonical dataset identity – uses datasource_id + dataset_id to uniquely identify datasets.
URL normalization – removes query params and fragments to match identical datasets with different URLs.
Title normalization – lowercasing, removing punctuation, extra spaces.
Fuzzy title matching – detects highly similar titles (threshold 0.93) to remove duplicates.
Titles Reordering - it handles titles reordering in the names of datasets

Impact of Feature In Real World UseCase

Faster searches – Users find the dataset they need quickly without seeing duplicates.
Clear results – Each dataset shows only once, making results easy to understand .
Consistent data – Datasets from different sources are shown cleanly in one place.
Better user experience - Good user experince
Reduced redundancy in dataset listings - no one get confused after seein duplicates
Better Data Reliability - reliable Data

Example
Previously, “Anesthesia EEG Dataset” appeared 3 times from DANDI. Now, only a single clean entry is returned.

Areeba-Tahir-18 · 2026-03-26T02:42:48Z

Merging this PR will successfully close the issue #68 .

@visakhmr and @QuantumByte-01 . I have updated my PR and removed large fies coming in PR as requested by @QuantumByte-01 . Kindly review PR when you have time . would appreciate your feedback . Thanks !!

Areeba-Tahir-18 · 2026-03-26T04:15:17Z

Just a point that came to my mind:

When the system shows duplicate datasets, users cannot trust the search results.
At first, even I was confused seeing this, but then I found it interesting to work on .I interact with agent multiple times
but when a new user came, duplication result will make them confuse as well and they will say like this " For one query it gives us same datasets 2 times, so I guess we should not trust on its search results blindly, heey not reliable. "
So I spent days verifying the issue and implementing the solution, and now the search results are clean and reliable for end users.

just deduplication logic

e93dfab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat : One Dataset, One Result – Smart Deduplication in Knowledge Space Search ✔#99

feat : One Dataset, One Result – Smart Deduplication in Knowledge Space Search ✔#99
Areeba-Tahir-18 wants to merge 1 commit intoINCF:mainfrom
Areeba-Tahir-18:newFeature

Areeba-Tahir-18 commented Mar 26, 2026 •

edited

Loading

Uh oh!

Areeba-Tahir-18 commented Mar 26, 2026 •

edited

Loading

Uh oh!

Areeba-Tahir-18 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Areeba-Tahir-18 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem #68

Solution

Impact of Feature In Real World UseCase

Uh oh!

Areeba-Tahir-18 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Areeba-Tahir-18 commented Mar 26, 2026

Just a point that came to my mind:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Areeba-Tahir-18 commented Mar 26, 2026 •

edited

Loading

Areeba-Tahir-18 commented Mar 26, 2026 •

edited

Loading