-
Notifications
You must be signed in to change notification settings - Fork 37
❗BUG❗ : Duplicate Datasets Appearing in Search Results of agent #68
Copy link
Copy link
Open
Description
Problem
While interacting with the KnowledgeSpace UI search from a time , I noticed that same datasets appear multiple times. I started thinking about why this happens why i am seeing same dataset in search result 2 times ??
- Multiple sources: The same dataset may exist in more than one datasource.
- Metadata variations: Enrichment adds metadata or URLs that are slightly different.
- Title differences: The same dataset might have small variations in its title across sources.
- Different IDs: id or dataset identifiers can differ even for the same logical dataset.
Then i decide to solve this problem and here is my approach for solving this .
Proposed Solution :
--> Use a multi-layer deduplication strategy:
- Canonical Identity: Combine datasource_id + dataset_id to identify duplicates.
- URL Normalization : Remove query parameters to unify different links.
- Fuzzy Title Matching: Detect minor variations in titles and merge duplicates.
Expected Impact on Knowledge-space-agent searching
- Each dataset appears only once in the UI search results.
- Improves user experience, making results cleaner and easier to navigate.
- Reduces confusion caused by repeated entries.
- Provides a foundation for future improvements, like semantic deduplication or hybrid search ranking.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels