Skip to content

"Sanitize Data" tab and application embedded in WebUI#2681

Open
johnshearing wants to merge 99 commits intoHKUDS:mainfrom
johnshearing:main
Open

"Sanitize Data" tab and application embedded in WebUI#2681
johnshearing wants to merge 99 commits intoHKUDS:mainfrom
johnshearing:main

Conversation

@johnshearing
Copy link
Copy Markdown

"Sanitize Data" tab and application embedded in WebUI

image

See the video at this link here showing how to use the Sanitize Data application


Description

For cleaning up dirty data that occurs when the a.i. doesn't understand the source material that it is tasked to index.

  • The merging app seen above now easily does the following:
    • Show all information about selected entities and their relations side by side with other selected entities in order to compare and decide what operations below need to be performed to clean up the data.
    • Merge entities
    • Add new entities
    • Add new entity types
    • Add new entity relationships -
    • Edit entity name and properties
    • Edit entity relationships
    • Delete entities
    • Delete entity relationships
    • Show all entities in the index associated with a particular entity type
    • Show all entities in the index that have no relations to other entities (orphans)
    • The substring filter finds merge candidates like "Jack" and "Dr. Jack Kruse" which don't sort next to each other alphabetically.

Conclusion

Thanks to all contributors for making such a wonderful library to build upon.

@danielaskdd danielaskdd added the enhancement New feature or request label Mar 6, 2026
@Konsilion
Copy link
Copy Markdown

Hello,

I was wondering if this feature will be added soon?

Thanks again for LightRAG

@johnshearing
Copy link
Copy Markdown
Author

johnshearing commented Apr 19, 2026

Thank you for the interest @Konsilion.
I am sure the core developers have a lot on their plate right now.
In the meantime, if you need Sanitize Data functionality, you can install it from the following:
https://github.com/johnshearing/LightRAG/blob/main/jrs/_notes/setup-dev.sh

The same document shows how to stay in sync with the main LightRAG repository.

New features have been added since I made the video linked above and the workflow has been improved but it's use is very intuitive. I am sure you will pick up on it right away

Currently I have the LightRAG server with the Sanitize Data tab running at the following URL:
It is pointing at a recent book written by Charles Hoskinson about Zero Knowledge Proof systems.
http://174.167.39.112:9621
The server will be running for the next few days. So you can try it out without installing it.

I built the Sanitize Data app to find and merge duplicates like the following.

  1. AIR (3 entities → 1, including case variant)
  • AIR, AIR (Algebraic Intermediate Representation), Air
  1. PLONK (2 entities → 1, case variant)
  • PLONK, Plonk
  1. Person name duplicates
  • Srinath Setty / srinath setty / Setty / Setty, Srinath
  • Ioanna Tzialla / ioanna tzialla / Tzialla
  • Kothapalli, Abhiram / Kothapalli / abhiram kothapalli
  1. zkVM case variants (3 → 1)
  • zkVM, ZkVM, ZKVM

Now I use the Claude Code MCP server linked below to quickly find the duplicates and then I inspect them and perform the merge with the Sanitize Data app.
https://github.com/lalitsuryan/lightragmcp

:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants