"Sanitize Data" tab and application embedded in WebUI by johnshearing · Pull Request #2681 · HKUDS/LightRAG

johnshearing · 2026-02-05T19:48:18Z

"Sanitize Data" tab and application embedded in WebUI

See the video at this link here showing how to use the Sanitize Data application

Description

For cleaning up dirty data that occurs when the a.i. doesn't understand the source material that it is tasked to index.

The merging app seen above now easily does the following:
- Show all information about selected entities and their relations side by side with other selected entities in order to compare and decide what operations below need to be performed to clean up the data.
- Merge entities
- Add new entities
- Add new entity types
- Add new entity relationships -
- Edit entity name and properties
- Edit entity relationships
- Delete entities
- Delete entity relationships
- Show all entities in the index associated with a particular entity type
- Show all entities in the index that have no relations to other entities (orphans)
- The substring filter finds merge candidates like "Jack" and "Dr. Jack Kruse" which don't sort next to each other alphabetically.

Conclusion

Thanks to all contributors for making such a wonderful library to build upon.

Sync with upstream repository

Konsilion · 2026-04-19T07:12:39Z

Hello,

I was wondering if this feature will be added soon?

Thanks again for LightRAG

johnshearing · 2026-04-19T11:03:39Z

Thank you for the interest @Konsilion.
I am sure the core developers have a lot on their plate right now.
In the meantime, if you need Sanitize Data functionality, you can install it from the following:
https://github.com/johnshearing/LightRAG/blob/main/jrs/_notes/setup-dev.sh

The same document shows how to stay in sync with the main LightRAG repository.

New features have been added since I made the video linked above and the workflow has been improved but it's use is very intuitive. I am sure you will pick up on it right away

Currently I have the LightRAG server with the Sanitize Data tab running at the following URL:
It is pointing at a recent book written by Charles Hoskinson about Zero Knowledge Proof systems.
http://174.167.39.112:9621
The server will be running for the next few days. So you can try it out without installing it.

I built the Sanitize Data app to find and merge duplicates like the following.

AIR (3 entities → 1, including case variant)

AIR, AIR (Algebraic Intermediate Representation), Air

PLONK (2 entities → 1, case variant)

PLONK, Plonk

Person name duplicates

Srinath Setty / srinath setty / Setty / Setty, Srinath
Ioanna Tzialla / ioanna tzialla / Tzialla
Kothapalli, Abhiram / Kothapalli / abhiram kothapalli

zkVM case variants (3 → 1)

zkVM, ZkVM, ZKVM

Now I use the Claude Code MCP server linked below to quickly find the duplicates and then I inspect them and perform the merge with the Sanitize Data app.
https://github.com/lalitsuryan/lightragmcp

:)

johnshearing added 30 commits December 26, 2025 23:15

Setup: Dependencies and JRS scripts for RAGAnywhere

c7e1647

Fixed merge conflicts in .gitignore and uv.lock

299f76a

Add README for jrs folder

44a218a

Reorganize docs into jrs/_notes and update setup.sh

e1e99d1

Checkpoint: Save my work before syncing

560bed0

Checkpoint: Save my work before syncing

6cb3c55

Fix: handled 2x vector count mismatch in EmbeddingFunc

f992e80

Add archive directory to tracking

450258c

split the example index/query script in two

784db41

added image query script

efbf7eb

image query loops for each query mode

cb5a697

looping for multiple text query modes

4b569e9

deleted the readme file

e336e9f

testing multimodal queries

20aaa2c

auto refresh WebUI after merge operation

3cb34d9

auto update WebUI after merging entities

61e76ed

updated documentation

e663e9d

Merge remote-tracking branch 'upstream/main'

90c3e38

final update before starting work on WebUI

66e7f7d

Start of building Data Sanitation Utiltiy into WebUI

6d378a5

Merge upstream/main and fix conflicts

404fd15

Final sync: Integrated upstream changes & updated lock files

22b5fc7

Sanitize Data screen

decb887

Arrange controls on Sanitiz Data screen

4d0708f

Updates to SanitizeData.tsx

614c5ca

Arranging controls on SanitizeData.tsx

7938036

Final arrangment of controls on SanitizeData.tsx

279c636

SanitizeData.tsx: fetch entities and filter results

033c819

coded page controls for SanitizeData.tsx

e294696

spinner applied to SanitizeData.tsx for page navigation

f235a8e

johnshearing added 16 commits February 6, 2026 19:58

Tightened related entity filter: SanitizeData.tsx

e668e05

Removed place holders: SanitizeData.tsx

890d762

Changed two controls from disabled to read only: SanitizeData.tsx

982a709

Fixed stale cache when editing relationships: SanitizeData.tsx

d6055cb

Fixed stale cache when editing entities: SanitizeData.tsx

b2bd996

Click anywhere on row to select checkbox: SanitizeData.tsx

e35be46

Focus on filter when starting: SanitizeData.tsx

0e7d521

Adjusted tab order for better UX: SanitizeData.tsx

de7a7ff

Tab order and other improvements to UX: SanitizeData.tsx

41952c3

Fixed hotkey issue: SanitizeData.tsx

cec6080

Fixed issues with Select Entity Type dialog: SanitizeData.tsx

ff7336e

Improved UX for selecting entity types: SanitizeData.tsx

988fafd

Workflow, Tab orders, Hotkeys, UX: SanitizeData.tsx

a3a4e04

Much simpler UI: SanitizeData.tsx

d662fcf

Simplified User Interface: SanitizeData.tsx

38a2685

Merge remote-tracking branch 'upstream/main'

02aeeda

Sync with upstream repository

johnshearing mentioned this pull request Feb 14, 2026

[Question]: Can the <SEP> delimiter be removed from all entity and relationship descriptions? #2694

Open

2 tasks

johnshearing added 7 commits February 15, 2026 15:15

WebUI: Remove chunks from context when Chunk Top K set to 0

1828f4d

Now can add relationships in batches: SanitizeData.tsx

63edf24

Fixed edge-case save failure: SanitizeData.tsx

9750239

Esc key for batch processing window: SanitizeData.tsx

9367ddb

Added progress spinners to batch processes: SanitizeData.tsx

ac0f744

Added progress spinners to batch processes: SanitizeData.tsx

56e9178

changed from local host to generic access: SanitizeData.tsx

1fc008a

danielaskdd added the enhancement New feature or request label Mar 6, 2026

johnshearing added 3 commits March 18, 2026 13:54

Updated to Pull Request HKUDS#2731

db7a884

Checkpoint: Save WebUI modifications and built assets

5e51df7

Merge remote-tracking branch 'upstream/main'

8801f68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Sanitize Data" tab and application embedded in WebUI#2681

"Sanitize Data" tab and application embedded in WebUI#2681
johnshearing wants to merge 99 commits intoHKUDS:mainfrom
johnshearing:main

johnshearing commented Feb 5, 2026

Uh oh!

Konsilion commented Apr 19, 2026

Uh oh!

johnshearing commented Apr 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

johnshearing commented Feb 5, 2026

"Sanitize Data" tab and application embedded in WebUI

See the video at this link here showing how to use the Sanitize Data application

Description

Conclusion

Uh oh!

Konsilion commented Apr 19, 2026

Uh oh!

johnshearing commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

johnshearing commented Apr 19, 2026 •

edited

Loading