Skip to content

fix(merge_entities): handle long descriptions, embedder failures, and storage wrappers#2916

Open
69blade69 wants to merge 2 commits intoHKUDS:mainfrom
69blade69:fix/merge-entities-embedder-truncate
Open

fix(merge_entities): handle long descriptions, embedder failures, and storage wrappers#2916
69blade69 wants to merge 2 commits intoHKUDS:mainfrom
69blade69:fix/merge-entities-embedder-truncate

Conversation

@69blade69
Copy link
Copy Markdown

@69blade69 69blade69 commented Apr 8, 2026

Summary

This PR addresses three closely related failure modes in _merge_entities_impl that all surface as the same user-visible bug: a merge of a graph hub entity (one with many neighbours) goes through partway, then aborts, leaving the graph in a half-migrated state.

Update: the original PR only had the description-truncate fix from the title. After deploying it I hit two more failure modes while merging hubs at scale and pushed an additional commit (8f0aea7) addressing them. Title and description rewritten to match the wider scope.

1. Truncate relation embed content (extends the original fix)

The original fix in this PR truncated description only for the entity vdb upsert at the end of _merge_entities_impl. But the loop that re-embeds each migrated relation right above (around line 1426) builds a content string from the relation's own description, which can also exceed the embedder's context window — relations that have been merged before may have descriptions accumulated from prior merges.

When merging Иисин (a hub with ~270 neighbours) the merge would crash mid-loop on a single relation whose accumulated description was too long:
```
ollama._types.ResponseError: the input length exceeds the context length (status code: 400)
File ".../lightrag/utils_graph.py", line 1441, in _merge_entities_impl
await relationships_vdb.upsert(relation_data_for_vdb)
```

This commit applies the same truncate (1500 chars for description, 200 for keywords) to the embedded content for each relation. The full description and full keywords are still stored in the VDB row metadata; only the string fed to the embedder is shortened.

2. Wrap entity and relation embed calls in try/except

Even with the truncate, embedders can fail for other transient reasons (network blip, model context drift, an unusually pathological character sequence). Without the try/except, one failed embedding aborts the whole merge, leaving the graph in a partially-migrated state where the target has been updated but the source entities have not been deleted (because we never reach the delete loop further down).

Wrap the entity vdb upsert and the per-relation vdb upsert in try/except. We log a warning and keep going. Relation chunk tracking is already updated separately above the upsert, so the graph stays consistent even if a single VDB row is missed; the next merge attempt or a subsequent reprocess will fix it.

3. Coerce raw node_data through dict() before _merge_attributes

This is the most subtle of the three. Some graph backends (notably PostgreSQL + Apache AGE through asyncpg) return a wrapper object from get_node whose .get() method returns None for every key — the attribute access protocol is implemented but dict.get semantics are not. When we feed that wrapper straight into _merge_attributes below:

```python
values = [data.get(key) for data in data_list if data.get(key)]
if not values:
continue # ← key is dropped!
```

…every value comes back falsy, the loop drops every key, and merged_entity_data ends up as an empty dict. Then on the line await chunk_entity_relation_graph.upsert_node(target_entity, merged_entity_data) we overwrite the existing target's description with empty.

We hit this in production: after a merge attempt aborted on a downstream embedder failure (#2 above), a hub entity was left with description=\"\" in the graph store while the VDB row still had the old 2.9KB description. We had to write a one-shot recovery script that read the VDB content back and called edit_entity to restore the graph node. The fix here prevents the regression entirely: coerce the wrapper into a plain dict before _merge_attributes sees it.

Verification

Tested against LightRAG v1.4.13 with PGGraphStorage (Apache AGE 1.5.0) and Ollama nomic-embed-text. Smoke merges before and after this commit:

  • Vault Refs ← Vault Ref (small, 2 sources): worked before, still works.
  • OpenClaw ← Openclaw (medium, 53 relations): worked before, still works.
  • Иисин ← Iisin/main (hub, 271 relations): previously aborted with the input length exceeds the context length, now succeeds in ~20 minutes (slow only because of the embedder rebuild for that many edges), with no errors in the server log.

After the merge, Иисин graph node still has the full description and all the relations are correctly migrated.

Notes

  • One-file change. Two commits in this PR — happy to squash if maintainers prefer.
  • All three sub-fixes are in _merge_entities_impl. They are tightly coupled (the embedder failures and the silent description nuking are two halves of the same incident) and easier to review together than as three separate PRs.
  • No public API or schema changes.
  • Independent of fix(utils_graph): coerce storage values to plain dict to fix vars() TypeError on all write endpoints #2915 (vars() TypeError on storage wrapper) — those are completely different code paths and either PR can be merged first.

… window

When merging entities the default `concatenate` strategy joins the
descriptions of every source entity into one long string. For graph
"hub" entities (those that appear in many documents) the result
routinely exceeds the embedder's context window. With Ollama's
`nomic-embed-text` (default 2048-token context) this surfaces as

  ollama._types.ResponseError: the input length exceeds the context length
                              (status code: 400)

and the merge aborts before the VDB upsert.

The fix is to truncate the description that is fed into `content` for
the embedding call. The full, untruncated description is still stored
in `merged_entity_data["description"]` and therefore in the graph node
properties — only the `content` string used for embedding is shortened.

`1500` characters ≈ `~500` tokens, which leaves comfortable headroom
for the entity name on top and stays well below the 2048 default. The
ellipsis suffix preserves the cue that more text exists for any human
inspecting `content`.

This change is minimal and surgical: only the embedded `content` is
affected, not what gets persisted into graph storage or VDB metadata.
69blade69 added a commit to 69blade69/LightRAG that referenced this pull request Apr 8, 2026
…ars() TypeError

`POST /graph/entity/create`, `POST /graph/entities/merge`, `POST /graph/entity/edit`, and the relation equivalents return HTTP 500 even when the underlying database operation succeeds, when the graph storage backend is Apache AGE on PostgreSQL. The merge/create/edit actually completes — the entity is in the graph, the VDB row is in place, the relations are migrated — but the client cannot tell the operation went through, which makes the API look broken.

## Repro

1. LightRAG with `LIGHTRAG_GRAPH_STORAGE=PGGraphStorage` (Apache AGE on PostgreSQL).
2. ` curl -X POST http://127.0.0.1:9621/graph/entity/create -H \"X-API-Key: \$KEY\" -H 'Content-Type: application/json' -d '{\"entity_name\":\"TestE\",\"entity_data\":{\"entity_type\":\"Test\",\"description\":\"smoke\"}}' `
3. Server returns HTTP 500 \"Internal Server Error\".
4. Verify `GET /graph/entity/exists?name=TestE` → `true`. The create **did happen** — only the response crashed.

Server log:
\`\`\`
INFO: Entity Create: 'TestE' successfully created
INFO: 127.0.0.1:41662 - \"POST /graph/entity/create HTTP/1.1\" 500
ERROR: Exception in ASGI application
Traceback (most recent call last):
  File \".../fastapi/encoders.py\", line 328, in jsonable_encoder
    data = dict(obj)
TypeError: cannot convert dictionary update sequence element #0 to a sequence
During handling of the above exception, another exception occurred:
  File \".../fastapi/encoders.py\", line 333, in jsonable_encoder
    data = vars(obj)
TypeError: vars() argument must have __dict__ attribute
ValueError: [TypeError('cannot convert dictionary update sequence element #0 to a sequence'), TypeError('vars() argument must have __dict__ attribute')]
\`\`\`

The same crash hits `/graph/entities/merge`, `/graph/entity/edit`, `/graph/relation/create`, `/graph/relation/edit` — basically every write endpoint that returns `get_entity_info(...)` / `get_relation_info(...)`.

## Root cause

`get_entity_info` and `get_relation_info` (in `lightrag/utils_graph.py`) embed the raw `node_data` / `edge_data` returned by the storage backend's `get_node` / `get_edge` straight into the response under `graph_data`. For Apache AGE through PostgreSQL via asyncpg, that value is a wrapper object whose properties can't be cleanly serialized — neither `dict(obj)` nor `vars(obj)` work, and `jsonable_encoder` raises.

Every write function in `utils_graph.py` ends with:

\`\`\`python
return await get_entity_info(
    chunk_entity_relation_graph,
    entities_vdb,
    entity_name,
    include_vector_data=True,
)
\`\`\`

…which means **all 5 write endpoints** propagate the crash:

- `acreate_entity` → `/graph/entity/create`
- `acreate_relation` → `/graph/relation/create`
- `aedit_entity` → `/graph/entity/edit`
- `aedit_relation` → `/graph/relation/edit`
- `_merge_entities_impl` → `/graph/entities/merge`

We hit this in production while trying to clean up duplicate entities at scale and spent significant time chasing it before realizing **every write was actually succeeding**.

## Fix

Add `_coerce_to_plain_dict(obj)` helper that recursively converts a value into a plain JSON-serializable dict, falling back to `str(...)` for unknown leaf values and to an empty dict on failure. `get_entity_info` and `get_relation_info` now coerce both `graph_data` and `vector_data` through it before returning.

This is a defensive fix at the seam between storage backends and the response layer. It does not touch storage code, does not touch the write functions, and does not change the public response schema — it just guarantees the response is always JSON-encodable. Backends that already return plain dicts (NetworkX, JSON, Neo4j with current asyncpg paths) are unaffected because `isinstance(obj, dict)` short-circuits the helper.

## Verification

Tested against LightRAG v1.4.13 with PGGraphStorage (Apache AGE 1.5.0). After the patch:

\`\`\`
POST /graph/entity/create  → HTTP 200, full response with graph_data + vector_data
POST /graph/entities/merge → HTTP 200, full response with graph_data + vector_data
POST /graph/entity/edit    → HTTP 200
DELETE /documents/delete_entity → HTTP 200 (was already working — schema stays the same)
\`\`\`

Smoke merge of OpenClaw ← Openclaw (53 relations transferred): HTTP 200 in ~5 minutes (slow only because of the embedder rebuild for an entity with many neighbours), no errors in the server log, source removed from both VDB and AGE graph, target preserved.

## Notes

- One-file change. Net +69/-12 lines.
- A separate PR (HKUDS#2916) addresses the orthogonal Ollama context-length crash on `merge_entities`. They are independent and can be reviewed separately.
- The cost of the helper is very small (a single `isinstance(obj, dict)` short-circuit when the storage already returns dicts), and it never raises — losing custom properties is strictly better than HTTP 500 from a successful write.
- Happy to narrow the helper to only `get_entity_info` / `get_relation_info` if maintainers prefer to keep it scoped, or to push it into the `PGGraphStorage.get_node` path instead. This PR went with the helper at the boundary because it covers all backends in one place.
…calls + coerce node_data

After deploying the original `description` truncate from this PR I hit
two more failure modes while merging hub entities (entities that appear
in many documents) at scale. This commit addresses all three together
because they all surface as the same user-visible failure: a merge of
a graph hub goes through partway, then aborts, leaving the graph in
a half-migrated state.

## 1. Truncate relation embed content (extends the original fix)

The original fix truncated `description` only for the **entity** vdb
upsert at the end of `_merge_entities_impl`. But the loop that re-embeds
each migrated relation right above (around line 1426) builds a `content`
string from the relation's own description, which can also exceed the
embedder's context window — relations that have been merged before may
have descriptions accumulated from prior merges.

When merging Иисин (a hub with ~270 neighbours) the merge would crash
mid-loop on a single relation whose accumulated description was too long.
Stack trace from the server log:

```
ollama._types.ResponseError: the input length exceeds the context length (status code: 400)
ERROR: Error in ollama_embed: the input length exceeds the context length
ERROR: Error merging entities: the input length exceeds the context length
File ".../lightrag/utils_graph.py", line 1441, in _merge_entities_impl
    await relationships_vdb.upsert(relation_data_for_vdb)
```

Line 1441 = the inner relation upsert. This commit applies the same
truncate (1500 chars for description, 200 for keywords) to the embedded
`content` for each relation. The full description and full keywords
are still stored in the VDB row metadata; only the string fed to the
embedder is shortened.

## 2. Wrap entity and relation embed calls in try/except

Even with the truncate, embedders can fail for other transient reasons
(network blip, model context drift, an unusually pathological character
sequence). Without the try/except, **one** failed embedding aborts the
**whole** merge, leaving the graph in a partially-migrated state where
the target has been updated but the source entities have not been
deleted (because we never reach the delete loop further down).

Wrap the entity vdb upsert and the per-relation vdb upsert in
try/except. We log a warning and keep going. Relation chunk tracking
is already updated separately above the upsert, so the graph stays
consistent even if a single VDB row is missed; the next merge attempt
or a subsequent reprocess will fix it.

## 3. Coerce raw node_data through dict() before _merge_attributes

This is the most subtle of the three. Some graph backends (notably
PostgreSQL + Apache AGE through asyncpg) return a wrapper object from
`get_node` whose `.get()` method returns `None` for every key — the
attribute access protocol is implemented but `dict.get` semantics
are not. When we feed that wrapper straight into `_merge_attributes`
below:

```python
values = [data.get(key) for data in data_list if data.get(key)]
if not values:
    continue  # ← key is dropped!
```

…every value comes back falsy, the loop drops every key, and
`merged_entity_data` ends up as an empty dict. Then on the line
`await chunk_entity_relation_graph.upsert_node(target_entity, merged_entity_data)`
we **overwrite** the existing target's description with empty.

We hit this in production: after a merge attempt aborted on a
downstream embedder failure (HKUDS#2 above), a hub entity was left with
`description=""` in the graph store while the VDB row still had the
old 2.9KB description. We had to write a one-shot recovery script
that read the VDB content back and called `edit_entity` to restore
the graph node. The fix here prevents the regression entirely:
coerce the wrapper into a plain dict before `_merge_attributes`
sees it.

## Verification

Tested against LightRAG v1.4.13 with PGGraphStorage (Apache AGE 1.5.0)
and Ollama nomic-embed-text. Smoke merges before and after this commit:

- `Vault Refs ← Vault Ref` (small, 2 sources): worked before, still works.
- `OpenClaw ← Openclaw` (medium, 53 relations): worked before, still works.
- `Иисин ← Iisin/main` (hub, 271 relations): **previously aborted** with
  `the input length exceeds the context length`, **now succeeds** in
  ~20 minutes (slow only because of the embedder rebuild for that many
  edges), with no errors in the server log.

After the merge, `Иисин` graph node still has the full description and
all the relations are correctly migrated.

## Notes

- One-file change.
- All three sub-fixes are in `_merge_entities_impl`. They are tightly
  coupled (the embedder failures and the silent description nuking
  are two halves of the same incident) and easier to review together
  than as three separate PRs.
- No public API or schema changes.
@69blade69 69blade69 changed the title fix(merge_entities): truncate description for embedder to fit context window fix(merge_entities): handle long descriptions, embedder failures, and storage wrappers Apr 8, 2026
@69blade69
Copy link
Copy Markdown
Author

Update — three fixes in one PR

After deploying the original description truncate from this PR I hit two more failure modes while merging hub entities at scale (entities with hundreds of neighbours). They all surface as the same user-visible bug — a merge of a graph hub aborts mid-flight and leaves the graph half-migrated — so I added them as a follow-up commit (8f0aea7) and rewrote the title and description to match the wider scope.

TL;DR of the new fixes:

  1. Same description-truncate but applied to relation embed content too (not just entity), because the relation upsert loop was also crashing on hub merges.
  2. try/except around entity and relation VDB upserts so a single bad embedding does not abort the whole merge.
  3. Coerce the value returned by get_node into a plain dict before _merge_attributes sees it. PostgreSQL+AGE returns a wrapper whose .get() always returns None, so _merge_attributes silently dropped every key, and upsert_node then overwrote the existing target's description with empty. We had to write a recovery script for a real entity in production after this happened. Worth fixing.

Verification covers small/medium/hub merges against PostgreSQL+AGE+Ollama. Hub merge (Иисин ← Iisin/main, 271 relations transferred) previously aborted with the embedder error from the original description and now succeeds with no errors. Two commits in the PR — happy to squash on request.

@danielaskdd
Copy link
Copy Markdown
Collaborator

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f0aea7891

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lightrag/utils_graph.py
Comment on lines +1541 to +1545
except Exception as _ent_upsert_exc:
logger.warning(
f"Entity Merge: skipping entity vdb upsert for "
f"`{target_entity}` due to {type(_ent_upsert_exc).__name__}: "
f"{_ent_upsert_exc}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail merge when target entity vector upsert is skipped

Catching and suppressing entities_vdb.upsert(...) errors here can silently drop all entity vectors for the merge result: this function still proceeds to step 10 and deletes each source entity's vector row, so a transient embedding/VDB failure leaves neither source nor target entity searchable in the vector index while reporting a successful merge. This should remain a hard failure (or be retried/compensated) unless source-vector deletion is deferred until target upsert is confirmed.

Useful? React with 👍 / 👎.

Comment thread lightrag/utils_graph.py
Comment on lines +1491 to +1495
except Exception as _rel_upsert_exc:
logger.warning(
f"Entity Merge: skipping vdb upsert for "
f"`{normalized_src}`~`{normalized_tgt}` due to "
f"{type(_rel_upsert_exc).__name__}: {_rel_upsert_exc}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not swallow relation upsert failures after VDB delete

This except turns relation embedding failures into silent data loss: the function deletes old relation vectors earlier (relationships_vdb.delete(...)) and then continues when a replacement upsert fails, so affected relations are left without vector entries even though the merge completes successfully. In environments with transient embedder/network errors, this permanently degrades relation retrieval until a manual reindex/backfill runs.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants