fix(merge_entities): handle long descriptions, embedder failures, and storage wrappers#2916
fix(merge_entities): handle long descriptions, embedder failures, and storage wrappers#291669blade69 wants to merge 2 commits intoHKUDS:mainfrom
Conversation
… window
When merging entities the default `concatenate` strategy joins the
descriptions of every source entity into one long string. For graph
"hub" entities (those that appear in many documents) the result
routinely exceeds the embedder's context window. With Ollama's
`nomic-embed-text` (default 2048-token context) this surfaces as
ollama._types.ResponseError: the input length exceeds the context length
(status code: 400)
and the merge aborts before the VDB upsert.
The fix is to truncate the description that is fed into `content` for
the embedding call. The full, untruncated description is still stored
in `merged_entity_data["description"]` and therefore in the graph node
properties — only the `content` string used for embedding is shortened.
`1500` characters ≈ `~500` tokens, which leaves comfortable headroom
for the entity name on top and stays well below the 2048 default. The
ellipsis suffix preserves the cue that more text exists for any human
inspecting `content`.
This change is minimal and surgical: only the embedded `content` is
affected, not what gets persisted into graph storage or VDB metadata.
…ars() TypeError `POST /graph/entity/create`, `POST /graph/entities/merge`, `POST /graph/entity/edit`, and the relation equivalents return HTTP 500 even when the underlying database operation succeeds, when the graph storage backend is Apache AGE on PostgreSQL. The merge/create/edit actually completes — the entity is in the graph, the VDB row is in place, the relations are migrated — but the client cannot tell the operation went through, which makes the API look broken. ## Repro 1. LightRAG with `LIGHTRAG_GRAPH_STORAGE=PGGraphStorage` (Apache AGE on PostgreSQL). 2. ` curl -X POST http://127.0.0.1:9621/graph/entity/create -H \"X-API-Key: \$KEY\" -H 'Content-Type: application/json' -d '{\"entity_name\":\"TestE\",\"entity_data\":{\"entity_type\":\"Test\",\"description\":\"smoke\"}}' ` 3. Server returns HTTP 500 \"Internal Server Error\". 4. Verify `GET /graph/entity/exists?name=TestE` → `true`. The create **did happen** — only the response crashed. Server log: \`\`\` INFO: Entity Create: 'TestE' successfully created INFO: 127.0.0.1:41662 - \"POST /graph/entity/create HTTP/1.1\" 500 ERROR: Exception in ASGI application Traceback (most recent call last): File \".../fastapi/encoders.py\", line 328, in jsonable_encoder data = dict(obj) TypeError: cannot convert dictionary update sequence element #0 to a sequence During handling of the above exception, another exception occurred: File \".../fastapi/encoders.py\", line 333, in jsonable_encoder data = vars(obj) TypeError: vars() argument must have __dict__ attribute ValueError: [TypeError('cannot convert dictionary update sequence element #0 to a sequence'), TypeError('vars() argument must have __dict__ attribute')] \`\`\` The same crash hits `/graph/entities/merge`, `/graph/entity/edit`, `/graph/relation/create`, `/graph/relation/edit` — basically every write endpoint that returns `get_entity_info(...)` / `get_relation_info(...)`. ## Root cause `get_entity_info` and `get_relation_info` (in `lightrag/utils_graph.py`) embed the raw `node_data` / `edge_data` returned by the storage backend's `get_node` / `get_edge` straight into the response under `graph_data`. For Apache AGE through PostgreSQL via asyncpg, that value is a wrapper object whose properties can't be cleanly serialized — neither `dict(obj)` nor `vars(obj)` work, and `jsonable_encoder` raises. Every write function in `utils_graph.py` ends with: \`\`\`python return await get_entity_info( chunk_entity_relation_graph, entities_vdb, entity_name, include_vector_data=True, ) \`\`\` …which means **all 5 write endpoints** propagate the crash: - `acreate_entity` → `/graph/entity/create` - `acreate_relation` → `/graph/relation/create` - `aedit_entity` → `/graph/entity/edit` - `aedit_relation` → `/graph/relation/edit` - `_merge_entities_impl` → `/graph/entities/merge` We hit this in production while trying to clean up duplicate entities at scale and spent significant time chasing it before realizing **every write was actually succeeding**. ## Fix Add `_coerce_to_plain_dict(obj)` helper that recursively converts a value into a plain JSON-serializable dict, falling back to `str(...)` for unknown leaf values and to an empty dict on failure. `get_entity_info` and `get_relation_info` now coerce both `graph_data` and `vector_data` through it before returning. This is a defensive fix at the seam between storage backends and the response layer. It does not touch storage code, does not touch the write functions, and does not change the public response schema — it just guarantees the response is always JSON-encodable. Backends that already return plain dicts (NetworkX, JSON, Neo4j with current asyncpg paths) are unaffected because `isinstance(obj, dict)` short-circuits the helper. ## Verification Tested against LightRAG v1.4.13 with PGGraphStorage (Apache AGE 1.5.0). After the patch: \`\`\` POST /graph/entity/create → HTTP 200, full response with graph_data + vector_data POST /graph/entities/merge → HTTP 200, full response with graph_data + vector_data POST /graph/entity/edit → HTTP 200 DELETE /documents/delete_entity → HTTP 200 (was already working — schema stays the same) \`\`\` Smoke merge of OpenClaw ← Openclaw (53 relations transferred): HTTP 200 in ~5 minutes (slow only because of the embedder rebuild for an entity with many neighbours), no errors in the server log, source removed from both VDB and AGE graph, target preserved. ## Notes - One-file change. Net +69/-12 lines. - A separate PR (HKUDS#2916) addresses the orthogonal Ollama context-length crash on `merge_entities`. They are independent and can be reviewed separately. - The cost of the helper is very small (a single `isinstance(obj, dict)` short-circuit when the storage already returns dicts), and it never raises — losing custom properties is strictly better than HTTP 500 from a successful write. - Happy to narrow the helper to only `get_entity_info` / `get_relation_info` if maintainers prefer to keep it scoped, or to push it into the `PGGraphStorage.get_node` path instead. This PR went with the helper at the boundary because it covers all backends in one place.
…calls + coerce node_data
After deploying the original `description` truncate from this PR I hit
two more failure modes while merging hub entities (entities that appear
in many documents) at scale. This commit addresses all three together
because they all surface as the same user-visible failure: a merge of
a graph hub goes through partway, then aborts, leaving the graph in
a half-migrated state.
## 1. Truncate relation embed content (extends the original fix)
The original fix truncated `description` only for the **entity** vdb
upsert at the end of `_merge_entities_impl`. But the loop that re-embeds
each migrated relation right above (around line 1426) builds a `content`
string from the relation's own description, which can also exceed the
embedder's context window — relations that have been merged before may
have descriptions accumulated from prior merges.
When merging Иисин (a hub with ~270 neighbours) the merge would crash
mid-loop on a single relation whose accumulated description was too long.
Stack trace from the server log:
```
ollama._types.ResponseError: the input length exceeds the context length (status code: 400)
ERROR: Error in ollama_embed: the input length exceeds the context length
ERROR: Error merging entities: the input length exceeds the context length
File ".../lightrag/utils_graph.py", line 1441, in _merge_entities_impl
await relationships_vdb.upsert(relation_data_for_vdb)
```
Line 1441 = the inner relation upsert. This commit applies the same
truncate (1500 chars for description, 200 for keywords) to the embedded
`content` for each relation. The full description and full keywords
are still stored in the VDB row metadata; only the string fed to the
embedder is shortened.
## 2. Wrap entity and relation embed calls in try/except
Even with the truncate, embedders can fail for other transient reasons
(network blip, model context drift, an unusually pathological character
sequence). Without the try/except, **one** failed embedding aborts the
**whole** merge, leaving the graph in a partially-migrated state where
the target has been updated but the source entities have not been
deleted (because we never reach the delete loop further down).
Wrap the entity vdb upsert and the per-relation vdb upsert in
try/except. We log a warning and keep going. Relation chunk tracking
is already updated separately above the upsert, so the graph stays
consistent even if a single VDB row is missed; the next merge attempt
or a subsequent reprocess will fix it.
## 3. Coerce raw node_data through dict() before _merge_attributes
This is the most subtle of the three. Some graph backends (notably
PostgreSQL + Apache AGE through asyncpg) return a wrapper object from
`get_node` whose `.get()` method returns `None` for every key — the
attribute access protocol is implemented but `dict.get` semantics
are not. When we feed that wrapper straight into `_merge_attributes`
below:
```python
values = [data.get(key) for data in data_list if data.get(key)]
if not values:
continue # ← key is dropped!
```
…every value comes back falsy, the loop drops every key, and
`merged_entity_data` ends up as an empty dict. Then on the line
`await chunk_entity_relation_graph.upsert_node(target_entity, merged_entity_data)`
we **overwrite** the existing target's description with empty.
We hit this in production: after a merge attempt aborted on a
downstream embedder failure (HKUDS#2 above), a hub entity was left with
`description=""` in the graph store while the VDB row still had the
old 2.9KB description. We had to write a one-shot recovery script
that read the VDB content back and called `edit_entity` to restore
the graph node. The fix here prevents the regression entirely:
coerce the wrapper into a plain dict before `_merge_attributes`
sees it.
## Verification
Tested against LightRAG v1.4.13 with PGGraphStorage (Apache AGE 1.5.0)
and Ollama nomic-embed-text. Smoke merges before and after this commit:
- `Vault Refs ← Vault Ref` (small, 2 sources): worked before, still works.
- `OpenClaw ← Openclaw` (medium, 53 relations): worked before, still works.
- `Иисин ← Iisin/main` (hub, 271 relations): **previously aborted** with
`the input length exceeds the context length`, **now succeeds** in
~20 minutes (slow only because of the embedder rebuild for that many
edges), with no errors in the server log.
After the merge, `Иисин` graph node still has the full description and
all the relations are correctly migrated.
## Notes
- One-file change.
- All three sub-fixes are in `_merge_entities_impl`. They are tightly
coupled (the embedder failures and the silent description nuking
are two halves of the same incident) and easier to review together
than as three separate PRs.
- No public API or schema changes.
Update — three fixes in one PRAfter deploying the original TL;DR of the new fixes:
Verification covers small/medium/hub merges against PostgreSQL+AGE+Ollama. Hub merge (Иисин ← Iisin/main, 271 relations transferred) previously aborted with the embedder error from the original description and now succeeds with no errors. Two commits in the PR — happy to squash on request. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8f0aea7891
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| except Exception as _ent_upsert_exc: | ||
| logger.warning( | ||
| f"Entity Merge: skipping entity vdb upsert for " | ||
| f"`{target_entity}` due to {type(_ent_upsert_exc).__name__}: " | ||
| f"{_ent_upsert_exc}" |
There was a problem hiding this comment.
Fail merge when target entity vector upsert is skipped
Catching and suppressing entities_vdb.upsert(...) errors here can silently drop all entity vectors for the merge result: this function still proceeds to step 10 and deletes each source entity's vector row, so a transient embedding/VDB failure leaves neither source nor target entity searchable in the vector index while reporting a successful merge. This should remain a hard failure (or be retried/compensated) unless source-vector deletion is deferred until target upsert is confirmed.
Useful? React with 👍 / 👎.
| except Exception as _rel_upsert_exc: | ||
| logger.warning( | ||
| f"Entity Merge: skipping vdb upsert for " | ||
| f"`{normalized_src}`~`{normalized_tgt}` due to " | ||
| f"{type(_rel_upsert_exc).__name__}: {_rel_upsert_exc}" |
There was a problem hiding this comment.
Do not swallow relation upsert failures after VDB delete
This except turns relation embedding failures into silent data loss: the function deletes old relation vectors earlier (relationships_vdb.delete(...)) and then continues when a replacement upsert fails, so affected relations are left without vector entries even though the merge completes successfully. In environments with transient embedder/network errors, this permanently degrades relation retrieval until a manual reindex/backfill runs.
Useful? React with 👍 / 👎.
Summary
This PR addresses three closely related failure modes in
_merge_entities_implthat all surface as the same user-visible bug: a merge of a graph hub entity (one with many neighbours) goes through partway, then aborts, leaving the graph in a half-migrated state.1. Truncate relation embed content (extends the original fix)
The original fix in this PR truncated
descriptiononly for the entity vdb upsert at the end of_merge_entities_impl. But the loop that re-embeds each migrated relation right above (around line 1426) builds acontentstring from the relation's own description, which can also exceed the embedder's context window — relations that have been merged before may have descriptions accumulated from prior merges.When merging Иисин (a hub with ~270 neighbours) the merge would crash mid-loop on a single relation whose accumulated description was too long:
```
ollama._types.ResponseError: the input length exceeds the context length (status code: 400)
File ".../lightrag/utils_graph.py", line 1441, in _merge_entities_impl
await relationships_vdb.upsert(relation_data_for_vdb)
```
This commit applies the same truncate (1500 chars for description, 200 for keywords) to the embedded
contentfor each relation. The full description and full keywords are still stored in the VDB row metadata; only the string fed to the embedder is shortened.2. Wrap entity and relation embed calls in try/except
Even with the truncate, embedders can fail for other transient reasons (network blip, model context drift, an unusually pathological character sequence). Without the try/except, one failed embedding aborts the whole merge, leaving the graph in a partially-migrated state where the target has been updated but the source entities have not been deleted (because we never reach the delete loop further down).
Wrap the entity vdb upsert and the per-relation vdb upsert in try/except. We log a warning and keep going. Relation chunk tracking is already updated separately above the upsert, so the graph stays consistent even if a single VDB row is missed; the next merge attempt or a subsequent reprocess will fix it.
3. Coerce raw node_data through dict() before _merge_attributes
This is the most subtle of the three. Some graph backends (notably PostgreSQL + Apache AGE through asyncpg) return a wrapper object from
get_nodewhose.get()method returnsNonefor every key — the attribute access protocol is implemented butdict.getsemantics are not. When we feed that wrapper straight into_merge_attributesbelow:```python
values = [data.get(key) for data in data_list if data.get(key)]
if not values:
continue # ← key is dropped!
```
…every value comes back falsy, the loop drops every key, and
merged_entity_dataends up as an empty dict. Then on the lineawait chunk_entity_relation_graph.upsert_node(target_entity, merged_entity_data)we overwrite the existing target's description with empty.We hit this in production: after a merge attempt aborted on a downstream embedder failure (#2 above), a hub entity was left with
description=\"\"in the graph store while the VDB row still had the old 2.9KB description. We had to write a one-shot recovery script that read the VDB content back and callededit_entityto restore the graph node. The fix here prevents the regression entirely: coerce the wrapper into a plain dict before_merge_attributessees it.Verification
Tested against LightRAG v1.4.13 with PGGraphStorage (Apache AGE 1.5.0) and Ollama nomic-embed-text. Smoke merges before and after this commit:
Vault Refs ← Vault Ref(small, 2 sources): worked before, still works.OpenClaw ← Openclaw(medium, 53 relations): worked before, still works.Иисин ← Iisin/main(hub, 271 relations): previously aborted withthe input length exceeds the context length, now succeeds in ~20 minutes (slow only because of the embedder rebuild for that many edges), with no errors in the server log.After the merge,
Иисинgraph node still has the full description and all the relations are correctly migrated.Notes
_merge_entities_impl. They are tightly coupled (the embedder failures and the silent description nuking are two halves of the same incident) and easier to review together than as three separate PRs.vars()TypeError on storage wrapper) — those are completely different code paths and either PR can be merged first.