WIP: Merge Dev to Main by danielaskdd · Pull Request #2846 · HKUDS/LightRAG

danielaskdd · 2026-03-27T05:21:38Z

Dummy PR: Merge Dev to Main (Never try to merge this PR)

…ter-based text - Add EntityExtractionResult Pydantic model for structured JSON output - Add JSON-mode prompt templates for entity/relationship extraction - Add _process_json_extraction_result() JSON parser in extraction pipeline - Add entity_extraction_use_json config option, default True - Add extraction_max_tokens config to prevent output truncation - OpenAI: use response_format json_object with auto-fallback retry - Ollama/Gemini: use native JSON mode for entity extraction - Other providers: pop entity_extraction kwarg for compatibility - Cache rebuild auto-detects JSON vs delimiter format - Skip relationships with empty descriptions to prevent merge errors

… ruff formatting

…ndles truncation)

Bring the RAG-Anything parsing/analyze flow into LightRAG's document pipeline and let extract, keyword, query, and VLM roles run with independent model settings. This keeps structured extraction, DOCX interchange ingestion, and relation merge hardening upstreamable without including the entity disambiguation experiment. Made-with: Cursor

Document the companion parser-side changes made in RAG-Anything so reviewers can understand how heading and table normalization align with this LightRAG multimodal pipeline PR without including external repository code here. Made-with: Cursor

Add safe example settings for the new structured extraction, parser integration, staged pipeline, and role-specific LLM/VLM routing options while keeping env.example ready for cp env.example .env deployment without exposing private models or secrets. Made-with: Cursor

Align the combined structured extraction and multimodal pipeline branch with the repository pre-commit rules so the upstream PR can pass formatting and lint checks cleanly. Made-with: Cursor

Sync env.example with the upstream interactive setup wizard template while keeping the multimodal parsing, staged pipeline, JSON extraction, and role-specific LLM settings exposed for configuration. Made-with: Cursor

Rebase env.example onto the latest upstream interactive setup template while preserving the multimodal parsing, staged pipeline, JSON extraction, and role-specific model configuration added by this PR. Made-with: Cursor

Keep LIGHTRAG_RUNTIME_TARGET in the latest upstream wizard template form so env.example remains a configuration template rather than activating a concrete runtime value in place. Made-with: Cursor

Refresh env.example against the latest interactive setup template updates, including runtime metadata, Mongo/OpenSearch storage guidance, and device-related comments, while preserving this PR's structured extraction, multimodal parsing, staged pipeline, and role-specific model settings. Made-with: Cursor

Add the OpenSearch client to offline storage dependencies and make the offline test workflow install storage backends so mocked OpenSearch storage tests can be collected in CI. Made-with: Cursor

Reorder requirements-offline-storage.txt to match the repository requirements formatter so pre-commit passes cleanly in CI. Made-with: Cursor

Add runtime role-specific LLM reconfiguration, provider option overrides per role, isolated role kwargs for Ollama query paths, and expanded offline coverage for role isolation, VLM, rollback behavior, and Ollama role kwargs. Made-with: Cursor

- Defer .docx upload parsing to three-stage pipeline via pending_parse format instead of synchronous API-layer extraction - Fill positions field in interchange JSONL with para_id from docx - Extract embedded images from .docx to *.blocks.assets directory - Add extract_docx_images() for binary image extraction via r:embed - Extend _extract_drawing_info() to return relationship ID - Add FULL_DOCS_FORMAT_PENDING_PARSE constant and parse_native branch - Mark engine_capabilities with "i" when images are present Made-with: Cursor

Made-with: Cursor

Cherry-pick runtime_validation.py from upstream/main so the test_runtime_target_validation tests can be collected during CI merge preview. Made-with: Cursor

Cherry-pick opensearch_impl.py and update kg/__init__.py from upstream/main so test_opensearch_storage tests can be collected during CI merge preview. Made-with: Cursor

Cherry-pick all files that exist in upstream/main but were missing from this branch. CI merge preview collects tests from both branches, so missing source modules cause ImportError during test collection. Includes: Makefile, docker-compose-full, InteractiveSetup docs, OpenSearch examples, setup wizard scripts/templates, and 4 upstream test files (zhipu, opensearch, runtime_target, interactive_setup). Note: test_interactive_setup_outputs has 4 pre-existing failures in upstream/main itself (port mapping assertion format mismatch). Made-with: Cursor

The port mapping assertions expected variable-format strings (${HOST:-0.0.0.0}:${PORT:-9621}:9621) but setup.sh produces concrete values (127.0.0.1:8080:9621). Update assertions to match the actual behavior. Made-with: Cursor

Keep test assertions identical to upstream/main to avoid merge conflicts. The 4 port-mapping assertion failures are a known upstream issue (setup.sh outputs concrete values but tests expect variable-format strings). Made-with: Cursor

Pull the latest setup.sh from upstream/main which now produces variable-format port mappings matching the test assertions. Made-with: Cursor

- add `ready_for_review` to PR trigger types for linting and tests workflows - add conditional check to skip jobs when PR is in draft state

…n-limits feat(extraction): configurable per-response entity/relation limits

Decouple keyword extraction from the provider-specific `GPTKeywordExtractionFormat` Pydantic schema and consolidate tolerant JSON parsing in `operate._parse_keywords_payload`. Providers now just return raw strings; the caller handles parsing, model-object unwrap, and empty fallbacks uniformly. - Standardize `response_format={"type": "json_object"}` where providers support JSON mode (OpenAI / Azure / Ollama); drop the flag where they do not (Bedrock / Lollms / Lmdeploy). - Add `_normalize_ollama_response_format` to translate the OpenAI-style payload into Ollama's native `format` field, with `format` taking precedence when explicitly set. - Remove redundant `kwargs.pop("keyword_extraction", keyword_extraction)` patterns — named parameters never leak into `**kwargs`, so those lines were no-ops. - Tighten the keywords-extraction prompt template: require exact JSON shape, explicit boundaries, and an empty-list response for edge cases. - Add tolerant parsing tests plus driver-level tests for Bedrock, Ollama, Lollms, and Lmdeploy keyword-extraction plumbing. Co-Authored-By: Claude Opus 4.7 <[email protected]>

…r handling - add boolean return value to _parse_keywords_payload to indicate parsing success - implement strict JSON parsing first with fallback to json_repair with warnings - enhance error logging with truncated response content for debugging - fix cache handling to properly distinguish between valid empty results and parsing failures - add comprehensive tests for JSON repair warnings and empty cache scenarios

- delete deprecated keyword extraction format model - clean up unused imports and reduce codebase complexity

♻️ refactor(llm): unify keyword extraction across providers

Promote `response_format` to the canonical structured-output parameter across all LLM providers. Demote `entity_extraction` / `keyword_extraction` booleans to deprecated shims that map to `{"type": "json_object"}` and emit a single `DeprecationWarning` at the driver layer. - Providers translate `response_format` to their native API: OpenAI passes through, Ollama -> `format`, Gemini -> `response_mime_type` / `response_schema`, Zhipu forwards to client. - Zhipu: drop the JSON-prompt injection path; task prompt ownership returns to the caller. - Providers without a JSON mode (lollms, lmdeploy, anthropic, hf, bedrock, llama_index) safely strip `response_format`. - Server wrappers become pure forwarders; deprecation shims live only at the driver layer to avoid duplicate warnings. - Simplify OpenAI dispatch to `create()` only — the project never passes Pydantic / JSON Schema, so `parse()` buys no extra value and risks compatibility on OpenAI-alike providers. Truncated responses are returned raw for upstream tolerant JSON parsing. Co-Authored-By: Claude Opus 4.7 <[email protected]>

- add docstrings explaining response_format handling to anthropic, bedrock, gemini, hf, llama_index, lmdeploy, lollms, ollama, openai, and zhipu modules - document which adapters support OpenAI-style JSON mode vs compatibility shims - clarify deprecated keyword_extraction and entity_extraction behavior - update type hints in llama_index_impl from LlamaIndexSettings to Any for flexibility - fix minor formatting inconsistencies in operate.py and test files

- replace deprecated keyword_extraction parameter with response_format in bedrock test - add pytest.warns context manager to zhipu test for deprecated parameter usage

…cache partitioning - update Gemini LLM to use response_json_schema instead of deprecated response_schema - enhance LLM cache to include response_format in hash computation for proper partitioning - update OpenAI LLM documentation to clarify json_schema passthrough behavior - add unit tests for Gemini schema mapping and cache partitioning logic

- add `_normalize_gemini_response_schema` helper to unwrap OpenAI-style json_schema wrappers - support Pydantic model classes via `model_json_schema()` method - simplify `_build_generation_config` by delegating schema normalization - add test coverage for OpenAI json_schema wrapper and Pydantic model inputs

…ders - add explicit validation to reject typed/Pydantic response_format in gemini, openai, and cache wrapper - update ollama to support dict-form json_schema response_format unwrapping - remove deprecated model_json_schema() handling from gemini - update docstrings to clarify supported response_format types - add comprehensive tests for rejection behavior and json_schema support

…eters - map deprecated `keyword_extraction` and `entity_extraction` booleans to `response_format={"type": "json_object"}` when no explicit format is supplied - force disable COT when structured output is used to prevent reasoning_content from corrupting JSON payload - update kwargs filtering to include `entity_extraction` in removal list - add deprecation warnings with migration guidance for legacy parameter usage ✅ test(zhipu): add coverage for entity extraction and COT interaction - verify `entity_extraction=True` maps to json_object response format and triggers deprecation warning - verify explicit `response_format` disables COT automatically to protect JSON output

- disable enable_cot when response_format is specified in gemini and openai modules - ensures structured JSON output is not polluted with reasoning content - add tests to verify COT is disabled for streaming structured output in both gemini and openai

♻️ refactor(llm): unify structured output control via response_format

…ation - add `_normalize_gemini_base_url` helper to treat google's default service roots as sdk defaults - change default host from hardcoded url to "DEFAULT_GEMINI_ENDPOINT" sentinel value - update client initialization to skip http_options for default endpoints - add test coverage for default endpoint behavior and custom base url preservation ✅ test(gemini): add unit tests for gemini endpoint configuration - verify default host returns sentinel value when env var is unset - confirm default service roots are not treated as custom base urls - ensure custom proxy urls are preserved in http_options

- add missing api_key parameter to test_gemini_streaming_structured_output_disables_cot test case

refactor(gemini): improve default endpoint handling and sdk integration

- extract _bedrock_client_kwargs helper to deduplicate client configuration logic - remove deprecated sys.version_info check for AsyncIterator (requires python 3.9+) - remove unused base_url fallback from kwargs in bedrock_complete_if_cache - add endpoint_url documentation for DEFAULT_BEDROCK_ENDPOINT sentinel behavior ✅ test(bedrock): add coverage for endpoint url edge cases - add test for custom host via LLM_BINDING_HOST env var - add test for DEFAULT_BEDROCK_ENDPOINT sentinel using sdk default - add test for empty endpoint_url using sdk default

refactor(bedrock): support default and custom endpoints

- add explanation for DEFAULT_GEMINI_ENDPOINT in Google Gemini AI Studio section - remove redundant comment from Vertex AI section - update Gemini embedding section with DEFAULT_GEMINI_ENDPOINT explanation

…l constants - change gemini default from hardcoded URL to DEFAULT_GEMINI_ENDPOINT - change aws_bedrock default from hardcoded URL to DEFAULT_BEDROCK_ENDPOINT - add test cases for sentinel default values in llm and embedding config collection - add test cases to verify explicit host values are preserved on reruns

…tinel-endpoint refactor(setup): use sentinel endpoints for Gemini and Bedrock defaults

- standardize switch case indentation to 2 spaces in DocumentManager.tsx - standardize switch case indentation to 2 spaces in documentStatusFilters.ts - improve code consistency with project formatting conventions

perf(postgres): use binary parameter for vector similarity queries

MrGidea and others added 30 commits March 8, 2026 15:59

fix: resolve CI linting - add extraction_max_tokens definition, apply…

c115342

… ruff formatting

refactor: remove extraction_max_tokens (not essential, json_repair ha…

a6da346

…ndles truncation)

fix: apply lint cleanup for multimodal PR

7c940af

Align the combined structured extraction and multimodal pipeline branch with the repository pre-commit rules so the upstream PR can pass formatting and lint checks cleanly. Made-with: Cursor

chore: align env.example with interactive setup template

deeb177

Sync env.example with the upstream interactive setup wizard template while keeping the multimodal parsing, staged pipeline, JSON extraction, and role-specific LLM settings exposed for configuration. Made-with: Cursor

chore: refresh env.example against latest upstream wizard

8147a47

Rebase env.example onto the latest upstream interactive setup template while preserving the multimodal parsing, staged pipeline, JSON extraction, and role-specific model configuration added by this PR. Made-with: Cursor

fix: align runtime target lines in env.example

bd491a4

Keep LIGHTRAG_RUNTIME_TARGET in the latest upstream wizard template form so env.example remains a configuration template rather than activating a concrete runtime value in place. Made-with: Cursor

fix: install offline storage deps for offline tests

1e85f95

Add the OpenSearch client to offline storage dependencies and make the offline test workflow install storage backends so mocked OpenSearch storage tests can be collected in CI. Made-with: Cursor

fix: sort offline storage requirements

4df267e

Reorder requirements-offline-storage.txt to match the repository requirements formatter so pre-commit passes cleanly in CI. Made-with: Cursor

fix: sync zhipu.py with upstream to resolve merge conflict

4de2dfd

Made-with: Cursor

fix: add upstream runtime_validation module for CI compatibility

584307c

Cherry-pick runtime_validation.py from upstream/main so the test_runtime_target_validation tests can be collected during CI merge preview. Made-with: Cursor

fix: add upstream opensearch storage impl for CI compatibility

651a996

Cherry-pick opensearch_impl.py and update kg/__init__.py from upstream/main so test_opensearch_storage tests can be collected during CI merge preview. Made-with: Cursor

chore: sync setup.sh with upstream fix for port mapping

2f60c67

Pull the latest setup.sh from upstream/main which now produces variable-format port mappings matching the test assertions. Made-with: Cursor

Merge branch 'dev' into feat/upstream-combined-no-disambiguation

881096f

Merge branch 'main' into dev

9a05371

Bump core version to 1.5.0

5e81d51

Merge branch 'dev' into ydh/multimodal-pipeline

4cdb582

Merge branch 'main' into dev

322fd3e

Merge branch 'dev' into feat/multimodal-pipeline

1df9f45

👷 ci(workflows): skip CI runs on draft pull requests

b4ddad3

- add `ready_for_review` to PR trigger types for linting and tests workflows - add conditional check to skip jobs when PR is in draft state

Merge branch 'main' into dev

5cd90f3

danielaskdd and others added 30 commits April 18, 2026 13:16

Merge pull request #2950 from danielaskdd/feat/configurable-extractio…

721ed6d

…n-limits feat(extraction): configurable per-response entity/relation limits

♻️ refactor(types): remove unused GPTKeywordExtractionFormat class

282b0fd

- delete deprecated keyword extraction format model - clean up unused imports and reduce codebase complexity

Merge pull request #2953 from danielaskdd/refac/keyword-extration

3508947

♻️ refactor(llm): unify keyword extraction across providers

✅ test(llm): update bedrock and zhipu tests for parameter changes

5a1efa1

- replace deprecated keyword_extraction parameter with response_format in bedrock test - add pytest.warns context manager to zhipu test for deprecated parameter usage

Merge pull request #2956 from danielaskdd/refac/unify-response-format

e592b8b

♻️ refactor(llm): unify structured output control via response_format

Fix linting

ba4e74d

✅ test(gemini): add api_key parameter to streaming test

a327134

- add missing api_key parameter to test_gemini_streaming_structured_output_disables_cot test case

Merge pull request #2957 from danielaskdd/fix/gemini-url

47c71a5

refactor(gemini): improve default endpoint handling and sdk integration

refactor(bedrock): support default and custom endpoints

c8a7df6

Merge pull request #2958 from danielaskdd/fix/bedrock-endpoint-url

02c1680

refactor(bedrock): support default and custom endpoints

📝 docs(env): clarify DEFAULT_GEMINI_ENDPOINT usage in comments

4aabcd9

- add explanation for DEFAULT_GEMINI_ENDPOINT in Google Gemini AI Studio section - remove redundant comment from Vertex AI section - update Gemini embedding section with DEFAULT_GEMINI_ENDPOINT explanation

Merge pull request #2959 from danielaskdd/refact/setup-wizard-use-sen…

8b3b961

…tinel-endpoint refactor(setup): use sentinel endpoints for Gemini and Bedrock defaults

Merge branch 'main' into dev

19d6652

Merge branch 'main' into dev

03c030e

💄 style(documentManager): fix switch statement indentation

950f3a3

- standardize switch case indentation to 2 spaces in DocumentManager.tsx - standardize switch case indentation to 2 spaces in documentStatusFilters.ts - improve code consistency with project formatting conventions

Merge pull request #2949 from wkpark/perf-postgres-safe-param-dev

925bdf7

perf(postgres): use binary parameter for vector similarity queries

Merge branch 'main' into dev

95813e0

Merge branch 'main' into dev

b8936ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Merge Dev to Main#2846

WIP: Merge Dev to Main#2846
danielaskdd wants to merge 163 commits intomainfrom
dev

danielaskdd commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielaskdd commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants