HKUDS · johnshearing · Dec 27, 2025 · Dec 27, 2025 · Dec 27, 2025 · Dec 27, 2025
diff --git a/.gitignore b/.gitignore
@@ -78,9 +78,30 @@ download_models_hf.py
 memory-bank
 .claude/CLAUDE.md
 .claude/
-
-# Claude Code
 CLAUDE.md
 
-# Google Jules
-.jules/
+
+# My changes to the gitignore file
+# 1. Ignore everything in the repository
+/*
+
+# 2. Allow the 'jrs' folder itself
+!/jrs/
+
+# 3. Ignore everything inside 'jrs' (including the 'work' folder)
+/jrs/*
+
+# 4. Allow specific files directly inside 'jrs'
+!/jrs/*.*
+
+# 5. Allow the _notes folder specifically
+!/jrs/_notes/
+
+# 6. Allow everything inside the _notes folder
+!/jrs/_notes/**
+
+# 7. Allow the archive folder specifically
+!/jrs/archive/
+
+# 8. Allow everything inside the _notes folder
+!/jrs/archive/**
diff --git a/jrs/_0_interesting_queries.md b/jrs/_0_interesting_queries.md
@@ -0,0 +1,144 @@
+## Ways To Query The LighRAG Server
+
+## Query 1
+
+### LightRAG server will produce links which will bring you to the exact spot in a video where the query is addressed.
+
+The magic which make this possible is in the way the data was prepared and indexed.
+[This library](https://github.com/johnshearing/scrape_yt_mk_transcripts) was used to prepare the data for indexing.
+<br>
+How the query is written also matters. The following is an example.
+
+**Paste the following query into the LightRAG server prompt box.
+Try Naive, Local, Global, Hybrid, and Mix mode for each query**
+
+
+[The source text is a transcript from the source video. The URL for the source video is found in the metadata associated with the source text. Time stamps in the source text appear between brackets at the beginning of each line as shown in the following sample text: <sample text begins here>[6.56 > 11.4] (Roger) the different years that in those that had COVID there were higher levels of oxidative stress.<sample text ends here> In the sample text, the values given between the brackets are given in seconds and decimal fractions of a second. So the first value in the timestamp sample is six seconds and 56 hundreths of a second. Only use the integer part of the value and only return the first value in the pair. So from the sample text, only use the number 6 to represent six seconds into the video when generating your answer. To say it another way, only use the whole number part of the first value given between the brackets when forming your answer. Please provide URLs in your responses for source video that incorporate the timestamps found in the source text so that the videos can be viewed at the moment which is relevant to the query response. The following is an example of how the URLs should be structured if the source text is from Ln3WszTq0uA.txt and the moment of interest has a time stamp of [6.56 > 11.68]: https://www.youtube.com/watch?v=Ln3WszTq0uA&t=6s. To build the URL always start with the URL you find in the metadata which is associated with the source text. In this case the URL associated with the source text is "https://www.youtube.com/watch?v=Ln3WszTq0uA". Append the characters &t= to the URL. So now the URL will look as follows: "https://www.youtube.com/watch?v=Ln3WszTq0uA&t=". Next, append the number 6 to the URL if the timestamp is [6.56 > 11.4] because we are only interested in the whole number part of the first value in the timestamp. So now the URL will look as follows: "https://www.youtube.com/watch?v=Ln3WszTq0uA&t=6". Finally, append the letter s to the URL to denote seconds. So the final resulting URL will look as follows: "https://www.youtube.com/watch?v=Ln3WszTq0uA&t=6s"] Please search for all instances where infrared light is discussed with regard to studies done in Brazil.
+
+**The following was the response produced**
+
+Infrared Light Studies in Brazil
+In the context of infrared light and its potential benefits for COVID-19, there is a notable study mentioned that was conducted in Brazil. This study involved the use of a jacket for patients with mild to moderate COVID-19 who were hospitalized but not on ventilators. The findings indicated that the use of near-infrared light therapy significantly improved several pulmonary functions and reduced hospitalization time by four days. Specifically, there was a 32% improvement in maximum inspiratory pressure and a 23% improvement in maximum expiratory pressure.
+
+The researchers highlighted the systemic effects of photobiomodulation, which suggests that near-infrared light could have broader implications for treating conditions like COVID-19. However, the study's main limitation was the small cohort size of only 30 patients, suggesting that further research with larger populations is necessary to strengthen the conclusions.
+
+For more information on this topic, you can view the relevant section of the video discussing the Brazilian study at this timestamp: https://www.youtube.com/watch?v=Hy8G5TKxiLY&t=2102s.
+
+### Summary
+The discussion around topical sunscreen highlights a critical view of its use in preventing skin cancer, suggesting that blocking UV light with sunscreen may lead to adverse health outcomes, including an increased risk of melanoma.
+
+---
+#### The following explains the effect of the query formating in the above example
+
+  * Query methods available are naive, local, global, hybrid, mix, and bypass.
+    * naive only looks in the vector database for information but will give thoughtfull answers beyond what it finds. So it's not just searching for words. Rather it is searching for meaning and reporting on that.
+    * local uses information found in the knowledge graph nodes.
+    * global uses information found in knowledge graph relations.
+    * hybrid combines local and global methods.
+    * mix combines hybrid with naive
+    * The results vary significantly based on the method used.
+    * There is no telling which will give the best result so it is good to try them all.
+    * The python script _2_index_query_??.py does indeed try all the methods so it might be wise to run the query in the script first to see which method produces the best response and then run the query again in the LightRAG server to get the best formatting.
+
+* [Time stamps in the...]
+  * Specify special instructions or formating instructions in brackets before the query. This greatly improves the quality of the response.
+
+
+## Query 2
+
+### Metadata: LightRAG server will list data about the source material and produce links to that source material.
+
+The magic which make this possible is in the way the data was prepared and indexed.
+[This library](https://github.com/johnshearing/scrape_yt_mk_transcripts) was used to prepare the data for indexing.
+<br>
+How the query is written also matters. The following is an example.
+
+**Paste the following query into the LightRAG server prompt box.**
+
+/naive Please list the metadata for the source material.
+
+**The following is was the response produced**
+
+### Metadata for the Source Material
+- **Video Title:** Dr Jack Kruse: Light, Melanin, POMC in Human Evolution & Disease
+- **Video URL:** [Watch here](https://www.youtube.com/watch?v=Ln3WszTq0uA)
+- **Video Platform:** YouTube
+- **Video Channel:** Max Gulhane MD
+- **Posted Date:** April 2, 2023
+- **Language:** English
+
+### References
+- [DC] Ln3WszTq0uA_metadata.json
+
+## Query 3
+
+### LightRAG server will produce diagrams.
+
+How the query is written matters.
+The query type is specified first: /naive
+Special instructions about the output go in between brackets []
+Finally the query is written.
+The following is an example.
+
+**Paste the following query into the LightRAG server prompt box.**
+
+>/mix [Use mermaid format for diagrams] Please draw a character relationship diagram for Age of Mammals
+
+**The following is was the response produced**
+
+```mermaid
+graph TD;
+    A[Age of Mammals] -->|Describes| B[Max]
+    A -->|Emergence after| C[The Last Extinction Event]
+    A -->|Links to| D[Darwin's Theory of Evolution]
+    A -->|Impacted by| E[Asteroid Event]
+    A -->|Related to| F[Human Evolution]
+    A -->|Significant for| G[Dinosaurs]
+```
+
+### Explanation:
+- The diagram illustrates the relationships surrounding the "Age of Mammals," showing its connections to key entities such as Max, the Last Extinction Event, Darwin's Theory of Evolution, the Asteroid Event, Human Evolution, and Dinosaurs. Each relationship highlights how these entities are interconnected in the context of evolutionary history.
+
+## Query 4
+
+### LightRAG can find orphans in the knowledge graph.
+
+This only works in a local or mixed mode query.
+And currently this will only work for 23 items in the list when using LR-Server.
+When this same query is use in a python query script many more list items can be used in the query.
+I am trying to discover what is making that difference.
+The query type is specified first: /local
+Special instructions about the output go in between brackets []
+Finally the query is written.
+
+The following is an example.
+**Paste the following query into the LightRAG server prompt box.**
+
+>/local [Only provide the name of the entity in the response. Nothing else is required.] Please examine the Entity ID for all Entities in the following python list. Then please return only the Entities with a Rank of 0. [ "2023-04-02T06:06:17Z", "2023-04-22T23:01:27Z", "Alabama", "Albury", "Biological Compartments", "Cellular Health", "Circadian Biology", "Circadian Health", "Circadian Rhythm", "Cold Thermogenesis Protocol", "Cytochrome C Oxidase", "Delta Airlines", "Deuterium Depleted Water", "Dr Jack Kruse: Light, Melanin, POMC in Human Evolution & Disease", "Dr Jack Kruse: WATER, non-native EMFs & mitochondrial basis of disease | Regenerative Health Podcast", "Dr. Anthony Chafee", "Dr. Jack Kruse", "Dr. Max Gulhane", "Eureka Moment", "Farm Tour", "Health Optimization"]
+
+**The following is was the response produced**
+
+>The only Entity from your provided list with a Rank of 0 is:
+>
+>Albury
+>This entity corresponds to the location in New South Wales where Dr. Max Gulhane is based.
+
+## Query 5
+
+### LightRAG can find Type, Description, Rank, and File Path in the knowledge graph. In the previous example we searched from Entity to Entity ID to Rank in order to find orphans. In the following example we go through Entity to Entity ID to Type in order to return categories.
+
+This only works in a mixed mode query.
+The query type is specified first: /mixed
+Special instructions about the output go in between brackets []
+Finally the query is written.
+
+The following is an example.
+**Paste the following query into the LightRAG server prompt box.**
+
+>/mix [Only provide the name of the entity in the response. Nothing else is required.] Please examine the Entity ID for all Entities in the following python list. Then please return only the Entities with a Type of Person. [ "2023-04-02T06:06:17Z", "2023-04-22T23:01:27Z", "Alabama", "Albury", "Biological Compartments", "Cellular Health", "Circadian Biology", "Circadian Health", "Circadian Rhythm", "Cold Thermogenesis Protocol", "Cytochrome C Oxidase", "Delta Airlines", "Deuterium Depleted Water", "Dr Jack Kruse: Light, Melanin, POMC in Human Evolution & Disease", "Dr Jack Kruse: WATER, non-native EMFs & mitochondrial basis of disease | Regenerative Health Podcast", "Dr. Anthony Chafee", "Dr. Jack Kruse", "Dr. Max Gulhane", "Eureka Moment", "Farm Tour", "Health Optimization"]
+
+**The following is was the response produced**
+
+>Dr. Anthony Chafee
+Dr. Jack Kruse
+Dr. Max Gulhane
diff --git a/jrs/_1_custom_index_01.py b/jrs/_1_custom_index_01.py
@@ -0,0 +1,170 @@
+import os
+import asyncio
+import logging
+import logging.config
+import json
+import numpy as np
+from lightrag import LightRAG
+from lightrag.llm.openai import gpt_4o_mini_complete
+from lightrag.kg.shared_storage import initialize_pipeline_status
+from lightrag.utils import logger, set_verbose_debug, EmbeddingFunc
+from llama_index.embeddings.openai import OpenAIEmbedding
+
+# import textract
+
+# Configuration
+WORKING_DIR = "/home/js/LightRAG/jrs/work/seheult/_seheult_work_dir"
+EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-large")
+EMBEDDING_DIM = int(os.getenv("EMBEDDING_DIM", 3072))
+API_KEY = os.getenv("EMBEDDING_BINDING_API_KEY")
+MAX_TOKEN_SIZE = int(os.getenv("MAX_TOKEN_SIZE", 8192))
+
+# Files to be indexed
+files_2b_indexed = [
+    "/home/js/LightRAG/jrs/work/seheult/seheult_metadata/_bNySyEobfY_metadata.json",
+    "/home/js/LightRAG/jrs/work/seheult/seheult_metadata/0m1Qekrfs7w_metadata.json",
+]
+
+
+def configure_logging():
+    """Configure logging with console and rotating file handlers."""
+    for logger_name in ["uvicorn", "uvicorn.access", "uvicorn.error", "lightrag"]:
+        logger_instance = logging.getLogger(logger_name)
+        logger_instance.handlers = []
+        logger_instance.filters = []
+    log_dir = os.getenv("LOG_DIR", os.getcwd())
+    log_file_path = os.path.abspath(os.path.join(log_dir, "lightrag_index.log"))
+    print(f"\nLightRAG index log file: {log_file_path}\n")
+    os.makedirs(os.path.dirname(log_dir), exist_ok=True)
+    log_max_bytes = int(os.getenv("LOG_MAX_BYTES", 10485760))
+    log_backup_count = int(os.getenv("LOG_BACKUP_COUNT", 5))
+    logging.config.dictConfig(
+        {
+            "version": 1,
+            "disable_existing_loggers": False,
+            "formatters": {
+                "default": {"format": "%(levelname)s: %(message)s"},
+                "detailed": {
+                    "format": "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+                },
+            },
+            "handlers": {
+                "console": {
+                    "formatter": "default",
+                    "class": "logging.StreamHandler",
+                    "stream": "ext://sys.stderr",
+                },
+                "file": {
+                    "formatter": "detailed",
+                    "class": "logging.handlers.RotatingFileHandler",
+                    "filename": log_file_path,
+                    "maxBytes": log_max_bytes,
+                    "backupCount": log_backup_count,
+                    "encoding": "utf-8",
+                },
+            },
+            "loggers": {
+                "lightrag": {
+                    "handlers": ["console", "file"],
+                    "level": "INFO",
+                    "propagate": False,
+                },
+            },
+        }
+    )
+    logger.setLevel(logging.INFO)
+    set_verbose_debug(os.getenv("VERBOSE_DEBUG", "true").lower() == "true")
+
+
+if not os.path.exists(WORKING_DIR):
+    os.makedirs(WORKING_DIR)
+
+
+async def initialize_rag():
+    """Initialize LightRAG with custom embedding function."""
+    print("Initializing LightRAG for indexing...")
+
+    # Initialize embedding model
+    embed_model = OpenAIEmbedding(
+        model=EMBEDDING_MODEL, api_key=API_KEY, dimensions=EMBEDDING_DIM
+    )
+
+    # Define async embedding function
+    async def async_embedding_func(texts):
+        # llama-index returns a list; we convert it to a numpy array for LightRAG
+        embeddings = await embed_model.aget_text_embedding_batch(texts)
+        return np.array(embeddings)
+
+    # Define embedding function
+    embedding_func = EmbeddingFunc(
+        embedding_dim=EMBEDDING_DIM,
+        max_token_size=MAX_TOKEN_SIZE,
+        func=async_embedding_func,
+    )
+
+    # Initialize LightRAG
+    rag = LightRAG(
+        working_dir=WORKING_DIR,
+        embedding_func=embedding_func,
+        llm_model_func=gpt_4o_mini_complete,
+    )
+
+    await rag.initialize_storages()
+    await initialize_pipeline_status()
+    return rag
+
+
+async def main():
+    """Main function to index documents."""
+    rag = None
+    try:
+        if not os.getenv("OPENAI_API_KEY") and not API_KEY:
+            raise ValueError(
+                "OPENAI_API_KEY or EMBEDDING_BINDING_API_KEY environment variable not set"
+            )
+        rag = await initialize_rag()
+
+        # Check which files are already indexed
+        indexed_files = set()
+        doc_status_file = os.path.join(WORKING_DIR, "kv_store_doc_status.json")
+        if os.path.exists(doc_status_file):
+            with open(doc_status_file, "r") as f:
+                docs = json.load(f)
+                indexed_files = {
+                    doc["file_path"]
+                    for doc in docs.values()
+                    if doc.get("status") == "processed" and "file_path" in doc
+                }
+            print(f"Already indexed files: {indexed_files}")
+
+        # Index new documents
+        for doc_path in files_2b_indexed:
+            if doc_path in indexed_files:
+                print(f"Skipping already indexed file: {doc_path}")
+                continue
+            print(f"Checking document at: {doc_path}")
+            if not os.path.exists(doc_path):
+                print(f"Document file not found at: {doc_path}, skipping...")
+                continue
+            print(f"Indexing document: {doc_path}...")
+
+            with open(doc_path, "r") as f:
+                docs = json.load(f)
+
+            await rag.ainsert_custom_kg(docs, full_doc_id=os.path.basename(doc_path))
+            print(f"Indexed {doc_path}")
+    except Exception as e:
+        print(f"An error occurred: {e}")
+        import traceback
+
+        traceback.print_exc()
+    finally:
+        if rag:
+            print("Finalizing storages...")
+            await rag.finalize_storages()
+
+
+if __name__ == "__main__":
+    configure_logging()
+    asyncio.run(main())
+    print("\nIndexing Done!")