Skip to content

Commit 74af02f

Browse files
RecoDemoclaude
andcommitted
Add persistent disk cache, bump to 0.5.0
Save index to .codebase-index-cache.pkl after every build. On startup, load from cache if git ref matches (instant) or incrementally update if ≤20 files changed. Eliminates cold-start penalty on server restarts, context compaction, and new sessions. Co-Authored-By: Claude Opus 4.6 <[email protected]>
1 parent 64fac2d commit 74af02f

4 files changed

Lines changed: 144 additions & 4 deletions

File tree

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,9 @@ htmlcov/
5252
# ── Claude Code local state ───────────────────
5353
.claude/
5454

55+
# ── Codebase index cache ─────────────────────
56+
.codebase-index-cache.pkl
57+
5558
# ── Misc ───────────────────────────────────────
5659
*.log
5760
*.bak

README.md

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ Indexes codebases by parsing source files into structural metadata -- functions,
1616

1717
**Automatic incremental re-indexing:** In git repositories, the index stays up to date automatically. Before every query, the server checks `git diff` and `git status` (~1-2ms). If files changed, only those files are re-parsed and the dependency graph is rebuilt. No need to manually call `reindex` after edits, branch switches, or pulls.
1818

19+
**Persistent disk cache:** The index is saved to a pickle cache file (`.codebase-index-cache.pkl`) after every build. On subsequent server starts, the cache is loaded and validated against the current git HEAD — if the ref matches, startup is instant. If a small number of files changed (≤20), the cached index is loaded and incrementally updated instead of rebuilt from scratch. This eliminates the cold-start penalty when restarting Claude Code sessions, restarting the MCP server, or resuming work after context compaction.
20+
1921
## Language Support
2022

2123
| Language | Method | Extracts |
@@ -55,6 +57,16 @@ PROJECT_ROOT=/path/to/project python -m mcp_codebase_index.server
5557

5658
`PROJECT_ROOT` specifies which directory to index. Defaults to the current working directory.
5759

60+
### Persistent Cache
61+
62+
In git repositories, the server automatically caches the index to `.codebase-index-cache.pkl` in the project root. On startup:
63+
64+
1. **Cache hit (exact match):** If the cached git ref matches the current HEAD, the index loads instantly from disk — no parsing, no file walking.
65+
2. **Cache hit (small changeset):** If ≤20 files changed since the cached ref, the cached index is loaded and incrementally updated on the first query.
66+
3. **Cache miss:** If the changeset is large or no cache exists, a full rebuild runs and saves a new cache.
67+
68+
Add `.codebase-index-cache.pkl` to your `.gitignore` — it's a local-only build artifact.
69+
5870
### Configuring with OpenClaw
5971

6072
Install the package on the machine where OpenClaw is running:
@@ -99,7 +111,7 @@ openclaw mcp list
99111

100112
All 18 tools will be available to your agent.
101113

102-
**Performance note:** The server automatically detects file changes via `git diff` before every query (~1-2ms) and incrementally re-indexes only what changed. However, OpenClaw's default MCP integration via mcporter spawns a fresh server process per tool call, which discards the in-memory index and forces a full rebuild each time (~1-2s for small projects, longer for large ones). This is a mcporter process lifecycle limitation, not a server limitation. For persistent connections, use the [openclaw-mcp-adapter](https://github.com/androidStern-personal/openclaw-mcp-adapter) plugin, which connects once at startup and keeps the server running:
114+
**Performance note:** The server automatically detects file changes via `git diff` before every query (~1-2ms) and incrementally re-indexes only what changed. However, OpenClaw's default MCP integration via mcporter spawns a fresh server process per tool call, which discards the in-memory index and forces a full rebuild each time (~1-2s for small projects, longer for large ones). With persistent caching, these cold starts are now significantly faster — the server loads from the disk cache instead of re-parsing the entire codebase. For persistent connections (avoiding even the cache load overhead), use the [openclaw-mcp-adapter](https://github.com/androidStern-personal/openclaw-mcp-adapter) plugin, which connects once at startup and keeps the server running:
103115

104116
```bash
105117
pip install openclaw-mcp-adapter
@@ -138,6 +150,39 @@ Or using the Python module directly (useful if installed in a virtualenv):
138150
}
139151
```
140152

153+
#### Reinforcing Tool Usage with Hooks
154+
155+
Claude Code tends to default to built-in Glob/Grep/Read tools even when codebase-index is available. In addition to CLAUDE.md instructions (see below), you can add hooks that fire on every prompt to reinforce the behavior. Add this to `.claude/settings.local.json`:
156+
157+
```json
158+
{
159+
"hooks": {
160+
"SessionStart": [
161+
{
162+
"hooks": [
163+
{
164+
"type": "command",
165+
"command": "echo 'CRITICAL REMINDER: Use codebase-index MCP tools FIRST for ALL code navigation (find_symbol, get_function_source, search_codebase, get_dependencies, etc). Only fall back to Glob/Grep/Read for non-code files.'"
166+
}
167+
]
168+
}
169+
],
170+
"UserPromptSubmit": [
171+
{
172+
"hooks": [
173+
{
174+
"type": "command",
175+
"command": "echo 'Use codebase-index MCP tools first for code navigation.'"
176+
}
177+
]
178+
}
179+
]
180+
}
181+
}
182+
```
183+
184+
Hook stdout is injected as context Claude sees before responding. `SessionStart` fires on startup, resume, and context compaction. `UserPromptSubmit` fires on every turn.
185+
141186
### Important: Make the AI Actually Use Indexed Tools
142187

143188
By default, AI assistants will ignore the indexed tools and fall back to reading entire files with Glob/Grep/Read. Soft language like "prefer" gets rationalized away. Add this to your project's `CLAUDE.md` (or equivalent instructions file) with **mandatory** language:
@@ -193,6 +238,8 @@ Tested across four real-world projects on an M-series MacBook Pro, from a small
193238
| Django | 3,714 | 707,493 | 29,995 | 7,371 | 36.2s | 126 MB |
194239
| **CPython** | **2,464** | **1,115,334** | **59,620** | **9,037** | **55.9s** | **197 MB** |
195240

241+
With persistent caching, subsequent startups bypass the full build entirely. Cache load time is negligible compared to parsing — a cache hit on CPython restores the full index in under a second instead of 56s.
242+
196243
### Query Response Size vs Total Source
197244

198245
Querying CPython — 41 million characters of source code:

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "mcp-codebase-index"
7-
version = "0.4.6"
7+
version = "0.5.0"
88
description = "Structural codebase indexer with MCP server for AI-assisted development"
99
requires-python = ">=3.11"
1010
readme = "README.md"

src/mcp_codebase_index/server.py

Lines changed: 92 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
import json
3333
import os
3434
import sys
35+
import pickle
3536
import time
3637
import traceback
3738

@@ -55,6 +56,10 @@
5556
_query_fns: dict | None = None
5657
_is_git: bool = False
5758

59+
# Persistent cache
60+
_CACHE_FILENAME = ".codebase-index-cache.pkl"
61+
_CACHE_VERSION = 1 # Bump when ProjectIndex schema changes
62+
5863
# Session usage stats
5964
_session_start: float = time.time()
6065
_tool_call_counts: dict[str, int] = {}
@@ -156,33 +161,116 @@ def _format_duration(seconds: float) -> str:
156161
return f"{hours}h {mins}m"
157162

158163

164+
def _cache_path(project_root: str) -> str:
165+
"""Return the path to the pickle cache file for this project."""
166+
return os.path.join(project_root, _CACHE_FILENAME)
167+
168+
169+
def _save_cache(index: "ProjectIndex") -> None:
170+
"""Persist the project index to a pickle cache file."""
171+
try:
172+
root = index.root_path
173+
path = _cache_path(root)
174+
payload = {"version": _CACHE_VERSION, "index": index}
175+
with open(path, "wb") as f:
176+
pickle.dump(payload, f, protocol=pickle.HIGHEST_PROTOCOL)
177+
print(f"[mcp-codebase-index] Cache saved → {path}", file=sys.stderr)
178+
except Exception as exc:
179+
print(f"[mcp-codebase-index] Cache save failed: {exc}", file=sys.stderr)
180+
181+
182+
def _load_cache(project_root: str) -> "ProjectIndex | None":
183+
"""Load a cached project index if it exists and is compatible."""
184+
path = _cache_path(project_root)
185+
if not os.path.exists(path):
186+
return None
187+
try:
188+
with open(path, "rb") as f:
189+
payload = pickle.load(f)
190+
if not isinstance(payload, dict) or payload.get("version") != _CACHE_VERSION:
191+
print("[mcp-codebase-index] Cache version mismatch, ignoring", file=sys.stderr)
192+
return None
193+
index = payload["index"]
194+
from mcp_codebase_index.models import ProjectIndex as PI
195+
if not isinstance(index, PI):
196+
return None
197+
return index
198+
except Exception as exc:
199+
print(f"[mcp-codebase-index] Cache load failed: {exc}", file=sys.stderr)
200+
return None
201+
202+
159203
def _ensure_index() -> None:
160204
"""Build the project index on first use (lazy initialization).
161205
206+
Tries to load from a pickle cache first. If the cache is valid and
207+
the git ref matches (or the changeset is small enough for incremental
208+
update), skips a full rebuild.
209+
162210
This is called on the first tool call rather than at startup so that
163211
the MCP server can complete its initialization handshake immediately.
164212
Without this, large projects would cause Claude Code to timeout waiting
165213
for the server to become ready.
166214
"""
215+
global _project_root, _indexer, _query_fns, _is_git
216+
167217
if _indexer is not None:
168218
return
219+
220+
_project_root = os.environ.get("PROJECT_ROOT", os.getcwd())
221+
_is_git = is_git_repo(_project_root)
222+
223+
cached_index = _load_cache(_project_root)
224+
if cached_index is not None and _is_git and cached_index.last_indexed_git_ref:
225+
current_head = get_head_commit(_project_root)
226+
if current_head == cached_index.last_indexed_git_ref:
227+
# Exact match — use cache directly
228+
print("[mcp-codebase-index] Cache hit (git ref matches)", file=sys.stderr)
229+
_indexer = ProjectIndexer(_project_root)
230+
_indexer._project_index = cached_index
231+
_query_fns = create_project_query_functions(cached_index)
232+
return
233+
234+
# Check if changeset is small enough for incremental update on cache
235+
changeset = get_changed_files(_project_root, cached_index.last_indexed_git_ref)
236+
total_changes = len(changeset.modified) + len(changeset.added) + len(changeset.deleted)
237+
if not changeset.is_empty and total_changes <= 20:
238+
print(
239+
f"[mcp-codebase-index] Cache hit with {total_changes} changed files, "
240+
f"applying incremental update",
241+
file=sys.stderr,
242+
)
243+
_indexer = ProjectIndexer(_project_root)
244+
_indexer._project_index = cached_index
245+
_query_fns = create_project_query_functions(cached_index)
246+
# _maybe_incremental_update will handle the rest on first tool call
247+
return
248+
249+
print(
250+
f"[mcp-codebase-index] Cache stale ({total_changes} changes), full rebuild",
251+
file=sys.stderr,
252+
)
253+
169254
_build_index()
170255

171256

172257
def _build_index() -> None:
173258
"""Build (or rebuild) the project index and query functions."""
174259
global _project_root, _indexer, _query_fns, _is_git
175260

176-
_project_root = os.environ.get("PROJECT_ROOT", os.getcwd())
261+
if not _project_root:
262+
_project_root = os.environ.get("PROJECT_ROOT", os.getcwd())
177263
print(f"[mcp-codebase-index] Indexing project: {_project_root}", file=sys.stderr)
178264

179265
_indexer = ProjectIndexer(_project_root)
180266
index = _indexer.index()
181267
_query_fns = create_project_query_functions(index)
182268

183-
_is_git = is_git_repo(_project_root)
269+
if not _is_git:
270+
_is_git = is_git_repo(_project_root)
184271
if _is_git:
185272
index.last_indexed_git_ref = get_head_commit(_project_root)
273+
_save_cache(index)
186274

187275
print(
188276
f"[mcp-codebase-index] Indexed {index.total_files} files, "
@@ -256,6 +344,8 @@ def _maybe_incremental_update() -> None:
256344
file=sys.stderr,
257345
)
258346

347+
_save_cache(idx)
348+
259349

260350
# ---------------------------------------------------------------------------
261351
# Tool definitions

0 commit comments

Comments
 (0)