Skip to content

Commit 96c7164

Browse files
committed
Refactor dependencies in uv.lock: remove libclang and protobuf, add msgpack; update pathspec version
1 parent f1b56e5 commit 96c7164

File tree

123 files changed

+3284
-25463
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

123 files changed

+3284
-25463
lines changed
Lines changed: 345 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,345 @@
1+
# LLM-Optimized Index Replacement Plan
2+
3+
## Current Architecture Analysis
4+
5+
### Actual Implementation Process
6+
1. **Project Initialization**: LLM calls `set_project_path()` to establish project root
7+
2. **File Watcher Activation**: Automatic file monitoring starts with debounced re-indexing
8+
3. **Codebase Traversal**: System scans all files using extension whitelist (SUPPORTED_EXTENSIONS)
9+
4. **Language-Specific Processing**: Different strategies for each language's unique characteristics
10+
5. **Dual Storage**: Index stored in temporary path + in-memory for fast access
11+
6. **Query Tools**: LLMs call analysis tools that use the built index
12+
13+
### SCIP-Based System Issues
14+
- **Complex Protocol**: SCIP protobuf format designed for IDEs, not LLM consumption
15+
- **Over-Engineering**: Multi-layer abstraction (strategies/factories) creates complexity
16+
- **Token Inefficiency**: Verbose SCIP format wastes LLM context tokens
17+
- **Parsing Overhead**: Complex symbol ID generation and validation
18+
- **Cross-Document Complexity**: Relationship building adds minimal LLM value
19+
20+
### Current Flow Analysis
21+
```
22+
set_project_path() → File Watcher Activation → Codebase Traversal (Extension Whitelist) →
23+
Language-Specific Strategies → SCIP Builder → Index Storage (Temp + Memory) →
24+
Query Tools Access Index
25+
```
26+
27+
### Reusable Components
28+
- **Extension Whitelist**: SUPPORTED_EXTENSIONS constant defining indexable file types
29+
- **File Watcher Service**: Robust debounced file monitoring with auto re-indexing
30+
- **Language Strategy System**: Multi-language support with unique characteristics per language
31+
- **Dual Storage Pattern**: Temporary file storage + in-memory caching for performance
32+
- **Service Architecture**: Clean 3-layer pattern (MCP → Services → Tools)
33+
- **Tree-sitter Parsing**: High-quality AST parsing for supported languages
34+
35+
## Replacement Architecture
36+
37+
### Core Principle
38+
Clean slate approach: Delete all SCIP components and build simple, LLM-optimized JSON indexing system from scratch. Preserve three-layer architecture by only replacing the tool layer.
39+
40+
### New Index Format Design
41+
42+
#### Design Rationale
43+
The index should optimize for **LLM query patterns** rather than IDE features:
44+
45+
1. **Function Tracing Focus**: LLMs primarily need to understand "what calls what"
46+
2. **Fast Lookups**: Hash-based access for instant symbol resolution
47+
3. **Minimal Redundancy**: Avoid duplicate data that wastes tokens
48+
4. **Query-Friendly Structure**: Organize data how LLMs will actually access it
49+
5. **Incremental Updates**: Support efficient file-by-file rebuilds
50+
51+
#### Multi-Language Index Format
52+
```json
53+
{
54+
"metadata": {
55+
"project_path": "/absolute/path/to/project",
56+
"indexed_files": 275,
57+
"index_version": "1.0.0",
58+
"timestamp": "2025-01-15T10:30:00Z",
59+
"languages": ["python", "javascript", "java", "objective-c"]
60+
},
61+
62+
"symbols": {
63+
"src/main.py::process_data": {
64+
"type": "function",
65+
"file": "src/main.py",
66+
"line": 42,
67+
"signature": "def process_data(items: List[str]) -> None:",
68+
"called_by": ["src/main.py::main"]
69+
},
70+
"src/main.py::MyClass": {
71+
"type": "class",
72+
"file": "src/main.py",
73+
"line": 10
74+
},
75+
"src/main.py::MyClass.process": {
76+
"type": "method",
77+
"file": "src/main.py",
78+
"line": 20,
79+
"signature": "def process(self, data: str) -> bool:",
80+
"called_by": ["src/main.py::process_data"]
81+
},
82+
"src/MyClass.java::com.example.MyClass": {
83+
"type": "class",
84+
"file": "src/MyClass.java",
85+
"line": 5,
86+
"package": "com.example"
87+
},
88+
"src/MyClass.java::com.example.MyClass.process": {
89+
"type": "method",
90+
"file": "src/MyClass.java",
91+
"line": 10,
92+
"signature": "public void process(String data)",
93+
"called_by": ["src/Main.java::com.example.Main.main"]
94+
},
95+
"src/main.js::regularFunction": {
96+
"type": "function",
97+
"file": "src/main.js",
98+
"line": 5,
99+
"signature": "function regularFunction(data)",
100+
"called_by": ["src/main.js::main"]
101+
},
102+
"src/main.js::MyClass.method": {
103+
"type": "method",
104+
"file": "src/main.js",
105+
"line": 15,
106+
"signature": "method(data)",
107+
"called_by": ["src/main.js::regularFunction"]
108+
}
109+
},
110+
111+
"files": {
112+
"src/main.py": {
113+
"language": "python",
114+
"line_count": 150,
115+
"symbols": {
116+
"functions": ["process_data", "helper"],
117+
"classes": ["MyClass"]
118+
},
119+
"imports": ["os", "json", "typing"]
120+
},
121+
"src/MyClass.java": {
122+
"language": "java",
123+
"line_count": 80,
124+
"symbols": {
125+
"classes": ["MyClass"]
126+
},
127+
"package": "com.example",
128+
"imports": ["java.util.List", "java.io.File"]
129+
},
130+
"src/main.js": {
131+
"language": "javascript",
132+
"line_count": 120,
133+
"symbols": {
134+
"functions": ["regularFunction", "helperFunction"],
135+
"classes": ["MyClass"]
136+
},
137+
"imports": ["fs", "path"],
138+
"exports": ["regularFunction", "MyClass"]
139+
}
140+
}
141+
}
142+
```
143+
144+
#### Key Design Decisions
145+
146+
**1. Universal Qualified Symbol Names**
147+
- Use `"file::symbol"` for standalone symbols, `"file::scope.symbol"` for nested
148+
- **Why**: Eliminates name collisions across all languages, consistent naming
149+
- **LLM Benefit**: Unambiguous symbol identification with clear hierarchy
150+
151+
**2. Multi-Language Consistency**
152+
- Same symbol format for Python classes, Java packages, JavaScript exports
153+
- **Why**: Single query pattern works across all languages
154+
- **LLM Benefit**: Learn once, query any language the same way
155+
156+
**3. Called-By Only Relationships**
157+
- Track only `called_by` arrays, not `calls`
158+
- **Why**: Simpler implementation, linear build performance, focuses on usage
159+
- **LLM Benefit**: Direct answers to "where is function X used?" queries
160+
161+
**4. Language-Specific Fields**
162+
- Java: `package` field, JavaScript: `exports` array, etc.
163+
- **Why**: Preserve important language semantics without complexity
164+
- **LLM Benefit**: Access language-specific information when needed
165+
166+
**5. Simplified File Structure**
167+
- Organized `symbols` object with arrays by type (functions, classes)
168+
- **Why**: Fast file-level queries, clear organization
169+
- **LLM Benefit**: Immediate file overview showing what symbols exist
170+
171+
**6. Scope Resolution Strategy**
172+
- Python: `MyClass.method`, Java: `com.example.MyClass.method`
173+
- **Why**: Natural language patterns, includes necessary context
174+
- **LLM Benefit**: Symbol names match how developers think about code
175+
176+
### Simplified Flow
177+
```
178+
set_project_path() → File Watcher Activation → Extension Whitelist Traversal →
179+
Language-Specific Simple Parsers → JSON Index Update → Dual Storage (Temp + Memory) →
180+
Query Tools Access Optimized Index
181+
```
182+
183+
## Implementation Plan
184+
185+
### Phase 1: Clean Slate - Remove SCIP System
186+
- **Delete all SCIP tools**: Remove `src/code_index_mcp/scip/` directory completely
187+
- **Remove protobuf dependencies**: Clean up `scip_pb2.py` and related imports
188+
- **Strip SCIP from services**: Remove SCIP references from business logic layers
189+
- **Clean constants**: Remove `SCIP_INDEX_FILE` and related SCIP constants
190+
- **Update dependencies**: Remove protobuf from `pyproject.toml`
191+
192+
### Phase 2: Tool Layer Replacement
193+
- **Keep three-layer architecture**: Only modify the tool layer, preserve services/MCP layers
194+
- **New simple index format**: Implement lightweight JSON-based indexing tools
195+
- **Language parsers**: Create simple parsers in tool layer (Python `ast`, simplified tree-sitter)
196+
- **Storage tools**: Implement dual storage tools (temp + memory) for new format
197+
- **Query tools**: Build fast lookup tools for the new index structure
198+
199+
### Phase 3: Service Layer Integration
200+
- **Minimal service changes**: Services delegate to new tools instead of SCIP tools
201+
- **Preserve business logic**: Keep existing service workflows and validation
202+
- **Maintain interfaces**: Services still expose same functionality to MCP layer
203+
- **File watcher integration**: Connect file watcher to new index rebuild tools
204+
205+
### Phase 4: MCP Layer Compatibility
206+
- **Zero MCP changes**: Existing `@mcp.tool` functions unchanged
207+
- **Same interfaces**: Tools return data in expected formats
208+
- **Backward compatibility**: Existing LLM workflows continue working
209+
- **Performance gains**: Faster responses with same functionality
210+
211+
### Phase 5: Build from Scratch Mentality
212+
- **New index design**: Simple, LLM-optimized format built fresh
213+
- **Clean codebase**: Remove all SCIP complexity and start simple
214+
- **Fresh dependencies**: Only essential libraries (no protobuf, simplified tree-sitter)
215+
- **Focused scope**: Build only what's needed for LLM use cases
216+
217+
## Technical Specifications
218+
219+
### Index Storage
220+
- **Dual Storage**: Temporary path (`%TEMP%/code_indexer/<hash>/`) + in-memory caching
221+
- **Format**: JSON with msgpack binary serialization for performance
222+
- **Location**: Follow existing pattern (discoverable via constants.py)
223+
- **Extension Filtering**: Use existing SUPPORTED_EXTENSIONS whitelist
224+
- **Size**: ~10-50KB for typical projects vs ~1-5MB SCIP
225+
- **Access**: Direct dict lookups vs protobuf traversal
226+
- **File Watcher Integration**: Automatic updates when files change
227+
228+
### Language Support
229+
- **Python**: Built-in `ast` module for optimal performance and accuracy
230+
- **JavaScript/TypeScript**: Existing tree-sitter parsers (proven reliability)
231+
- **Other Languages**: Reuse existing tree-sitter implementations
232+
- **Simplify**: Remove SCIP-specific symbol generation overhead
233+
- **Focus**: Extract symbols and `called_by` relationships only
234+
235+
### Query Performance
236+
- **Target**: <100ms for any query operation
237+
- **Method**: Hash-based lookups vs linear SCIP traversal
238+
- **Caching**: In-memory symbol registry for instant access
239+
240+
### File Watching
241+
- **Keep**: Existing watchdog-based file monitoring
242+
- **Optimize**: Batch incremental updates vs full rebuilds
243+
- **Debounce**: Maintain 4-6 second debounce for change batching
244+
245+
## Migration Strategy
246+
247+
### Backward Compatibility
248+
- **Zero breaking changes**: Same MCP tool interfaces and return formats
249+
- **Preserve workflows**: File watcher, project setup, and query patterns unchanged
250+
- **Service contracts**: Business logic layer contracts remain stable
251+
- **LLM experience**: Existing LLM usage patterns continue working
252+
253+
### Rollback Plan
254+
- **Git branch strategy**: Preserve SCIP implementation in separate branch
255+
- **Incremental deployment**: Can revert individual components if needed
256+
- **Performance monitoring**: Compare old vs new system metrics
257+
- **Fallback mechanism**: Quick switch back to SCIP if issues arise
258+
259+
### Testing Strategy
260+
- Compare output accuracy between SCIP and simple index
261+
- Benchmark query performance improvements
262+
- Validate function tracing completeness
263+
- Test incremental update correctness
264+
265+
## Expected Benefits
266+
267+
### Performance Improvements
268+
- **Index Build**: 5-10x faster (no protobuf, no complex call analysis)
269+
- **Query Speed**: 10-100x faster (direct hash lookups)
270+
- **Memory Usage**: 80% reduction (simple JSON vs protobuf)
271+
- **Build Complexity**: Linear O(n) vs complex relationship resolution
272+
273+
### Maintenance Benefits
274+
- **Code Complexity**: 70% reduction (remove entire SCIP system)
275+
- **Dependencies**: Remove protobuf, simplify tree-sitter usage
276+
- **Debugging**: Human-readable JSON vs binary protobuf
277+
- **Call Analysis**: Simple `called_by` tracking vs complex call graph building
278+
279+
### LLM Integration Benefits
280+
- **Fast Responses**: Sub-100ms query times for any symbol lookup
281+
- **Token Efficiency**: Qualified names eliminate ambiguity
282+
- **Simple Format**: Direct JSON access patterns
283+
- **Focused Data**: Only essential information for code understanding
284+
285+
## Risk Mitigation
286+
287+
### Functionality Loss
288+
- **Risk**: Missing advanced SCIP features
289+
- **Mitigation**: Focus on core LLM use cases (function tracing)
290+
- **Validation**: Compare query completeness with existing system
291+
292+
### Performance Regression
293+
- **Risk**: New implementation slower than expected
294+
- **Mitigation**: Benchmark against SCIP at each phase
295+
- **Fallback**: Maintain SCIP implementation as backup
296+
297+
### Migration Complexity
298+
- **Risk**: Difficult transition from SCIP
299+
- **Mitigation**: Phased rollout with feature flags
300+
- **Safety**: Comprehensive testing before production use
301+
302+
## Success Metrics
303+
304+
### Performance Targets
305+
- Index build time: <5 seconds for 1000 files
306+
- Query response time: <100ms for any operation
307+
- Memory usage: <50MB for typical projects
308+
- Token efficiency: 90% reduction in LLM context usage
309+
310+
### Quality Targets
311+
- Function detection accuracy: >95% vs SCIP
312+
- Call chain completeness: >90% vs SCIP
313+
- Incremental update correctness: 100%
314+
- File watcher reliability: Zero missed changes
315+
316+
## Implementation Timeline
317+
318+
### Week 1-2: Foundation
319+
- Core index structure and storage
320+
- Basic JSON schema implementation
321+
- Simple parser extraction from existing code
322+
323+
### Week 3-4: Language Integration
324+
- Tree-sitter parser simplification
325+
- Multi-language symbol extraction
326+
- Function call relationship building
327+
328+
### Week 5-6: MCP Tools
329+
- LLM-optimized tool implementation
330+
- Performance optimization
331+
- Query response formatting
332+
333+
### Week 7-8: Integration and Testing
334+
- File watcher integration
335+
- Comprehensive testing
336+
- Migration tooling
337+
338+
### Week 9-10: Production Deployment
339+
- Feature flag rollout
340+
- Performance monitoring
341+
- SCIP deprecation planning
342+
343+
## Conclusion
344+
345+
This replacement plan transforms the code-index-mcp from a complex SCIP-based system into a lean, LLM-optimized indexing solution. By focusing on the core use case of function tracing and rapid codebase understanding, we achieve significant performance improvements while maintaining all essential functionality. The simplified architecture reduces maintenance burden and enables faster iteration on LLM-specific features.

pyproject.toml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,13 @@ authors = [
1515
dependencies = [
1616
"mcp>=0.3.0",
1717
"watchdog>=3.0.0",
18-
"protobuf>=4.21.0",
1918
"tree-sitter>=0.20.0",
2019
"tree-sitter-javascript>=0.20.0",
2120
"tree-sitter-typescript>=0.20.0",
2221
"tree-sitter-java>=0.20.0",
2322
"tree-sitter-zig>=0.20.0",
2423
"pathspec>=0.12.1",
25-
"libclang>=16.0.0",
24+
"msgpack>=1.0.0",
2625
]
2726

2827
[project.urls]

src/code_index_mcp/constants.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,7 @@
55
# Directory and file names
66
SETTINGS_DIR = "code_indexer"
77
CONFIG_FILE = "config.json"
8-
SCIP_INDEX_FILE = "index.scip" # SCIP protobuf binary file
9-
# Legacy files
10-
INDEX_FILE = "index.json" # Legacy JSON index file (to be removed)
11-
# CACHE_FILE removed - no longer needed with new indexing system
8+
INDEX_FILE = "index.json" # JSON index file
129

1310
# Supported file extensions for code analysis
1411
# This is the authoritative list used by both old and new indexing systems

0 commit comments

Comments
 (0)