johnhuang316
diff --git a/‎LLM_OPTIMIZED_INDEX_REPLACEMENT_PLAN.md‎
Lines changed: 345 additions & 0 deletions b/‎LLM_OPTIMIZED_INDEX_REPLACEMENT_PLAN.md‎
Lines changed: 345 additions & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 1 addition & 2 deletions b/‎pyproject.toml‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎src/code_index_mcp/constants.py‎
Lines changed: 1 addition & 4 deletions b/‎src/code_index_mcp/constants.py‎
Lines changed: 1 addition & 4 deletions
@@ -0,0 +1,345 @@
+# LLM-Optimized Index Replacement Plan
+
+## Current Architecture Analysis
+
+### Actual Implementation Process
+1. **Project Initialization**: LLM calls `set_project_path()` to establish project root
+2. **File Watcher Activation**: Automatic file monitoring starts with debounced re-indexing
+3. **Codebase Traversal**: System scans all files using extension whitelist (SUPPORTED_EXTENSIONS)
+4. **Language-Specific Processing**: Different strategies for each language's unique characteristics
+5. **Dual Storage**: Index stored in temporary path + in-memory for fast access
+6. **Query Tools**: LLMs call analysis tools that use the built index
+
+### SCIP-Based System Issues
+- **Complex Protocol**: SCIP protobuf format designed for IDEs, not LLM consumption
+- **Over-Engineering**: Multi-layer abstraction (strategies/factories) creates complexity
+- **Token Inefficiency**: Verbose SCIP format wastes LLM context tokens
+- **Parsing Overhead**: Complex symbol ID generation and validation
+- **Cross-Document Complexity**: Relationship building adds minimal LLM value
+
+### Current Flow Analysis
+```
+set_project_path() → File Watcher Activation → Codebase Traversal (Extension Whitelist) → 
+Language-Specific Strategies → SCIP Builder → Index Storage (Temp + Memory) → 
+Query Tools Access Index
+```
+
+### Reusable Components
+- **Extension Whitelist**: SUPPORTED_EXTENSIONS constant defining indexable file types
+- **File Watcher Service**: Robust debounced file monitoring with auto re-indexing
+- **Language Strategy System**: Multi-language support with unique characteristics per language
+- **Dual Storage Pattern**: Temporary file storage + in-memory caching for performance
+- **Service Architecture**: Clean 3-layer pattern (MCP → Services → Tools)
+- **Tree-sitter Parsing**: High-quality AST parsing for supported languages
+
+## Replacement Architecture
+
+### Core Principle
+Clean slate approach: Delete all SCIP components and build simple, LLM-optimized JSON indexing system from scratch. Preserve three-layer architecture by only replacing the tool layer.
+
+### New Index Format Design
+
+#### Design Rationale
+The index should optimize for **LLM query patterns** rather than IDE features:
+
+1. **Function Tracing Focus**: LLMs primarily need to understand "what calls what"
+2. **Fast Lookups**: Hash-based access for instant symbol resolution
+3. **Minimal Redundancy**: Avoid duplicate data that wastes tokens
+4. **Query-Friendly Structure**: Organize data how LLMs will actually access it
+5. **Incremental Updates**: Support efficient file-by-file rebuilds
+
+#### Multi-Language Index Format
+```json
+{
+  "metadata": {
+    "project_path": "/absolute/path/to/project",
+    "indexed_files": 275,
+    "index_version": "1.0.0",
+    "timestamp": "2025-01-15T10:30:00Z",
+    "languages": ["python", "javascript", "java", "objective-c"]
+  },
+  
+  "symbols": {
+    "src/main.py::process_data": {
+      "type": "function",
+      "file": "src/main.py",
+      "line": 42,
+      "signature": "def process_data(items: List[str]) -> None:",
+      "called_by": ["src/main.py::main"]
+    },
+    "src/main.py::MyClass": {
+      "type": "class", 
+      "file": "src/main.py",
+      "line": 10
+    },
+    "src/main.py::MyClass.process": {
+      "type": "method",
+      "file": "src/main.py",
+      "line": 20,
+      "signature": "def process(self, data: str) -> bool:",
+      "called_by": ["src/main.py::process_data"]
+    },
+    "src/MyClass.java::com.example.MyClass": {
+      "type": "class",
+      "file": "src/MyClass.java", 
+      "line": 5,
+      "package": "com.example"
+    },
+    "src/MyClass.java::com.example.MyClass.process": {
+      "type": "method",
+      "file": "src/MyClass.java",
+      "line": 10,
+      "signature": "public void process(String data)",
+      "called_by": ["src/Main.java::com.example.Main.main"]
+    },
+    "src/main.js::regularFunction": {
+      "type": "function",
+      "file": "src/main.js",
+      "line": 5,
+      "signature": "function regularFunction(data)",
+      "called_by": ["src/main.js::main"]
+    },
+    "src/main.js::MyClass.method": {
+      "type": "method",
+      "file": "src/main.js", 
+      "line": 15,
+      "signature": "method(data)",
+      "called_by": ["src/main.js::regularFunction"]
+    }
+  },
+  
+  "files": {
+    "src/main.py": {
+      "language": "python",
+      "line_count": 150,
+      "symbols": {
+        "functions": ["process_data", "helper"],
+        "classes": ["MyClass"]
+      },
+      "imports": ["os", "json", "typing"]
+    },
+    "src/MyClass.java": {
+      "language": "java", 
+      "line_count": 80,
+      "symbols": {
+        "classes": ["MyClass"]
+      },
+      "package": "com.example",
+      "imports": ["java.util.List", "java.io.File"]
+    },
+    "src/main.js": {
+      "language": "javascript",
+      "line_count": 120,
+      "symbols": {
+        "functions": ["regularFunction", "helperFunction"],
+        "classes": ["MyClass"]
+      },
+      "imports": ["fs", "path"],
+      "exports": ["regularFunction", "MyClass"]
+    }
+  }
+}
+```
+
+#### Key Design Decisions
+
+**1. Universal Qualified Symbol Names**
+- Use `"file::symbol"` for standalone symbols, `"file::scope.symbol"` for nested
+- **Why**: Eliminates name collisions across all languages, consistent naming
+- **LLM Benefit**: Unambiguous symbol identification with clear hierarchy
+
+**2. Multi-Language Consistency**
+- Same symbol format for Python classes, Java packages, JavaScript exports
+- **Why**: Single query pattern works across all languages
+- **LLM Benefit**: Learn once, query any language the same way
+
+**3. Called-By Only Relationships**
+- Track only `called_by` arrays, not `calls`
+- **Why**: Simpler implementation, linear build performance, focuses on usage
+- **LLM Benefit**: Direct answers to "where is function X used?" queries
+
+**4. Language-Specific Fields**
+- Java: `package` field, JavaScript: `exports` array, etc.
+- **Why**: Preserve important language semantics without complexity
+- **LLM Benefit**: Access language-specific information when needed
+
+**5. Simplified File Structure**
+- Organized `symbols` object with arrays by type (functions, classes)
+- **Why**: Fast file-level queries, clear organization
+- **LLM Benefit**: Immediate file overview showing what symbols exist
+
+**6. Scope Resolution Strategy**
+- Python: `MyClass.method`, Java: `com.example.MyClass.method`
+- **Why**: Natural language patterns, includes necessary context
+- **LLM Benefit**: Symbol names match how developers think about code
+
+### Simplified Flow
+```
+set_project_path() → File Watcher Activation → Extension Whitelist Traversal → 
+Language-Specific Simple Parsers → JSON Index Update → Dual Storage (Temp + Memory) →
+Query Tools Access Optimized Index
+```
+
+## Implementation Plan
+
+### Phase 1: Clean Slate - Remove SCIP System
+- **Delete all SCIP tools**: Remove `src/code_index_mcp/scip/` directory completely
+- **Remove protobuf dependencies**: Clean up `scip_pb2.py` and related imports
+- **Strip SCIP from services**: Remove SCIP references from business logic layers
+- **Clean constants**: Remove `SCIP_INDEX_FILE` and related SCIP constants
+- **Update dependencies**: Remove protobuf from `pyproject.toml`
+
+### Phase 2: Tool Layer Replacement
+- **Keep three-layer architecture**: Only modify the tool layer, preserve services/MCP layers
+- **New simple index format**: Implement lightweight JSON-based indexing tools
+- **Language parsers**: Create simple parsers in tool layer (Python `ast`, simplified tree-sitter)
+- **Storage tools**: Implement dual storage tools (temp + memory) for new format
+- **Query tools**: Build fast lookup tools for the new index structure
+
+### Phase 3: Service Layer Integration
+- **Minimal service changes**: Services delegate to new tools instead of SCIP tools
+- **Preserve business logic**: Keep existing service workflows and validation
+- **Maintain interfaces**: Services still expose same functionality to MCP layer
+- **File watcher integration**: Connect file watcher to new index rebuild tools
+
+### Phase 4: MCP Layer Compatibility
+- **Zero MCP changes**: Existing `@mcp.tool` functions unchanged
+- **Same interfaces**: Tools return data in expected formats
+- **Backward compatibility**: Existing LLM workflows continue working
+- **Performance gains**: Faster responses with same functionality
+
+### Phase 5: Build from Scratch Mentality
+- **New index design**: Simple, LLM-optimized format built fresh
+- **Clean codebase**: Remove all SCIP complexity and start simple
+- **Fresh dependencies**: Only essential libraries (no protobuf, simplified tree-sitter)
+- **Focused scope**: Build only what's needed for LLM use cases
+
+## Technical Specifications
+
+### Index Storage
+- **Dual Storage**: Temporary path (`%TEMP%/code_indexer/<hash>/`) + in-memory caching
+- **Format**: JSON with msgpack binary serialization for performance
+- **Location**: Follow existing pattern (discoverable via constants.py)
+- **Extension Filtering**: Use existing SUPPORTED_EXTENSIONS whitelist
+- **Size**: ~10-50KB for typical projects vs ~1-5MB SCIP
+- **Access**: Direct dict lookups vs protobuf traversal
+- **File Watcher Integration**: Automatic updates when files change
+
+### Language Support
+- **Python**: Built-in `ast` module for optimal performance and accuracy
+- **JavaScript/TypeScript**: Existing tree-sitter parsers (proven reliability)
+- **Other Languages**: Reuse existing tree-sitter implementations
+- **Simplify**: Remove SCIP-specific symbol generation overhead
+- **Focus**: Extract symbols and `called_by` relationships only
+
+### Query Performance
+- **Target**: <100ms for any query operation
+- **Method**: Hash-based lookups vs linear SCIP traversal
+- **Caching**: In-memory symbol registry for instant access
+
+### File Watching
+- **Keep**: Existing watchdog-based file monitoring
+- **Optimize**: Batch incremental updates vs full rebuilds
+- **Debounce**: Maintain 4-6 second debounce for change batching
+
+## Migration Strategy
+
+### Backward Compatibility
+- **Zero breaking changes**: Same MCP tool interfaces and return formats
+- **Preserve workflows**: File watcher, project setup, and query patterns unchanged
+- **Service contracts**: Business logic layer contracts remain stable
+- **LLM experience**: Existing LLM usage patterns continue working
+
+### Rollback Plan
+- **Git branch strategy**: Preserve SCIP implementation in separate branch
+- **Incremental deployment**: Can revert individual components if needed
+- **Performance monitoring**: Compare old vs new system metrics
+- **Fallback mechanism**: Quick switch back to SCIP if issues arise
+
+### Testing Strategy
+- Compare output accuracy between SCIP and simple index
+- Benchmark query performance improvements
+- Validate function tracing completeness
+- Test incremental update correctness
+
+## Expected Benefits
+
+### Performance Improvements
+- **Index Build**: 5-10x faster (no protobuf, no complex call analysis)
+- **Query Speed**: 10-100x faster (direct hash lookups)
+- **Memory Usage**: 80% reduction (simple JSON vs protobuf)
+- **Build Complexity**: Linear O(n) vs complex relationship resolution
+
+### Maintenance Benefits
+- **Code Complexity**: 70% reduction (remove entire SCIP system)
+- **Dependencies**: Remove protobuf, simplify tree-sitter usage
+- **Debugging**: Human-readable JSON vs binary protobuf
+- **Call Analysis**: Simple `called_by` tracking vs complex call graph building
+
+### LLM Integration Benefits
+- **Fast Responses**: Sub-100ms query times for any symbol lookup
+- **Token Efficiency**: Qualified names eliminate ambiguity
+- **Simple Format**: Direct JSON access patterns
+- **Focused Data**: Only essential information for code understanding
+
+## Risk Mitigation
+
+### Functionality Loss
+- **Risk**: Missing advanced SCIP features
+- **Mitigation**: Focus on core LLM use cases (function tracing)
+- **Validation**: Compare query completeness with existing system
+
+### Performance Regression
+- **Risk**: New implementation slower than expected
+- **Mitigation**: Benchmark against SCIP at each phase
+- **Fallback**: Maintain SCIP implementation as backup
+
+### Migration Complexity
+- **Risk**: Difficult transition from SCIP
+- **Mitigation**: Phased rollout with feature flags
+- **Safety**: Comprehensive testing before production use
+
+## Success Metrics
+
+### Performance Targets
+- Index build time: <5 seconds for 1000 files
+- Query response time: <100ms for any operation
+- Memory usage: <50MB for typical projects
+- Token efficiency: 90% reduction in LLM context usage
+
+### Quality Targets
+- Function detection accuracy: >95% vs SCIP
+- Call chain completeness: >90% vs SCIP
+- Incremental update correctness: 100%
+- File watcher reliability: Zero missed changes
+
+## Implementation Timeline
+
+### Week 1-2: Foundation
+- Core index structure and storage
+- Basic JSON schema implementation
+- Simple parser extraction from existing code
+
+### Week 3-4: Language Integration
+- Tree-sitter parser simplification
+- Multi-language symbol extraction
+- Function call relationship building
+
+### Week 5-6: MCP Tools
+- LLM-optimized tool implementation
+- Performance optimization
+- Query response formatting
+
+### Week 7-8: Integration and Testing
+- File watcher integration
+- Comprehensive testing
+- Migration tooling
+
+### Week 9-10: Production Deployment
+- Feature flag rollout
+- Performance monitoring
+- SCIP deprecation planning
+
+## Conclusion
+
+This replacement plan transforms the code-index-mcp from a complex SCIP-based system into a lean, LLM-optimized indexing solution. By focusing on the core use case of function tracing and rapid codebase understanding, we achieve significant performance improvements while maintaining all essential functionality. The simplified architecture reduces maintenance burden and enables faster iteration on LLM-specific features.
@@ -15,14 +15,13 @@ authors = [
 dependencies = [
     "mcp>=0.3.0",
     "watchdog>=3.0.0",
-    "protobuf>=4.21.0",
     "tree-sitter>=0.20.0",
     "tree-sitter-javascript>=0.20.0",
     "tree-sitter-typescript>=0.20.0",
     "tree-sitter-java>=0.20.0",
     "tree-sitter-zig>=0.20.0",
     "pathspec>=0.12.1",
-    "libclang>=16.0.0",
+    "msgpack>=1.0.0",
 ]
 
 [project.urls]
 
@@ -5,10 +5,7 @@
 # Directory and file names
 SETTINGS_DIR = "code_indexer"
 CONFIG_FILE = "config.json"
-SCIP_INDEX_FILE = "index.scip"  # SCIP protobuf binary file
-# Legacy files
-INDEX_FILE = "index.json"  # Legacy JSON index file (to be removed)
-# CACHE_FILE removed - no longer needed with new indexing system
+INDEX_FILE = "index.json"  # JSON index file
 
 # Supported file extensions for code analysis
 # This is the authoritative list used by both old and new indexing systems