A lightweight, local RAG (Retrieval-Augmented Generation) system for indexing and searching your documents. Built with FastAPI, Docling, and ChromaDB for high-performance semantic search across your local files.
- π Lightweight & Fast: Optimized for performance with millions of document chunks
- π Beautiful Web Interface: Modern, responsive UI for easy document management and search
- π Auto File Watching: Automatically indexes new/modified files in watched folders
- π Semantic Search: Uses advanced embeddings for intelligent document retrieval
- π Real-time Stats: Monitor your document index and search performance
- ποΈ File Browser: Dropbox-like interface for browsing and selecting files/folders
- β‘ Smart Indexing: Avoids re-indexing unchanged files using content hashing
- π Progress Tracking: Real-time indexing progress with detailed status updates
- πΎ Persistent Configuration: Automatically saves and restores watched folders
- π§ Configurable: Easy configuration via environment variables
- π OAuth Support: Integration with Microsoft OneDrive/SharePoint (via .tokens.json)
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Frontend β β Backend API β β Vector Store β
β (HTML/JS) βββββΊβ (FastAPI) βββββΊβ (Chroma) β
β β β β β β
β β’ File selector β β β’ File watcher β β β’ Embeddings β
β β’ Search UI β β β’ Doc processing β β β’ Metadata β
β β’ Results view β β β’ Embedding gen β β β’ Fast search β
β β’ Progress view β β β’ Hash checking β β β’ Deduplication β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
ββββββββββββββββββββ
β File System β
β Watcher β
β (watchdog) β
ββββββββββββββββββββ
- PDF documents
- Microsoft Word (.docx)
- Text files (.txt, .md)
- HTML files
- PowerPoint (.pptx)
- Excel (.xlsx)
- Python 3.12+
- uv package manager
-
Clone the repository:
git clone <your-repo-url> cd syftbox-rag
-
Install dependencies:
uv pip install -r requirements.txt
-
Run the application:
./run.sh
-
Open your browser and go to:
http://localhost:9000
- Go to the "Manage Files" tab
- Use the file browser to navigate to your desired folder
- Select folders or individual files using checkboxes
- Click "Add Selected" to start indexing
- Monitor progress in real-time with the indexing status indicator
- π Navigate through your file system like Dropbox
- βοΈ Select multiple files and folders with checkboxes
- π View file sizes and modification dates
- π Quick folder expansion/collapse
- π Home directory quick access
- Go to the "Search Documents" tab
- Enter your search query in natural language
- Adjust the results limit if needed (default: 20)
- Click "Search" or press Enter
- View results with similarity scores and metadata
"machine learning algorithms""project timeline and deadlines""financial reports Q3""meeting notes from last week"
You can customize the system behavior using environment variables:
export VECTOR_DB_PATH="./my_vector_db" # Vector database locationexport EMBEDDING_MODEL="all-MiniLM-L6-v2" # Sentence transformer modelexport CHUNK_SIZE="500" # Document chunk size (characters)
export CHUNK_OVERLAP="50" # Overlap between chunks
export MIN_CHUNK_SIZE="50" # Minimum chunk size to indexexport HOST="0.0.0.0" # Server host
export PORT="8080" # Server portexport PROCESSING_DELAY="1.0" # File processing delay (seconds)
export DEFAULT_SEARCH_LIMIT="20" # Default search results
export MAX_SEARCH_LIMIT="100" # Maximum search results./run.shuv run uvicorn backend.main:app --host 127.0.0.1 --port 9000uv run uvicorn backend.main:app --host 127.0.0.1 --port 9000 --reload./cleanup.shsyftbox-rag/
βββ backend/
β βββ __init__.py
β βββ main.py # FastAPI server with API endpoints
β βββ config.py # Configuration settings
β βββ document_processor.py # Docling integration for parsing
β βββ embeddings.py # Sentence transformer embeddings
β βββ file_watcher.py # File system monitoring & processing
β βββ vector_store.py # ChromaDB interface
βββ frontend/
β βββ index.html # Main interface with tabs
β βββ app.js # Frontend logic & file browser
β βββ style.css # Modern responsive styling
βββ data/ # Application data (logs, PID files)
βββ vector_db/ # ChromaDB storage (auto-created)
βββ .tokens.json # OAuth tokens (optional)
βββ requirements.txt # Python dependencies
βββ run.sh # Application launcher script
βββ cleanup.sh # Application cleanup script
βββ README.md # This file
The system provides a comprehensive REST API:
- GET
/- Main web interface - POST
/api/add-folder- Add folder to watch list - GET
/api/watched-folders- Get watched folders - DELETE
/api/watched-folders/{path}- Remove watched folder - POST
/api/search- Search documents - GET
/api/stats- Get database statistics - GET
/api/indexing-status- Get real-time indexing progress - POST
/api/file-structure- Browse file system - GET
/api/file-structure/home- Get home directory structure - POST
/api/folder-selection- Batch add/remove files and folders
- File hashing prevents re-indexing unchanged documents
- Chunked processing handles large files efficiently
- Background processing doesn't block the UI
- Error recovery handles corrupted or inaccessible files
- Operation queue with detailed progress information
- File-level progress with chunk counting
- Size estimation and processing speed metrics
- Activity logs for debugging and monitoring
- Watched folders automatically restored on restart
- Database integrity maintained across sessions
- Configuration persistence via environment variables
uv add package-name# Add test files to test the system
uv run python -m pytest tests/uv run black backend/
uv run isort backend/- Vector Database: ChromaDB provides excellent performance for millions of document chunks
- Embedding Model: The default
all-MiniLM-L6-v2model balances speed and accuracy - Chunking Strategy: 500-character chunks with 50-character overlap work well for most documents
- File Watching: Files are processed asynchronously to avoid blocking the UI
- Search Speed: Typical search times are under 1 second for large document collections
- Smart Caching: File hashing prevents unnecessary re-processing
- Memory Management: Efficient streaming processing for large documents
-
Port already in use:
export SYFTBOX_ASSIGNED_PORT="8080" # Use a different port ./run.sh
-
Permission errors when adding folders:
- Ensure the folder path exists and is readable
- Check file permissions on the target directory
-
Slow indexing:
- Reduce
CHUNK_SIZEfor faster processing - Increase
PROCESSING_DELAYto reduce system load - Monitor progress in the indexing status panel
- Reduce
-
Out of memory:
- Use a smaller embedding model
- Process fewer files at once
- Increase system memory
-
Files not being indexed:
- Check if file format is supported
- Verify file permissions and accessibility
- Monitor activity logs for error messages
-
Application won't start:
# Clean up any stale processes and files ./cleanup.sh # Then try starting again ./run.sh
The application logs are stored in ./data/app.log. For more detailed logging:
export LOG_LEVEL="DEBUG"
./run.shYou can also check the application status:
# Check if application is running
ps aux | grep uvicorn
# View recent logs
tail -f ./data/app.logIf using Microsoft OneDrive/SharePoint integration:
- Place your OAuth tokens in
.tokens.json - Ensure proper permissions for Files.Read, Files.ReadWrite, etc.
- Monitor token expiration and refresh as needed
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Docling for document processing
- ChromaDB for vector storage
- Sentence Transformers for embeddings
- FastAPI for the web framework
- Watchdog for file system monitoring
- Add support for more file formats (CSV, JSON, XML)
- Implement document preview functionality
- Add user authentication and multi-user support
- Create Docker containerization
- Add automated testing suite
- Implement document versioning
- Add advanced search filters and faceting