Code indexing and embedding generation for RAG systems
Bee2Bee Indexer is a powerful tool for indexing GitHub repositories and generating embeddings optimized for Retrieval-Augmented Generation (RAG) systems. It parses code across multiple programming languages, chunks it intelligently, and generates dual embeddings (NLP + Code-specific) ready for vector databases.
- π Multi-language support: Python, JavaScript, TypeScript, Rust, Go, Java, C, C++
- π³ AST-based parsing: Uses tree-sitter for accurate syntax understanding
- π§ Dual embeddings: Generates both NLP and code-specific embeddings
- β‘ Flexible chunking: Function-level, class-level, or file-level strategies
- π Multiple providers: Local embeddings (free) or OpenAI (paid)
- π― n8n integration: Custom community node for workflow automation
- π¦ Standalone CLI: Use without n8n for custom integrations
- π Production-ready: Battle-tested on large codebases
# Clone the repository
git clone https://github.com/oshankhz/bee2bee-indexer.git
cd bee2bee-indexer
# Install Python dependencies
pip install -e .
# Or using poetry
poetry installfrom bee2bee_indexer import GitHubClient, TreeSitterParser, FunctionChunker, DualEmbedder
# Initialize components
github_client = GitHubClient(token="your_github_token")
parser = TreeSitterParser()
chunker = FunctionChunker()
embedder = DualEmbedder(provider="local")
# Download and parse repository
repo_path = await github_client.download_repo("facebook", "react", "main")
files = repo_path.glob("**/*.js")
# Process files
for file in files:
tree = parser.parse(file.read_text(), ".js")
chunks = chunker.extract_chunks(tree, file.read_text(), "facebook/react", str(file))
embeddings = embedder.embed_batch([chunk.dict() for chunk in chunks])
# Store in your vector database
# your_vector_db.insert(chunks, embeddings)# Create config
cat > config.json << EOF
{
"owner": "facebook",
"repo": "react",
"branch": "main",
"githubToken": "your_token",
"embeddingProvider": "local",
"outputFormat": "chunks_embeddings"
}
EOF
# Run indexer
python cli.py < config.json > output.jsonInstall as a custom community node in n8n:
- Go to Settings β Community Nodes
- Click Install
- Enter:
@oshankhz/n8n-nodes-bee2bee-indexer
[Schedule] β [Bee2Bee Indexer] β [Pinecone] β [Email Notification]
# Required
GITHUB_TOKEN=your_github_personal_access_token
# Optional
EMBEDDING_PROVIDER=local # or "openai"
OPENAI_API_KEY=your_openai_key # required if provider=openai
EMBEDDING_MODEL=text-embedding-3-small # for OpenAI
MAX_WORKERS=4
CHUNK_MAX_SIZE=2000| Format | Description | Use Case |
|---|---|---|
full |
Metadata + Chunks + Embeddings | Complete indexing |
chunks_embeddings |
Chunks with embeddings | Vector DB insertion |
chunks |
Code chunks only | Custom embedding |
metadata |
Statistics only | Repository analysis |
| Strategy | Description | Best For |
|---|---|---|
function |
One chunk per function/method | Fine-grained search |
class |
One chunk per class | OOP codebases |
file |
One chunk per file | High-level search |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Bee2Bee Indexer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β GitHub βββββΆβ Tree-sitter βββββΆβ Chunker β β
β β Client β β Parser β β (Function) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Downloads β β AST Nodes β β Code Chunks β β
β β Repository β β Extracted β β + Metadata β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββ β
β β Dual β β
β β Embeddings β β
β ββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββ β
β β Vector DB β β
β β (Your choice)β β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Clone repository
git clone https://github.com/oshankhz/bee2bee-indexer.git
cd bee2bee-indexer
# Create virtual environment
python -m venv venv
source venv/bin/activate # or `venv\Scripts\activate` on Windows
# Install in editable mode
pip install -e ".[dev]"
# Run tests
pytest tests/cd n8n-node
# Install dependencies
npm install
# Build
npm run build
# Test locally
npm linkContributions are welcome! Please read our Contributing Guide for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'feat: add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- tree-sitter for parsing
- sentence-transformers for local embeddings
- n8n for workflow automation
- OpenAI for embedding APIs
This project is licensed under the MIT License - see the LICENSE file for details.
Bee2Bee Indexer is part of the Bee2Bee ecosystem, building tools to help developers work smarter with AI.
- Website: https://bee2bee.ai
- GitHub: https://github.com/oshankhz
- Documentation: https://docs.bee2bee.ai
Made with β€οΈ by the Bee2Bee Team