Skip to content

OshanKHZ/bee2bee-indexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🐝 Bee2Bee Indexer

Code indexing and embedding generation for RAG systems

License: MIT Python 3.10+ n8n Community Node

Bee2Bee Indexer is a powerful tool for indexing GitHub repositories and generating embeddings optimized for Retrieval-Augmented Generation (RAG) systems. It parses code across multiple programming languages, chunks it intelligently, and generates dual embeddings (NLP + Code-specific) ready for vector databases.

✨ Features

  • 🌍 Multi-language support: Python, JavaScript, TypeScript, Rust, Go, Java, C, C++
  • 🌳 AST-based parsing: Uses tree-sitter for accurate syntax understanding
  • 🧠 Dual embeddings: Generates both NLP and code-specific embeddings
  • ⚑ Flexible chunking: Function-level, class-level, or file-level strategies
  • πŸ” Multiple providers: Local embeddings (free) or OpenAI (paid)
  • 🎯 n8n integration: Custom community node for workflow automation
  • πŸ“¦ Standalone CLI: Use without n8n for custom integrations
  • πŸš€ Production-ready: Battle-tested on large codebases

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/oshankhz/bee2bee-indexer.git
cd bee2bee-indexer

# Install Python dependencies
pip install -e .

# Or using poetry
poetry install

Basic Usage (Python)

from bee2bee_indexer import GitHubClient, TreeSitterParser, FunctionChunker, DualEmbedder

# Initialize components
github_client = GitHubClient(token="your_github_token")
parser = TreeSitterParser()
chunker = FunctionChunker()
embedder = DualEmbedder(provider="local")

# Download and parse repository
repo_path = await github_client.download_repo("facebook", "react", "main")
files = repo_path.glob("**/*.js")

# Process files
for file in files:
    tree = parser.parse(file.read_text(), ".js")
    chunks = chunker.extract_chunks(tree, file.read_text(), "facebook/react", str(file))
    embeddings = embedder.embed_batch([chunk.dict() for chunk in chunks])

    # Store in your vector database
    # your_vector_db.insert(chunks, embeddings)

CLI Usage

# Create config
cat > config.json << EOF
{
  "owner": "facebook",
  "repo": "react",
  "branch": "main",
  "githubToken": "your_token",
  "embeddingProvider": "local",
  "outputFormat": "chunks_embeddings"
}
EOF

# Run indexer
python cli.py < config.json > output.json

🎨 n8n Integration

Install as a custom community node in n8n:

  • Go to Settings β†’ Community Nodes
  • Click Install
  • Enter: @oshankhz/n8n-nodes-bee2bee-indexer

Example Workflow

[Schedule] β†’ [Bee2Bee Indexer] β†’ [Pinecone] β†’ [Email Notification]

πŸ“– Documentation

πŸ› οΈ Configuration

Environment Variables

# Required
GITHUB_TOKEN=your_github_personal_access_token

# Optional
EMBEDDING_PROVIDER=local  # or "openai"
OPENAI_API_KEY=your_openai_key  # required if provider=openai
EMBEDDING_MODEL=text-embedding-3-small  # for OpenAI
MAX_WORKERS=4
CHUNK_MAX_SIZE=2000

Output Formats

Format Description Use Case
full Metadata + Chunks + Embeddings Complete indexing
chunks_embeddings Chunks with embeddings Vector DB insertion
chunks Code chunks only Custom embedding
metadata Statistics only Repository analysis

Chunk Strategies

Strategy Description Best For
function One chunk per function/method Fine-grained search
class One chunk per class OOP codebases
file One chunk per file High-level search

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Bee2Bee Indexer                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   GitHub     │───▢│ Tree-sitter  │───▢│   Chunker    β”‚ β”‚
β”‚  β”‚   Client     β”‚    β”‚   Parser     β”‚    β”‚  (Function)  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚         β”‚                    β”‚                    β”‚         β”‚
β”‚         β–Ό                    β–Ό                    β–Ό         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Downloads   β”‚    β”‚  AST Nodes   β”‚    β”‚ Code Chunks  β”‚ β”‚
β”‚  β”‚  Repository  β”‚    β”‚  Extracted   β”‚    β”‚  + Metadata  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                   β”‚         β”‚
β”‚                                                   β–Ό         β”‚
β”‚                                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚                                          β”‚    Dual      β”‚  β”‚
β”‚                                          β”‚  Embeddings  β”‚  β”‚
β”‚                                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                   β”‚         β”‚
β”‚                                                   β–Ό         β”‚
β”‚                                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚                                          β”‚ Vector DB    β”‚  β”‚
β”‚                                          β”‚ (Your choice)β”‚  β”‚
β”‚                                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”§ Development

Setup Development Environment

# Clone repository
git clone https://github.com/oshankhz/bee2bee-indexer.git
cd bee2bee-indexer

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install in editable mode
pip install -e ".[dev]"

# Run tests
pytest tests/

Building the n8n Node

cd n8n-node

# Install dependencies
npm install

# Build
npm run build

# Test locally
npm link

🀝 Contributing

Contributions are welcome! Please read our Contributing Guide for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'feat: add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ™ Acknowledgments

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

🐝 About Bee2Bee

Bee2Bee Indexer is part of the Bee2Bee ecosystem, building tools to help developers work smarter with AI.


Made with ❀️ by the Bee2Bee Team

About

Code-aware repo indexer with tree-sitter AST parsing, multi-strategy chunking, and dual embedding generation for RAG and vector DBs.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors