🐝 Bee2Bee Indexer

Code indexing and embedding generation for RAG systems

Bee2Bee Indexer is a powerful tool for indexing GitHub repositories and generating embeddings optimized for Retrieval-Augmented Generation (RAG) systems. It parses code across multiple programming languages, chunks it intelligently, and generates dual embeddings (NLP + Code-specific) ready for vector databases.

✨ Features

🌍 Multi-language support: Python, JavaScript, TypeScript, Rust, Go, Java, C, C++
🌳 AST-based parsing: Uses tree-sitter for accurate syntax understanding
🧠 Dual embeddings: Generates both NLP and code-specific embeddings
⚡ Flexible chunking: Function-level, class-level, or file-level strategies
🔐 Multiple providers: Local embeddings (free) or OpenAI (paid)
🎯 n8n integration: Custom community node for workflow automation
📦 Standalone CLI: Use without n8n for custom integrations
🚀 Production-ready: Battle-tested on large codebases

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/oshankhz/bee2bee-indexer.git
cd bee2bee-indexer

# Install Python dependencies
pip install -e .

# Or using poetry
poetry install

Basic Usage (Python)

from bee2bee_indexer import GitHubClient, TreeSitterParser, FunctionChunker, DualEmbedder

# Initialize components
github_client = GitHubClient(token="your_github_token")
parser = TreeSitterParser()
chunker = FunctionChunker()
embedder = DualEmbedder(provider="local")

# Download and parse repository
repo_path = await github_client.download_repo("facebook", "react", "main")
files = repo_path.glob("**/*.js")

# Process files
for file in files:
    tree = parser.parse(file.read_text(), ".js")
    chunks = chunker.extract_chunks(tree, file.read_text(), "facebook/react", str(file))
    embeddings = embedder.embed_batch([chunk.dict() for chunk in chunks])

    # Store in your vector database
    # your_vector_db.insert(chunks, embeddings)

CLI Usage

# Create config
cat > config.json << EOF
{
  "owner": "facebook",
  "repo": "react",
  "branch": "main",
  "githubToken": "your_token",
  "embeddingProvider": "local",
  "outputFormat": "chunks_embeddings"
}
EOF

# Run indexer
python cli.py < config.json > output.json

🎨 n8n Integration

Install as a custom community node in n8n:

Go to Settings → Community Nodes
Click Install
Enter: @oshankhz/n8n-nodes-bee2bee-indexer

Example Workflow

[Schedule] → [Bee2Bee Indexer] → [Pinecone] → [Email Notification]

📖 Documentation

🛠️ Configuration

Environment Variables

# Required
GITHUB_TOKEN=your_github_personal_access_token

# Optional
EMBEDDING_PROVIDER=local  # or "openai"
OPENAI_API_KEY=your_openai_key  # required if provider=openai
EMBEDDING_MODEL=text-embedding-3-small  # for OpenAI
MAX_WORKERS=4
CHUNK_MAX_SIZE=2000

Output Formats

Format	Description	Use Case
`full`	Metadata + Chunks + Embeddings	Complete indexing
`chunks_embeddings`	Chunks with embeddings	Vector DB insertion
`chunks`	Code chunks only	Custom embedding
`metadata`	Statistics only	Repository analysis

Chunk Strategies

Strategy	Description	Best For
`function`	One chunk per function/method	Fine-grained search
`class`	One chunk per class	OOP codebases
`file`	One chunk per file	High-level search

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Bee2Bee Indexer                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐ │
│  │   GitHub     │───▶│ Tree-sitter  │───▶│   Chunker    │ │
│  │   Client     │    │   Parser     │    │  (Function)  │ │
│  └──────────────┘    └──────────────┘    └──────────────┘ │
│         │                    │                    │         │
│         ▼                    ▼                    ▼         │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐ │
│  │  Downloads   │    │  AST Nodes   │    │ Code Chunks  │ │
│  │  Repository  │    │  Extracted   │    │  + Metadata  │ │
│  └──────────────┘    └──────────────┘    └──────────────┘ │
│                                                   │         │
│                                                   ▼         │
│                                          ┌──────────────┐  │
│                                          │    Dual      │  │
│                                          │  Embeddings  │  │
│                                          └──────────────┘  │
│                                                   │         │
│                                                   ▼         │
│                                          ┌──────────────┐  │
│                                          │ Vector DB    │  │
│                                          │ (Your choice)│  │
│                                          └──────────────┘  │
└─────────────────────────────────────────────────────────────┘

🔧 Development

Setup Development Environment

# Clone repository
git clone https://github.com/oshankhz/bee2bee-indexer.git
cd bee2bee-indexer

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install in editable mode
pip install -e ".[dev]"

# Run tests
pytest tests/

Building the n8n Node

cd n8n-node

# Install dependencies
npm install

# Build
npm run build

# Test locally
npm link

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details.

Development Workflow

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'feat: add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

🙏 Acknowledgments

tree-sitter for parsing
sentence-transformers for local embeddings
n8n for workflow automation
OpenAI for embedding APIs

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🐝 About Bee2Bee

Bee2Bee Indexer is part of the Bee2Bee ecosystem, building tools to help developers work smarter with AI.

Made with ❤️ by the Bee2Bee Team

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
n8n-node		n8n-node
src		src
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
init_repo.sh		init_repo.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐝 Bee2Bee Indexer

✨ Features

🚀 Quick Start

Installation

Basic Usage (Python)

CLI Usage

🎨 n8n Integration

Example Workflow

📖 Documentation

🛠️ Configuration

Environment Variables

Output Formats

Chunk Strategies

🏗️ Architecture

🔧 Development

Setup Development Environment

Building the n8n Node

🤝 Contributing

Development Workflow

🙏 Acknowledgments

📄 License

🐝 About Bee2Bee

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🐝 Bee2Bee Indexer

✨ Features

🚀 Quick Start

Installation

Basic Usage (Python)

CLI Usage

🎨 n8n Integration

Example Workflow

📖 Documentation

🛠️ Configuration

Environment Variables

Output Formats

Chunk Strategies

🏗️ Architecture

🔧 Development

Setup Development Environment

Building the n8n Node

🤝 Contributing

Development Workflow

🙏 Acknowledgments

📄 License

🐝 About Bee2Bee

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages