Thank you for your interest in contributing to ContextGem! This document provides guidelines and instructions for contributing to the project.
Please read our Code of Conduct to understand the expectations for all interactions within the project.
Before your contribution can be accepted, you must sign our Contributor Agreement. This is a legal document that grants us the necessary rights to use your contribution. The agreement is based on the Oracle Contributor Agreement and this requirement follows OpenSSF Best Practices for silver-level criteria (section "Project oversight").
To sign the agreement:
- Read the Contributor Agreement carefully
- Create a copy of the agreement under
.github/contributors/[your-github-username].md - Fill in all the requested information and include it in your first pull request
This repository is AI agent-friendly and includes configuration files to help AI coding assistants understand the codebase:
- AGENTS.md - Project overview, architecture patterns, coding conventions, and workflow guidelines for AI assistants (agents.md standard)
- CLAUDE.md - Configuration for Claude Code
When using AI assistants (Claude Code, Cursor, etc.) to contribute:
- Review AI-generated code - Always verify changes follow project patterns and pass tests
- Handle VCR cassettes yourself - AI assistants should not run tests that call LLM APIs without existing cassettes
- Manage git operations yourself - Review and commit changes manually rather than letting AI handle git
💡 Tip: AI assistants work best when given specific, focused tasks. Break large contributions into smaller pieces for better results.
-
🍴 Fork and clone the repository:
-
First, fork the repository by clicking the "Fork" button on the GitHub project page
-
Then clone your fork to your local machine:
git clone https://github.com/YOUR-GITHUB-USERNAME/contextgem.git cd contextgem- Add the original repository as an upstream remote:
git remote add upstream https://github.com/shcherbak-ai/contextgem.git
-
-
⚙️ Set up the development environment:
# Install uv if you don't have it pip install uv # Install dependencies and pre-commit hooks uv run fab setup # Or manually: # uv sync --all-groups --upgrade # uv run pre-commit install # uv run pre-commit install --hook-type commit-msg
The project includes a fabfile.py with common development tasks:
uv run fab --list # List all available commands
uv run fab setup # Set up dev environment (deps + hooks)
uv run fab sync # Sync dependencies with upgrades
uv run fab lint # Run pre-commit checks on all files
uv run fab docs # Build documentation
uv run fab docs-live # Start live documentation server
uv run fab readme # Regenerate README.md from template
uv run fab install-hooks # Install pre-commit hooksBelow is a high-level overview of the codebase layout and where to make different types of contributions:
contextgem/
│
├── contextgem/
│ │
│ ├── internal/ # 🔧 Core implementation (start here for new features)
│ │ ├── base/ # - Core abstractions & business logic
│ │ │ ├── concepts.py # - Internal concept implementations
│ │ │ ├── aspects.py # - Internal aspect implementations
│ │ │ ├── documents.py # - Internal document processing
│ │ │ ├── llms.py # - Internal LLM functionality
│ │ │ └── ... # - More internal implementations
│ │ ├── prompts/ # - LLM prompt templates
│ │ ├── typings/ # - Type definitions
│ │ └── ... # - More internal modules
│ │
│ └── public/ # 🎯 User-facing API (thin facades exposing internals)
│ ├── concepts.py # - Public concept facades
│ ├── aspects.py # - Public aspect facades
│ ├── documents.py # - Public document facades
│ ├── pipelines.py # - Public pipeline facades
│ ├── llms.py # - Public LLM facades
│ └── ... # - More public modules
│
├── tests/
│ ├── cassettes/ # 📼 VCR recordings (auto-generated)
│ ├── test_all.py # ✅ Add your tests here
│ ├── utils.py # 🛠️ Test utilities & dummy env vars
│ └── ... # 📁 Test data files
│
├── docs/
│ ├── source/ # 📚 Documentation source files
│ └── ... # 📋 Build configs & outputs
│
├── dev/
│ ├── usage_examples/ # 📝 Code examples for docs
│ ├── notebooks/ # 📓 Notebooks (auto-generated)
│ ├── readme.template.md # ✏️ Edit this, not README.md
│ └── ... # 🛠️ Development scripts
│
├── pyproject.toml # ⚙️ Dependencies & project config
└── README.md # 🤖 Auto-generated (don't edit)
🎯 Quick Start for Your Contribution:
- Adding new functionality? → Implement in
contextgem/internal/(core logic). Then expose via a thin public facade incontextgem/public/using the registry. - Writing tests? → Add to
tests/test_all.py::TestAll - Updating docs? → Edit files in
docs/source/ordev/ - Fixing README? → Edit
dev/readme.template.md
💡 Note: Implement functionality in
internal/(base classes, validation, serialization, typing). Usepublic/to expose thin, documented facades that inherit from internal classes and are registered with@_expose_in_registrydecorator to ensure deserialization and instance creation utils return public types. Do not import public classes in internal modules; use the registry for type resolution and publicization.
-
🌿 Create a new branch:
For example:
git checkout -b feature/your-feature-name
When creating a branch, use one of the following prefixes that matches your change type:
bugfix/- For bug fixes (e.g.,bugfix/fix-llm-timeout)feature/- For new features (e.g.,feature/add-new-concept-type)breaking/- For breaking changes (e.g.,breaking/concepts-api-v2)docs/- For documentation updates (e.g.,docs/update-aspects-guide)perf/- For performance improvements (e.g.,perf/optimize-prompts)refactor/- For code cleanup or refactoring (e.g.,refactor/simplify-error-handling)
General guidelines:
- Use hyphens (-) between words, not underscores or spaces
- Be specific but concise about what the branch does
- Include issue numbers when applicable (e.g.,
bugfix/issue-42) - Keep branch names lowercase
-
📝 Make your changes following our code style guidelines.
We use several tools to maintain code quality:
- Ruff: For code formatting and linting
- Pyright: For static type checking
- Bandit: For Python security vulnerability scanning
- Deptry: For dependency health checks (unused, missing, transitive dependencies)
- Interrogate: For docstring coverage checking
- Pre-commit hooks: To automatically check and format code before commits
The pre-commit hooks will automatically check and format your code when you commit. There are two scenarios to be aware of:
If the hooks modify any files during commit (such as Ruff formatting):
- Review the changes made
- Add the modified files to the staging area
- Commit again
If security issues are detected (Bandit):
- Review the security findings in the terminal output
- Fix the identified security issues in your code
- Add the fixed files to the staging area
- Commit again
-
🧪 Run tests to ensure your changes do not break existing functionality:
uv run pytest
Note: We use pytest-recording to record and replay LLM API interactions. Your changes may require re-recording VCR cassettes for the tests. See VCR Cassette Management section below for details.
-
💾 Commit your changes using Conventional Commits format:
We use Conventional Commits format for our commit messages. Instead of using regular git commit, please use commitizen:
uv run cz commit
This will guide you through an interactive prompt to create a properly formatted commit message with:
- Type of change (feat, fix, docs, style, refactor, etc.)
- Optional scope (e.g., api, cli, docs)
- Short description
- Optional longer description and breaking change notes
Example of resulting commit message:
docs(readme): update installation instructionsNote: If pre-commit hooks fail or modify files during
cz commit, you can retry with the same message:uv run cz commit --retry
-
🔄 Update your fork with the latest changes from the
devbranch:git fetch upstream git checkout dev git merge upstream/dev git push origin dev
-
📤 Push your changes to your fork:
git push origin feature/your-feature-name
-
🎯 Create a pull request from your branch to the
devbranch. We use thedevbranch for integration and testing before merging tomainto keep the main branch stable for releases. -
📋 Fill out the pull request template with all required information.
-
✍️ Sign the Contributor Agreement by including your filled-in
.github/contributors/[your-github-username].mdfile (required in the first pull request). -
⏳ Wait for review. Maintainers will review your PR and may request changes.
-
🔧 Address review comments if requested.
When submitting issues or feature requests, please use our GitHub issue templates:
-
Check if a similar issue already exists in the Issues section.
-
If not, create a new issue using the appropriate template:
- Bug Report: For reporting bugs or unexpected behavior
- Feature Request: For suggesting new features or enhancements
- Documentation Improvement: For suggesting improvements to our documentation
Each template will guide you through providing all the necessary information for your specific request.
By submitting issues or feature requests to this project, you acknowledge that these suggestions may be implemented by the project maintainers without attribution or compensation.
Currently, all tests are located in a single file: tests/test_all.py within the TestAll class. When adding new tests, place them in this file following the existing patterns.
Note: We plan to refactor tests into multiple files for better maintainability in the future, but for now all tests should be added to
tests/test_all.py.
- Write tests for new features or bug fixes
- Make sure all tests pass before submitting a PR
- Maintain code coverage above 80%
- Check code coverage by running:
uv run pytest --cov=contextgem
We use pytest-recording to record and replay HTTP interactions with LLM APIs (both cloud-based and local). This allows tests that call LLM APIs to run without making actual API calls after the initial recording.
Note: Tests that do not call LLM APIs do not require or use VCR cassettes. The cassette system only applies to tests that interact with LLM APIs.
VCR cassettes provide the most reliable testing approach for ContextGem because:
- Real API Testing: Testing with actual LLM APIs ensures our functionality works as expected with real responses, edge cases, and API behaviors
- Scalability: With a significant number of LLM API tests, hardcoding requests/responses would be impractical and unmaintainable
- Reproducibility: Once recorded, tests run consistently without variability in LLM responses
- No Setup Friction: Contributors can run tests without API keys or local LLM installations
Local LLMs (Ollama, LM Studio, etc.) also use HTTP APIs (typically on localhost) and their interactions are recorded in cassettes too.
The test suite automatically uses dummy environment variables with pre-recorded cassettes when no .env file is present, so most contributors won't need to set up real API keys or local LLM servers.
To determine whether you need to record new or re-record existing cassettes, run the tests first:
uv run pytestBased on the test results and your changes, you'll fall into one of these four scenarios:
When this applies:
- New tests that do not call LLM APIs
- Code changes that don't modify internal prompts or LLM parameters
- Changes are compatible with existing pre-recorded API calls (confirmed by passing tests)
What to do:
- Nothing! Tests that call LLM APIs should pass by replaying from existing cassettes with automatically-set dummy environment variables
- No need to create a
.envfile or set up API keys
When this applies:
- New test methods that call LLM APIs (cloud-based or local)
- Adding tests for new functionality that requires LLM interaction
What to do:
-
Create a
.envfile locally (ignored by git) with the API keys for the LLM services your new tests will use:# Only include the variables for LLM APIs your tests actually call # For OpenAI API tests CONTEXTGEM_OPENAI_API_KEY=your_openai_api_key # For Azure OpenAI tests CONTEXTGEM_AZURE_OPENAI_API_KEY=your_azure_openai_api_key CONTEXTGEM_AZURE_OPENAI_API_BASE=your_azure_openai_base CONTEXTGEM_AZURE_OPENAI_API_VERSION=your_azure_openai_api_version # For debugging output CONTEXTGEM_LOGGER_LEVEL=DEBUG -
For new LLM providers, create environment variables prefixed with
CONTEXTGEM_:CONTEXTGEM_GOOGLE_AI_STUDIO_API_KEY=your_google_api_key -
Update dummy variables in
tests/utils.pyby adding your new environment variables to thedefault_env_varsdictionary inset_dummy_env_variables_for_testing_from_cassettes(), mapped to a dummy value (e.g. "DUMMY") -
Add the VCR decorator to your new test methods that call LLM APIs (cloud or local):
@pytest.mark.vcr def test_your_new_llm_feature(self): # Your test code that calls LLM APIs (cloud or local)
⚠️ Important: Without the@pytest.mark.vcrdecorator, no cassette will be recorded! -
Run your new tests - new cassettes will be created automatically
-
Verify redaction - check that sensitive data is properly redacted in the new cassette files
-
Test with dummy variables - delete your
.envfile and run tests again to confirm LLM API tests pass by replaying from cassettes with dummy variables
When this applies:
- Tests fail because your changes are incompatible with specific existing cassettes
- Only certain test cases are affected
What to do:
-
Identify failing cassettes from test output
-
Delete specific cassette files from
tests/cassettes/that need re-recording -
Create a
.envfile if needed (same as Scenario 2) -
Run the affected tests to re-record only the necessary cassettes:
uv run pytest tests/test_all.py::TestAll::test_specific_method
When this applies:
- You modified internal prompts (direct changes or code that renders prompts differently)
- You changed default LLM API parameters
- Multiple LLM-related tests fail due to your changes
What to do:
-
Delete all cassette files:
# On Unix/Linux/Mac rm tests/cassettes/*.yaml # On Windows del tests\cassettes\*.yaml
-
Create a
.envfile with your API keys (same as Scenario 2) -
Run all tests to re-record everything:
uv run pytest
⚠️ Important: This will use significant API quota and may incur substantial costs!
Automatically Redacted Variables:
CONTEXTGEM_OPENAI_API_KEYCONTEXTGEM_AZURE_OPENAI_API_KEYCONTEXTGEM_AZURE_OPENAI_API_BASECONTEXTGEM_AZURE_OPENAI_API_VERSION
Adding New Variables:
- Use the
CONTEXTGEM_prefix for new API keys - Verify redaction in your cassette files
- Update redaction logic in
tests/utils.pyif needed - Add dummy values to
set_dummy_env_variables_for_testing_from_cassettes()
For local LLM testing, install the following tools and download the relevant models identified under ollama and lm_studio prefixes in tests/test_all.py:
⚠️ Important: Your system needs to have an appropriate GPU capacity to run such local LLMs.
💰 Cost Warning: Recording cassettes for test methods that use live LLM API (non-local LLMs) uses your API keys and will incur charges. Scenario 4 (re-recording all cassettes) can be particularly expensive.
🔒 Security: Environment variables such as API keys are automatically stripped from cassettes, but always verify new cassette content.
🧪 Testing: After recording, delete your
.envfile and run tests again to ensure LLM API tests pass by replaying from cassettes with dummy variables.
The test suite uses tethered to enforce network egress control at the socket level during VCR-marked tests:
- Replay mode (cassette exists): blocks all outbound connections except HuggingFace (for SaT model downloads not captured by VCR)
- Recording mode (no cassette): allows only approved endpoints (LLM APIs, HuggingFace for model downloads, genai-prices for cost data) and localhost for local LLMs
If you add tests that connect to new endpoints, update the _TETHERED_RECORDING_ALLOW list in tests/conftest.py.
Run all tests:
uv run pytestRun specific tests:
# Run a specific test method
uv run pytest tests/test_all.py::TestAll::test_extract_all🔍 Optional Memory Profiling: For performance testing, you can enable memory profiling to analyze memory usage during test execution:
uv run pytest --mem-profileNote: Memory profiling adds significant overhead and tests will run much slower when profiling is enabled. Memory profiling helps ensure that ContextGem objects don't consume excessive memory and validates memory usage against defined reasonable limits.
Warnings generated during tests are often expected and by design. Many warnings are intentionally triggered to test error handling, edge cases, and warning systems. Common expected warnings include:
- LLM extraction errors and retries (testing error handling)
- Missing LLM roles (testing validation logic)
- Concurrency optimization warnings (testing performance comparisons)
- Deprecation warnings from dependencies
Key Point: If tests PASS with warnings, this should not prevent you from submitting your PR. The test suite is designed to handle and expect these warnings as part of normal operation.
The log output will show detailed information about test execution.
- Update documentation for any changed functionality
- Document new features
- Use clear, concise language
Use the fabric commands from the project root:
# Live rebuild with auto-refresh on file changes
uv run fab docs-liveThis starts a development server on http://localhost:9000 with:
- Automatic rebuilds when files change
- Browser auto-refresh
- Pretty URLs without
.htmlextensions
# Build with verbose output, ignore cache, and treat warnings as errors
uv run fab docsWith Live Development:
Open http://localhost:9000 in your browser.
With Static Builds:
After building, open docs/build/dirhtml/index.html in your web browser.
You can access the live documentation at: https://contextgem.dev
Note: Documentation is automatically deployed when maintainers merge changes from
devtomain. As a contributor, your documentation changes will be visible on the live site after your PR is merged and subsequently deployed by maintainers.
source/- Contains the source.rstfilessource/_static/- Static assets like imagessource/conf.py- Sphinx configurationbuild/- Generated documentation (not committed to version control)
The project's README.md is generated from a template dev/readme.template.md, as it embeds code fragments that are located in separate modules and are subject to tests. Do not modify README.md directly as your changes will be overwritten by a pre-commit hook.
Instead:
- Edit the template file at
dev/readme.template.md - The pre-commit hook will automatically update README.md using the template
If you need to test the README generation manually:
uv run fab readmeWe're here to help! Whether you're stuck on something technical, have questions about the contribution process, or want to suggest improvements to this guide, don't hesitate to reach out.
- Technical Issues: Setup problems, test failures, or development environment issues
- Contribution Process: Questions about pull requests, branching, or code review
- Feature Ideas: Discussion about new features or improvements
- Documentation: Clarifications about this contributing guide or suggesting improvements
🐛 GitHub Issues (preferred for technical questions):
- Open a new issue using our issue templates
📧 Direct Contact:
- 📧 Email: sergii@shcherbak.ai
- 💼 LinkedIn: Sergii Shcherbak
- 🐦 X: @seshch
Found something unclear in this contributing guide? Missing information that would have helped you? Please:
- Open an issue with the
documentationlabel - Suggest specific improvements or additions
- Share your contributor experience to help us improve the process
Thank you for contributing to ContextGem! 🙏
Your contributions help make ContextGem better for everyone. We appreciate your time and effort!