Claude Code users: Interactive skills guide you through these steps with checklists and validation:
/cocosearch:cocosearch-add-language— language handlers, symbol extraction, context expansion/cocosearch:cocosearch-add-grammar— grammar handlers for domain-specific formats
Every indexed file is chunked by CocoIndex's SplitRecursively. The chunking strategy depends on what the language parameter resolves to:
| Tier | How it works | Languages |
|---|---|---|
| Tree-sitter (CocoIndex built-in) | SplitRecursively uses Tree-sitter internally to split at syntax boundaries (function/class edges) |
Python, JS, TS, Go, Rust, Java, C, C++, C#, Ruby, PHP, and ~10 more in CocoIndex's built-in list |
| Custom handler regex | SplitRecursively receives a CustomLanguageSpec with hierarchical regex separators |
HCL, Go Template, Dockerfile, Bash (language handlers) + GitHub Actions, GitLab CI, Docker Compose, Helm Template, Helm Values (grammar handlers) |
| Plain-text fallback | Splits on blank lines, newlines, whitespace | Everything not matched by either tier above |
CocoSearch has five independent systems for language support:
- Language Handlers (
src/cocosearch/handlers/) — custom chunking and metadata extraction for languages not in CocoIndex's built-in Tree-sitter list. Matched by file extension. - Grammar Handlers (
src/cocosearch/handlers/grammars/) — domain-specific chunking for files that share a base language but have distinct structure (e.g., GitHub Actions is a grammar of YAML). Matched by file path + content patterns. - Symbol Extraction (
src/cocosearch/indexer/symbols.py) — tree-sitter query-based extraction of functions, classes, methods, and other symbols for--symbol-type/--symbol-namefiltering. - Context Expansion (
src/cocosearch/search/context_expander.py) — tree-sitter-based smart expansion to enclosing function/class boundaries for search results. Currently supports Python, JavaScript, TypeScript, Go, Rust. - Dependency Extraction (
src/cocosearch/deps/extractors/) — pluggable extractors that parse import/reference patterns to build a file-level dependency graph. Enablesdeps tree,deps impact, and MCP toolsget_file_dependencies/get_file_impact.
These systems are independent. A language can have:
- A handler only (e.g., Dockerfile, Bash)
- Symbol extraction only (e.g., Java, C, Ruby)
- Both (e.g., HCL/Terraform)
- A grammar handler (e.g., GitHub Actions, GitLab CI, Docker Compose)
- Context expansion (e.g., Python, Go, Rust)
- A dependency extractor (e.g., Python, JS/TS, Go, Docker Compose, GitHub Actions, Terraform, Helm)
Before adding a language, check if tree-sitter-language-pack already supports it:
uv run python -c "from tree_sitter_language_pack import SupportedLanguage; print(sorted(SupportedLanguage.__args__))"This prints all language names accepted by get_parser() and get_language(). If the language is listed, you can write tree-sitter queries for symbol extraction (Path B). If not, you'll need a handler (Path A) for custom chunking.
You can also verify a specific language works:
uv run python -c "from tree_sitter_language_pack import get_parser; p = get_parser('hcl'); print(p)"CocoIndex's SplitRecursively has built-in chunking support for ~28 languages (C, C++, C#, CSS, Go, HTML, Java, JavaScript, Python, Rust, TypeScript, etc.). These are listed in the CocoIndex docs and mapped in LANGUAGE_EXTENSIONS in src/cocosearch/search/query.py.
Languages in this list get language-aware chunking automatically — no handler needed. The custom_languages parameter on SplitRecursively (see indexer/flow.py) extends this with custom handlers (HCL, Dockerfile, Bash). If a language string doesn't match any built-in or custom language, CocoIndex falls back to plain-text splitting.
So to decide which path to take:
- Language in CocoIndex's built-in list → chunking works out of the box, add symbol extraction (Path B) if desired
- Language not in CocoIndex's list but in
tree-sitter-language-pack→ add a handler for chunking (Path A) and optionally symbol extraction (Path B) - Language in neither → add a handler (Path A) only
Use this when the language is not in CocoIndex's built-in list and needs custom chunking logic (config formats, DevOps tools, etc.).
-
Copy the template:
cp src/cocosearch/handlers/_template.py src/cocosearch/handlers/<language>.py
-
Implement the handler class:
- Set
EXTENSIONSto the file extensions (with leading dot) - Define
SEPARATOR_SPECwithCustomLanguageSpec— hierarchical regex separators from coarsest to finest - Implement
extract_metadata()returningblock_type,hierarchy, andlanguage_id
- Set
-
Include patterns are auto-derived —
IndexingConfigautomatically collects file extensions from handlerEXTENSIONSand grammarPATH_PATTERNS. No manualconfig.pyedit needed. For non-extension patterns (likeDockerfile), add anINCLUDE_PATTERNSclass var to the handler. -
Important constraints:
- Separators must use standard regex only — no lookaheads/lookbehinds (CocoIndex uses Rust regex)
- The handler is autodiscovered at import time; no registration code needed
-
Add language_id to
_SKIP_PARSE_EXTENSIONSinsrc/cocosearch/indexer/parse_tracking.pyif the language has no tree-sitter grammar. This prevents falseno_grammarreports in parse tracking stats. (Languages with tree-sitter support don't need this.) -
Add tests:
# Create test file touch tests/unit/handlers/test_<language>.py # Run tests uv run pytest tests/unit/handlers/test_<language>.py -v
-
Update
cli.pylanguages_command— add a display name to thedisplay_namesdict inlanguages_commandif the default.title()casing isn't right (e.g.,"hcl": "HCL"). Extensions are derived from the handler'sEXTENSIONSautomatically.
| File | Action |
|---|---|
src/cocosearch/handlers/<language>.py |
Create — handler class |
tests/unit/handlers/test_<language>.py |
Create — handler tests |
src/cocosearch/indexer/parse_tracking.py |
Modify — add language_id to _SKIP_PARSE_EXTENSIONS (only if no tree-sitter grammar) |
src/cocosearch/cli.py |
Modify — display_names in languages_command (only if .title() casing is wrong) |
See handlers/README.md for the full handler protocol, separator design, and testing checklist.
Use this for languages already supported by Tree-sitter where you want --symbol-type and --symbol-name filtering.
-
Create a tree-sitter query file:
touch src/cocosearch/indexer/queries/<language>.scm
Write S-expression patterns matching the language's AST. Use
@definition.<type>captures for symbol types and@namefor symbol names.Example (Python):
(function_definition name: (identifier) @name) @definition.function (class_definition name: (identifier) @name) @definition.class
-
Add extension mappings to
LANGUAGE_MAPinsrc/cocosearch/indexer/symbols.py:LANGUAGE_MAP = { # ...existing... "ext": "language_name", }
-
Add the language to
SYMBOL_AWARE_LANGUAGESinsrc/cocosearch/search/query.py:SYMBOL_AWARE_LANGUAGES = {"python", "javascript", ..., "new_language"}
-
Update
_map_symbol_typeinsymbols.pyif the language introduces new AST node types that need mapping (e.g.,"block" -> "class"for HCL). -
Update
_build_qualified_nameinsymbols.pyif the language needs special qualified name logic (e.g., Go receiver methods, HCL block labels). -
Add tests:
# Create tests/unit/indexer/symbols/test_<language>.py uv run pytest tests/unit/indexer/symbols/test_<language>.py -v
| File | Action |
|---|---|
src/cocosearch/indexer/queries/<language>.scm |
Create — tree-sitter query |
src/cocosearch/indexer/symbols.py |
Modify — LANGUAGE_MAP, possibly _map_symbol_type and _build_qualified_name |
src/cocosearch/search/query.py |
Modify — SYMBOL_AWARE_LANGUAGES |
tests/unit/indexer/symbols/test_<language>.py |
Create — symbol extraction tests |
Query files are resolved with priority: project-level (.cocosearch/queries/) > user-level (~/.cocosearch/queries/) > built-in (src/cocosearch/indexer/queries/). Users can override built-in queries without modifying the package.
HCL/Terraform is a worked example of a language with both systems.
EXTENSIONS = [".tf", ".hcl", ".tfvars"]SEPARATOR_SPECwith regex separators for HCL blocksextract_metadata()recognizing 12 block keywords (resource, data, variable, output, locals, module, provider, terraform, import, moved, removed, check)
- Query files:
src/cocosearch/indexer/queries/hcl.scmandterraform.scm(identical AST) LANGUAGE_MAPentries:"tf" -> "terraform","hcl" -> "hcl","tfvars" -> "hcl"_map_symbol_type:"block" -> "class"mapping added_build_qualified_name: HCL-specific logic to build names from block type + labels (e.g.,resource.aws_s3_bucket.data)
SYMBOL_AWARE_LANGUAGESinsearch/query.pyincludes"hcl"cli.pylanguages_commandshows HCL with checkmark and all three extensions
Use this when multiple domain syntaxes share the same file extension and you want structured chunking and metadata for a specific schema. For example, GitHub Actions, GitLab CI, and Docker Compose are all YAML files, but each has distinct structure.
Language vs Grammar:
- A language is matched by file extension (1:1 mapping, e.g.,
.tf-> HCL) - A grammar is matched by file path + content patterns (e.g.,
.github/workflows/*.ymlwithon:+jobs:-> GitHub Actions)
Priority: Grammar match > Language match > TextHandler fallback.
extract_language() in indexer/embedder.py checks grammar handlers first. If a grammar matches, it returns the grammar name (e.g., "github-actions") instead of the file extension. This grammar name flows through the pipeline:
SplitRecursivelyuses the grammar'sCustomLanguageSpecfor chunkingextract_chunk_metadatadispatches to the grammar handler for metadata
-
Copy the template:
cp src/cocosearch/handlers/grammars/_template.py src/cocosearch/handlers/grammars/<grammar>.py
-
Implement the grammar handler class:
For YAML-based grammars, inherit from
YamlGrammarBase(inhandlers/grammars/_base.py), which provides shared comment stripping, path matching, and fallback metadata chain. You only need to implement:GRAMMAR_NAME— unique identifier (lowercase, hyphenated, e.g.,"github-actions")PATH_PATTERNS— glob patterns matching the file pathsSEPARATOR_SPEC—CustomLanguageSpecwith hierarchical separators (orNonefor default)_has_content_markers(content)— content validation formatches()_extract_grammar_metadata(stripped, text)— grammar-specific metadata extraction (return dict orNonefor fallback)
For non-YAML grammars (e.g., HCL-based Terraform), implement the full
GrammarHandlerprotocol directly. -
Important constraints:
- Separators must use standard regex only — no lookaheads/lookbehinds (CocoIndex uses Rust regex)
- The grammar is autodiscovered at import time; no registration code needed
- Include patterns are auto-derived from
PATH_PATTERNS— no manualconfig.pyedit needed
-
Add tests:
touch tests/unit/handlers/grammars/test_<grammar>.py uv run pytest tests/unit/handlers/grammars/test_<grammar>.py -v
| File | Action |
|---|---|
src/cocosearch/handlers/grammars/<grammar>.py |
Create — grammar handler class |
tests/unit/handlers/grammars/test_<grammar>.py |
Create — grammar handler tests |
| Grammar | Base Language | Path Patterns | Content Markers |
|---|---|---|---|
argocd |
yaml | *.yaml, *.yml |
apiVersion: + argoproj.io/ + kind: |
docker-compose |
yaml | docker-compose*.yml, compose*.yml |
services: |
github-actions |
yaml | .github/workflows/*.yml |
on: + jobs: |
gitlab-ci |
yaml | .gitlab-ci.yml |
stages: or (script: + image:/stage:) |
helm-chart |
yaml | **/Chart.yaml, **/Chart.yml |
apiVersion: + name: |
helm-template |
gotmpl | templates/*.yaml, templates/*.tpl |
apiVersion: or {{ |
helm-values |
yaml | values*.yaml in chart dirs |
## @section or YAML with comments |
kubernetes |
yaml | *.yaml, *.yml |
apiVersion: + kind: (excludes Helm/ArgoCD) |
terraform |
hcl | **/*.tf, **/*.tfvars |
HCL resource/data/module blocks |
Use this when you want smart_context=True to expand search results to enclosing function/class boundaries for a language. Currently supported: Python, JavaScript, TypeScript, Go, Rust, Scala, HCL/Terraform.
-
Add node types to
DEFINITION_NODE_TYPESinsrc/cocosearch/search/context_expander.py:DEFINITION_NODE_TYPES: dict[str, set[str]] = { # ...existing... "new_language": {"function_declaration", "class_declaration"}, }
Use the tree-sitter node types for function/class definitions in that language. You can explore them with:
uv run python -c " from tree_sitter_language_pack import get_parser p = get_parser('new_language') # Parse a sample file and inspect node types "
-
Add extension mappings to
EXTENSION_TO_LANGUAGEin the same file:EXTENSION_TO_LANGUAGE: dict[str, str] = { # ...existing... ".ext": "new_language", }
-
CONTEXT_EXPANSION_LANGUAGESupdates automatically — it's derived fromDEFINITION_NODE_TYPES.keys().
| File | Action |
|---|---|
src/cocosearch/search/context_expander.py |
Modify — DEFINITION_NODE_TYPES and EXTENSION_TO_LANGUAGE |
Use this when the language has import/require/reference patterns that can be extracted for dependency analysis. This enables deps tree, deps impact, and the get_file_dependencies/get_file_impact MCP tools.
Already implemented: Python (py), JavaScript/TypeScript (js, jsx, ts, tsx, mjs, cjs, mts, cts), Go (go), Docker Compose (docker-compose), GitHub Actions (github-actions), Terraform (terraform), Helm (helm-template, helm-values).
- Programming languages with import statements (e.g.,
import,require,use,include) → edge typeDepType.IMPORT - Infrastructure formats with reference patterns (e.g., image refs, action
uses:, module sources) → edge typeDepType.REFERENCEwithmetadata.kindfor specifics - Grammar handlers that set a
language_id(theirGRAMMAR_NAME) — the extractor matches onLANGUAGES = {"grammar-name"}
-
Create the extractor:
touch src/cocosearch/deps/extractors/<language>.py
Implement the
DependencyExtractorprotocol:- Set
LANGUAGES— set of language IDs this extractor handles (must match thelanguage_idfrom the handler/grammar or file extension) - Implement
extract(file_path: str, content: str) -> list[DependencyEdge] - For tree-sitter languages, use
get_parser("<language>")and walk the AST to find import nodes - For YAML-based grammars, parse with
yaml.safe_loadand extract reference patterns - For HCL/template formats, use regex
- Set
-
Choose edge types:
DepType.IMPORT— code-level imports (import X,require('X'),use X)DepType.REFERENCE— grammar-level refs (image refs, action uses, module sources). Setmetadata.kindto describe the reference type (e.g.,"image","action","module_source")DepType.CALL— direct symbol calls (rare, not used by current extractors)
-
Populate edge metadata:
DependencyEdge( source_file=file_path, target_file=None, # Resolved later by module resolver target_module="<raw-import-string>", dep_type=DepType.IMPORT, metadata={"module": "<raw-string>", "line": 5}, )
-
Add a module resolver (if needed) to
src/cocosearch/deps/resolver.py:- Implement
ModuleResolverprotocol:build_index(indexed_files)andresolve(edge, module_index) - Register in
_RESOLVERSdict mapping language IDs to the resolver instance - Already implemented:
PythonResolver,JavaScriptResolver,GoResolver,TerraformResolver
- Implement
-
Add tests:
touch tests/unit/deps/extractors/test_<language>.py uv run pytest tests/unit/deps/extractors/test_<language>.py -v
Cover: each import/reference pattern, edge metadata, empty files, syntax variations.
If a resolver was added:
uv run pytest tests/unit/deps/test_resolver.py -v
The extractor is autodiscovered at import time — any deps/extractors/*.py file (not prefixed with _) implementing the protocol is auto-registered. No registration code needed.
| File | Action |
|---|---|
src/cocosearch/deps/extractors/<language>.py |
Create — dependency extractor |
src/cocosearch/deps/resolver.py |
Modify — add resolver (only if import resolution needed) |
tests/unit/deps/extractors/test_<language>.py |
Create — extractor tests |
tests/unit/deps/test_resolver.py |
Modify — resolver tests (only if resolver added) |
| Extractor | Language IDs | Edge Type | Parsing Method |
|---|---|---|---|
python.py |
py |
IMPORT | Tree-sitter (import_statement, import_from_statement) |
javascript.py |
js, jsx, ts, tsx, mjs, cjs, mts, cts |
IMPORT | Tree-sitter (import_statement, call_expression[require]) |
go.py |
go |
IMPORT | Tree-sitter (import_declaration, import_spec) |
docker_compose.py |
docker-compose |
REFERENCE | YAML (image:, depends_on:, extends:) |
github_actions.py |
github-actions |
REFERENCE | YAML (uses: action/workflow refs) |
terraform.py |
terraform |
REFERENCE | Regex (HCL module { source = "..." }) |
helm.py |
helm-template, helm-values |
REFERENCE | Regex + YAML (includes, images, subcharts) |
When adding a new language handler, verify all registrations are complete:
- Handler (if applicable):
handlers/<language>.pycreated, extensions registered via autodiscovery - include_patterns: auto-derived from handler
EXTENSIONS— verify withIndexingConfig().include_patterns - _SKIP_PARSE_EXTENSIONS: language_id added to
_SKIP_PARSE_EXTENSIONSinindexer/parse_tracking.py(only if no tree-sitter grammar — prevents falseno_grammarreports) - LANGUAGE_MAP (if symbol extraction): all file extensions mapped to tree-sitter language name
- Query file (if symbol extraction):
indexer/queries/<language>.scmcreated - SYMBOL_AWARE_LANGUAGES: language added to set in
search/query.py - _map_symbol_type: any new AST node types mapped to standard types
- _build_qualified_name: special qualified name logic added if needed
- DEFINITION_NODE_TYPES (if context expansion): node types added in
search/context_expander.py - EXTENSION_TO_LANGUAGE (if context expansion): extension mappings added in
search/context_expander.py - cli.py languages_command: display name override added if needed (extensions are derived from handler)
- Dependency extractor:
deps/extractors/<language>.pycreated (if language has imports — Path F) - Module resolver: added to
deps/resolver.py(if import resolution needed — Path F) - Tests: handler tests and/or symbol extraction tests added
- README.md: Supported Languages section updated (count, table, lists)
- README.md: Language badge added to the badges section at the top
When adding a new grammar handler:
- Grammar handler:
handlers/grammars/<grammar>.pycreated — inheritYamlGrammarBasefor YAML grammars - Tests:
tests/unit/handlers/grammars/test_<grammar>.pycreated - Dependency extractor:
deps/extractors/<grammar>.pycreated (if grammar has reference patterns) - Extractor tests:
tests/unit/deps/extractors/test_<grammar>.pycreated (if extractor added) - README.md: Supported Grammars section updated
- README.md: Grammar badge added to the badges section at the top
When adding a dependency extractor (Path F):
- Extractor:
src/cocosearch/deps/extractors/<language>.pycreated (autodiscovered) -
LANGUAGESset matches language IDs from handler/grammar or file extension - Module resolver added to
src/cocosearch/deps/resolver.py(if import resolution needed) - Resolver registered in
_RESOLVERSdict (if added) - Tests:
tests/unit/deps/extractors/test_<language>.pycreated - Resolver tests: added to
tests/unit/deps/test_resolver.py(if resolver added) - CLAUDE.md: extractor count and resolver list updated
- handlers/README.md — handler protocol, separator design, testing
- handlers/grammars/_template.py — grammar handler template
- CLAUDE.md — quick handler steps, architecture overview