Adding Language Support

Claude Code users: Interactive skills guide you through these steps with checklists and validation:

/cocosearch:cocosearch-add-language — language handlers, symbol extraction, context expansion

/cocosearch:cocosearch-add-grammar — grammar handlers for domain-specific formats

Chunking Tiers

Every indexed file is chunked by CocoIndex's SplitRecursively. The chunking strategy depends on what the language parameter resolves to:

Tier	How it works	Languages
Tree-sitter (CocoIndex built-in)	`SplitRecursively` uses Tree-sitter internally to split at syntax boundaries (function/class edges)	Python, JS, TS, Go, Rust, Java, C, C++, C#, Ruby, PHP, and ~10 more in CocoIndex's built-in list
Custom handler regex	`SplitRecursively` receives a `CustomLanguageSpec` with hierarchical regex separators	HCL, Go Template, Dockerfile, Bash (language handlers) + GitHub Actions, GitLab CI, Docker Compose, Helm Template, Helm Values (grammar handlers)
Plain-text fallback	Splits on blank lines, newlines, whitespace	Everything not matched by either tier above

Systems Overview

CocoSearch has five independent systems for language support:

Language Handlers (src/cocosearch/handlers/) — custom chunking and metadata extraction for languages not in CocoIndex's built-in Tree-sitter list. Matched by file extension.
Grammar Handlers (src/cocosearch/handlers/grammars/) — domain-specific chunking for files that share a base language but have distinct structure (e.g., GitHub Actions is a grammar of YAML). Matched by file path + content patterns.
Symbol Extraction (src/cocosearch/indexer/symbols.py) — tree-sitter query-based extraction of functions, classes, methods, and other symbols for --symbol-type / --symbol-name filtering.
Context Expansion (src/cocosearch/search/context_expander.py) — tree-sitter-based smart expansion to enclosing function/class boundaries for search results. Currently supports Python, JavaScript, TypeScript, Go, Rust.
Dependency Extraction (src/cocosearch/deps/extractors/) — pluggable extractors that parse import/reference patterns to build a file-level dependency graph. Enables deps tree, deps impact, and MCP tools get_file_dependencies/get_file_impact.

These systems are independent. A language can have:

A handler only (e.g., Dockerfile, Bash)
Symbol extraction only (e.g., Java, C, Ruby)
Both (e.g., HCL/Terraform)
A grammar handler (e.g., GitHub Actions, GitLab CI, Docker Compose)
Context expansion (e.g., Python, Go, Rust)
A dependency extractor (e.g., Python, JS/TS, Go, Docker Compose, GitHub Actions, Terraform, Helm)

Checking Tree-sitter's Built-in Language List

Before adding a language, check if tree-sitter-language-pack already supports it:

uv run python -c "from tree_sitter_language_pack import SupportedLanguage; print(sorted(SupportedLanguage.__args__))"

This prints all language names accepted by get_parser() and get_language(). If the language is listed, you can write tree-sitter queries for symbol extraction (Path B). If not, you'll need a handler (Path A) for custom chunking.

You can also verify a specific language works:

uv run python -c "from tree_sitter_language_pack import get_parser; p = get_parser('hcl'); print(p)"

Checking CocoIndex's Built-in Language List

CocoIndex's SplitRecursively has built-in chunking support for ~28 languages (C, C++, C#, CSS, Go, HTML, Java, JavaScript, Python, Rust, TypeScript, etc.). These are listed in the CocoIndex docs and mapped in LANGUAGE_EXTENSIONS in src/cocosearch/search/query.py.

Languages in this list get language-aware chunking automatically — no handler needed. The custom_languages parameter on SplitRecursively (see indexer/flow.py) extends this with custom handlers (HCL, Dockerfile, Bash). If a language string doesn't match any built-in or custom language, CocoIndex falls back to plain-text splitting.

So to decide which path to take:

Language in CocoIndex's built-in list → chunking works out of the box, add symbol extraction (Path B) if desired
Language not in CocoIndex's list but in tree-sitter-language-pack → add a handler for chunking (Path A) and optionally symbol extraction (Path B)
Language in neither → add a handler (Path A) only

Path A: Adding a Language Handler (Chunking + Metadata)

Use this when the language is not in CocoIndex's built-in list and needs custom chunking logic (config formats, DevOps tools, etc.).

Steps

Copy the template:

cp src/cocosearch/handlers/_template.py src/cocosearch/handlers/<language>.py

Implement the handler class:
- Set EXTENSIONS to the file extensions (with leading dot)
- Define SEPARATOR_SPEC with CustomLanguageSpec — hierarchical regex separators from coarsest to finest
- Implement extract_metadata() returning block_type, hierarchy, and language_id
Include patterns are auto-derived — IndexingConfig automatically collects file extensions from handler EXTENSIONS and grammar PATH_PATTERNS. No manual config.py edit needed. For non-extension patterns (like Dockerfile), add an INCLUDE_PATTERNS class var to the handler.
Important constraints:
- Separators must use standard regex only — no lookaheads/lookbehinds (CocoIndex uses Rust regex)
- The handler is autodiscovered at import time; no registration code needed
Add language_id to _SKIP_PARSE_EXTENSIONS in src/cocosearch/indexer/parse_tracking.py if the language has no tree-sitter grammar. This prevents false no_grammar reports in parse tracking stats. (Languages with tree-sitter support don't need this.)

Add tests:

# Create test file
touch tests/unit/handlers/test_<language>.py

# Run tests
uv run pytest tests/unit/handlers/test_<language>.py -v

Update cli.py languages_command — add a display name to the display_names dict in languages_command if the default .title() casing isn't right (e.g., "hcl": "HCL"). Extensions are derived from the handler's EXTENSIONS automatically.

Files to Create/Modify

File	Action
`src/cocosearch/handlers/<language>.py`	Create — handler class
`tests/unit/handlers/test_<language>.py`	Create — handler tests
`src/cocosearch/indexer/parse_tracking.py`	Modify — add language_id to `_SKIP_PARSE_EXTENSIONS` (only if no tree-sitter grammar)
`src/cocosearch/cli.py`	Modify — `display_names` in `languages_command` (only if `.title()` casing is wrong)

See handlers/README.md for the full handler protocol, separator design, and testing checklist.

Path B: Adding Symbol Extraction (Tree-sitter Queries)

Use this for languages already supported by Tree-sitter where you want --symbol-type and --symbol-name filtering.

Steps

Create a tree-sitter query file:

touch src/cocosearch/indexer/queries/<language>.scm

Write S-expression patterns matching the language's AST. Use @definition.<type> captures for symbol types and @name for symbol names.

Example (Python):

(function_definition name: (identifier) @name) @definition.function
(class_definition name: (identifier) @name) @definition.class

Add extension mappings to LANGUAGE_MAP in src/cocosearch/indexer/symbols.py:

LANGUAGE_MAP = {
    # ...existing...
    "ext": "language_name",
}

Add the language to SYMBOL_AWARE_LANGUAGES in src/cocosearch/search/query.py:

SYMBOL_AWARE_LANGUAGES = {"python", "javascript", ..., "new_language"}

Update _map_symbol_type in symbols.py if the language introduces new AST node types that need mapping (e.g., "block" -> "class" for HCL).
Update _build_qualified_name in symbols.py if the language needs special qualified name logic (e.g., Go receiver methods, HCL block labels).

Add tests:

# Create tests/unit/indexer/symbols/test_<language>.py
uv run pytest tests/unit/indexer/symbols/test_<language>.py -v

Files to Create/Modify

File	Action
`src/cocosearch/indexer/queries/<language>.scm`	Create — tree-sitter query
`src/cocosearch/indexer/symbols.py`	Modify — `LANGUAGE_MAP`, possibly `_map_symbol_type` and `_build_qualified_name`
`src/cocosearch/search/query.py`	Modify — `SYMBOL_AWARE_LANGUAGES`
`tests/unit/indexer/symbols/test_<language>.py`	Create — symbol extraction tests

Query file resolution

Query files are resolved with priority: project-level (.cocosearch/queries/) > user-level (~/.cocosearch/queries/) > built-in (src/cocosearch/indexer/queries/). Users can override built-in queries without modifying the package.

Path C: Both Handler + Symbol Extraction (HCL Example)

HCL/Terraform is a worked example of a language with both systems.

Handler (`src/cocosearch/handlers/hcl.py`)

EXTENSIONS = [".tf", ".hcl", ".tfvars"]
SEPARATOR_SPEC with regex separators for HCL blocks
extract_metadata() recognizing 12 block keywords (resource, data, variable, output, locals, module, provider, terraform, import, moved, removed, check)

Symbol extraction

Query files: src/cocosearch/indexer/queries/hcl.scm and terraform.scm (identical AST)
LANGUAGE_MAP entries: "tf" -> "terraform", "hcl" -> "hcl", "tfvars" -> "hcl"
_map_symbol_type: "block" -> "class" mapping added
_build_qualified_name: HCL-specific logic to build names from block type + labels (e.g., resource.aws_s3_bucket.data)

Registration

SYMBOL_AWARE_LANGUAGES in search/query.py includes "hcl"
cli.py languages_command shows HCL with checkmark and all three extensions

Path D: Adding a Grammar Handler (Domain-Specific Schema)

Use this when multiple domain syntaxes share the same file extension and you want structured chunking and metadata for a specific schema. For example, GitHub Actions, GitLab CI, and Docker Compose are all YAML files, but each has distinct structure.

Language vs Grammar:

A language is matched by file extension (1:1 mapping, e.g., .tf -> HCL)
A grammar is matched by file path + content patterns (e.g., .github/workflows/*.yml with on: + jobs: -> GitHub Actions)

Priority: Grammar match > Language match > TextHandler fallback.

How it works

extract_language() in indexer/embedder.py checks grammar handlers first. If a grammar matches, it returns the grammar name (e.g., "github-actions") instead of the file extension. This grammar name flows through the pipeline:

SplitRecursively uses the grammar's CustomLanguageSpec for chunking
extract_chunk_metadata dispatches to the grammar handler for metadata

Steps

Copy the template:

cp src/cocosearch/handlers/grammars/_template.py src/cocosearch/handlers/grammars/<grammar>.py

Implement the grammar handler class:

For YAML-based grammars, inherit from YamlGrammarBase (in handlers/grammars/_base.py), which provides shared comment stripping, path matching, and fallback metadata chain. You only need to implement:
- GRAMMAR_NAME — unique identifier (lowercase, hyphenated, e.g., "github-actions")
- PATH_PATTERNS — glob patterns matching the file paths
- SEPARATOR_SPEC — CustomLanguageSpec with hierarchical separators (or None for default)
- _has_content_markers(content) — content validation for matches()
- _extract_grammar_metadata(stripped, text) — grammar-specific metadata extraction (return dict or None for fallback)
For non-YAML grammars (e.g., HCL-based Terraform), implement the full GrammarHandler protocol directly.
Important constraints:
- Separators must use standard regex only — no lookaheads/lookbehinds (CocoIndex uses Rust regex)
- The grammar is autodiscovered at import time; no registration code needed
- Include patterns are auto-derived from PATH_PATTERNS — no manual config.py edit needed

Add tests:

touch tests/unit/handlers/grammars/test_<grammar>.py
uv run pytest tests/unit/handlers/grammars/test_<grammar>.py -v

Files to Create/Modify

File	Action
`src/cocosearch/handlers/grammars/<grammar>.py`	Create — grammar handler class
`tests/unit/handlers/grammars/test_<grammar>.py`	Create — grammar handler tests

Existing grammar handlers

Grammar	Base Language	Path Patterns	Content Markers
`argocd`	yaml	`.yaml`, `.yml`	`apiVersion:` + `argoproj.io/` + `kind:`
`docker-compose`	yaml	`docker-compose.yml`, `compose.yml`	`services:`
`github-actions`	yaml	`.github/workflows/*.yml`	`on:` + `jobs:`
`gitlab-ci`	yaml	`.gitlab-ci.yml`	`stages:` or (`script:` + `image:`/`stage:`)
`helm-chart`	yaml	`/Chart.yaml`, `/Chart.yml`	`apiVersion:` + `name:`
`helm-template`	gotmpl	`templates/.yaml`, `templates/.tpl`	`apiVersion:` or `{{`
`helm-values`	yaml	`values*.yaml` in chart dirs	`## @section` or YAML with comments
`kubernetes`	yaml	`.yaml`, `.yml`	`apiVersion:` + `kind:` (excludes Helm/ArgoCD)
`terraform`	hcl	`*/.tf`, `*/.tfvars`	HCL resource/data/module blocks

Path E: Adding Context Expansion (Smart Boundaries)

Use this when you want smart_context=True to expand search results to enclosing function/class boundaries for a language. Currently supported: Python, JavaScript, TypeScript, Go, Rust, Scala, HCL/Terraform.

Steps

Add node types to DEFINITION_NODE_TYPES in src/cocosearch/search/context_expander.py:

DEFINITION_NODE_TYPES: dict[str, set[str]] = {
    # ...existing...
    "new_language": {"function_declaration", "class_declaration"},
}

Use the tree-sitter node types for function/class definitions in that language. You can explore them with:

uv run python -c "
from tree_sitter_language_pack import get_parser
p = get_parser('new_language')
# Parse a sample file and inspect node types
"

Add extension mappings to EXTENSION_TO_LANGUAGE in the same file:

EXTENSION_TO_LANGUAGE: dict[str, str] = {
    # ...existing...
    ".ext": "new_language",
}

CONTEXT_EXPANSION_LANGUAGES updates automatically — it's derived from DEFINITION_NODE_TYPES.keys().

Files to Modify

File	Action
`src/cocosearch/search/context_expander.py`	Modify — `DEFINITION_NODE_TYPES` and `EXTENSION_TO_LANGUAGE`

Path F: Adding a Dependency Extractor

Use this when the language has import/require/reference patterns that can be extracted for dependency analysis. This enables deps tree, deps impact, and the get_file_dependencies/get_file_impact MCP tools.

Already implemented: Python (py), JavaScript/TypeScript (js, jsx, ts, tsx, mjs, cjs, mts, cts), Go (go), Docker Compose (docker-compose), GitHub Actions (github-actions), Terraform (terraform), Helm (helm-template, helm-values).

When to add an extractor

Programming languages with import statements (e.g., import, require, use, include) → edge type DepType.IMPORT
Infrastructure formats with reference patterns (e.g., image refs, action uses:, module sources) → edge type DepType.REFERENCE with metadata.kind for specifics
Grammar handlers that set a language_id (their GRAMMAR_NAME) — the extractor matches on LANGUAGES = {"grammar-name"}

Steps

Create the extractor:
```
touch src/cocosearch/deps/extractors/<language>.py
```
Implement the DependencyExtractor protocol:
- Set LANGUAGES — set of language IDs this extractor handles (must match the language_id from the handler/grammar or file extension)
- Implement extract(file_path: str, content: str) -> list[DependencyEdge]
- For tree-sitter languages, use get_parser("<language>") and walk the AST to find import nodes
- For YAML-based grammars, parse with yaml.safe_load and extract reference patterns
- For HCL/template formats, use regex
Choose edge types:
- DepType.IMPORT — code-level imports (import X, require('X'), use X)
- DepType.REFERENCE — grammar-level refs (image refs, action uses, module sources). Set metadata.kind to describe the reference type (e.g., "image", "action", "module_source")
- DepType.CALL — direct symbol calls (rare, not used by current extractors)

Populate edge metadata:

DependencyEdge(
    source_file=file_path,
    target_file=None,           # Resolved later by module resolver
    target_module="<raw-import-string>",
    dep_type=DepType.IMPORT,
    metadata={"module": "<raw-string>", "line": 5},
)

Add a module resolver (if needed) to src/cocosearch/deps/resolver.py:
- Implement ModuleResolver protocol: build_index(indexed_files) and resolve(edge, module_index)
- Register in _RESOLVERS dict mapping language IDs to the resolver instance
- Already implemented: PythonResolver, JavaScriptResolver, GoResolver, TerraformResolver

Add tests:

touch tests/unit/deps/extractors/test_<language>.py
uv run pytest tests/unit/deps/extractors/test_<language>.py -v

Cover: each import/reference pattern, edge metadata, empty files, syntax variations.

If a resolver was added:

uv run pytest tests/unit/deps/test_resolver.py -v

The extractor is autodiscovered at import time — any deps/extractors/*.py file (not prefixed with _) implementing the protocol is auto-registered. No registration code needed.

Files to Create/Modify

File	Action
`src/cocosearch/deps/extractors/<language>.py`	Create — dependency extractor
`src/cocosearch/deps/resolver.py`	Modify — add resolver (only if import resolution needed)
`tests/unit/deps/extractors/test_<language>.py`	Create — extractor tests
`tests/unit/deps/test_resolver.py`	Modify — resolver tests (only if resolver added)

Existing extractors

Extractor	Language IDs	Edge Type	Parsing Method
`python.py`	`py`	IMPORT	Tree-sitter (`import_statement`, `import_from_statement`)
`javascript.py`	`js`, `jsx`, `ts`, `tsx`, `mjs`, `cjs`, `mts`, `cts`	IMPORT	Tree-sitter (`import_statement`, `call_expression[require]`)
`go.py`	`go`	IMPORT	Tree-sitter (`import_declaration`, `import_spec`)
`docker_compose.py`	`docker-compose`	REFERENCE	YAML (`image:`, `depends_on:`, `extends:`)
`github_actions.py`	`github-actions`	REFERENCE	YAML (`uses:` action/workflow refs)
`terraform.py`	`terraform`	REFERENCE	Regex (HCL `module { source = "..." }`)
`helm.py`	`helm-template`, `helm-values`	REFERENCE	Regex + YAML (includes, images, subcharts)

Registration Checklist

When adding a new language handler, verify all registrations are complete:

When adding a new grammar handler:

Grammar handler: handlers/grammars/<grammar>.py created — inherit YamlGrammarBase for YAML grammars
Tests: tests/unit/handlers/grammars/test_<grammar>.py created
Dependency extractor: deps/extractors/<grammar>.py created (if grammar has reference patterns)
Extractor tests: tests/unit/deps/extractors/test_<grammar>.py created (if extractor added)
README.md: Supported Grammars section updated
README.md: Grammar badge added to the badges section at the top

When adding a dependency extractor (Path F):

Extractor: src/cocosearch/deps/extractors/<language>.py created (autodiscovered)
LANGUAGES set matches language IDs from handler/grammar or file extension
Module resolver added to src/cocosearch/deps/resolver.py (if import resolution needed)
Resolver registered in _RESOLVERS dict (if added)
Tests: tests/unit/deps/extractors/test_<language>.py created
Resolver tests: added to tests/unit/deps/test_resolver.py (if resolver added)
CLAUDE.md: extractor count and resolver list updated

Reference

handlers/README.md — handler protocol, separator design, testing
handlers/grammars/_template.py — grammar handler template
CLAUDE.md — quick handler steps, architecture overview

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding Language Support

Chunking Tiers

Systems Overview

Checking Tree-sitter's Built-in Language List

Checking CocoIndex's Built-in Language List

Path A: Adding a Language Handler (Chunking + Metadata)

Steps

Files to Create/Modify

Path B: Adding Symbol Extraction (Tree-sitter Queries)

Steps

Files to Create/Modify

Query file resolution

Path C: Both Handler + Symbol Extraction (HCL Example)

Handler (`src/cocosearch/handlers/hcl.py`)

Symbol extraction

Registration

Path D: Adding a Grammar Handler (Domain-Specific Schema)

How it works

Steps

Files to Create/Modify

Existing grammar handlers

Path E: Adding Context Expansion (Smart Boundaries)

Steps

Files to Modify

Path F: Adding a Dependency Extractor

When to add an extractor

Steps

Files to Create/Modify

Existing extractors

Registration Checklist

Reference

Uh oh!

FilesExpand file tree

adding-languages.md

Latest commit

History

adding-languages.md

File metadata and controls

Adding Language Support

Chunking Tiers

Systems Overview

Checking Tree-sitter's Built-in Language List

Checking CocoIndex's Built-in Language List

Path A: Adding a Language Handler (Chunking + Metadata)

Steps

Files to Create/Modify

Path B: Adding Symbol Extraction (Tree-sitter Queries)

Steps

Files to Create/Modify

Query file resolution

Path C: Both Handler + Symbol Extraction (HCL Example)

Handler (src/cocosearch/handlers/hcl.py)

Symbol extraction

Registration

Path D: Adding a Grammar Handler (Domain-Specific Schema)

How it works

Steps

Files to Create/Modify

Existing grammar handlers

Path E: Adding Context Expansion (Smart Boundaries)

Steps

Files to Modify

Path F: Adding a Dependency Extractor

When to add an extractor

Steps

Files to Create/Modify

Existing extractors

Registration Checklist

Reference

Handler (`src/cocosearch/handlers/hcl.py`)