Conversation
Fixed critical bug where tests were loading user's actual data directory instead of using isolated test fixtures, causing 30+ second test runs when large indexes existed. Changes: - Fixed incorrect env var ISCC_SEARCH_INDEX_LOCATION → ISCC_SEARCH_INDEX_URI in test_remote.py fixture - Added session-scoped isolate_tests_from_user_data fixture to conftest.py that automatically overrides default index URI for all tests - Tests now use temporary directories and never touch user data Performance impact: test_remote.py reduced from 34s → 1.2s (28x faster) Total test suite improved from 50-54s → 22s (60% faster)
Add GitHub Actions workflow to build and publish Docker images to GitHub Container Registry on pushes to develop branch. Images are tagged with 'develop' and commit SHA.
…ependency - Move iscc-sct>=0.1.3 from optional 'semantic' group to main dependencies - Add psycopg[binary]>=3.2.12 to dev dependencies - Remove now-empty 'semantic' optional dependency group
Add test job to docker-publish workflow that must pass before building and pushing Docker image to prevent publishing broken builds.
Configure production logging to include timestamps in both uvicorn access logs and application loguru messages for better observability. Changes: - Add loguru configuration to server/__init__.py with timestamp format - Create log_config.json for uvicorn with custom timestamp formatters - Update Dockerfile to use --log-config for uvicorn - Disable colors in logs for clean Docker output Log format: YYYY-MM-DD HH:mm:ss.SSS | LEVEL | module:function:line - message This resolves the issue where production logs showed only minimal HTTP access logs without timestamps or application-level messages.
Add model caching and pre-download steps to prevent parallel test failures caused by concurrent model downloads. Model download race conditions were corrupting the ONNX file, causing INVALID_PROTOBUF errors in GitHub Actions tests. CI Changes: - Cache ~/.local/share/iscc-sct between GitHub Actions runs - Pre-download model before running parallel tests (pytest -n auto) Docker Changes: - Pre-download model during Docker build in builder stage - Copy model from builder to runtime stage - Eliminates model download on container startup (~200-300 MB) Fixes test failures where multiple pytest workers simultaneously attempted to download the model to the same location. Related upstream issues filed on iscc/iscc-sct: - #18: Add file locking during model download - #19: Add configurable model storage path - #20: Add CLI command to pre-download models
…l confidence weighting Add UsearchSimprintIndex that uses pure usearch with multi=True for asset-level similarity search. Implements multi-query aggregation with configurable confidence weighting to emphasize high-quality matches while maintaining coverage. Key features: - Multi-query aggregation combining coverage and quality metrics - Exponential confidence weighting (configurable per query) - Soft-boundary matching with match_threshold filtering - No metadata storage (asset-level only, no chunk tracking) - Context manager support with automatic save/restore Also rename simprint test files to follow indexes naming convention for consistency.
Add connectivity and expansion_add parameters to UsearchSimprintIndex to allow tuning build-time performance vs search quality trade-offs. Benchmarks show connectivity=8, expansion_add=8 gives 20x faster builds with 25% smaller indexes at acceptable recall levels. Defaults remain at usearch's standard values (16/128) for compatibility.
Add safety checks to prevent division by zero when calculating quality scores in exponential confidence weighting. Returns 0.0 when weight_sum is zero, which can occur with empty score sets. Affected components: - UsearchSimprintIndex quality calculation - UsearchIndex total_score calculation
Add lancedb>=0.25.3 for out-of-core simprint index implementation. Enables disk-based ANN search for datasets larger than RAM.
Implement LancedbSimprintIndex for out-of-core simprint search using LanceDB's columnar storage and HNSW indexing. Key features: - Variable-length simprint support with configurable ndim - Full chunk metadata (offset, size) stored per simprint - Exponential confidence weighting for soft-boundary matching - Append semantics for high-performance batch ingestion - Disk-based storage for datasets larger than RAM Architecture: - PyArrow schema: iscc_id_body + simprint + offset + size + vector - Metadata storage: ndim and realm_id in table metadata - Native Hamming distance search via LanceDB IVF/HNSW Test coverage includes: lifecycle, variable-length handling, batch operations, confidence weighting, and deduplication behavior.
Implement SimprintMultiIndex for backend-agnostic multi-type simprint
indexing with transparent type routing and result aggregation.
Features:
- Pluggable backends: LMDB (hard-boundary) and LanceDB (soft-boundary)
- Automatic type routing to dedicated sub-indexes
- Realm ID extraction and propagation to sub-indexes
- Parallel search across simprint types with result aggregation
- Stateless coordinator with backend-specific file naming
Architecture:
- Root directory contains: SIMPRINT_{type}{backend_ext} files
- Sub-indexes store realm_id and ndim in backend metadata
- Works with 8-byte ISCC-ID bodies, reconstructs 10-byte IDs
- Thread-pool parallelization for multi-type searches
Test coverage includes: backend registration, type routing, realm ID
handling, parallel search, and error handling.
Replace custom match_threshold kwarg with standard threshold parameter for consistency with other index implementations. The threshold now filters individual simprint matches (noise rejection) rather than final scores. - Remove DEFAULT_MATCH_THRESHOLD class constant - Use threshold parameter directly in search_raw() - Update tests to use threshold instead of match_threshold - Clarify in docstring: threshold filters individual matches, not final scores - Final results limited by limit parameter only (no score threshold)
…undary scoring **Core scoring changes:** - Replace hard coverage*quality product with configurable coverage influence - Add DEFAULT_COVERAGE_WEIGHT parameter (default 0.2) to control coverage influence - Formula: coverage^w × quality where w=0 ignores coverage entirely - Track best score per query per asset to naturally bound coverage to [0,1] - Recalculate scores in detailed mode based on actual chunk matches **Performance optimizations:** - Update HNSW build parameters: connectivity=8, expansion_add=16 for faster bulk indexing - Set expansion_search=512 for better recall during queries - Switch to SimprintMultiIndex with usearch backend (from lmdb) **Configuration updates:** - Update default threshold from 0.0 to 0.75 for practical noise filtering - Support per-query coverage_weight and confidence_exponent overrides **Infrastructure:** - Add iscc_search/core module scaffold for future sharded index implementation - Add psycopg dependency for PostgreSQL backend support - Add .gemini/ directory to gitignore **Tests:** - Update all tests to reflect new scoring behavior and backend - Add coverage for configurable coverage_weight parameter - Add test for realm_id_int property extraction - Fix remote test to use FastAPI directly (avoid loading production index)
The os.environ approach was ineffective because search_settings is already instantiated by the session-scoped isolate_tests_from_user_data fixture, causing the test to run against UsearchIndex instead of MemoryIndex.
…exports Replace ~320 lines of duplicated code with thin re-export modules from iscc-usearch. Remove direct numba dependency (now transitive via iscc-usearch). Pin iscc-usearch>=0.2.1. Update tests for stricter TypeError validation and upstream error messages.
usearch-iscc now raises ValueError for count=0 instead of segfaulting.
Restores 100% test coverage by removing the empty placeholder class that was the sole uncovered line in the codebase.
Modernize Status and Modality enums from (str, Enum) to StrEnum. Replace bare list default with default_factory on chunk_matches.
- Delete iscc_search/metrics.py (dead re-export, no internal consumers) - Delete tests/test_metrics.py (tested obsolete numba-compiled NPHD metric) - Delete scripts/benchmark_nphd.py (directly imported numba) - Pin iscc-usearch>=0.5.0 and usearch-iscc>=2.24 (native MetricKind.NPHD) - Convert bytes to np.uint8 arrays at NphdIndex call sites (pad_vectors in v0.5.0 no longer accepts raw bytes in per-element loop) - Remove iscc_search.metrics reference from docs/modules.md
Switch ISCC-UNIT similarity indexes from single-file NphdIndex to
directory-based ShardedNphdIndex for bounded RAM and auto-sharding.
Adapted 6 breaking call sites: _get_or_create_nphd_index (path at
construction), _load_nphd_indexes (metadata-driven directory loading),
_search_similarity_unit (numpy array instead of list wrapper),
flush/close (save() without path arg), _rebuild_nphd_index (clean
stale directory before fresh construction).
Persistence format changed from {unit_type}.usearch files to
{unit_type}/ directories containing shard files and bloom filter.
Add dual-write simprint path to LMDB alongside existing SimprintMultiIndex. Simprints are stored in per-type dupsort databases within the main index.lmdb, providing atomic persistence and enabling exact (hard-boundary) search. Key changes: - lmdb_ops.py: pure functions for chunk pointer pack/unpack, IDF calculation, document frequency counting, delete_asset_simprints, and exact search - UsearchIndex.add_assets: dual-writes simprints to LMDB with delete-before- rewrite on updates (idempotent for changed/added simprints) - UsearchIndex.search_assets: exact=True flag routes to LMDB exact search - max_dbs uses max(user_value, 32) instead of hardcoded override
Derived ShardedIndex128 indexes for approximate simprint similarity search: - Composite 128-bit keys (iscc_id_body + offset + size) for chunk tracking - IDF-weighted scoring with smooth log(1 + N/(1+df)) formula - 20x oversampling for asset diversity in HNSW results - doc_freq_fn callback for LMDB-based document frequency lookup - Rebuild from LMDB source of truth on sync mismatch Bug fixes: - Remove stale derived vectors when asset updates delete all simprints - Use matched (stored) simprint for IDF lookup instead of query simprint
This commit consolidates the simprint architecture by removing legacy backend implementations and the coordinator layer, keeping only the new unified LMDB + ShardedIndex128 system. **Deleted files:** - Simprint backends: `lmdb_core.py`, `lancedb_core.py`, `lmdb_multi.py`, `multi.py` - Legacy usearch variant: `_legacy_usearch_core.py` - Protocol definitions: `protocols/simprint_core.py`, `protocols/simprint_multi.py` - Test files for deleted backends: 10 test files removed **Changes to UsearchIndex (index.py):** - Remove `SimprintMultiIndex` integration and instance variable `self._simprint_index` - Delete methods: `_load_simprint_index()`, `_asset_to_simprint_entry()` - Remove old backend write path from `add_assets()` (SimprintMultiIndex.add_raw_multi) - Simplify `has_simprint_index` check to use only derived indexes - Remove old backend close logic from `close()` **Changes to models.py:** - Remove unused struct classes: `SimprintRaw`, `SimprintEntryRaw`, `SimprintEntryMulti` - Keep only: `MatchedChunkRaw`, `SimprintMatchRaw`, `TypeMatchResult`, `SimprintMatchMulti` **Test updates:** - Rewritten `test_indexes_usearch_simprint_m1.py`: 6 new tests for new system - Updated `test_indexes_simprint_models.py`: added tests for TypeMatchResult - Both exact (LMDB) and approximate (ShardedIndex128) search paths verified **Test results:** - All 1189 tests pass (253 legacy tests removed) - 100% code coverage maintained - Simprint tests: 72 pass (14 exact + 41 approx + 6 m1 + 11 m2) The new architecture has a cleaner separation: - LMDB: authoritative storage for both units and simprints - ShardedNphdIndex: derived similarity search for units - ShardedIndex128: derived approximate search for simprints No more coordinator layer or multiple backend implementations.
Align configuration naming with iscc-sct project conventions: - Rename settings.py → options.py - Rename SearchSettings → SearchOptions, search_settings → search_opts - Add load_dotenv() and env_file=".env" support - Drop case_sensitive=True (use pydantic-settings default) - Add host, port, workers fields to SearchOptions - Add env var names to field descriptions - Add python-dotenv dependency - Update all imports across production and test code
Upgrade iscc-usearch to 0.6.1 which fixes ShardedIndex128 AxisError when searching with batch_size=1 and both view+active shards populated (iscc-usearch#22). This crashed text search for short inputs producing only one simprint chunk. Also show API error details in playground instead of generic HTTP status text (e.g. "axis 1 is out of bounds..." instead of "Bad Request"). Add regression test exercising the single-query + multi-shard scenario.
py-lmdb 2.2.0 invalidates all cached named-db handles when env.set_mapsize() is called. UsearchIndex caches simprint db handles in self._sp_data_dbs and self._sp_assets_dbs, so MapFullError retries (and any subsequent writes/reads) would fail with "Database handle belongs to another environment". Clear and repopulate the caches via _load_sp_databases() after the resize.
Drop the pre-UsearchIndex legacy code paths — none of them are imported by the production usearch:// backend and all consumers were themselves cruft: - store.py (IsccStore), instance.py (InstanceIndex), iscc.py (IsccIndex) - lookup.py (IsccLookupIndex), unit.py (UnitIndex) - nphd.py (thin re-export of iscc_usearch.NphdIndex) - simprint/ package (SimprintMiniIndex, SimprintIndex — superseded by iscc_search/indexes/simprint/) - matching tests (test_store, test_instance, test_iscc, test_lookup, test_unit*, test_nphd*, test_simprint_mini) - scripts/benchmark_instance.py, scripts/benchmark_metric.py - NphdIndex/UnitIndex/InstanceIndex re-exports from iscc_search/__init__.py Also drop lancedb and psycopg[binary] from pyproject.toml — neither is imported anywhere in the codebase — and remove the now-nonexistent iscc_search/simprint/* from coverage omit.
The iscc_search/indexes/postgres/ directory held only an empty __init__.py — no implementation, no routing in options.get_index(). Drop the directory along with the "planned" postgres mentions in the field description, get_index() docstring, and error messages. Delete the vacuous test_search_options_postgres_uri test and switch test_get_index_unsupported_uri to use redis:// so the test no longer implies postgres is a planned future backend.
Reflect the post-cleanup state: CLI + FastAPI server with protocol-based backends (memory/lmdb/usearch), NPHD metric now provided by the external iscc-usearch package, and the two-config split (options.py vs config.py).
Reject --workers > 1 when ISCC_SEARCH_INDEX_URI uses the usearch:// backend since concurrent writers corrupt .usearch files. Remove the misleading --workers 4 example from the docstring.
A simprint index missing from memory but present in LMDB previously triggered a full rebuild inside the search request. At production scale (hundreds of millions of vectors per type) this blocks an HTTP request for hours. Log a warning and skip the affected simprint type instead; an explicit out-of-band rebuild must be used to restore results.
/healthz is a pure liveness probe that returns 200 as long as the process can respond. /readyz is a readiness probe that returns 200 only when the index is initialized and list_indexes() succeeds, or 503 with a structured reason otherwise. Orchestrators should point their liveness probes at /healthz and their readiness probes at /readyz instead of the root endpoint.
Add sentry-sdk[fastapi] and initialize it at server import time when ISCC_SEARCH_SENTRY_DSN is set. When the DSN is unset, init_sentry is a no-op so local and test environments are unaffected. The performance trace sampling rate defaults to 5% and is tunable via ISCC_SEARCH_SENTRY_TRACES_SAMPLE_RATE.
Add a sizing profiles table for Sandbox, Validation, Launch, and Growth deployments covering RAM, disk, shard sizes, flush interval, and HNSW search depth. Document the /healthz and /readyz probe endpoints, bump the recommended stop_grace_period to 120s, note the CLI-enforced multi-worker guard, and add FLUSH_INTERVAL to .env.example with a production-grade recommendation.
All actions are on Node.js 20, which gets force-replaced June 2nd 2026. Add FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 to all 3 workflows to silence the deprecation warnings and ensure CI keeps working after the cutover.
Existing docs moved to cauldron/legacy-docs (untracked) for reference. Fresh documentation will follow.
Set up complete documentation infrastructure (zensical.toml, ISCC brand CSS/JS, copilot widget, copy-as-markdown, SEO meta tags, LLM-friendly output) and authored 13 pages following the Divio documentation framework: - Homepage with badges, quick start, and navigation grid - Tutorial: getting started (Python/CLI/REST tabs) - How-to: index backends, CLI, REST API, deployment - Explanation: ISCC primer, architecture, similarity search - Reference: API (mkdocstrings), configuration, for-coding-agents - Development: contributing guide Also updates CI deploy workflow from mkdocs to zensical, adds CNAME for search.iscc.codes, fixes license to Apache-2.0 in pyproject.toml and README, and adds BETA status warnings.
Add notes to homepage, README, and architecture page explaining that iscc-usearch is a patched fork of the usearch vector search library used internally, not an alternative to iscc-search. Ref: iscc/iscc-usearch#24
Add mdformat with mdformat-mkdocs[recommended] to pre-commit config for consistent markdown formatting on commit. Remove legacy mkdocs, mkdocs-material, and individual mdformat plugin dev dependencies (markdown formatting now runs exclusively via pre-commit). Remove the poe format-markdown task to match the iscc-usearch approach.
Update CI matrix to 3.11-3.14, bump requires-python to >=3.11, and remove click, tqdm, pyyaml, and simsimd which had no imports in the source tree. Also fix README CI badge label.
Bump actions/checkout v4→v6, actions/cache v4→v5, actions/setup-python v5→v6, astral-sh/setup-uv v2→v8, docker/build-push-action v6→v7, docker/login-action v3→v4, docker/setup-buildx-action v3→v4, docker/metadata-action v5→v6, codecov/codecov-action v4→v6, actions/upload-artifact v4→v6, and actions/download-artifact v4→v7. Remove the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 workaround since all actions now run on Node.js 24 natively.
setup-uv v8 stopped publishing major/minor pointer tags for supply chain security. Use the full immutable tag instead.
- Rename scratch/ to cauldron/ in CLAUDE.md package layout - Gitignore CLAUDE.local.md for local-only instructions - Note default usearch index path in .env.example
Workflows collapse to two files with clear triggers: - ci.yml: PR + push to develop/main — tests (Linux matrix 3.11-3.14 + Windows/macOS smoke), OpenAPI build, wheel build+smoke, Docker build+smoke (healthz/readyz), gated publish to ghcr:develop on push to develop. - release.yml: release: published — full test matrix, wheel build with tag-version verification, wheel smoke on all three OSes, PyPI publish, Docker push with vX.Y.Z / X.Y / X / latest tags (latest=auto for stable only), docs deploy. Build system switches to hatch-vcs: version derives from git tags, removing the sed hack. FastAPI app.version now tracks the package version instead of a hardcoded literal. Dockerfile installs the pre-built wheel from dist/, eliminating source files from the image. .dockerignore flips to an allowlist so dev artifacts (cauldron/, .claude/, scratch/, tests/, docs/, .git/) cannot leak into the image.
- `iscc-search datasets` lists ISCC datasets on the HF Hub (defaults to the `iscc` org) with a rich table or `--json` output. - `iscc-search hub REPO_ID` streams a dataset's parquet files via huggingface_hub + pyarrow and indexes each row as an IsccEntry, preserving non-binary columns as opaque metadata. - Auto-registers a local index derived from the dataset name when one isn't configured yet; monotonic microsecond stepping keeps `gen_iscc_id` collision-free in tight loops. - Docs updated: CLI how-to, getting-started tutorial, for-coding-agents.
…ests pytest-xdist runs workers in parallel; concurrent first-time downloads of the iscc-sct ONNX model produced Protobuf parse errors on four of seven matrix jobs. Pre-downloading the model serially before running tests (and caching it across runs via actions/cache) matches the pattern the previous docker-publish.yml used.
The 100ms upper-bound on test_timer_measures_actual_time failed on a busy GitHub-hosted macOS runner where a 50ms sleep measured 200ms wall clock. The assertion's job is to catch gross unit/impl bugs, not to benchmark sleep precision, so 1s of headroom is plenty.
Default the CLI logger to INFO, demote per-batch add_assets logs to DEBUG, and disable huggingface_hub's tqdm bars (they render poorly in non-TTY consoles where our own rich spinners already provide feedback).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Merge
developintomainto cut the first public releasev0.1.0.This brings 81 commits of work into
main, covering:IsccIndexProtocol) with three backends:memory://,lmdb:///,usearch:///hub/datasetsHuggingFace ingestion/healthz+/readyzprobes, optional Sentry, modular OpenAPI 3.0 specdevelop, ready for stable taggingSee
CHANGELOG.mdfor the full feature list.Test plan
developCI green on the latest SHA (06ea467)develop → mainmerge result is greenv0.1.0to triggerrelease.yml(PyPI + ghcr + docs)