Skip to content

Release v0.1.0#3

Merged
titusz merged 81 commits intomainfrom
develop
Apr 16, 2026
Merged

Release v0.1.0#3
titusz merged 81 commits intomainfrom
develop

Conversation

@titusz
Copy link
Copy Markdown
Member

@titusz titusz commented Apr 16, 2026

Summary

Merge develop into main to cut the first public release v0.1.0.

This brings 81 commits of work into main, covering:

  • Protocol-based architecture (IsccIndexProtocol) with three backends: memory://, lmdb:///, usearch:///
  • ISCC-SIMPRINT (granular feature) indexing with exact + approximate search paths
  • Typer-based CLI with git-style multi-index management plus hub / datasets HuggingFace ingestion
  • FastAPI REST server with auth, /healthz + /readyz probes, optional Sentry, modular OpenAPI 3.0 spec
  • Docker images published from develop, ready for stable tagging
  • Zensical-based documentation site
  • Cross-platform support (Linux/macOS/Windows) and Python 3.11–3.14
  • 100% test coverage

See CHANGELOG.md for the full feature list.

Test plan

  • develop CI green on the latest SHA (06ea467)
  • CI on the develop → main merge result is green
  • After merge, publish GitHub Release v0.1.0 to trigger release.yml (PyPI + ghcr + docs)

Fixed critical bug where tests were loading user's actual data directory
instead of using isolated test fixtures, causing 30+ second test runs
when large indexes existed.

Changes:
- Fixed incorrect env var ISCC_SEARCH_INDEX_LOCATION → ISCC_SEARCH_INDEX_URI
  in test_remote.py fixture
- Added session-scoped isolate_tests_from_user_data fixture to conftest.py
  that automatically overrides default index URI for all tests
- Tests now use temporary directories and never touch user data

Performance impact: test_remote.py reduced from 34s → 1.2s (28x faster)
Total test suite improved from 50-54s → 22s (60% faster)
Add GitHub Actions workflow to build and publish Docker images to GitHub Container Registry on pushes to develop branch. Images are tagged with 'develop' and commit SHA.
…ependency

- Move iscc-sct>=0.1.3 from optional 'semantic' group to main dependencies
- Add psycopg[binary]>=3.2.12 to dev dependencies
- Remove now-empty 'semantic' optional dependency group
Add test job to docker-publish workflow that must pass before building
and pushing Docker image to prevent publishing broken builds.
Configure production logging to include timestamps in both uvicorn
access logs and application loguru messages for better observability.

Changes:
- Add loguru configuration to server/__init__.py with timestamp format
- Create log_config.json for uvicorn with custom timestamp formatters
- Update Dockerfile to use --log-config for uvicorn
- Disable colors in logs for clean Docker output

Log format: YYYY-MM-DD HH:mm:ss.SSS | LEVEL | module:function:line - message

This resolves the issue where production logs showed only minimal
HTTP access logs without timestamps or application-level messages.
Add model caching and pre-download steps to prevent parallel test
failures caused by concurrent model downloads. Model download race
conditions were corrupting the ONNX file, causing INVALID_PROTOBUF
errors in GitHub Actions tests.

CI Changes:
- Cache ~/.local/share/iscc-sct between GitHub Actions runs
- Pre-download model before running parallel tests (pytest -n auto)

Docker Changes:
- Pre-download model during Docker build in builder stage
- Copy model from builder to runtime stage
- Eliminates model download on container startup (~200-300 MB)

Fixes test failures where multiple pytest workers simultaneously
attempted to download the model to the same location.

Related upstream issues filed on iscc/iscc-sct:
- #18: Add file locking during model download
- #19: Add configurable model storage path
- #20: Add CLI command to pre-download models
…l confidence weighting

Add UsearchSimprintIndex that uses pure usearch with multi=True for asset-level similarity search. Implements multi-query aggregation with configurable confidence weighting to emphasize high-quality matches while maintaining coverage.

Key features:
- Multi-query aggregation combining coverage and quality metrics
- Exponential confidence weighting (configurable per query)
- Soft-boundary matching with match_threshold filtering
- No metadata storage (asset-level only, no chunk tracking)
- Context manager support with automatic save/restore

Also rename simprint test files to follow indexes naming convention for consistency.
Add connectivity and expansion_add parameters to UsearchSimprintIndex
to allow tuning build-time performance vs search quality trade-offs.
Benchmarks show connectivity=8, expansion_add=8 gives 20x faster builds
with 25% smaller indexes at acceptable recall levels.

Defaults remain at usearch's standard values (16/128) for compatibility.
Add safety checks to prevent division by zero when calculating quality
scores in exponential confidence weighting. Returns 0.0 when weight_sum
is zero, which can occur with empty score sets.

Affected components:
- UsearchSimprintIndex quality calculation
- UsearchIndex total_score calculation
Add lancedb>=0.25.3 for out-of-core simprint index implementation.
Enables disk-based ANN search for datasets larger than RAM.
Implement LancedbSimprintIndex for out-of-core simprint search using
LanceDB's columnar storage and HNSW indexing.

Key features:
- Variable-length simprint support with configurable ndim
- Full chunk metadata (offset, size) stored per simprint
- Exponential confidence weighting for soft-boundary matching
- Append semantics for high-performance batch ingestion
- Disk-based storage for datasets larger than RAM

Architecture:
- PyArrow schema: iscc_id_body + simprint + offset + size + vector
- Metadata storage: ndim and realm_id in table metadata
- Native Hamming distance search via LanceDB IVF/HNSW

Test coverage includes: lifecycle, variable-length handling, batch
operations, confidence weighting, and deduplication behavior.
Implement SimprintMultiIndex for backend-agnostic multi-type simprint
indexing with transparent type routing and result aggregation.

Features:
- Pluggable backends: LMDB (hard-boundary) and LanceDB (soft-boundary)
- Automatic type routing to dedicated sub-indexes
- Realm ID extraction and propagation to sub-indexes
- Parallel search across simprint types with result aggregation
- Stateless coordinator with backend-specific file naming

Architecture:
- Root directory contains: SIMPRINT_{type}{backend_ext} files
- Sub-indexes store realm_id and ndim in backend metadata
- Works with 8-byte ISCC-ID bodies, reconstructs 10-byte IDs
- Thread-pool parallelization for multi-type searches

Test coverage includes: backend registration, type routing, realm ID
handling, parallel search, and error handling.
Replace custom match_threshold kwarg with standard threshold parameter for
consistency with other index implementations. The threshold now filters
individual simprint matches (noise rejection) rather than final scores.

- Remove DEFAULT_MATCH_THRESHOLD class constant
- Use threshold parameter directly in search_raw()
- Update tests to use threshold instead of match_threshold
- Clarify in docstring: threshold filters individual matches, not final scores
- Final results limited by limit parameter only (no score threshold)
…undary scoring

**Core scoring changes:**
- Replace hard coverage*quality product with configurable coverage influence
- Add DEFAULT_COVERAGE_WEIGHT parameter (default 0.2) to control coverage influence
- Formula: coverage^w × quality where w=0 ignores coverage entirely
- Track best score per query per asset to naturally bound coverage to [0,1]
- Recalculate scores in detailed mode based on actual chunk matches

**Performance optimizations:**
- Update HNSW build parameters: connectivity=8, expansion_add=16 for faster bulk indexing
- Set expansion_search=512 for better recall during queries
- Switch to SimprintMultiIndex with usearch backend (from lmdb)

**Configuration updates:**
- Update default threshold from 0.0 to 0.75 for practical noise filtering
- Support per-query coverage_weight and confidence_exponent overrides

**Infrastructure:**
- Add iscc_search/core module scaffold for future sharded index implementation
- Add psycopg dependency for PostgreSQL backend support
- Add .gemini/ directory to gitignore

**Tests:**
- Update all tests to reflect new scoring behavior and backend
- Add coverage for configurable coverage_weight parameter
- Add test for realm_id_int property extraction
- Fix remote test to use FastAPI directly (avoid loading production index)
The os.environ approach was ineffective because search_settings is
already instantiated by the session-scoped isolate_tests_from_user_data
fixture, causing the test to run against UsearchIndex instead of
MemoryIndex.
…exports

Replace ~320 lines of duplicated code with thin re-export modules from
iscc-usearch. Remove direct numba dependency (now transitive via
iscc-usearch). Pin iscc-usearch>=0.2.1. Update tests for stricter
TypeError validation and upstream error messages.
usearch-iscc now raises ValueError for count=0 instead of segfaulting.
Restores 100% test coverage by removing the empty placeholder class
that was the sole uncovered line in the codebase.
Modernize Status and Modality enums from (str, Enum) to StrEnum.
Replace bare list default with default_factory on chunk_matches.
- Delete iscc_search/metrics.py (dead re-export, no internal consumers)
- Delete tests/test_metrics.py (tested obsolete numba-compiled NPHD metric)
- Delete scripts/benchmark_nphd.py (directly imported numba)
- Pin iscc-usearch>=0.5.0 and usearch-iscc>=2.24 (native MetricKind.NPHD)
- Convert bytes to np.uint8 arrays at NphdIndex call sites (pad_vectors
  in v0.5.0 no longer accepts raw bytes in per-element loop)
- Remove iscc_search.metrics reference from docs/modules.md
Switch ISCC-UNIT similarity indexes from single-file NphdIndex to
directory-based ShardedNphdIndex for bounded RAM and auto-sharding.

Adapted 6 breaking call sites: _get_or_create_nphd_index (path at
construction), _load_nphd_indexes (metadata-driven directory loading),
_search_similarity_unit (numpy array instead of list wrapper),
flush/close (save() without path arg), _rebuild_nphd_index (clean
stale directory before fresh construction).

Persistence format changed from {unit_type}.usearch files to
{unit_type}/ directories containing shard files and bloom filter.
Add dual-write simprint path to LMDB alongside existing SimprintMultiIndex.
Simprints are stored in per-type dupsort databases within the main index.lmdb,
providing atomic persistence and enabling exact (hard-boundary) search.

Key changes:
- lmdb_ops.py: pure functions for chunk pointer pack/unpack, IDF calculation,
  document frequency counting, delete_asset_simprints, and exact search
- UsearchIndex.add_assets: dual-writes simprints to LMDB with delete-before-
  rewrite on updates (idempotent for changed/added simprints)
- UsearchIndex.search_assets: exact=True flag routes to LMDB exact search
- max_dbs uses max(user_value, 32) instead of hardcoded override
Derived ShardedIndex128 indexes for approximate simprint similarity search:
- Composite 128-bit keys (iscc_id_body + offset + size) for chunk tracking
- IDF-weighted scoring with smooth log(1 + N/(1+df)) formula
- 20x oversampling for asset diversity in HNSW results
- doc_freq_fn callback for LMDB-based document frequency lookup
- Rebuild from LMDB source of truth on sync mismatch

Bug fixes:
- Remove stale derived vectors when asset updates delete all simprints
- Use matched (stored) simprint for IDF lookup instead of query simprint
This commit consolidates the simprint architecture by removing legacy backend
implementations and the coordinator layer, keeping only the new unified LMDB +
ShardedIndex128 system.

**Deleted files:**
- Simprint backends: `lmdb_core.py`, `lancedb_core.py`, `lmdb_multi.py`, `multi.py`
- Legacy usearch variant: `_legacy_usearch_core.py`
- Protocol definitions: `protocols/simprint_core.py`, `protocols/simprint_multi.py`
- Test files for deleted backends: 10 test files removed

**Changes to UsearchIndex (index.py):**
- Remove `SimprintMultiIndex` integration and instance variable `self._simprint_index`
- Delete methods: `_load_simprint_index()`, `_asset_to_simprint_entry()`
- Remove old backend write path from `add_assets()` (SimprintMultiIndex.add_raw_multi)
- Simplify `has_simprint_index` check to use only derived indexes
- Remove old backend close logic from `close()`

**Changes to models.py:**
- Remove unused struct classes: `SimprintRaw`, `SimprintEntryRaw`, `SimprintEntryMulti`
- Keep only: `MatchedChunkRaw`, `SimprintMatchRaw`, `TypeMatchResult`, `SimprintMatchMulti`

**Test updates:**
- Rewritten `test_indexes_usearch_simprint_m1.py`: 6 new tests for new system
- Updated `test_indexes_simprint_models.py`: added tests for TypeMatchResult
- Both exact (LMDB) and approximate (ShardedIndex128) search paths verified

**Test results:**
- All 1189 tests pass (253 legacy tests removed)
- 100% code coverage maintained
- Simprint tests: 72 pass (14 exact + 41 approx + 6 m1 + 11 m2)

The new architecture has a cleaner separation:
- LMDB: authoritative storage for both units and simprints
- ShardedNphdIndex: derived similarity search for units
- ShardedIndex128: derived approximate search for simprints
No more coordinator layer or multiple backend implementations.
Align configuration naming with iscc-sct project conventions:
- Rename settings.py → options.py
- Rename SearchSettings → SearchOptions, search_settings → search_opts
- Add load_dotenv() and env_file=".env" support
- Drop case_sensitive=True (use pydantic-settings default)
- Add host, port, workers fields to SearchOptions
- Add env var names to field descriptions
- Add python-dotenv dependency
- Update all imports across production and test code
titusz added 29 commits March 10, 2026 18:32
Upgrade iscc-usearch to 0.6.1 which fixes ShardedIndex128 AxisError
when searching with batch_size=1 and both view+active shards populated
(iscc-usearch#22). This crashed text search for short inputs producing
only one simprint chunk.

Also show API error details in playground instead of generic HTTP status
text (e.g. "axis 1 is out of bounds..." instead of "Bad Request").

Add regression test exercising the single-query + multi-shard scenario.
py-lmdb 2.2.0 invalidates all cached named-db handles when
env.set_mapsize() is called. UsearchIndex caches simprint db handles in
self._sp_data_dbs and self._sp_assets_dbs, so MapFullError retries (and
any subsequent writes/reads) would fail with "Database handle belongs to
another environment". Clear and repopulate the caches via
_load_sp_databases() after the resize.
Drop the pre-UsearchIndex legacy code paths — none of them are imported
by the production usearch:// backend and all consumers were themselves
cruft:

- store.py (IsccStore), instance.py (InstanceIndex), iscc.py (IsccIndex)
- lookup.py (IsccLookupIndex), unit.py (UnitIndex)
- nphd.py (thin re-export of iscc_usearch.NphdIndex)
- simprint/ package (SimprintMiniIndex, SimprintIndex — superseded by
  iscc_search/indexes/simprint/)
- matching tests (test_store, test_instance, test_iscc, test_lookup,
  test_unit*, test_nphd*, test_simprint_mini)
- scripts/benchmark_instance.py, scripts/benchmark_metric.py
- NphdIndex/UnitIndex/InstanceIndex re-exports from iscc_search/__init__.py

Also drop lancedb and psycopg[binary] from pyproject.toml — neither is
imported anywhere in the codebase — and remove the now-nonexistent
iscc_search/simprint/* from coverage omit.
The iscc_search/indexes/postgres/ directory held only an empty
__init__.py — no implementation, no routing in options.get_index().
Drop the directory along with the "planned" postgres mentions in the
field description, get_index() docstring, and error messages. Delete
the vacuous test_search_options_postgres_uri test and switch
test_get_index_unsupported_uri to use redis:// so the test no longer
implies postgres is a planned future backend.
Reflect the post-cleanup state: CLI + FastAPI server with protocol-based
backends (memory/lmdb/usearch), NPHD metric now provided by the external
iscc-usearch package, and the two-config split (options.py vs config.py).
Reject --workers > 1 when ISCC_SEARCH_INDEX_URI uses the usearch://
backend since concurrent writers corrupt .usearch files. Remove the
misleading --workers 4 example from the docstring.
A simprint index missing from memory but present in LMDB previously
triggered a full rebuild inside the search request. At production scale
(hundreds of millions of vectors per type) this blocks an HTTP request
for hours. Log a warning and skip the affected simprint type instead;
an explicit out-of-band rebuild must be used to restore results.
/healthz is a pure liveness probe that returns 200 as long as the
process can respond. /readyz is a readiness probe that returns 200
only when the index is initialized and list_indexes() succeeds, or
503 with a structured reason otherwise. Orchestrators should point
their liveness probes at /healthz and their readiness probes at
/readyz instead of the root endpoint.
Add sentry-sdk[fastapi] and initialize it at server import time when
ISCC_SEARCH_SENTRY_DSN is set. When the DSN is unset, init_sentry is a
no-op so local and test environments are unaffected. The performance
trace sampling rate defaults to 5% and is tunable via
ISCC_SEARCH_SENTRY_TRACES_SAMPLE_RATE.
Add a sizing profiles table for Sandbox, Validation, Launch, and Growth
deployments covering RAM, disk, shard sizes, flush interval, and HNSW
search depth. Document the /healthz and /readyz probe endpoints, bump
the recommended stop_grace_period to 120s, note the CLI-enforced
multi-worker guard, and add FLUSH_INTERVAL to .env.example with a
production-grade recommendation.
All actions are on Node.js 20, which gets force-replaced June 2nd 2026.
Add FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 to all 3 workflows to silence
the deprecation warnings and ensure CI keeps working after the cutover.
Existing docs moved to cauldron/legacy-docs (untracked) for reference.
Fresh documentation will follow.
Set up complete documentation infrastructure (zensical.toml, ISCC brand
CSS/JS, copilot widget, copy-as-markdown, SEO meta tags, LLM-friendly
output) and authored 13 pages following the Divio documentation framework:

- Homepage with badges, quick start, and navigation grid
- Tutorial: getting started (Python/CLI/REST tabs)
- How-to: index backends, CLI, REST API, deployment
- Explanation: ISCC primer, architecture, similarity search
- Reference: API (mkdocstrings), configuration, for-coding-agents
- Development: contributing guide

Also updates CI deploy workflow from mkdocs to zensical, adds CNAME for
search.iscc.codes, fixes license to Apache-2.0 in pyproject.toml and
README, and adds BETA status warnings.
Add notes to homepage, README, and architecture page explaining that
iscc-usearch is a patched fork of the usearch vector search library
used internally, not an alternative to iscc-search.

Ref: iscc/iscc-usearch#24
Add mdformat with mdformat-mkdocs[recommended] to pre-commit config
for consistent markdown formatting on commit. Remove legacy mkdocs,
mkdocs-material, and individual mdformat plugin dev dependencies
(markdown formatting now runs exclusively via pre-commit). Remove the
poe format-markdown task to match the iscc-usearch approach.
Update CI matrix to 3.11-3.14, bump requires-python to >=3.11, and
remove click, tqdm, pyyaml, and simsimd which had no imports in the
source tree. Also fix README CI badge label.
Bump actions/checkout v4→v6, actions/cache v4→v5,
actions/setup-python v5→v6, astral-sh/setup-uv v2→v8,
docker/build-push-action v6→v7, docker/login-action v3→v4,
docker/setup-buildx-action v3→v4, docker/metadata-action v5→v6,
codecov/codecov-action v4→v6, actions/upload-artifact v4→v6, and
actions/download-artifact v4→v7. Remove the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24
workaround since all actions now run on Node.js 24 natively.
setup-uv v8 stopped publishing major/minor pointer tags for supply
chain security. Use the full immutable tag instead.
- Rename scratch/ to cauldron/ in CLAUDE.md package layout
- Gitignore CLAUDE.local.md for local-only instructions
- Note default usearch index path in .env.example
Workflows collapse to two files with clear triggers:
- ci.yml: PR + push to develop/main — tests (Linux matrix 3.11-3.14 +
  Windows/macOS smoke), OpenAPI build, wheel build+smoke, Docker
  build+smoke (healthz/readyz), gated publish to ghcr:develop on push
  to develop.
- release.yml: release: published — full test matrix, wheel build with
  tag-version verification, wheel smoke on all three OSes, PyPI publish,
  Docker push with vX.Y.Z / X.Y / X / latest tags (latest=auto for
  stable only), docs deploy.

Build system switches to hatch-vcs: version derives from git tags,
removing the sed hack. FastAPI app.version now tracks the package
version instead of a hardcoded literal.

Dockerfile installs the pre-built wheel from dist/, eliminating source
files from the image. .dockerignore flips to an allowlist so dev
artifacts (cauldron/, .claude/, scratch/, tests/, docs/, .git/) cannot
leak into the image.
- `iscc-search datasets` lists ISCC datasets on the HF Hub (defaults to the
  `iscc` org) with a rich table or `--json` output.
- `iscc-search hub REPO_ID` streams a dataset's parquet files via
  huggingface_hub + pyarrow and indexes each row as an IsccEntry,
  preserving non-binary columns as opaque metadata.
- Auto-registers a local index derived from the dataset name when one
  isn't configured yet; monotonic microsecond stepping keeps
  `gen_iscc_id` collision-free in tight loops.
- Docs updated: CLI how-to, getting-started tutorial, for-coding-agents.
…ests

pytest-xdist runs workers in parallel; concurrent first-time downloads of
the iscc-sct ONNX model produced Protobuf parse errors on four of seven
matrix jobs. Pre-downloading the model serially before running tests
(and caching it across runs via actions/cache) matches the pattern the
previous docker-publish.yml used.
The 100ms upper-bound on test_timer_measures_actual_time failed on a
busy GitHub-hosted macOS runner where a 50ms sleep measured 200ms wall
clock. The assertion's job is to catch gross unit/impl bugs, not to
benchmark sleep precision, so 1s of headroom is plenty.
Default the CLI logger to INFO, demote per-batch add_assets logs to DEBUG,
and disable huggingface_hub's tqdm bars (they render poorly in non-TTY
consoles where our own rich spinners already provide feedback).
@titusz titusz merged commit 284490d into main Apr 16, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant