Skip to content

perf:Add additional sharding benchmarks#3712

Open
mkitti wants to merge 4 commits intozarr-developers:mainfrom
mkitti:mkitti-morton-order-shard-indexing-benchmarks
Open

perf:Add additional sharding benchmarks#3712
mkitti wants to merge 4 commits intozarr-developers:mainfrom
mkitti:mkitti-morton-order-shard-indexing-benchmarks

Conversation

@mkitti
Copy link

@mkitti mkitti commented Feb 16, 2026

Summary

Added benchmarks for monitoring Morton order computation in sharded arrays. These benchmarks help assess the impact of Morton order optimizations in the context of I/O operations.

Benchmarks Added

  • test_sharded_morton_indexing - Sharded array indexing with power-of-2 chunks per shard
  • test_sharded_morton_indexing_large - Large shard with 32^3 = 32,768 chunks
  • test_sharded_morton_single_chunk - Reading a single chunk from a large shard
  • test_morton_order_iter - Direct benchmark of morton_order_iter (no I/O)
  • test_sharded_morton_write_single_chunk - Writing a single chunk to a large shard (best end-to-end test)

Benchmark Results

Single Chunk Write (Best End-to-End Test)

Writing a single 1x1x1 chunk to a shard with 32^3 = 32,768 chunks:

Branch Mean Time Improvement
Main (no optimization) 425ms -
Optimized (PR #3708) 261ms 164ms (39% faster)

Morton Order Computation (Micro-benchmark)

Direct morton_order_iter benchmark without I/O:

Shape Main Branch Optimized Speedup
(8, 8, 8) 2.73ms 0.85ms 3.2x
(16, 16, 16) 25.53ms 6.31ms 4.0x
(32, 32, 32) 229.25ms 51.31ms 4.5x

Profiling Analysis

Profile of single chunk write benchmark showing where time is spent:

Main Branch (977ms total)

Function Time Calls % of Total
decode_morton (scalar) 289ms 32,768 30%
get_chunk_slice 104ms 32,768 11%
_localize_chunk 103ms 32,768 11%
_morton_order 99ms 1 10%
Generator expressions 94ms 262k 10%
all() / len() 87ms 263k 9%

Optimized Branch (456ms total)

Function Time Calls % of Total
get_chunk_slice 110ms 32,768 24%
_localize_chunk 105ms 32,768 23%
_morton_order 66ms 1 14%
Generator expressions 38ms 131k 8%
decode_morton_vectorized 9ms 1 2%

Key Optimization Wins

  1. Vectorized decoding: Eliminates 32,768 scalar decode_morton calls (289ms → 9ms)
  2. Reduced bounds checking: Hypercube optimization eliminates all() checks for in-bounds coordinates
  3. Fewer function calls: 1.1M calls reduced to 299k calls

Remaining Optimization Opportunity

get_chunk_slice and _localize_chunk are called 32,768 times even when writing a single chunk due to line 508 in sharding.py:

shard_dict = {k: shard_reader.get(k) for k in morton_order_iter(chunks_per_shard)}

This builds a dict of ALL chunks before writing. Optimizing this read-modify-write pattern could save an additional ~215ms.

Checklist

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 16, 2026
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 16, 2026
@mkitti
Copy link
Author

mkitti commented Feb 17, 2026

If we wanted to minimize this pull request, I would reduce it to just "test_sharded_morton_write_single_chunk".

@mkitti
Copy link
Author

mkitti commented Feb 17, 2026

@d-v-b merge or add benchmark label, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant