feat: optimize property access with cached_property decorator#61
feat: optimize property access with cached_property decorator#61rhoadesScholar merged 12 commits intomainfrom
Conversation
rhoadesScholar
commented
Feb 25, 2026
- Replaced @Property with @cached_property in multiple classes (CellMapDatasetWriter, CellMapDataSplit, CellMapImage, CellMapMultiDataset) to improve performance by caching results of expensive computations.
- Removed redundant try-except blocks for attribute access, simplifying the code and enhancing readability.
- Updated related methods to ensure consistent behavior with the new caching mechanism.
- Removed test for zero-byte cache limit as it was deemed unnecessary.
- Replaced @Property with @cached_property in multiple classes (CellMapDatasetWriter, CellMapDataSplit, CellMapImage, CellMapMultiDataset) to improve performance by caching results of expensive computations. - Removed redundant try-except blocks for attribute access, simplifying the code and enhancing readability. - Updated related methods to ensure consistent behavior with the new caching mechanism. - Removed test for zero-byte cache limit as it was deemed unnecessary.
There was a problem hiding this comment.
Pull request overview
This PR optimizes repeated access to expensive, derived attributes in CellMap-Data’s dataset/image abstractions by switching from manual @property-with-_cached patterns to functools.cached_property, and updates cache-bounding behavior in the DataLoader accordingly.
Changes:
- Replaced several computed
@propertyimplementations with@cached_propertyacross dataset/image/split/multidataset writer components. - Updated TensorStore cache bounding and context-application logic in
CellMapDataLoader, including new edge-case handling and warnings. - Removed the test that asserted a 0-byte TensorStore cache limit disables caching.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_dataloader.py | Removes the zero-byte cache-limit test related to TensorStore cache bounding. |
| src/cellmap_data/multidataset.py | Converts derived weight/count/index properties to @cached_property and updates empty() cache initialization. |
| src/cellmap_data/image.py | Converts multiple expensive image metadata/data accessors to @cached_property. |
| src/cellmap_data/datasplit.py | Converts combined-dataset accessors to @cached_property and updates cache invalidation/reset logic. |
| src/cellmap_data/dataset_writer.py | Converts writer-derived geometry/index helpers to @cached_property. |
| src/cellmap_data/dataset.py | Converts dataset-derived geometry/count/index helpers to @cached_property. |
| src/cellmap_data/dataloader.py | Updates TensorStore context application to align with cached_property caching and adds cache-splitting edge-case warning + refresh() warning. |
| self.classes, | ||
| self.input_arrays, | ||
| self.target_arrays, |
There was a problem hiding this comment.
UserWarning("Validation datasets not loaded.") creates a warning object but does not emit it, so this is effectively a no-op. If this is meant to notify callers, use warnings.warn(...) (or logger.warning(...)) instead.
src/cellmap_data/multidataset.py
Outdated
| UserWarning( | ||
| f"Unable to get validation indices for dataset {dataset}\n skipping" |
There was a problem hiding this comment.
In the AttributeError handler, UserWarning(...) is instantiated but never emitted, so missing validation_indices will be silently ignored. Use warnings.warn(...) or logger.warning(...) so callers understand that validation indices were skipped for that dataset.
| UserWarning( | |
| f"Unable to get validation indices for dataset {dataset}\n skipping" | |
| logger.warning( | |
| "Unable to get validation indices for dataset %r; skipping this " | |
| "dataset when building validation_indices.", | |
| dataset, |
| per_worker_bytes = tensorstore_cache_bytes // effective_workers | ||
| if per_worker_bytes == 0 and tensorstore_cache_bytes > 0: | ||
| per_worker_bytes = 1 | ||
| logger.warning( | ||
| "tensorstore_cache_bytes=%d with num_workers=%d results in " | ||
| "per-worker cache limit of 0 bytes, which TensorStore treats as " | ||
| "unlimited. Setting per-worker cache limit to 1 byte to enforce " | ||
| "a meaningful bound. To avoid this warning, set tensorstore_cache_bytes " | ||
| "to at least %d bytes for num_workers=%d.", | ||
| tensorstore_cache_bytes, | ||
| num_workers, | ||
| effective_workers, | ||
| effective_workers, | ||
| ) |
There was a problem hiding this comment.
When tensorstore_cache_bytes < num_workers, forcing per_worker_bytes=1 means the aggregate cache budget across workers can exceed the requested tensorstore_cache_bytes (e.g., total=1 with 4 workers becomes 4 bytes). Either adjust the wording/docs (since the budget is no longer truly “shared across all workers”), or change the splitting strategy to keep the effective total bounded (e.g., cap the number of workers that get a cache_pool limit, or explicitly document the minimum per-worker limit behavior).
| def _apply_context_to_image(image: "CellMapImage", context) -> None: | ||
| """Set the TensorStore context on a single CellMapImage, warning if already opened.""" | ||
| if "_array" in getattr(image, "__dict__", {}): | ||
| if "array" in getattr(image, "__dict__", {}): | ||
| logger.warning( |
There was a problem hiding this comment.
The module-level docstring above still refers to the TensorStore array being cached under "_array", but the implementation now checks for the cached_property name "array". Update the docstring/comments accordingly to avoid misleading future changes/debugging.
| if per_worker_bytes == 0 and tensorstore_cache_bytes > 0: | ||
| per_worker_bytes = 1 | ||
| logger.warning( | ||
| "tensorstore_cache_bytes=%d with num_workers=%d results in " | ||
| "per-worker cache limit of 0 bytes, which TensorStore treats as " | ||
| "unlimited. Setting per-worker cache limit to 1 byte to enforce " | ||
| "a meaningful bound. To avoid this warning, set tensorstore_cache_bytes " | ||
| "to at least %d bytes for num_workers=%d.", | ||
| tensorstore_cache_bytes, | ||
| num_workers, | ||
| effective_workers, | ||
| effective_workers, | ||
| ) |
There was a problem hiding this comment.
New behavior for the per_worker_bytes == 0 case (warning + clamping to 1 byte) is introduced here but there is no test asserting the warning and resulting per-image context when tensorstore_cache_bytes < num_workers. Consider adding a unit test in TestTensorStoreCacheBounding to lock in this edge-case behavior.
| @cached_property | ||
| def sampling_box_shape(self) -> dict[str, int]: | ||
| """Returns the shape of the sampling box of the dataset in voxels of the largest voxel size requested.""" | ||
| try: | ||
| return self._sampling_box_shape | ||
| except AttributeError: | ||
| self._sampling_box_shape = self._get_box_shape(self.sampling_box) | ||
| if self.pad: | ||
| for c, size in self._sampling_box_shape.items(): | ||
| if size <= 0: | ||
| logger.debug( | ||
| "Sampling box for axis %s has size %d <= 0. " | ||
| "Setting to 1 and padding.", | ||
| c, | ||
| size, | ||
| ) | ||
| self._sampling_box_shape[c] = 1 | ||
| return self._sampling_box_shape | ||
| shape = self._get_box_shape(self.sampling_box) | ||
| if self.pad: | ||
| for c, size in shape.items(): | ||
| if size <= 0: | ||
| logger.debug( | ||
| "Sampling box for axis %s has size %d <= 0. " | ||
| "Setting to 1 and padding.", | ||
| c, | ||
| size, | ||
| ) | ||
| shape[c] = 1 | ||
| return shape |
There was a problem hiding this comment.
sampling_box_shape is now a @cached_property, so setting instance._sampling_box_shape = ... (as done in CellMapDataset.empty()) will no longer affect the returned value. This can break empty-dataset behavior (e.g., len() or sampling) because the cached_property will recompute from sampling_box instead of using the pre-set shape. Update CellMapDataset.empty() (or any other callers) to pre-populate sampling_box_shape in __dict__ (or clear/pop it appropriately) rather than writing _sampling_box_shape.
src/cellmap_data/dataset.py
Outdated
| size = np.prod([stop - start for start, stop in self.bounding_box.items()]) | ||
| self._size = int(size) | ||
| return self._size | ||
| return int(np.prod([stop - start for start, stop in self.bounding_box.items()])) |
There was a problem hiding this comment.
size computes stop - start over self.bounding_box.items(), but .items() yields (axis, [start, stop]). This will raise a TypeError (subtracting list/str) and should instead subtract the numeric bounds (e.g., iterate over .values() or unpack (axis, (start, stop))).
| return int(np.prod([stop - start for start, stop in self.bounding_box.items()])) | |
| return int(np.prod([stop - start for start, stop in self.bounding_box.values()])) |
| array_future = ts.open( | ||
| spec, read=True, write=False, context=self.context | ||
| ) | ||
| try: | ||
| array = array_future.result() | ||
| except ValueError as e: | ||
| Warning(e) | ||
| UserWarning("Falling back to zarr3 driver") | ||
| spec["driver"] = "zarr3" | ||
| array_future = ts.open( | ||
| spec, read=True, write=False, context=self.context | ||
| ) | ||
| array = array_future.result() | ||
| data = xt._TensorStoreAdapter(array) | ||
| self._array = xarray.DataArray(data=data, coords=self.full_coords) | ||
| return self._array | ||
| array = array_future.result() |
There was a problem hiding this comment.
Warning(e) and UserWarning(...) are being constructed here but not emitted/logged, so failures to open with the default driver will be silently ignored. Use logger.warning(...) and/or warnings.warn(...) (and consider including the exception details) so users can actually see the fallback behavior.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #61 +/- ##
==========================================
+ Coverage 70.41% 70.48% +0.07%
==========================================
Files 28 28
Lines 2633 2450 -183
==========================================
- Hits 1854 1727 -127
+ Misses 779 723 -56 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@copilot open a new pull request to apply changes based on the comments in this thread |
|
@rhoadesScholar I've opened a new pull request, #62, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: rhoadesScholar <37990507+rhoadesScholar@users.noreply.github.com>
…m_workers Co-authored-by: rhoadesScholar <37990507+rhoadesScholar@users.noreply.github.com>
…e warning emission
Co-authored-by: rhoadesScholar <37990507+rhoadesScholar@users.noreply.github.com>
fix: address PR review feedback - warning emission, cached_property compatibility, and documentation