-
Notifications
You must be signed in to change notification settings - Fork 49
Sorcha/dev/zarr3 compaction #1450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
|
Integration test fails, need to investigate this further! |
What is the performance impact for running evaluation with using zipstore? |
tjhunter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small style comments. Otherwise, it looks ready to try out on a larger scale.
When testing into the logic, I moved the creation of the zarr store straight into inference. Otherwise, I am personally confused when we write during inference and when we just calculate validation losses.
@grassesi : do you know why we don't write data when calling validation during training?
| """Convert raw dask arrays into chunked dask-aware xarray dataset.""" | ||
| chunks = (chunk_nsamples, *self.data.shape[1:]) | ||
|
|
||
| chunks = (chunk_nsamples, *(max(x // 4, 1) for x in self.data.shape[1:])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this formula coming from? Someone else should be able to update it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extracted out into a scale factor parameter now for easier updating
| self.data_root: zarr.Group | None = None | ||
| self._store: LocalStore | None = None | ||
| # determine type using extension | ||
| ext = self._store_path.suffix[1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are repeating yourself
| if self.type == "zip": | ||
| if self.create: | ||
| _logger.info("Creating zipstore") | ||
| self._store = ZipStore(self._store_path, mode="a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
append is more permissive, but there is a cost in opening and closing the store. I would rather have the store opened for the full duration of the write rather than for each chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mode = "w" works with the new develop code, before I had some issues with overwriting so was using append - thanks for fixing this :)
We don't write during validation since it's slows down the validation, generates additional files/storage, and is rarely used. There's a config parameter that controls this: WeatherGenerator/config/default_config.yml Line 337 in 1776b0a
|
…rcha/dev/zarr3_compaction
Hi Christian, sorry I missed this before the holidays - the latest tests are all in the hedge doc: https://gitlab.jsc.fz-juelich.de/hedgedoc/3X10vtdVQumQZy4qpBiHWg - there seems to be an small imporevemnt when using the ZipStore to evaluate |
grassesi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, but I would like some structural changes: If possible trainer.Trainer should not be concerned with managing the ZarrIO context. In addition to the changes I requested below I would also like for the ZarrIO.enter method to use polymorphism instead of switching behavior based on self.type. I would also like that number of times the literals indicating file extension/store_type (there is a one-to-one mapping) is reduced (eg. "zip" now occurs in 7 different places all over the code, similarly with "zarr"/"local"). I have implemented these two points in #1553 , feel free to just merge it if you like it.
| if SHARDING_ENABLED and chunks != "auto": | ||
| shards = (SHARD_N_SAMPLES, *((SCALE_FACTOR + 1) * x for x in chunks[1:])) | ||
| group.create_array(name, data=array, chunks=chunks, shards=shards) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice if you can encapsulate this conditional:
shards = _get_shards(chunks)
group.create_array(name, data=array, chunks=chunks, shards = shards)
def _get_shards(tuple[int]) -> dict[str, Any] | None:
...| ) | ||
| stream_dict = reader.get_stream(stream) | ||
| if not stream_dict: | ||
| return run_id, stream, {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change required for the new zarr zip store? I am also surprised that this is needed. From the code I would have thought that reader.get_stream always returns at least an empty dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was inherited from an older merge w/develop. I'll remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this file from your PR, since you did not make any real changes to pyproject.toml
| zarr_store=args.zarr_store, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment on utils/cli.py. I think it should be a proper option instead (with a default value in default_config.yaml etc.)
| elapsed = timeit.default_timer() - start_time | ||
| _logger.debug(f"writing array: {name} took {elapsed:.2f}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we want that timing logic in develop.
| def as_xarray(self, chunk_nsamples=CHUNK_N_SAMPLES) -> xr.DataArray: | ||
| def as_xarray( | ||
| self, chunk_nsamples=CHUNK_N_SAMPLES, shard_nsamples=SHARD_N_SAMPLES | ||
| ) -> xr.DataArray: | ||
| """Convert raw dask arrays into chunked dask-aware xarray dataset.""" | ||
| chunks = (chunk_nsamples, *self.data.shape[1:]) | ||
|
|
||
| chunks = (chunk_nsamples, *(max(x // SCALE_FACTOR, 1) for x in self.data.shape[1:])) | ||
| if SHARDING_ENABLED: | ||
| shards = (shard_nsamples, *((SCALE_FACTOR + 1) * x for x in chunks[1:])) | ||
| _logger.info(f"sharding enabled with shards: {shards} and chunks: {chunks}") | ||
| else: | ||
| shards = None | ||
| _logger.info(f"sharding disabled, using chunks: {chunks}") | ||
| # maybe do dask conversion earlier? => usefull for parallel writing? | ||
| data = da.from_zarr(self.data, chunks=chunks) # dont call compute to lazy load | ||
| data = da.from_zarr( | ||
| self.data, chunks=chunks, shards=shards | ||
| ) # dont call compute to lazy load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good :)
| # warnings.warn(f"Zarr2 conflict: {last_msg}", | ||
| # DeprecationWarning | ||
| # ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove if not important
| with warnings.catch_warnings(record=True) as caught: | ||
| self._store = LocalStore(self._store_path) | ||
| self.data_root = zarr.group(store=self._store) | ||
| # Raise DeprecationWarning only if a ZarrUserWarning was raised |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the cause of this warnings? Maybe add a short comment why we want to ignore it here.
* make zarrio subclasses * store string literals for output storage in enum.
Description
Draft PR to showcase how to enable sharding (experimentation underway to optimise performance)
Issue Number
DRAFT
Closes #1384
Checklist before asking for review
./scripts/actions.sh lint./scripts/actions.sh unit-test./scripts/actions.sh integration-testlaunch-slurm.py --time 60