Sorcha/dev/zarr3 compaction #1450

enssow · 2025-12-11T14:21:53Z

Description

Draft PR to showcase how to enable sharding (experimentation underway to optimise performance)

Issue Number

DRAFT
Closes #1384

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

Revert "Implement per-channel logging (#283)" (#434)

…mpaction

…ng common.io)

…mpaction

enssow · 2025-12-19T18:23:34Z

Specify type of store to create during an inference with uv run --offline inference --from_run_id zs581tqh --samples 1 --streams_output ERA5 --zarr-store local or uv run --offline inference --from_run_id zs581tqh --samples 1 --streams_output ERA5 --zarr-store zip
Code to run evaluation remains the same with no extra sections of config required (the type of store is detected using the extension) e.g. uv run evaluate --config ../test_evaluate_zip.yml (I tested the latest code with runs: v6l27pog (zip) and bvu0y897 (local)
Backwards Compatible: /p/project1/weatherai/owens1/WeatherGenerator/.venv/lib/python3.12/site-packages/zarr/core/group.py:568: ZarrUserWarning: Both zarr.json (Zarr format 3) and .zgroup (Zarr format 2) metadata objects exist at file:///p/scratch/weatherai/shared_work/results/v8pqzh4y/validation_chkpt00000_rank0000.zarr. Zarr format 3 will be used. warnings.warn(msg, category=ZarrUserWarning, stacklevel=1) get this warning but it does plot (tested with run v8pqzh4y)

enssow · 2025-12-19T18:23:52Z

Integration test fails, need to investigate this further!

clessig · 2025-12-20T12:55:11Z

Latest version can read and write with the zipstore format but needs tidying up (produces lots of print statements to narrow down the issues when trying to run the evaluation from a .yml file)

What is the performance impact for running evaluation with using zipstore?

tjhunter

Some small style comments. Otherwise, it looks ready to try out on a larger scale.

When testing into the logic, I moved the creation of the zarr store straight into inference. Otherwise, I am personally confused when we write during inference and when we just calculate validation losses.

@grassesi : do you know why we don't write data when calling validation during training?

packages/common/src/weathergen/common/config.py

tjhunter · 2025-12-29T15:58:22Z

packages/common/src/weathergen/common/io.py

        """Convert raw dask arrays into chunked dask-aware xarray dataset."""
-        chunks = (chunk_nsamples, *self.data.shape[1:])
-
+        chunks = (chunk_nsamples, *(max(x // 4, 1) for x in self.data.shape[1:]))


where is this formula coming from? Someone else should be able to update it later.

extracted out into a scale factor parameter now for easier updating

packages/common/src/weathergen/common/io.py

tjhunter · 2025-12-29T15:59:38Z

packages/common/src/weathergen/common/io.py

        self.data_root: zarr.Group | None = None
-        self._store: LocalStore | None = None
+        # determine type using extension
+        ext = self._store_path.suffix[1:]


you are repeating yourself

tjhunter · 2025-12-29T16:01:12Z

packages/common/src/weathergen/common/io.py

+        if self.type == "zip":
+            if self.create:
+                _logger.info("Creating zipstore")
+                self._store = ZipStore(self._store_path, mode="a")


append is more permissive, but there is a cost in opening and closing the store. I would rather have the store opened for the full duration of the write rather than for each chunk.

mode = "w" works with the new develop code, before I had some issues with overwriting so was using append - thanks for fixing this :)

packages/common/src/weathergen/common/io.py

clessig · 2025-12-29T16:19:52Z

Some small style comments. Otherwise, it looks ready to try out on a larger scale.

When testing into the logic, I moved the creation of the zarr store straight into inference. Otherwise, I am personally confused when we write during inference and when we just calculate validation losses.

@grassesi : do you know why we don't write data when calling validation during training?

We don't write during validation since it's slows down the validation, generates additional files/storage, and is rarely used. There's a config parameter that controls this:

WeatherGenerator/config/default_config.yml

Line 337 in 1776b0a

log_validation: 0

(in the process of being renamed).

…rcha/dev/zarr3_compaction

enssow · 2025-12-31T17:13:55Z

Latest version can read and write with the zipstore format but needs tidying up (produces lots of print statements to narrow down the issues when trying to run the evaluation from a .yml file)

What is the performance impact for running evaluation with using zipstore?

Hi Christian, sorry I missed this before the holidays - the latest tests are all in the hedge doc: https://gitlab.jsc.fz-juelich.de/hedgedoc/3X10vtdVQumQZy4qpBiHWg - there seems to be an small imporevemnt when using the ZipStore to evaluate

…mpaction

grassesi

Great work, but I would like some structural changes: If possible trainer.Trainer should not be concerned with managing the ZarrIO context. In addition to the changes I requested below I would also like for the ZarrIO.enter method to use polymorphism instead of switching behavior based on self.type. I would also like that number of times the literals indicating file extension/store_type (there is a one-to-one mapping) is reduced (eg. "zip" now occurs in 7 different places all over the code, similarly with "zarr"/"local"). I have implemented these two points in #1553 , feel free to just merge it if you like it.

packages/common/src/weathergen/common/io.py

grassesi · 2026-01-06T05:36:48Z

packages/common/src/weathergen/common/io.py

+        if SHARDING_ENABLED and chunks != "auto":
+            shards = (SHARD_N_SAMPLES, *((SCALE_FACTOR + 1) * x for x in chunks[1:]))
+            group.create_array(name, data=array, chunks=chunks, shards=shards)


would be nice if you can encapsulate this conditional:

shards = _get_shards(chunks) group.create_array(name, data=array, chunks=chunks, shards = shards) def _get_shards(tuple[int]) -> dict[str, Any] | None: ...

grassesi · 2026-01-06T05:49:49Z

packages/evaluate/src/weathergen/evaluate/run_evaluation.py

        )
+        stream_dict = reader.get_stream(stream)
+        if not stream_dict:
+            return run_id, stream, {}


Is this change required for the new zarr zip store? I am also surprised that this is needed. From the code I would have thought that reader.get_stream always returns at least an empty dict.

i was inherited from an older merge w/develop. I'll remove it

grassesi · 2026-01-06T05:51:05Z

pyproject.toml

Can you remove this file from your PR, since you did not make any real changes to pyproject.toml

grassesi · 2026-01-06T06:55:55Z

src/weathergen/run_train.py

+        zarr_store=args.zarr_store,
    )


See my comment on utils/cli.py. I think it should be a proper option instead (with a default value in default_config.yaml etc.)

grassesi · 2026-01-06T11:29:45Z

packages/common/src/weathergen/common/io.py

+        elapsed = timeit.default_timer() - start_time
+        _logger.debug(f"writing array: {name} took {elapsed:.2f}")


I dont think we want that timing logic in develop.

grassesi · 2026-01-06T11:48:54Z

packages/common/src/weathergen/common/io.py

-    def as_xarray(self, chunk_nsamples=CHUNK_N_SAMPLES) -> xr.DataArray:
+    def as_xarray(
+        self, chunk_nsamples=CHUNK_N_SAMPLES, shard_nsamples=SHARD_N_SAMPLES
+    ) -> xr.DataArray:
        """Convert raw dask arrays into chunked dask-aware xarray dataset."""
-        chunks = (chunk_nsamples, *self.data.shape[1:])
-
+        chunks = (chunk_nsamples, *(max(x // SCALE_FACTOR, 1) for x in self.data.shape[1:]))
+        if SHARDING_ENABLED:
+            shards = (shard_nsamples, *((SCALE_FACTOR + 1) * x for x in chunks[1:]))
+            _logger.info(f"sharding enabled with shards: {shards} and chunks: {chunks}")
+        else:
+            shards = None
+            _logger.info(f"sharding disabled, using chunks: {chunks}")
        # maybe do dask conversion earlier? => usefull for parallel writing?
-        data = da.from_zarr(self.data, chunks=chunks)  # dont call compute to lazy load
+        data = da.from_zarr(
+            self.data, chunks=chunks, shards=shards
+        )  # dont call compute to lazy load


looks good :)

grassesi · 2026-01-06T12:44:36Z

packages/common/src/weathergen/common/io.py

+                # warnings.warn(f"Zarr2 conflict: {last_msg}",
+                #                 DeprecationWarning
+                # )


please remove if not important

grassesi · 2026-01-06T12:45:18Z

packages/common/src/weathergen/common/io.py

+            with warnings.catch_warnings(record=True) as caught:
+                self._store = LocalStore(self._store_path)
+                self.data_root = zarr.group(store=self._store)
+            # Raise DeprecationWarning only if a ZarrUserWarning was raised


what is the cause of this warnings? Maybe add a short comment why we want to ignore it here.

* make zarrio subclasses * store string literals for output storage in enum.

enssow and others added 22 commits July 3, 2025 13:30

Merge pull request #1 from ecmwf/develop

8be8692

Revert "Implement per-channel logging (#283)" (#434)

Merge remote-tracking branch 'ecmwf/develop' into develop

459ae7e

Merge remote-tracking branch 'ecmwf/develop' into develop

c9db33c

Merge remote-tracking branch 'ecmwf/develop' into develop

ac48078

Merge remote-tracking branch 'ecmwf/develop' into develop

d098bd7

Merge remote-tracking branch 'ecmwf/develop' into develop

0f5f70d

update dependencies to zarr3/experimental anemoi (#1253)

1796339

Merge remote-tracking branch 'origin/develop' into zarr3

5daed39

upper-bounding eccodes

724c0e7

zarr3 changes

c7dd1be

linting

d1643d5

Merge remote-tracking branch 'ecmwf/develop' into develop

6c16a25

Merge remote-tracking branch 'ecmwf/develop' into develop

097cd36

Merge remote-tracking branch 'ecmwf/develop' into sorcha/dev/zarr3_co…

fbb7a0b

…mpaction

porblem with new evaluate dependencies (removed temporarily for testi…

e7575e1

…ng common.io)

Merge remote-tracking branch 'ecmwf/develop' into sorcha/dev/zarr3_co…

7109e8d

…mpaction

revert pyproject

073b75e

Merge branch 'develop' into sorcha/dev/zarr3_compaction

3b192b1

first draft

485ecf0

commit to merge

cbefb96

Merge remote-tracking branch 'ecmwf/develop' into sorcha/dev/zarr3_co…

e621a5c

…mpaction

commit to change branch

5ff168c

github-project-automation bot added this to WeatherGen-dev Dec 11, 2025

github-actions bot added infra Issues related to infrastructure performance Work related to performance improvements labels Dec 11, 2025

enssow marked this pull request as draft December 11, 2025 14:22

enssow mentioned this pull request Dec 11, 2025

zarr3 compaction #1384

Open

6 tasks

enssow added 3 commits December 11, 2025 17:34

trying to remove metadata (too many zarr.json files)

1148635

zipstore

52beb32

working (lot of debug prints to remove)

d59101a

neaten up

8f86c96

enssow and others added 9 commits December 22, 2025 16:29

wrapping zarruserwarning + linting

80a2916

merge with develop

ee6dfa6

changes

6ad7f93

fixing warnings

add3c03

fixes

de786a2

groups

449317a

change writer

a59b4d6

switch group

923f083

reverting, issue is more complex than thought

d047f67

tjhunter reviewed Dec 29, 2025

View reviewed changes

enssow added 2 commits December 31, 2025 17:14

post review changes

fdb5c16

Merge remote-tracking branch 'ecmwf/tjh/dev/zarr3_compaction' into so…

618a58b

…rcha/dev/zarr3_compaction

github-actions bot added the initiative Large piece of work covering multiple sprint label Dec 31, 2025

linting

a5cf31b

enssow added 4 commits January 2, 2026 13:06

fixing zarrio

9c1ef96

linting

ed3bbdf

fixing create default arg

35940b7

small change to fix export

642bb98

enssow marked this pull request as ready for review January 2, 2026 15:22

enssow added 2 commits January 2, 2026 16:31

Merge remote-tracking branch 'ecmwf/develop' into sorcha/dev/zarr3_co…

670af6e

…mpaction

linting

be3e152

grassesi requested changes Jan 6, 2026

View reviewed changes

github-project-automation bot moved this to In Progress in WeatherGen-dev Jan 6, 2026

Simon/zarr3 compaction/refactoring (#1553)

bb4ba07

* make zarrio subclasses * store string literals for output storage in enum.

		elapsed = timeit.default_timer() - start_time
		_logger.debug(f"writing array: {name} took {elapsed:.2f}")

Sorcha/dev/zarr3 compaction #1450

Are you sure you want to change the base?

Sorcha/dev/zarr3 compaction #1450

Conversation

enssow commented Dec 11, 2025

Description

Issue Number

Checklist before asking for review

Uh oh!

enssow commented Dec 19, 2025

Uh oh!

enssow commented Dec 19, 2025

Uh oh!

clessig commented Dec 20, 2025

Uh oh!

tjhunter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clessig commented Dec 29, 2025

Uh oh!

enssow commented Dec 31, 2025

Uh oh!

grassesi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants