Improve group_average performance with Dask (and flox) #763

will-s-hart · 2025-05-23T17:19:05Z

will-s-hart
May 23, 2025

Hello! 👋

Ideally, with flox installed, temporal.group_average should be able to operate lazily on Dask-backed datasets. However, that doesn't quite seem to be the case at the moment — when using fairly large Dask-backed datasets, group_average has been somewhat slow (e.g., ~10 seconds for 6GB of data), and appears to trigger either the full input dataset, or a similarly sized array, to be loaded into memory at some point (although the output is Dask-backed).

After a bit of digging, the issue seems to be here:

https://github.com/xCDAT/xcdat/blob/main/xcdat/temporal.py#L1561-L1562

Specifically, the line xr.where(dv.copy().isnull(), 0.0, weights) seems to cause significant memory pressure when dv is Dask-backed but weights is not.

I experimented with the below, which seems promising:

if dv.chunks:
    self._weights = self._weights.chunk({self.dim: dv.chunksizes[self.dim]})
weights, _ = xr.broadcast(self._weights, dv)
weights = xr.where(dv.copy().isnull(), 0.0, weights)

In my case, this reduced runtime to under 0.5 seconds and avoided the memory issue. That said, I'm not deeply familiar with the xCDAT codebase, and am unsure whether this is robust/generalizable.

Thought I'd ask for opinions here first, but happy to open an Issue if this seems worth pursuing. Thanks!

tomvothecoder · 2025-05-23T19:16:15Z

tomvothecoder
May 23, 2025
Maintainer

Hi @will-s-hart, thanks a ton for digging deeper into this issue. We're definitely trying to find ways to improve performance, so I appreciate you finding a bottleneck in the temporal.group_average() code.

Specifically, the line xr.where(dv.copy().isnull(), 0.0, weights) seems to cause significant memory pressure when dv is Dask-backed but weights is not.

This makes sense to me. I wonder if the data variable (dv) is loaded into memory either during the broadcasting step and/or masking with xr.where(), since weights are not Dask-backed.

Your solution does look promising. If you're interested in opening an issue and trying out a PR, I'm all for it :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve group_average performance with Dask (and flox) #763

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Improve group_average performance with Dask (and flox) #763

Uh oh!

will-s-hart May 23, 2025

Replies: 1 comment

Uh oh!

tomvothecoder May 23, 2025 Maintainer

will-s-hart
May 23, 2025

tomvothecoder
May 23, 2025
Maintainer