Replies: 7 comments 12 replies
-
|
These things are new to me but it's interesting to see the thought process behind the decisions for implementing time in the model.
Agree this sounds reasonable. |
Beta Was this translation helpful? Give feedback.
-
I agree that it is where we want to go in the long-term, and I also agree that it is not mission critical for the minimum viable product; getting all the features in place is I believe more urgent and totally doable with the current time index setup.
Given that everything runs with monthly means at the moment, I don't think this is a reason to be embarrassed ;-)
I think we could validate the correct number of time steps just like the correct number of grid cells without having exact dates. Users could theoretically make up data, for example to do paleo experiments or just hypothetical climate scenarios -> then they have to make up dates.
This hydrology section is to make it flexible for running the full VE at different timescales which again is not something that we will do in the near future. I would rather focus on improving the rainfall generator than reading in daily rainfall data at this stage.
I think we could get around this by introducing a time axis which could contain the daily values, or were you thinking of having lots of nans in the other variables?
Again, the hydrology daily input is not mission critical and for the validation the daily output would be a nice to have but could be worked around. Not that I don't want the time awareness, but this comes with all sorts of other problems... I also think it would only make sense to have the daily inputs if the plants run at the same daily/sub-daily timescales - as long as this connection is not made there is no need to be overly detailed with the input, it might even give a false sense of certainty.
I have no preference what to use, I am not so familiar with either at the moment. I think we might want to have a think about the general time steps we allow in the model. Are we going to make this a monthly model, to allow for the animals to make sense, and have some components always on a shorter timescale? Or are we giving the user the option to mix and match timescales based on what they want to focus on?
In summary,
|
Beta Was this translation helpful? Give feedback.
-
|
Trying to summarise a more wide-ranging IRL conversation about timing. @vgro, @sallymatson and @jacobcook1995 feel free to edit this message! We started from @vgro's question:
The consensus was that we do still want to allow for general flexibility, but with the main motivation being that we want to be able to assess the calibration of the overall model with different intervals. We're not really so interested in the currently formal coded limits of the individual models, but how does the model work with steps in the ~14-60 day range. The coded model-specific update interval ranges are still useful information, but we probably want users to focus on a rough month long time step, whilst allowing them to explore how changing that slightly affects predictions. So - given that we want to be able to vary the update interval - what is the way forward.
|
Beta Was this translation helpful? Give feedback.
-
|
I was going to comment on #1176, but it seems that this discussion is at the root of the problem, so will do so here. Correct me if I'm wrong, but my impression is that a calendar is a somewhat arbitrary human concept. Neither animals, plants or the weather cares much a about calendars, just about repetitive cycles, and these are split into equal length intervals. So, for the calculations to be "correct", I think they should use a time schema that ensure these cycles. Models will be simpler this way and any option that goes in this direction will be preferable. Now, humans do use calendars - some do, at least :P - and therefore input data is likely to be based on calendar-aware information, and to be useful, output data should also be provided in a calendar aware format, most of the time. From my point of view, what is needed here is an adaptor in the input and output layers that make this transformation between calendar-aware data and equal interval data, and vice-versa, in such a way that works for users, making their life a bit easier when compiling the inputs and analysing the outputs, but works well for the models. This might be going in a different direction that what you had in mind during the discussion, but I feel this is what is lacking at the moment. |
Beta Was this translation helpful? Give feedback.
-
|
@jacobcook1995 @TaranRallings @vgro @sallymatson I was just thinking about @jacobcook1995's point about output variation due to interval length. Was out running in the rain along a canal and thinking about precipitation and suddenly wondered: don't we already have a problem here because monthly input datasets using current data (like CRU etc - not climate forecasts or paleo models) are going to have variations in totals because of monthly variation. And if the data preparation team are summing values from daily inputs to monthly totals (without having to do mad calendar corrections -which was @dalonsoa's point about an interface) then we have the same problem. How many of our input variables are "totals over the update interval"? I'm not saying that we suddenly switch to implementing this, but I wonder if our outputs are actually any less calendar biased now than they would be if we went calendar aware. |
Beta Was this translation helpful? Give feedback.
-
|
Linking in an issue from the data science team that illustrates why what we currently have is a problem: https://github.com/ImperialCollegeLondon/ve_data_science/issues/161 |
Beta Was this translation helpful? Give feedback.
-
|
Closely following on this discussion. This maybe a bit off topic though. I would like to add: As a user, I found it a bit counter intuitive that in the exported output file, the "time" column (obtained from timestamp) is in POSIX date format when it is a 30 days fixed interval. Personally, I think having "Day since 0" or both "Day since 0" and "POSIX" makes more sense. If I were to use VE as a researcher, I will be looking out for daily rate outputs to design my studies or interpret my results. Was there a reason why we do not use day as timestamp interval for input and output data? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
@vgro and I had a chat recently about time axis dates and inputs and outputs that raised the issue of calendar aware dates in the VE see here. We've been ducking this forever but I think we're at a stage where it is finally blocking things we want to do.
Please could you have a read and leave any thoughts. I don't think this would be a difficult change - and I think it probably is on the critical path - but I want input before we turn it into an actual issue for action.
Important
Executive summary
It wouldn't be hard to pass actual calendar aware periods and update intervals to models - although the changes needed in models that already subdivide intervals internally might not be trivial. The current sticking point is that the models are not currently written to definitively return outputs that calibrate to the interval length. And fixing that is a longer issue.
At the moment, we're using
pintto configure the run length and update interval.pintunits for years and months are "average" duration, so are floats:At the moment, we're simply ducking this and updates iterate over the expected number of intervals just by index
[0, 1, ..., N].That's a problem (with reasons this might not be urgent) because:
It's a horrible, clearly low quality state of affairs that we should be embarrassed by.
That @davidorme is embarrassed by.
We can't define the time coordinates of the time update axis properly so can't validate the time axis across inputs.
Well, OK, but we've survived so far without validation - as long as the number of intervals along the time axis in inputs is fine it works. Models know how long an interval is, and that is the main thing.
We have already had to implement hacks to work around this issue:
virtual_ecosystem/virtual_ecosystem/models/hydrology/hydrology_model.py
Lines 443 to 452 in 0f4f9ed
Sure, but those hacks work! Does a recurrent hack warning in every log matter?
It makes it impossible to link inputs or outputs at different temporal frequencies. If we have a central update frequency of - say - monthly values but a model wants to use daily inputs, then we have to have calendar-aware update periods to be able to tie update index [0] to '2000-01' in the monthly inputs and '2000-01-01/2000-01-31' in the daily inputs. I think the only way we could duck this would be to insist on 360 day years (12 x 30) - some climate models do this - but it's going to be a hard blocker for most users.
We already have an current issue about outputting data that needs date aware coordinates: Write out daily hydrology values for calibration validation #1176
This is hard to duck - it depends on whether daily inputs are on the path to the short-term minimum viable product. The hydrology model at least wants daily inputs and deciduous plants are going to need to know what the date is.
What do we care about
There are many Python packages providing "calendar aware" dates.
datetimelike packages:datetimeitself and improvements (Arrow, Pendulum, whenever)numpywithnp.datetime64and thenxarrayandpandaswhich wrap more or less wrapnumpy)These differ in how fancy they get about things like leap seconds, daylight savings, timezones etc. I think we largely don't care about any of those: VE will run in small, single locations on at least yearly scales where we don't want to track saving times. So, something like
pandasprovides functionality that provides calendar aware arrays of date ranges without us having to reinvent any wheels.Possible solution
This
pandascode allows us to generate a sequence of calendar aware periods, from reasonably user friendly inputs. Honestly, it would be easier to configureend_timerather thanrun_length, but from where we are this works:Explanations
This is using
pandas.Periodnotpandas.Timestamp. That's partly because the data do represent periods not point values, but also because the timestamp implementation is edging back towards the fancier end of datetime handling with all sorts of business days, timezone etc. malarky.The
pandas.Periodimplementation is a lot simpler - each item is explicitly a set number of *days, weeks, months or years (I think it would handle faster frequencies fine, but we aren't going there anytime soon). The possible letter codes we'd want to use are thereforenD,nMandnYfor days, months and years (see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#period-aliases for details) and those seem pretty user friendly.We might have wanted weeks but
pandasanchors weeks to a calendar week with a particular start day (e.g Monday) rather than just 7 days. We can just use7Dinstead.Limitation
Months and Years automatically align to calendar months and years. So if you use
2020-01-15as a start date with an update interval of1M, the period you get back is2020-01-01to'2020-01-31 23:59:59.99999. It's the same for years2019-06-23plus1Ycould be a couple of different things because of the leap year.This is reasonable behaviour - it is clear what a month from
2020-01-01is. If you add a month onto2020-01-15then the number of days you should add is ambiguous. Same with years - because of leap years. The restriction for users would be - if you want to use monthly or annual steps then the start date needs to be the start of a calendar month and years, not some weird mid period start.That seems like a fringe use case anyway. I don't doubt we could work around it, but we'd have to come up with arbitrary rules for resolving those ambiguities. We might circle back to this later, but something like this fixes this issue for now.
So... thoughts?
Beta Was this translation helpful? Give feedback.
All reactions