201: create a SSOT doc for Cirrus payloads#203
201: create a SSOT doc for Cirrus payloads#203ping-e84 wants to merge 5 commits intostac-utils:mainfrom
Conversation
jkeifer
left a comment
There was a problem hiding this comment.
This is so great, I love it. I have some strong feelings about some of the details, so I've left some pretty nitpicky comments. I did call out a few things that I think are a little inaccurate and would like to see tweaked a bit. Let me know if you have any questions about how to address any of my feedback, or think I'm wrong about any of it (because I very well could be).
docs/source/payload.rst
Outdated
| 2. Define the workflow: | ||
|
|
||
| - How should the workflow be structured? (*e.g.* single STAC task? series | ||
| of STAC tasks?) |
There was a problem hiding this comment.
This feels misleading to me, like the payload alone defines the workflow and the tasks executed therein. And there's a part of me deep down that wishes this were in fact true, that we could define a workflow in a payload ad hoc. But for reasons that idea falls apart, and isn't even attractive at scale, such that we end up with workflow definitions to define the workflows, and payload merely contain a pointer to a definition (the workflow name) that should be executed.
There was a problem hiding this comment.
per in-person convo, changed the wording of this section to be less about defining and more about configuring the workflow.
docs/source/payload.rst
Outdated
| - What global parameters should be applied to all tasks in the workflow? | ||
| - How should each task be configured? (*i.e.*, what parameters are required | ||
| for each task)? |
There was a problem hiding this comment.
These are the same thing, in my opinion, so I'd try to use the same language for both, if it is even worth calling them out separately. That is, maybe this is just one item, "Input argument specification for task parameters (via both task-specific and globally accessible argument lists)." Or something. That's not great either, but maybe it can spark something???
There was a problem hiding this comment.
per in-person convo, changed wording to combine the ideas of task and workflow parameters.
docs/source/payload.rst
Outdated
| - What global parameters should be applied to all tasks in the workflow? | ||
| - How should each task be configured? (*i.e.*, what parameters are required | ||
| for each task)? | ||
| 3. How should Collections be assigned to the output STAC Item(s)? |
There was a problem hiding this comment.
I'm being being very nitpicky but it is both which collections and how they are assigned.
There was a problem hiding this comment.
per in-person convo, changed wording per suggestion.
docs/source/payload.rst
Outdated
| - How should each task be configured? (*i.e.*, what parameters are required | ||
| for each task)? | ||
| 3. How should Collections be assigned to the output STAC Item(s)? | ||
| 4. How and where should Items and Item Assets be uploaded? |
There was a problem hiding this comment.
Now I'm going to be overly nitpicky in the other direction: what does the payload specify about how? How to me indicates via what mechanism...maybe its perfectly valid though referring to arguments to the upload mechanism, I guess I could maybe accept the reasoning that they change the behavior and thus the how?
I'm really getting nitpicky here. Feel free to ignore this one entirely.
There was a problem hiding this comment.
per in-person convo, changed to only indicate that the where concept is key here.
docs/source/payload.rst
Outdated
| * - features | ||
| - [Item] | ||
| - An array of STAC Items |
There was a problem hiding this comment.
Zero or more STAC items...and this is technically not required, per our current validation. Except it is required by the GeoJSON spec for FeatureCollections, so we probably should make it required here too if we are going to stick with this payload format much longer...but that's a separate problem, and the genie is likely out of that bottle. For now let's just not call attention to the fact that it is optional and pretend it is required.
But the zero or more STAC items piece is important. I'd honestly like to see a section discussing that case, where you don't have valid STAC items or Features. Say for non-STACified data coming in, what do you have? Often just a reference to a file somewhere, right? So it is perfectly valid to have an empty features array, but what do you do with that reference?
There was a problem hiding this comment.
per in-person convo, made the suggested change -- also added a payload examples section at the end of the doc to demo this particular case.
docs/source/payload.rst
Outdated
| ProcessDefinition Object | ||
| ======================== | ||
|
|
||
| The ``process`` block (array) must include a *single* ``ProcessDefinition`` |
There was a problem hiding this comment.
| The ``process`` block (array) must include a *single* ``ProcessDefinition`` | |
| The ``process`` array must include a *single* ``ProcessDefinition`` at the beginning. [more about chaining and that constraint...] |
But maybe start with what one ProcessDefinition is, then explain the process array?
There was a problem hiding this comment.
Per in-person convo, changed wording to reflect the larger conversation about the process block...
docs/source/payload.rst
Outdated
| - ``Map<string, Map<string, Any>>`` | ||
| - Dictionary of collection-specific configuration options | ||
|
|
||
| .. _workflow-options: |
There was a problem hiding this comment.
I personally would want to explain these after introducing tasks in the next section. I think the tasks description is more illustrative of the concepts here, and I also thing it's generally preferable for people to use task options over workflow options (explicit is better than implicit).
There was a problem hiding this comment.
I think if you introduce the task options first, then you can also show an example here.
There was a problem hiding this comment.
Per in-person convo, switched the order of the tasks and workflow_options sub-sections to improve flow.
docs/source/payload.rst
Outdated
| -------------------- | ||
|
|
||
| An ``UploadOptions`` object includes parameters that define how Items and Item | ||
| Assets should be uploaded to the cloud (note that presently only AWS S3 is |
There was a problem hiding this comment.
Maybe this is correct, but is only AWS S3 supported? Or is any S3-compatible object store potentially supported? Recent boto3 versions allow changing the AWS endpoint via an env var (AWS_ENDPOINT_URL). And if you just want to target S3 and leave other AWS services with their default endpoints that's possible using AWS_ENDPOINT_URL_S3.
Of course we need a better story here, like the ability to have per-bucket endpoints. But I'm not certain we are strictly limited to AWS S3 here.
There was a problem hiding this comment.
This comment is old, but I think it still applies: #20 (comment)
For upload, it looks like, yes, one could upload to non-S3, but stac-task doesn't make it easy:
- the code still relies on
boto3utils.s3()as the global client - URL generation assumes S3 URIs
Because of the use of stac-asset, downloads can use whatever fsspec can do (so multiple cloud providers) but you have to use stac_asset.Config which is not locally documented in any way and is, frankly, a bit opaque for a non-expert user.
For both downloading and uploading, there would be a lot of customization involved for non-S3, so, yes, non-S3 is possible, but I think to say "supported" is a stretch to be honest.
I removed the specific "only AWS S3 is supported" language and added another info box with a little more high-level detail...
docs/source/payload.rst
Outdated
| complete) - consider this when uploading Items to S3 or otherwise writing them | ||
| *prior* to returning them. Also note that ``collection_matchers`` are mutually |
There was a problem hiding this comment.
It is effectively required to assign collections prior to upload, no? I feel like the language here could be stronger. And that maybe we should have some gate in place to prevent people from doing an upload without doing the collection assignment? Or perhaps I'm getting too close to #186 with that latter idea.
There was a problem hiding this comment.
Per in-person convo...
It is effectively required to assign collections prior to upload, no?
Yes, it is required, however, stac-task only assigns collections on an automated basis after the process method completes. So, given that fact, the suggestion here is to assign collections somewhere in the process method prior to uploading (even though, unfortunately, collections might be assigned twice).
I added another info box to clarify this and hopefully provide some useful guidance to users...
docs/source/payload.rst
Outdated
| @@ -0,0 +1,435 @@ | |||
| ====================== | |||
| Cirrus Process payload | |||
There was a problem hiding this comment.
This isn't actually a cirrus payload...we should keep this doc generic, i.e., non-cirrus-specific. Once we have this doc we can revise the cirrus documentation to reference this doc and extend it with the cirrus-specific bits.
There was a problem hiding this comment.
per in-person convo, changed title to STAC Task Payload.
jkeifer
left a comment
There was a problem hiding this comment.
Happy to chat again if you have questions. I'm still being quite nitpicky so if we just need go ahead and merge this I am happy to approve as-is and we can iterate on it sometime in the future. The current state I feel still could be improved, but it is really good, and profoundly better than anything we have, so I'd be happy to take it as-is for the win.
| ProcessDefinition Object | ||
| ======================== | ||
|
|
||
| The ``process`` block (array) should include a *single* ``ProcessDefinition`` |
There was a problem hiding this comment.
I would use stronger language, should typically is a recommendation, versus must where something is strictly required.
| The ``process`` block (array) should include a *single* ``ProcessDefinition`` | |
| The ``process`` block (array) must include a *single* ``ProcessDefinition`` |
| objects, ``stac-task`` only supports a single ``ProcessDefinition`` and will | ||
| only read the first one in the ``process`` array. **\*** |
There was a problem hiding this comment.
I still find this language a bit misleading. stac-task supports mutliple process defintions in the sense that it will not error if there are additional ones.
What do you think about something like this?
| objects, ``stac-task`` only supports a single ``ProcessDefinition`` and will | |
| only read the first one in the ``process`` array. **\*** | |
| objects, ``stac-task`` reads only the first ``ProcessDefinition`` in the | |
| ``process`` array. **\*** |
| Here is an example of a ``workflow_options`` dictionary with a single global | ||
| parameter: | ||
|
|
||
| .. code-block:: json | ||
|
|
||
| { | ||
| "workflow_options": { | ||
| "global_param": "global_value" | ||
| } | ||
| } |
There was a problem hiding this comment.
I was thinking maybe an example of how this works with task parameters would be valuable. Consider:
| Here is an example of a ``workflow_options`` dictionary with a single global | |
| parameter: | |
| .. code-block:: json | |
| { | |
| "workflow_options": { | |
| "global_param": "global_value" | |
| } | |
| } | |
| Here is an example of a ``workflow_options`` dictionary with a single global | |
| parameter, and how ``workflow_options`` interacts with task-specific parameter | |
| values: | |
| .. code-block:: json | |
| { | |
| "workflow_options": { | |
| "param1": "global_value" | |
| }, | |
| "tasks": { | |
| "task-a": { | |
| "param1": "value1" | |
| }, | |
| "task-c": { | |
| "param2": "value2" | |
| } | |
| } | |
| } | |
| In this case, ``task-a`` would get one keyword argument ``param1='value1'``, | |
| where ``task-c`` would get two: ``param1='global_value'`` and | |
| ``param2='value2'``. If another task ``task-b`` also runs, it would get one | |
| keyword argument ``param1='global_value'``. |
| In the example above, a task named ``task-a`` would have the ``param1=value1`` | ||
| key-value pair passed as a keyword, while ``task-c`` would have | ||
| ``param2=value2`` passed in. If there were a ``task-b`` to be run, it would not be | ||
| passed any keywords. |
There was a problem hiding this comment.
I would put quotes around the values to indicate they are strings:
| In the example above, a task named ``task-a`` would have the ``param1=value1`` | |
| key-value pair passed as a keyword, while ``task-c`` would have | |
| ``param2=value2`` passed in. If there were a ``task-b`` to be run, it would not be | |
| passed any keywords. | |
| In the example above, a task named ``task-a`` would have the ``param1='value1'`` | |
| key-value pair passed as a keyword, while ``task-c`` would have | |
| ``param2='value2'`` passed in. If there were a ``task-b`` to be run, it would not be | |
| passed any keywords. |
| "features": [], | ||
| "process": [ | ||
| { | ||
| "workflow": "image-ingest", |
There was a problem hiding this comment.
Workflow is an orchestration thing, not stac-task. I don't think this belongs here.
| ], | ||
| "process": [ | ||
| { | ||
| "workflow": "cog-generation", |
There was a problem hiding this comment.
Again, workflow is not part of the stac-task payload model, and is currently only added by cirrus (and swoop if you can count it).
| ], | ||
| "process": [ | ||
| { | ||
| "description": "NDVI COG Workflow", |
There was a problem hiding this comment.
Where is description modeled? Is anything modeling this? I suppose I could go look myself, but I am lazy...
Description feels super weird to me though. What are we describing? The payload? The process definition? It can't be the workflow, because the payload doesn't define the workflow, it merely contains a pointer to it in systems like cirrus where we have a workflow concept. stac-task does not.
| ] | ||
| } | ||
|
|
||
| The workflow described by this payload configures the `asset-fetcher` task to |
There was a problem hiding this comment.
I take issue with framing this as describing a workflow. What is running this? Not stac-task. stac-task only supports single tasks. That multiple tasks can be present in the payload is a function of the utility of stac-task in workflow situations like cirrus.
I think you could reframe this as something like "multiple stac-tasks are often composed together into workflows by STAC Workflow orchestration frameworks such as cirrus." My main concern here is to make it clear that a stac-task is concerned with a single task, itself, and something above stac-task is required to define and execute a workflow.
| 2. Configure the STAC workflow: | ||
|
|
||
| - What is the proposed workflow (*e.g.* single task? series of tasks?) |
There was a problem hiding this comment.
I am still suspect of this explanation, because the payload is not currently used to define a workflow anywhere. I don't think it reasonably can. The payload is merely the input to a single task. Yes, it may contain extra configuration to passed along to successive tasks, but that's outside the scope of stac-task. Granted, you invoke "STAC workflows" here, which is conceptually a thing meant to convey the possibility of having multiple stac-tasks composed together into a workflow, but even then the payload does not define what tasks are executed. A payload my include task parameters for tasks that will never run. A payload may not include task parameters for tasks that do not require any.
In other words, one cannot look at a payload and have any conception of what tasks will or will not be executed by a STAC workflow.
Related Issue(s):
Proposed Changes:
PR Checklist:
Use of AI: