Skip to content

201: create a SSOT doc for Cirrus payloads#203

Open
ping-e84 wants to merge 5 commits intostac-utils:mainfrom
ping-e84:docs/201/payload-definition
Open

201: create a SSOT doc for Cirrus payloads#203
ping-e84 wants to merge 5 commits intostac-utils:mainfrom
ping-e84:docs/201/payload-definition

Conversation

@ping-e84
Copy link
Collaborator

@ping-e84 ping-e84 commented Jan 7, 2026

Related Issue(s):

Proposed Changes:

  1. Add a doc on payload structure (intended to be a single source of truth across Cirrus, stac-task, and other docs)
  2. change payload references elsewhere in stac-task documentation to point to the payload doc

PR Checklist:

  • I have added my changes to the CHANGELOG or a CHANGELOG entry is not required.

Use of AI:

  • initialized basic document structure and syntax using Copilot; documentation is otherwise human written and edited

Copy link
Member

@jkeifer jkeifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so great, I love it. I have some strong feelings about some of the details, so I've left some pretty nitpicky comments. I did call out a few things that I think are a little inaccurate and would like to see tweaked a bit. Let me know if you have any questions about how to address any of my feedback, or think I'm wrong about any of it (because I very well could be).

Comment on lines 15 to 18
2. Define the workflow:

- How should the workflow be structured? (*e.g.* single STAC task? series
of STAC tasks?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels misleading to me, like the payload alone defines the workflow and the tasks executed therein. And there's a part of me deep down that wishes this were in fact true, that we could define a workflow in a payload ad hoc. But for reasons that idea falls apart, and isn't even attractive at scale, such that we end up with workflow definitions to define the workflows, and payload merely contain a pointer to a definition (the workflow name) that should be executed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per in-person convo, changed the wording of this section to be less about defining and more about configuring the workflow.

Comment on lines 19 to 21
- What global parameters should be applied to all tasks in the workflow?
- How should each task be configured? (*i.e.*, what parameters are required
for each task)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the same thing, in my opinion, so I'd try to use the same language for both, if it is even worth calling them out separately. That is, maybe this is just one item, "Input argument specification for task parameters (via both task-specific and globally accessible argument lists)." Or something. That's not great either, but maybe it can spark something???

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per in-person convo, changed wording to combine the ideas of task and workflow parameters.

- What global parameters should be applied to all tasks in the workflow?
- How should each task be configured? (*i.e.*, what parameters are required
for each task)?
3. How should Collections be assigned to the output STAC Item(s)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm being being very nitpicky but it is both which collections and how they are assigned.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per in-person convo, changed wording per suggestion.

- How should each task be configured? (*i.e.*, what parameters are required
for each task)?
3. How should Collections be assigned to the output STAC Item(s)?
4. How and where should Items and Item Assets be uploaded?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm going to be overly nitpicky in the other direction: what does the payload specify about how? How to me indicates via what mechanism...maybe its perfectly valid though referring to arguments to the upload mechanism, I guess I could maybe accept the reasoning that they change the behavior and thus the how?

I'm really getting nitpicky here. Feel free to ignore this one entirely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per in-person convo, changed to only indicate that the where concept is key here.

Comment on lines 43 to 45
* - features
- [Item]
- An array of STAC Items
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero or more STAC items...and this is technically not required, per our current validation. Except it is required by the GeoJSON spec for FeatureCollections, so we probably should make it required here too if we are going to stick with this payload format much longer...but that's a separate problem, and the genie is likely out of that bottle. For now let's just not call attention to the fact that it is optional and pretend it is required.

But the zero or more STAC items piece is important. I'd honestly like to see a section discussing that case, where you don't have valid STAC items or Features. Say for non-STACified data coming in, what do you have? Often just a reference to a file somewhere, right? So it is perfectly valid to have an empty features array, but what do you do with that reference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per in-person convo, made the suggested change -- also added a payload examples section at the end of the doc to demo this particular case.

ProcessDefinition Object
========================

The ``process`` block (array) must include a *single* ``ProcessDefinition``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The ``process`` block (array) must include a *single* ``ProcessDefinition``
The ``process`` array must include a *single* ``ProcessDefinition`` at the beginning. [more about chaining and that constraint...]

But maybe start with what one ProcessDefinition is, then explain the process array?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per in-person convo, changed wording to reflect the larger conversation about the process block...

- ``Map<string, Map<string, Any>>``
- Dictionary of collection-specific configuration options

.. _workflow-options:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would want to explain these after introducing tasks in the next section. I think the tasks description is more illustrative of the concepts here, and I also thing it's generally preferable for people to use task options over workflow options (explicit is better than implicit).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you introduce the task options first, then you can also show an example here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per in-person convo, switched the order of the tasks and workflow_options sub-sections to improve flow.

--------------------

An ``UploadOptions`` object includes parameters that define how Items and Item
Assets should be uploaded to the cloud (note that presently only AWS S3 is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is correct, but is only AWS S3 supported? Or is any S3-compatible object store potentially supported? Recent boto3 versions allow changing the AWS endpoint via an env var (AWS_ENDPOINT_URL). And if you just want to target S3 and leave other AWS services with their default endpoints that's possible using AWS_ENDPOINT_URL_S3.

Of course we need a better story here, like the ability to have per-bucket endpoints. But I'm not certain we are strictly limited to AWS S3 here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is old, but I think it still applies: #20 (comment)
For upload, it looks like, yes, one could upload to non-S3, but stac-task doesn't make it easy:

  • the code still relies on boto3utils.s3() as the global client
  • URL generation assumes S3 URIs

Because of the use of stac-asset, downloads can use whatever fsspec can do (so multiple cloud providers) but you have to use stac_asset.Config which is not locally documented in any way and is, frankly, a bit opaque for a non-expert user.

For both downloading and uploading, there would be a lot of customization involved for non-S3, so, yes, non-S3 is possible, but I think to say "supported" is a stretch to be honest.

I removed the specific "only AWS S3 is supported" language and added another info box with a little more high-level detail...

Comment on lines 307 to 308
complete) - consider this when uploading Items to S3 or otherwise writing them
*prior* to returning them. Also note that ``collection_matchers`` are mutually
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is effectively required to assign collections prior to upload, no? I feel like the language here could be stronger. And that maybe we should have some gate in place to prevent people from doing an upload without doing the collection assignment? Or perhaps I'm getting too close to #186 with that latter idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per in-person convo...

It is effectively required to assign collections prior to upload, no?

Yes, it is required, however, stac-task only assigns collections on an automated basis after the process method completes. So, given that fact, the suggestion here is to assign collections somewhere in the process method prior to uploading (even though, unfortunately, collections might be assigned twice).

I added another info box to clarify this and hopefully provide some useful guidance to users...

@@ -0,0 +1,435 @@
======================
Cirrus Process payload
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually a cirrus payload...we should keep this doc generic, i.e., non-cirrus-specific. Once we have this doc we can revise the cirrus documentation to reference this doc and extend it with the cirrus-specific bits.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per in-person convo, changed title to STAC Task Payload.

@ping-e84 ping-e84 requested a review from jkeifer January 12, 2026 22:24
Copy link
Member

@jkeifer jkeifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to chat again if you have questions. I'm still being quite nitpicky so if we just need go ahead and merge this I am happy to approve as-is and we can iterate on it sometime in the future. The current state I feel still could be improved, but it is really good, and profoundly better than anything we have, so I'd be happy to take it as-is for the win.

ProcessDefinition Object
========================

The ``process`` block (array) should include a *single* ``ProcessDefinition``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use stronger language, should typically is a recommendation, versus must where something is strictly required.

Suggested change
The ``process`` block (array) should include a *single* ``ProcessDefinition``
The ``process`` block (array) must include a *single* ``ProcessDefinition``

Comment on lines +79 to +80
objects, ``stac-task`` only supports a single ``ProcessDefinition`` and will
only read the first one in the ``process`` array. **\***
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find this language a bit misleading. stac-task supports mutliple process defintions in the sense that it will not error if there are additional ones.

What do you think about something like this?

Suggested change
objects, ``stac-task`` only supports a single ``ProcessDefinition`` and will
only read the first one in the ``process`` array. **\***
objects, ``stac-task`` reads only the first ``ProcessDefinition`` in the
``process`` array. **\***

Comment on lines +171 to +180
Here is an example of a ``workflow_options`` dictionary with a single global
parameter:

.. code-block:: json

{
"workflow_options": {
"global_param": "global_value"
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking maybe an example of how this works with task parameters would be valuable. Consider:

Suggested change
Here is an example of a ``workflow_options`` dictionary with a single global
parameter:
.. code-block:: json
{
"workflow_options": {
"global_param": "global_value"
}
}
Here is an example of a ``workflow_options`` dictionary with a single global
parameter, and how ``workflow_options`` interacts with task-specific parameter
values:
.. code-block:: json
{
"workflow_options": {
"param1": "global_value"
},
"tasks": {
"task-a": {
"param1": "value1"
},
"task-c": {
"param2": "value2"
}
}
}
In this case, ``task-a`` would get one keyword argument ``param1='value1'``,
where ``task-c`` would get two: ``param1='global_value'`` and
``param2='value2'``. If another task ``task-b`` also runs, it would get one
keyword argument ``param1='global_value'``.

Comment on lines +155 to +158
In the example above, a task named ``task-a`` would have the ``param1=value1``
key-value pair passed as a keyword, while ``task-c`` would have
``param2=value2`` passed in. If there were a ``task-b`` to be run, it would not be
passed any keywords.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put quotes around the values to indicate they are strings:

Suggested change
In the example above, a task named ``task-a`` would have the ``param1=value1``
key-value pair passed as a keyword, while ``task-c`` would have
``param2=value2`` passed in. If there were a ``task-b`` to be run, it would not be
passed any keywords.
In the example above, a task named ``task-a`` would have the ``param1='value1'``
key-value pair passed as a keyword, while ``task-c`` would have
``param2='value2'`` passed in. If there were a ``task-b`` to be run, it would not be
passed any keywords.

"features": [],
"process": [
{
"workflow": "image-ingest",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow is an orchestration thing, not stac-task. I don't think this belongs here.

],
"process": [
{
"workflow": "cog-generation",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, workflow is not part of the stac-task payload model, and is currently only added by cirrus (and swoop if you can count it).

],
"process": [
{
"description": "NDVI COG Workflow",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is description modeled? Is anything modeling this? I suppose I could go look myself, but I am lazy...

Description feels super weird to me though. What are we describing? The payload? The process definition? It can't be the workflow, because the payload doesn't define the workflow, it merely contains a pointer to it in systems like cirrus where we have a workflow concept. stac-task does not.

]
}

The workflow described by this payload configures the `asset-fetcher` task to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take issue with framing this as describing a workflow. What is running this? Not stac-task. stac-task only supports single tasks. That multiple tasks can be present in the payload is a function of the utility of stac-task in workflow situations like cirrus.

I think you could reframe this as something like "multiple stac-tasks are often composed together into workflows by STAC Workflow orchestration frameworks such as cirrus." My main concern here is to make it clear that a stac-task is concerned with a single task, itself, and something above stac-task is required to define and execute a workflow.

Comment on lines +15 to +17
2. Configure the STAC workflow:

- What is the proposed workflow (*e.g.* single task? series of tasks?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still suspect of this explanation, because the payload is not currently used to define a workflow anywhere. I don't think it reasonably can. The payload is merely the input to a single task. Yes, it may contain extra configuration to passed along to successive tasks, but that's outside the scope of stac-task. Granted, you invoke "STAC workflows" here, which is conceptually a thing meant to convey the possibility of having multiple stac-tasks composed together into a workflow, but even then the payload does not define what tasks are executed. A payload my include task parameters for tasks that will never run. A payload may not include task parameters for tasks that do not require any.

In other words, one cannot look at a payload and have any conception of what tasks will or will not be executed by a STAC workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants