Skip to content

Improve metadata UX: add metadata preview and bundled download support#191

Merged
phargogh merged 54 commits intofeature/serverless-file-accessfrom
feature/166-metadata-viz
Feb 27, 2026
Merged

Improve metadata UX: add metadata preview and bundled download support#191
phargogh merged 54 commits intofeature/serverless-file-accessfrom
feature/166-metadata-viz

Conversation

@claire-simpson
Copy link
Copy Markdown

@claire-simpson claire-simpson commented Dec 10, 2025

This PR adds support for metadata discovery and bundled downloads across both CKAN resources and zip-expanded “sources”, with special handling for shapefiles and capacity to handle large files.

Metadata preview UI

  • Adds an inline metadata preview icon next to resources and sources.
  • Uses font-awesome icons (instead of pixelated image for download button)
  • Displays YAML metadata in a Bootstrap modal without navigating away.
  • Hides standalone YAML resources when they are attached to a data file.

Bundled downloads

  • Introduces new blueprint routes to bundle:
    • A resource + its attached YAML
    • Shapefiles with all required sidecar files (.shp, .dbf, .shx, .prj, etc.)
    • Zip-expanded sources (including folder downloads) + metadata. I.e., when a source represents a directory (via gate.url --> .zip), the zip is included directly in the tar bundle alongside metadata (no unzip/repack). (note that support for downloading folders like ET0 in 3Ps Colombia package will hopefully be temporary and a future PR will allow us to automatically tar all files enclosed in a folder rather than requiring us to upload an adjacent zip for each downloadable folder)
    • Falls back to direct download for single non-shapefile resources with no metadata (so we don't have to unnecessarily archive data)

Streaming tar creation for large files

  • Bundles are streamed using tarfile in pipe mode to avoid loading large files into memory (this caused problems when trying to download global rasters or large data packages like Colombia's)

Resource layout

  • Refactors resource rows to separate the clickable filename from clickable action icons
  • Ensures metadata icons remain clickable even for non-downloadable resources

Other important notes:

  • YAML metadata is treated as an attachment, not a standalone download. This PR does not add functionality for previewing standalone YAMLs, CSVs, or any file other than sidecar YAML metadata.
  • To do (mentioned above): Determine how to bundle all files within a folder that is downloadable
  • To do: There is potential for long-running bundle-tar downloads to tie up uWSGI workers, which risks exhausting all CKAN workers and blocking normal page requests (e.g., because SRTM tar download can take 50+ minutes). Possible solutions include offloading tar/streaming to new infrastructure (e.g., extending the gcsproxy VM pool and autoscaling it) or using Cloud Run (which has a 60 minute cap which is a bit problematic).

Fixes #166, #154 (partly)

Screenshot 2025-12-19 at 3 03 51 PM

…ownload; don't show GMM yamls in resource list #166
@claire-simpson claire-simpson changed the title Improve metadata UX Improve metadata UX: add metadata preview and bundled download support Dec 19, 2025
@claire-simpson claire-simpson marked this pull request as ready for review December 19, 2025 23:09
Copy link
Copy Markdown
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @claire-simpson, It is really neat to see the built-in streaming features that are included in python's stdlib tarfile implementation, and nice job combining tarfile with os-level file descriptors to get this to work! And so nice to see the UI improvements, too .... it looks really nice.

I had a range of comments and suggestions, and I think there might be a few places where we can offload work from CKAN to other services so we aren't proxying more than we need to. I definitely think we should move the actual tarring to a separate service, and ideally avoid writing anything to disk, which I think should be possible by just reading in data and writing it right out to a response. Anyways, let's talk more about this and thanks so much for all your work!

name_lower = name.lower()

# Only attach metadata to the main file for shapefiles
if name_lower.endswith(".shx") or name_lower.endswith(".dbf") or name_lower.endswith(".prj") or name_lower.endswith(".cpg"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be nicely abbreviated if you'd like:

Suggested change
if name_lower.endswith(".shx") or name_lower.endswith(".dbf") or name_lower.endswith(".prj") or name_lower.endswith(".cpg"):
if name_lower.endswith((".shx", ".dbf", ".prj", ".cpg")):

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh neat, didn't know endswith could be used like that!

Comment on lines +29 to +38
for r in resources:
n = r.get("name") or os.path.basename(r.get("url", "")) or ""
if _is_yaml_name(n):
yaml_by_name[n.lower()] = r

attached = {}
for r in resources:
# Determine the data filename we will match against
name = r.get("name") or os.path.basename(r.get("url", "")) or ""
name_lower = name.lower()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these two loops could be consolidated? It doesn't look to me like resources is modified in the first loop over the list, and also the name is duplicated, so we could further save some time by just consolidating things.

- Shapefiles: only the `.shp` gets metadata (`vector.shp.yml`).
- We ignore standalone YAMLs *without* a corresponding data file.
"""
resources = pkg_dict.get("resources", []) or []
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm new to or in this context, but wouldn't pkg_dict.get("resources", []) return an empty list if resources is not in pkg_dict? What does the or [] add in this case?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to safeguard in case pkg_dict['resources'] exists but is None, to ensure that resources gets set to an empty list that could be iterated over so an error isn't thrown later if trying to iterate over None. I don't know if pkg_dict['resources'] would ever be None, but added this to be safe...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not familiar enough with CKAN to confidently say whether 'resources' would ever be none, but it makes sense to guard against the possibility. Could you add a comment about this to help future us remember why the or [] is used here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it looks like the next lines would return {} if resources is None, so maybe or [] isn't necessary!

{% endfor %}
{% endblock %}
{% set name_l = (resource.name or '')|lower %}
{% set is_yaml = name_l.endswith('.yml') or name_l.endswith('.yaml') %}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could shorten this one too if you wanted, but not as important as the other one with like 5 ors chained together!

For what it's worth, a single .endswith(some_tuple_of_extensions) appears to be nominally faster (about twice as fast) as chaining ors:

>>> import timeit
>>> timeit.timeit('"foo.shp".endswith((".shx", ".dbf", ".prj", ".cpg", ".shp"))')
0.07806391688063741
>>> timeit.timeit('path.endswith(".shx") or path.endswith(".dbf") or path.endswith(".prj") or path.endswith(".cpg") or path.endswith(".shp")', setup='path="foo.shp"')
0.1667984169907868

Comment on lines +88 to +91
if (name_lower.endswith('.shx') or
name_lower.endswith('.dbf') or
name_lower.endswith('.prj') or
name_lower.endswith('.cpg')):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a similar set of checks elsewhere, might it be worth creating an "is_shapefile()" function?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also noticed that SHAPEFILE_PART_EXTS defined in blueprint.py includes more potential shapefile part extensions than this list -- might we want to include the others as well?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can definitely add a helper with all of the extensions mentioned in blueprint.py (except .shp!)

if not url:
abort(400, "Missing url")
try:
data = _download_bytes(url)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose url represents some super big file. Given the current implementation of _download_bytes(), wouldn't this be both a) occupying a WSGI worker and b) reading the whole file into memory?

Couldn't we just redirect the client to the right file instead of serving the data through our instance here? I'm not super familiar with redirects, but it seems like maybe that would simplify things a bit and offload some of the responsibility onto other services, wouldn't it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we could redirect and that would be a great improvement! I also don't think this direct download will get used all that often because this is just the fallback for files that have no metadata/aren't shapefiles

abort(502)

filename = download_name or _filename_from_url(url) or "download"
return send_file(BytesIO(data), as_attachment=True, download_name=filename, mimetype=mimetype)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building on what I commented on _download_bytes above, wouldn't send_file proxy the file through ckan? Wouldn't flask.redirect be helpful here?

return send_file(BytesIO(data), as_attachment=True, download_name=filename, mimetype=mimetype)


def _stream_tar_response(out_name: str, build_tar_fn):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not critical for this PR, but when we're moving this to a separate service, I think it'll be important to implement range requests to handle the ability to resume downloads that fail partway through for whatever reason.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I'm not sure its possible with how I've implemented it here, which is definitely a downside

last_err = None
for attempt in range(1, MAX_RETRIES + 1):
with tempfile.NamedTemporaryFile(mode="w+b", delete=True) as tmp:
r = _stream_get(url)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to re-get the stream here? It looks like this is redeclaration from above.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rearranged the logic (including ditching the retries because I think its overkill and unlikely to work), and slightly simplified that function and added a comment about why we need to re-get the stream (not 1000% sure it was necessary before, but pretty sure it is at least needed now in case the the streaming failed mid-transfer - added a comment about this)

# Otherwise spool to disk (and verify/optionally retry)
last_err = None
for attempt in range(1, MAX_RETRIES + 1):
with tempfile.NamedTemporaryFile(mode="w+b", delete=True) as tmp:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this correctly, won't this write the whole file to disk? So if we try to tar a 100GB raster with its geometamaker yaml file we'll need 100GB+ of disk space available per request?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have now changed the logic to always stream directly if Content-Length is present/parse-able

Copy link
Copy Markdown

@megannissel megannissel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Claire! Absolutely love the UI aspect for the metadata preview!

James had more insightful comments than I could provide about the specifics of the tarring/streaming aspect of this PR. I'd love to have a conversation at some point about how that works, at a bit of a higher level; I've never had to implement anything like this before.

I also definitely see what you mean about how some files being CKAN Resources and others just being in the Sources list makes this more complicated / adds redundancy. Handling shapefiles and their various constituent parts raises a number of questions. I'd previously been thinking that it makes sense to continue storing the .zip that contains the parts as the top-level Resource, but after looking at this PR I'm re-thinking that assumption. Probably something we ought to discuss in greater detail at some point!

n = (res.get("name") or "").strip()
if n:
return n
url = res.get("url", "") or ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the or "" doing anything here, given the default of "" for .get?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just prevents url from being set to None in the case that res["url"] = None! Looking at the following code though, it's probably unnecessary as urlparse can operate on None or '' and will just return either empty bytes or an empty string


def _filename_from_resource(res: dict) -> str:
# Prefer CKAN resource.name, fallback to URL basename
n = (res.get("name") or "").strip()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be simplified a bit -- res.get("name", "").strip()

meta_res_url = meta_res.get("url") if meta_res else None

# Decide what data files to include
if name.lower().endswith(".shp"):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to this PR, but wanted to put a pin in this while I'm thinking about it -- currently, I don't think we ever have a top-level Resource ending with .shp because even packages that are a single shapefile layer (e.g. NOAA Shorelines) have the .zip as the top-level resource alongside the GMM YAML. If we do want to move towards "everything is a Resource," I suppose that would also mean resources for each shapefile part.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I believe you're right, no shp resources! It seems like a good idea to still plan for this as a possibility just in case, like you mentioned, we make everything a resource

Comment on lines +88 to +91
if (name_lower.endswith('.shx') or
name_lower.endswith('.dbf') or
name_lower.endswith('.prj') or
name_lower.endswith('.cpg')):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also noticed that SHAPEFILE_PART_EXTS defined in blueprint.py includes more potential shapefile part extensions than this list -- might we want to include the others as well?

name_lower = name.lower()

# Skip if this IS a YAML file
if name_lower.endswith('.yml') or name_lower.endswith('.yaml'):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with _is_yaml_name() check?

return {
"natcap_hello": natcap_hello,
"natcap_find_attached_metadata_map": natcap_find_attached_metadata_map,
"natcap_find_source_metadata_map": natcap_find_source_metadata_map,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: Love that you've defined these functions in helpers.py -- I've been thinking we ought to move a number of the helpers we've previously defined in plugin.py into helpers for better organization!

target: /srv/app/src
# Need to rebuild ckan to reflect changes to python files, ex:
# - action: rebuild
# path: ./src/ckanext-natcap/ckanext/natcap/blueprint.py
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Copy Markdown
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Claire! This is looking very good. I just had a couple minor comments for your consideration, really very small. Thinking ahead to merging the PR, would you like this to be merged into master or into a different branch? I remember I created feature/serverless-file-access for the gcsproxy work but wanted to see if it still made sense to merge this over there or what the latest thinking was. Thanks!

Comment on lines +536 to +545
# If the URL is a zip (folder download), ensure the downloaded filename ends with .zip
if url_fname.lower().endswith(".zip") and not src_base.lower().endswith(".zip"):
filename = f"{src_base}.zip"
mimetype = "application/zip"
else:
# Prefer the URL filename if source_name is extensionless
filename = src_base
if "." not in src_base and url_fname:
filename = url_fname
mimetype = "application/octet-stream"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just remembered that python's stdlib has a mimetypes package, in case that has some functionality that is useful for this block: https://docs.python.org/3/library/mimetypes.html

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mimetypes does have some good functionality I wasn't aware of! In this block we're not really trying to guess the filetype, this is the case where there is a folder we'd like to download and a zip of the folder exists, so we download that instead (i.e., the source would be a text/directory or something, but we're manually changing it to be treated as a zip). In the future I'd like to change this so we don't have to have a zip of the folder, and instead just automatically add everything in the folder to a tar.

Comment on lines +16 to +17
shapefile_extensions = (".dbf", ".shx", ".prj", ".cpg", ".qix", ".sbn",
".sbx", ".shp.xml")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this list duplicated in the blueprint file? If they are the same, would it make sense to just import the list and access it instead of redefining it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key difference is that this list doesn't contain .shp, but I could import the list in blueprint.py and remove .shp

@claire-simpson claire-simpson changed the base branch from master to feature/serverless-file-access February 26, 2026 00:30
Copy link
Copy Markdown
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @claire-simpson !

@phargogh phargogh merged commit 412d709 into feature/serverless-file-access Feb 27, 2026
@claire-simpson claire-simpson deleted the feature/166-metadata-viz branch March 2, 2026 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Attach metadata to data files instead of listing YAMLs separately

3 participants