Improve metadata UX: add metadata preview and bundled download support by claire-simpson · Pull Request #191 · natcap/data.naturalcapitalproject.stanford.edu

claire-simpson · 2025-12-10T20:58:05Z

This PR adds support for metadata discovery and bundled downloads across both CKAN resources and zip-expanded “sources”, with special handling for shapefiles and capacity to handle large files.

Metadata preview UI

Adds an inline metadata preview icon next to resources and sources.
Uses font-awesome icons (instead of pixelated image for download button)
Displays YAML metadata in a Bootstrap modal without navigating away.
Hides standalone YAML resources when they are attached to a data file.

Bundled downloads

Introduces new blueprint routes to bundle:
- A resource + its attached YAML
- Shapefiles with all required sidecar files (.shp, .dbf, .shx, .prj, etc.)
- Zip-expanded sources (including folder downloads) + metadata. I.e., when a source represents a directory (via gate.url --> .zip), the zip is included directly in the tar bundle alongside metadata (no unzip/repack). (note that support for downloading folders like ET0 in 3Ps Colombia package will hopefully be temporary and a future PR will allow us to automatically tar all files enclosed in a folder rather than requiring us to upload an adjacent zip for each downloadable folder)
- Falls back to direct download for single non-shapefile resources with no metadata (so we don't have to unnecessarily archive data)

Streaming tar creation for large files

Bundles are streamed using tarfile in pipe mode to avoid loading large files into memory (this caused problems when trying to download global rasters or large data packages like Colombia's)

Resource layout

Refactors resource rows to separate the clickable filename from clickable action icons
Ensures metadata icons remain clickable even for non-downloadable resources

Other important notes:

YAML metadata is treated as an attachment, not a standalone download. This PR does not add functionality for previewing standalone YAMLs, CSVs, or any file other than sidecar YAML metadata.
To do (mentioned above): Determine how to bundle all files within a folder that is downloadable
To do: There is potential for long-running bundle-tar downloads to tie up uWSGI workers, which risks exhausting all CKAN workers and blocking normal page requests (e.g., because SRTM tar download can take 50+ minutes). Possible solutions include offloading tar/streaming to new infrastructure (e.g., extending the gcsproxy VM pool and autoscaling it) or using Cloud Run (which has a 60 minute cap which is a bit problematic).

Fixes #166, #154 (partly)

…ownload; don't show GMM yamls in resource list #166

…elper #166

…row layout #166

…ng single file #166

…vements #166

…th companion zip in gcs #166

…w nginx caching and buffering and timeouts) #166

…immediately show download and solve delay on first bytes #166

phargogh

Thanks @claire-simpson, It is really neat to see the built-in streaming features that are included in python's stdlib tarfile implementation, and nice job combining tarfile with os-level file descriptors to get this to work! And so nice to see the UI improvements, too .... it looks really nice.

I had a range of comments and suggestions, and I think there might be a few places where we can offload work from CKAN to other services so we aren't proxying more than we need to. I definitely think we should move the actual tarring to a separate service, and ideally avoid writing anything to disk, which I think should be possible by just reading in data and writing it right out to a response. Anyways, let's talk more about this and thanks so much for all your work!

phargogh · 2026-01-29T23:39:24Z

src/ckanext-natcap/ckanext/natcap/helpers.py

+        name_lower = name.lower()
+
+        # Only attach metadata to the main file for shapefiles
+        if name_lower.endswith(".shx") or name_lower.endswith(".dbf") or name_lower.endswith(".prj") or name_lower.endswith(".cpg"):


This can be nicely abbreviated if you'd like:

Suggested change

if name_lower.endswith(".shx") or name_lower.endswith(".dbf") or name_lower.endswith(".prj") or name_lower.endswith(".cpg"):

if name_lower.endswith((".shx", ".dbf", ".prj", ".cpg")):

Oh neat, didn't know endswith could be used like that!

phargogh · 2026-01-29T23:41:28Z

src/ckanext-natcap/ckanext/natcap/helpers.py

+    for r in resources:
+        n = r.get("name") or os.path.basename(r.get("url", "")) or ""
+        if _is_yaml_name(n):
+            yaml_by_name[n.lower()] = r
+
+    attached = {}
+    for r in resources:
+        # Determine the data filename we will match against
+        name = r.get("name") or os.path.basename(r.get("url", "")) or ""
+        name_lower = name.lower()


Maybe these two loops could be consolidated? It doesn't look to me like resources is modified in the first loop over the list, and also the name is duplicated, so we could further save some time by just consolidating things.

phargogh · 2026-01-29T23:43:18Z

src/ckanext-natcap/ckanext/natcap/helpers.py

+      - Shapefiles: only the `.shp` gets metadata (`vector.shp.yml`).
+      - We ignore standalone YAMLs *without* a corresponding data file.
+    """
+    resources = pkg_dict.get("resources", []) or []


Sorry, I'm new to or in this context, but wouldn't pkg_dict.get("resources", []) return an empty list if resources is not in pkg_dict? What does the or [] add in this case?

This is to safeguard in case pkg_dict['resources'] exists but is None, to ensure that resources gets set to an empty list that could be iterated over so an error isn't thrown later if trying to iterate over None. I don't know if pkg_dict['resources'] would ever be None, but added this to be safe...

Yeah, I'm not familiar enough with CKAN to confidently say whether 'resources' would ever be none, but it makes sense to guard against the possibility. Could you add a comment about this to help future us remember why the or [] is used here?

Actually it looks like the next lines would return {} if resources is None, so maybe or [] isn't necessary!

phargogh · 2026-01-29T23:49:27Z

src/ckanext-natcap/ckanext/natcap/templates/package/snippets/resources_list.html

-              {% endfor %}
-            {% endblock %}
+                {% set name_l = (resource.name or '')|lower %}
+                {% set is_yaml = name_l.endswith('.yml') or name_l.endswith('.yaml') %}


You could shorten this one too if you wanted, but not as important as the other one with like 5 ors chained together!

For what it's worth, a single .endswith(some_tuple_of_extensions) appears to be nominally faster (about twice as fast) as chaining ors:

>>> import timeit >>> timeit.timeit('"foo.shp".endswith((".shx", ".dbf", ".prj", ".cpg", ".shp"))') 0.07806391688063741 >>> timeit.timeit('path.endswith(".shx") or path.endswith(".dbf") or path.endswith(".prj") or path.endswith(".cpg") or path.endswith(".shp")', setup='path="foo.shp"') 0.1667984169907868

phargogh · 2026-01-30T00:07:37Z

src/ckanext-natcap/ckanext/natcap/helpers.py

+            if (name_lower.endswith('.shx') or
+                name_lower.endswith('.dbf') or
+                name_lower.endswith('.prj') or
+                name_lower.endswith('.cpg')):


Since we have a similar set of checks elsewhere, might it be worth creating an "is_shapefile()" function?

Also noticed that SHAPEFILE_PART_EXTS defined in blueprint.py includes more potential shapefile part extensions than this list -- might we want to include the others as well?

Sure, can definitely add a helper with all of the extensions mentioned in blueprint.py (except .shp!)

phargogh · 2026-01-30T18:39:02Z

src/ckanext-natcap/ckanext/natcap/blueprint.py

+    if not url:
+        abort(400, "Missing url")
+    try:
+        data = _download_bytes(url)


Suppose url represents some super big file. Given the current implementation of _download_bytes(), wouldn't this be both a) occupying a WSGI worker and b) reading the whole file into memory?

Couldn't we just redirect the client to the right file instead of serving the data through our instance here? I'm not super familiar with redirects, but it seems like maybe that would simplify things a bit and offload some of the responsibility onto other services, wouldn't it?

Yes I think we could redirect and that would be a great improvement! I also don't think this direct download will get used all that often because this is just the fallback for files that have no metadata/aren't shapefiles

phargogh · 2026-01-30T19:47:57Z

src/ckanext-natcap/ckanext/natcap/blueprint.py

+        abort(502)
+
+    filename = download_name or _filename_from_url(url) or "download"
+    return send_file(BytesIO(data), as_attachment=True, download_name=filename, mimetype=mimetype)


Building on what I commented on _download_bytes above, wouldn't send_file proxy the file through ckan? Wouldn't flask.redirect be helpful here?

phargogh · 2026-01-30T19:51:03Z

src/ckanext-natcap/ckanext/natcap/blueprint.py

+    return send_file(BytesIO(data), as_attachment=True, download_name=filename, mimetype=mimetype)
+
+
+def _stream_tar_response(out_name: str, build_tar_fn):


Probably not critical for this PR, but when we're moving this to a separate service, I think it'll be important to implement range requests to handle the ability to resume downloads that fail partway through for whatever reason.

That makes sense. I'm not sure its possible with how I've implemented it here, which is definitely a downside

phargogh · 2026-01-30T19:56:17Z

src/ckanext-natcap/ckanext/natcap/blueprint.py

+    last_err = None
+    for attempt in range(1, MAX_RETRIES + 1):
+        with tempfile.NamedTemporaryFile(mode="w+b", delete=True) as tmp:
+            r = _stream_get(url)


Do we need to re-get the stream here? It looks like this is redeclaration from above.

I have rearranged the logic (including ditching the retries because I think its overkill and unlikely to work), and slightly simplified that function and added a comment about why we need to re-get the stream (not 1000% sure it was necessary before, but pretty sure it is at least needed now in case the the streaming failed mid-transfer - added a comment about this)

phargogh · 2026-01-30T19:58:32Z

src/ckanext-natcap/ckanext/natcap/blueprint.py

+    # Otherwise spool to disk (and verify/optionally retry)
+    last_err = None
+    for attempt in range(1, MAX_RETRIES + 1):
+        with tempfile.NamedTemporaryFile(mode="w+b", delete=True) as tmp:


If I'm reading this correctly, won't this write the whole file to disk? So if we try to tar a 100GB raster with its geometamaker yaml file we'll need 100GB+ of disk space available per request?

Yes, I have now changed the logic to always stream directly if Content-Length is present/parse-able

megannissel

Thanks, Claire! Absolutely love the UI aspect for the metadata preview!

James had more insightful comments than I could provide about the specifics of the tarring/streaming aspect of this PR. I'd love to have a conversation at some point about how that works, at a bit of a higher level; I've never had to implement anything like this before.

I also definitely see what you mean about how some files being CKAN Resources and others just being in the Sources list makes this more complicated / adds redundancy. Handling shapefiles and their various constituent parts raises a number of questions. I'd previously been thinking that it makes sense to continue storing the .zip that contains the parts as the top-level Resource, but after looking at this PR I'm re-thinking that assumption. Probably something we ought to discuss in greater detail at some point!

megannissel · 2026-02-02T14:49:50Z

src/ckanext-natcap/ckanext/natcap/blueprint.py

+    n = (res.get("name") or "").strip()
+    if n:
+        return n
+    url = res.get("url", "") or ""


Is the or "" doing anything here, given the default of "" for .get?

This just prevents url from being set to None in the case that res["url"] = None! Looking at the following code though, it's probably unnecessary as urlparse can operate on None or '' and will just return either empty bytes or an empty string

megannissel · 2026-02-02T14:50:53Z

src/ckanext-natcap/ckanext/natcap/blueprint.py

+
+def _filename_from_resource(res: dict) -> str:
+    # Prefer CKAN resource.name, fallback to URL basename
+    n = (res.get("name") or "").strip()


I think this could be simplified a bit -- res.get("name", "").strip()

megannissel · 2026-02-02T15:38:25Z

src/ckanext-natcap/ckanext/natcap/blueprint.py

+        meta_res_url = meta_res.get("url") if meta_res else None
+
+        # Decide what data files to include
+        if name.lower().endswith(".shp"):


Not directly related to this PR, but wanted to put a pin in this while I'm thinking about it -- currently, I don't think we ever have a top-level Resource ending with .shp because even packages that are a single shapefile layer (e.g. NOAA Shorelines) have the .zip as the top-level resource alongside the GMM YAML. If we do want to move towards "everything is a Resource," I suppose that would also mean resources for each shapefile part.

Yeah I believe you're right, no shp resources! It seems like a good idea to still plan for this as a possibility just in case, like you mentioned, we make everything a resource

megannissel · 2026-02-02T15:44:24Z

src/ckanext-natcap/ckanext/natcap/helpers.py

+            if (name_lower.endswith('.shx') or
+                name_lower.endswith('.dbf') or
+                name_lower.endswith('.prj') or
+                name_lower.endswith('.cpg')):


Also noticed that SHAPEFILE_PART_EXTS defined in blueprint.py includes more potential shapefile part extensions than this list -- might we want to include the others as well?

megannissel · 2026-02-02T15:45:25Z

src/ckanext-natcap/ckanext/natcap/helpers.py

+            name_lower = name.lower()
+
+            # Skip if this IS a YAML file
+            if name_lower.endswith('.yml') or name_lower.endswith('.yaml'):


Replace with _is_yaml_name() check?

megannissel · 2026-02-02T15:47:21Z

src/ckanext-natcap/ckanext/natcap/helpers.py

    return {
        "natcap_hello": natcap_hello,
+        "natcap_find_attached_metadata_map": natcap_find_attached_metadata_map,
+        "natcap_find_source_metadata_map": natcap_find_source_metadata_map,


Side note: Love that you've defined these functions in helpers.py -- I've been thinking we ought to move a number of the helpers we've previously defined in plugin.py into helpers for better organization!

megannissel · 2026-02-02T15:47:56Z

docker-compose.dev.yml

          target: /srv/app/src
+        # Need to rebuild ckan to reflect changes to python files, ex:
+        # - action: rebuild
+        #   path: ./src/ckanext-natcap/ckanext/natcap/blueprint.py


…nd call to download, not open, txt etc; get CKAN_PORT form env; always stream if possible instead of spooling for large files; other small fixed #166

phargogh

Thanks, Claire! This is looking very good. I just had a couple minor comments for your consideration, really very small. Thinking ahead to merging the PR, would you like this to be merged into master or into a different branch? I remember I created feature/serverless-file-access for the gcsproxy work but wanted to see if it still made sense to merge this over there or what the latest thinking was. Thanks!

phargogh · 2026-02-18T21:58:31Z

src/ckanext-natcap/ckanext/natcap/blueprint.py

+        # If the URL is a zip (folder download), ensure the downloaded filename ends with .zip
+        if url_fname.lower().endswith(".zip") and not src_base.lower().endswith(".zip"):
+            filename = f"{src_base}.zip"
+            mimetype = "application/zip"
+        else:
+            # Prefer the URL filename if source_name is extensionless
+            filename = src_base
+            if "." not in src_base and url_fname:
+                filename = url_fname
+            mimetype = "application/octet-stream"


I just remembered that python's stdlib has a mimetypes package, in case that has some functionality that is useful for this block: https://docs.python.org/3/library/mimetypes.html

mimetypes does have some good functionality I wasn't aware of! In this block we're not really trying to guess the filetype, this is the case where there is a folder we'd like to download and a zip of the folder exists, so we download that instead (i.e., the source would be a text/directory or something, but we're manually changing it to be treated as a zip). In the future I'd like to change this so we don't have to have a zip of the folder, and instead just automatically add everything in the folder to a tar.

phargogh · 2026-02-18T22:01:51Z

src/ckanext-natcap/ckanext/natcap/helpers.py

+    shapefile_extensions = (".dbf", ".shx", ".prj", ".cpg", ".qix", ".sbn",
+                            ".sbx", ".shp.xml")


Isn't this list duplicated in the blueprint file? If they are the same, would it make sense to just import the list and access it instead of redefining it?

The key difference is that this list doesn't contain .shp, but I could import the list in blueprint.py and remove .shp

phargogh

Looks good, thanks @claire-simpson !

claire-simpson added 24 commits November 25, 2025 17:09

Add i button that pulls up metadata preview modal and allows yml to d…

2d51d9c

…ownload; don't show GMM yamls in resource list #166

Clean up resources comments and can_edit #166

948ddb6

Add helper function to get metadata map for sources #166

1f1d00a

Add styling for metadata info button and modal; remove pkg from new h…

8e5390d

…elper #166

Attach and preview metadata for source list #166

7063b2e

Move metadata to icon actions div #166

c106098

Simplify yaml detection for resources #166

4fea785

Clean up css #166

12fe1a1

Simplify sources meta map creation and yaml detection and logic #166

be2a048

Fix close button and header ie update to match bootstrap 5 #166

4c5eeb1

Clean up comments and whitespace #166

005cdb6

Add tar bundle option and clean up unused functions #166

17cefe0

Fix filename zip replace and remove unused helper #166

2b14aa2

Support shapefiles; add function to bundle sources bc have no id #166

4d3fe97

Rearrage so no nested anchors; wrap resource items in link and group …

0515a57

…row layout #166

Rearrange sources list and add bundled download option #166

2669099

Always use fa button even when not downloading tar and just downloadi…

47879c1

…ng single file #166

Download single file (not tar) if not shp and no metadata #166

72ffca6

Stream tar instead of buffering to accomodate large files #166

350b575

Only add folder download button if folder is downloadable #166

fa48dcb

Unify js-metadata-preview and icon button styling and other css impro…

9d2ff7f

…vements #166

Fix icon color and hover color #166

d497065

Use gate.url instead of source.url for directories in sources #166

c1ad74f

Add special logic for zip sources to allow download of directories wi…

14e46a5

…th companion zip in gcs #166

claire-simpson changed the title ~~Improve metadata UX~~ Improve metadata UX: add metadata preview and bundled download support Dec 19, 2025

Merge branch 'master' into feature/166-metadata-viz

ea6be00

claire-simpson marked this pull request as ready for review December 19, 2025 23:09

claire-simpson added 3 commits January 5, 2026 13:29

Clean up blueprint

9770d25

Fix timeouts on large resource tar and download (harakiri and issues …

15ad751

…w nginx caching and buffering and timeouts) #166

Fix issue when adding large resources to tar; add manifest to tar to …

a40569a

…immediately show download and solve delay on first bytes #166

claire-simpson added 2 commits January 13, 2026 16:30

Div to a for js metadata preview #166

2c2b979

Push bytes out when sources tar starts #166

28b5011

claire-simpson assigned phargogh and megannissel Jan 14, 2026

claire-simpson added 4 commits January 14, 2026 14:27

Remove old bundle tar prev #166

c4014c2

Improve docstrings for bundle tar functions #166

f27494f

Add auto-rebuild ckan functionality on py script change comment #166

06cd4a4

Remove unnecessary target #166

d3e50c6

megannissel requested review from megannissel and phargogh January 28, 2026 21:48

phargogh requested changes Jan 30, 2026

View reviewed changes

megannissel reviewed Feb 2, 2026

View reviewed changes

phargogh mentioned this pull request Feb 3, 2026

Proxy GCS access through cloud run #210

Merged

claire-simpson unassigned phargogh and megannissel Feb 4, 2026

claire-simpson added 4 commits February 10, 2026 15:35

Add helpers for checking yaml and shp sidecars #166

adc5a16

Fix direct download logic in bundle_tar; Add _proxy_stream function a…

26b1d6b

…nd call to download, not open, txt etc; get CKAN_PORT form env; always stream if possible instead of spooling for large files; other small fixed #166

Rename function #166

93054d9

Fixing comments and docstrings and removing redundant code #166

887c82e

claire-simpson requested review from megannissel and phargogh February 11, 2026 00:43

Linting and remove TODO comment #166

ee9320d

phargogh reviewed Feb 18, 2026

View reviewed changes

claire-simpson changed the base branch from master to feature/serverless-file-access February 26, 2026 00:30

Reduce shp extensions list duplication #166

4574ece

claire-simpson requested a review from phargogh February 26, 2026 20:26

phargogh approved these changes Feb 27, 2026

View reviewed changes

phargogh merged commit 412d709 into feature/serverless-file-access Feb 27, 2026

claire-simpson deleted the feature/166-metadata-viz branch March 2, 2026 17:59

claire-simpson mentioned this pull request Mar 6, 2026

Attach metadata to data files instead of listing YAMLs separately #166

Closed

	if name_lower.endswith(".shx") or name_lower.endswith(".dbf") or name_lower.endswith(".prj") or name_lower.endswith(".cpg"):
	if name_lower.endswith((".shx", ".dbf", ".prj", ".cpg")):

		return send_file(BytesIO(data), as_attachment=True, download_name=filename, mimetype=mimetype)


		def _stream_tar_response(out_name: str, build_tar_fn):

		shapefile_extensions = (".dbf", ".shx", ".prj", ".cpg", ".qix", ".sbn",
		".sbx", ".shp.xml")

Conversation

claire-simpson commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phargogh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

megannissel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phargogh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phargogh left a comment

claire-simpson commented Dec 10, 2025 •

edited

Loading