Improve metadata UX: add metadata preview and bundled download support#191
Conversation
phargogh
left a comment
There was a problem hiding this comment.
Thanks @claire-simpson, It is really neat to see the built-in streaming features that are included in python's stdlib tarfile implementation, and nice job combining tarfile with os-level file descriptors to get this to work! And so nice to see the UI improvements, too .... it looks really nice.
I had a range of comments and suggestions, and I think there might be a few places where we can offload work from CKAN to other services so we aren't proxying more than we need to. I definitely think we should move the actual tarring to a separate service, and ideally avoid writing anything to disk, which I think should be possible by just reading in data and writing it right out to a response. Anyways, let's talk more about this and thanks so much for all your work!
| name_lower = name.lower() | ||
|
|
||
| # Only attach metadata to the main file for shapefiles | ||
| if name_lower.endswith(".shx") or name_lower.endswith(".dbf") or name_lower.endswith(".prj") or name_lower.endswith(".cpg"): |
There was a problem hiding this comment.
This can be nicely abbreviated if you'd like:
| if name_lower.endswith(".shx") or name_lower.endswith(".dbf") or name_lower.endswith(".prj") or name_lower.endswith(".cpg"): | |
| if name_lower.endswith((".shx", ".dbf", ".prj", ".cpg")): |
There was a problem hiding this comment.
Oh neat, didn't know endswith could be used like that!
| for r in resources: | ||
| n = r.get("name") or os.path.basename(r.get("url", "")) or "" | ||
| if _is_yaml_name(n): | ||
| yaml_by_name[n.lower()] = r | ||
|
|
||
| attached = {} | ||
| for r in resources: | ||
| # Determine the data filename we will match against | ||
| name = r.get("name") or os.path.basename(r.get("url", "")) or "" | ||
| name_lower = name.lower() |
There was a problem hiding this comment.
Maybe these two loops could be consolidated? It doesn't look to me like resources is modified in the first loop over the list, and also the name is duplicated, so we could further save some time by just consolidating things.
| - Shapefiles: only the `.shp` gets metadata (`vector.shp.yml`). | ||
| - We ignore standalone YAMLs *without* a corresponding data file. | ||
| """ | ||
| resources = pkg_dict.get("resources", []) or [] |
There was a problem hiding this comment.
Sorry, I'm new to or in this context, but wouldn't pkg_dict.get("resources", []) return an empty list if resources is not in pkg_dict? What does the or [] add in this case?
There was a problem hiding this comment.
This is to safeguard in case pkg_dict['resources'] exists but is None, to ensure that resources gets set to an empty list that could be iterated over so an error isn't thrown later if trying to iterate over None. I don't know if pkg_dict['resources'] would ever be None, but added this to be safe...
There was a problem hiding this comment.
Yeah, I'm not familiar enough with CKAN to confidently say whether 'resources' would ever be none, but it makes sense to guard against the possibility. Could you add a comment about this to help future us remember why the or [] is used here?
There was a problem hiding this comment.
Actually it looks like the next lines would return {} if resources is None, so maybe or [] isn't necessary!
| {% endfor %} | ||
| {% endblock %} | ||
| {% set name_l = (resource.name or '')|lower %} | ||
| {% set is_yaml = name_l.endswith('.yml') or name_l.endswith('.yaml') %} |
There was a problem hiding this comment.
You could shorten this one too if you wanted, but not as important as the other one with like 5 ors chained together!
For what it's worth, a single .endswith(some_tuple_of_extensions) appears to be nominally faster (about twice as fast) as chaining ors:
>>> import timeit
>>> timeit.timeit('"foo.shp".endswith((".shx", ".dbf", ".prj", ".cpg", ".shp"))')
0.07806391688063741
>>> timeit.timeit('path.endswith(".shx") or path.endswith(".dbf") or path.endswith(".prj") or path.endswith(".cpg") or path.endswith(".shp")', setup='path="foo.shp"')
0.1667984169907868| if (name_lower.endswith('.shx') or | ||
| name_lower.endswith('.dbf') or | ||
| name_lower.endswith('.prj') or | ||
| name_lower.endswith('.cpg')): |
There was a problem hiding this comment.
Since we have a similar set of checks elsewhere, might it be worth creating an "is_shapefile()" function?
There was a problem hiding this comment.
Also noticed that SHAPEFILE_PART_EXTS defined in blueprint.py includes more potential shapefile part extensions than this list -- might we want to include the others as well?
There was a problem hiding this comment.
Sure, can definitely add a helper with all of the extensions mentioned in blueprint.py (except .shp!)
| if not url: | ||
| abort(400, "Missing url") | ||
| try: | ||
| data = _download_bytes(url) |
There was a problem hiding this comment.
Suppose url represents some super big file. Given the current implementation of _download_bytes(), wouldn't this be both a) occupying a WSGI worker and b) reading the whole file into memory?
Couldn't we just redirect the client to the right file instead of serving the data through our instance here? I'm not super familiar with redirects, but it seems like maybe that would simplify things a bit and offload some of the responsibility onto other services, wouldn't it?
There was a problem hiding this comment.
Yes I think we could redirect and that would be a great improvement! I also don't think this direct download will get used all that often because this is just the fallback for files that have no metadata/aren't shapefiles
| abort(502) | ||
|
|
||
| filename = download_name or _filename_from_url(url) or "download" | ||
| return send_file(BytesIO(data), as_attachment=True, download_name=filename, mimetype=mimetype) |
There was a problem hiding this comment.
Building on what I commented on _download_bytes above, wouldn't send_file proxy the file through ckan? Wouldn't flask.redirect be helpful here?
| return send_file(BytesIO(data), as_attachment=True, download_name=filename, mimetype=mimetype) | ||
|
|
||
|
|
||
| def _stream_tar_response(out_name: str, build_tar_fn): |
There was a problem hiding this comment.
Probably not critical for this PR, but when we're moving this to a separate service, I think it'll be important to implement range requests to handle the ability to resume downloads that fail partway through for whatever reason.
There was a problem hiding this comment.
That makes sense. I'm not sure its possible with how I've implemented it here, which is definitely a downside
| last_err = None | ||
| for attempt in range(1, MAX_RETRIES + 1): | ||
| with tempfile.NamedTemporaryFile(mode="w+b", delete=True) as tmp: | ||
| r = _stream_get(url) |
There was a problem hiding this comment.
Do we need to re-get the stream here? It looks like this is redeclaration from above.
There was a problem hiding this comment.
I have rearranged the logic (including ditching the retries because I think its overkill and unlikely to work), and slightly simplified that function and added a comment about why we need to re-get the stream (not 1000% sure it was necessary before, but pretty sure it is at least needed now in case the the streaming failed mid-transfer - added a comment about this)
| # Otherwise spool to disk (and verify/optionally retry) | ||
| last_err = None | ||
| for attempt in range(1, MAX_RETRIES + 1): | ||
| with tempfile.NamedTemporaryFile(mode="w+b", delete=True) as tmp: |
There was a problem hiding this comment.
If I'm reading this correctly, won't this write the whole file to disk? So if we try to tar a 100GB raster with its geometamaker yaml file we'll need 100GB+ of disk space available per request?
There was a problem hiding this comment.
Yes, I have now changed the logic to always stream directly if Content-Length is present/parse-able
megannissel
left a comment
There was a problem hiding this comment.
Thanks, Claire! Absolutely love the UI aspect for the metadata preview!
James had more insightful comments than I could provide about the specifics of the tarring/streaming aspect of this PR. I'd love to have a conversation at some point about how that works, at a bit of a higher level; I've never had to implement anything like this before.
I also definitely see what you mean about how some files being CKAN Resources and others just being in the Sources list makes this more complicated / adds redundancy. Handling shapefiles and their various constituent parts raises a number of questions. I'd previously been thinking that it makes sense to continue storing the .zip that contains the parts as the top-level Resource, but after looking at this PR I'm re-thinking that assumption. Probably something we ought to discuss in greater detail at some point!
| n = (res.get("name") or "").strip() | ||
| if n: | ||
| return n | ||
| url = res.get("url", "") or "" |
There was a problem hiding this comment.
Is the or "" doing anything here, given the default of "" for .get?
There was a problem hiding this comment.
This just prevents url from being set to None in the case that res["url"] = None! Looking at the following code though, it's probably unnecessary as urlparse can operate on None or '' and will just return either empty bytes or an empty string
|
|
||
| def _filename_from_resource(res: dict) -> str: | ||
| # Prefer CKAN resource.name, fallback to URL basename | ||
| n = (res.get("name") or "").strip() |
There was a problem hiding this comment.
I think this could be simplified a bit -- res.get("name", "").strip()
| meta_res_url = meta_res.get("url") if meta_res else None | ||
|
|
||
| # Decide what data files to include | ||
| if name.lower().endswith(".shp"): |
There was a problem hiding this comment.
Not directly related to this PR, but wanted to put a pin in this while I'm thinking about it -- currently, I don't think we ever have a top-level Resource ending with .shp because even packages that are a single shapefile layer (e.g. NOAA Shorelines) have the .zip as the top-level resource alongside the GMM YAML. If we do want to move towards "everything is a Resource," I suppose that would also mean resources for each shapefile part.
There was a problem hiding this comment.
Yeah I believe you're right, no shp resources! It seems like a good idea to still plan for this as a possibility just in case, like you mentioned, we make everything a resource
| if (name_lower.endswith('.shx') or | ||
| name_lower.endswith('.dbf') or | ||
| name_lower.endswith('.prj') or | ||
| name_lower.endswith('.cpg')): |
There was a problem hiding this comment.
Also noticed that SHAPEFILE_PART_EXTS defined in blueprint.py includes more potential shapefile part extensions than this list -- might we want to include the others as well?
| name_lower = name.lower() | ||
|
|
||
| # Skip if this IS a YAML file | ||
| if name_lower.endswith('.yml') or name_lower.endswith('.yaml'): |
There was a problem hiding this comment.
Replace with _is_yaml_name() check?
| return { | ||
| "natcap_hello": natcap_hello, | ||
| "natcap_find_attached_metadata_map": natcap_find_attached_metadata_map, | ||
| "natcap_find_source_metadata_map": natcap_find_source_metadata_map, |
There was a problem hiding this comment.
Side note: Love that you've defined these functions in helpers.py -- I've been thinking we ought to move a number of the helpers we've previously defined in plugin.py into helpers for better organization!
| target: /srv/app/src | ||
| # Need to rebuild ckan to reflect changes to python files, ex: | ||
| # - action: rebuild | ||
| # path: ./src/ckanext-natcap/ckanext/natcap/blueprint.py |
…nd call to download, not open, txt etc; get CKAN_PORT form env; always stream if possible instead of spooling for large files; other small fixed #166
phargogh
left a comment
There was a problem hiding this comment.
Thanks, Claire! This is looking very good. I just had a couple minor comments for your consideration, really very small. Thinking ahead to merging the PR, would you like this to be merged into master or into a different branch? I remember I created feature/serverless-file-access for the gcsproxy work but wanted to see if it still made sense to merge this over there or what the latest thinking was. Thanks!
| # If the URL is a zip (folder download), ensure the downloaded filename ends with .zip | ||
| if url_fname.lower().endswith(".zip") and not src_base.lower().endswith(".zip"): | ||
| filename = f"{src_base}.zip" | ||
| mimetype = "application/zip" | ||
| else: | ||
| # Prefer the URL filename if source_name is extensionless | ||
| filename = src_base | ||
| if "." not in src_base and url_fname: | ||
| filename = url_fname | ||
| mimetype = "application/octet-stream" |
There was a problem hiding this comment.
I just remembered that python's stdlib has a mimetypes package, in case that has some functionality that is useful for this block: https://docs.python.org/3/library/mimetypes.html
There was a problem hiding this comment.
mimetypes does have some good functionality I wasn't aware of! In this block we're not really trying to guess the filetype, this is the case where there is a folder we'd like to download and a zip of the folder exists, so we download that instead (i.e., the source would be a text/directory or something, but we're manually changing it to be treated as a zip). In the future I'd like to change this so we don't have to have a zip of the folder, and instead just automatically add everything in the folder to a tar.
| shapefile_extensions = (".dbf", ".shx", ".prj", ".cpg", ".qix", ".sbn", | ||
| ".sbx", ".shp.xml") |
There was a problem hiding this comment.
Isn't this list duplicated in the blueprint file? If they are the same, would it make sense to just import the list and access it instead of redefining it?
There was a problem hiding this comment.
The key difference is that this list doesn't contain .shp, but I could import the list in blueprint.py and remove .shp
phargogh
left a comment
There was a problem hiding this comment.
Looks good, thanks @claire-simpson !
This PR adds support for metadata discovery and bundled downloads across both CKAN resources and zip-expanded “sources”, with special handling for shapefiles and capacity to handle large files.
Metadata preview UI
Bundled downloads
ET0in 3Ps Colombia package will hopefully be temporary and a future PR will allow us to automatically tar all files enclosed in a folder rather than requiring us to upload an adjacentzipfor each downloadable folder)Streaming tar creation for large files
Resource layout
Other important notes:
Fixes #166, #154 (partly)