Skip to content

Fix S3 glob XML parsing to ignore non-Contents nodes (e.g. Marker)#350

Open
cbardasano wants to merge 1 commit intoconnormanning:masterfrom
cbardasano:fix-xml-parser
Open

Fix S3 glob XML parsing to ignore non-Contents nodes (e.g. Marker)#350
cbardasano wants to merge 1 commit intoconnormanning:masterfrom
cbardasano:fix-xml-parser

Conversation

@cbardasano
Copy link
Copy Markdown

Summary

This PR fixes a bug in S3::glob where the XML parsing loop incorrectly assumes that all sibling nodes following the first <Contents> node are also <Contents> nodes.

The Issue

When using S3-compatible storage providers (specifically Digital Ocean Spaces in my case), the ListBucketResult XML response may contain additional nodes like <Marker> appearing after the last <Contents> node.

The current implementation uses conNode = conNode->next_sibling() without arguments. This causes the loop to iterate over these non-Contents nodes. Since <Marker> (and probably others) do not contain a <Key> child, the function throws an ArbiterError("Missing Key...").

Example of crashing XML response:

<ListBucketResult ...>
    ...
    <Contents>
        <Key>path/to/file.laz</Key>
        ...
    </Contents>
    <Marker></Marker>
</ListBucketResult>

The Fix

I updated the loop to explicitly request the next sibling with the name "Contents":

// Before
for ( ; conNode; conNode = conNode->next_sibling())

// After
for ( ; conNode; conNode = conNode->next_sibling("Contents"))

This ensures that RapidXML skips any sibling nodes that are not <Contents>, preventing the crash on valid S3 responses that include metadata like markers at the end of the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant