Skip to content

Stream metadata.config decoding in chunks and increase max size to 1024KB#181

Merged
ericmj merged 4 commits intomainfrom
ericmj/streaming-metadata-decoder
May 7, 2026
Merged

Stream metadata.config decoding in chunks and increase max size to 1024KB#181
ericmj merged 4 commits intomainfrom
ericmj/streaming-metadata-decoder

Conversation

@ericmj
Copy link
Copy Markdown
Member

@ericmj ericmj commented May 2, 2026

No description provided.

ericmj added 2 commits May 2, 2026 15:14
Decode metadata.config by feeding the binary through safe_erl_term:tokens/2
in 4KB slices, parsing each dot-terminated form as it completes, instead of
materializing the whole file as a char list before tokenizing. Peak transient
decode memory drops from roughly 17x the binary size to binary + ~64KB.

This makes it safe to bump MAX_METADATA_SIZE from 128KB to 1MB.

Switches the leex dot rule to end_token so tokens/2 returns one form per
call. Atom safety (list_to_existing_atom) and the restricted grammar are
preserved. UTF-8 is decoded incrementally with a latin1 fallback restart on
invalid bytes.
- Fix infinite loop on trailing incomplete UTF-8 bytes: when no input
  bytes remain but the UTF-8 buffer is non-empty, fall back to latin1
  rather than spin re-decoding the same buffer.
- Drop overly broad try/catch around erl_parse:parse_term/1; the
  function only returns {ok, _} | {error, _} and shouldn't throw.
- Comment in safe_erl_term.xrl explaining why the dot rule emits
  end_token instead of token.
- Cover the new chunked path in decode_metadata_test: large
  multi-chunk input, UTF-8 char straddling a chunk boundary, latin1
  fallback for embedded invalid UTF-8 bytes, the trailing-incomplete-
  UTF-8 regression case, and 5000 terms across many chunks.
Copy link
Copy Markdown

@jeregrine jeregrine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hitting this when we bump to latest Erlang. ❤️

@dch
Copy link
Copy Markdown

dch commented May 6, 2026

On OTP28.5 + Elixir 1.19.5 we're hitting this limit already:

-define(MAX_METADATA_SIZE, 1024 * 1024).

I'll let you know what library & size we hit.

@mitchellhenke
Copy link
Copy Markdown

https://hex.pm/packages/phosphor_icons seems to be a package that has a large metadata file that goes over the current limit and breaks. hex_metadata.config is about 399KB because it includes thousands of SVGs in the file list.

@ericmj
Copy link
Copy Markdown
Member Author

ericmj commented May 7, 2026

I checked all existing packages and no package hits the limit raised to 1024KB, some outliers are close but I think the fix should be on the package side in that case.

ericmj added 2 commits May 7, 2026 11:18
Adds a `metadata_fields` config option (default `all`) that opts callers
into reading only specific top-level keys from metadata.config. When a
list is given, the decoder switches to a token-by-token streaming
parser: forms whose key is not in the list are discarded with only a
bracket-depth counter held in state, so peak memory stays bounded by
the chunk size plus one token regardless of how large unwanted forms
get.
@ericmj ericmj merged commit 79bc0f7 into main May 7, 2026
10 checks passed
@ericmj ericmj deleted the ericmj/streaming-metadata-decoder branch May 7, 2026 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants