Skip to content

[Enhancement] Implementing dictification for ARRAY_AGG#2

Open
farhad-celo wants to merge 1 commit intomainfrom
dictification_array_agg
Open

[Enhancement] Implementing dictification for ARRAY_AGG#2
farhad-celo wants to merge 1 commit intomainfrom
dictification_array_agg

Conversation

@farhad-celo
Copy link
Copy Markdown
Owner

@farhad-celo farhad-celo commented Feb 2, 2026

Why I'm doing:

The Low Cardinality Optimization (LCO) currently does not support ARRAY_AGG. When ARRAY_AGG operates on low-cardinality string columns that have global dictionaries, the aggregation processes full VARCHAR values instead of compact integer-encoded dictionary IDs. This means ARRAY_AGG misses the performance benefits (reduced memory, faster comparisons) of dictification.

What I'm doing:

Extends the Low Cardinality Optimization to support ARRAY_AGG with dict-encoded columns. When the value column (first argument) of ARRAY_AGG has a global dictionary, the optimization:

  • Encodes the value and any dict-encodable ORDER BY columns as integer IDs throughout all aggregation stages.
  • Computes the intermediate struct type with dict-encoded array fields (e.g. struct<ARRAY<INT>, ARRAY<INT>> instead of struct<ARRAY<VARCHAR>, ARRAY<VARCHAR>>)
  • Correctly handles multi-stage aggregation by using the struct field mapping to independently determine which fields are dict-encoded in the merge stage
  • Supports downstream operations on the dict-encoded result (e.g. UPPER(ARRAY_AGG(...)[1]) applies DictDecode at the projection level)

Key changes:

  • DecodeContext.java: Refactored AggregateRewriter to handle functions with different intermediate types.
  • DecodeCollector.java: Adds ARRAY_AGG to supported aggregate functions, gates on value column being a string type with a global dict, builds struct field mapping for intermediate types
  • SessionVariable.java: New array_agg_low_cardinality_optimize session variable (default: true) gated behind existing array and struct LCO flags

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5
    • 3.4

@farhad-celo farhad-celo force-pushed the dictification_array_agg branch 7 times, most recently from 7c6e28f to d3f43a8 Compare February 2, 2026 23:32
@farhad-celo farhad-celo changed the title Implementing dictification for ARRAY_AGG [Enhancement] Implementing dictification for ARRAY_AGG Feb 2, 2026
@farhad-celo farhad-celo force-pushed the dictification_array_agg branch 2 times, most recently from 2c8eb77 to 503999d Compare February 3, 2026 03:17
Signed-off-by: Farhad Shahmohammadi <f.shahmohammadi@celonis.com>
@farhad-celo farhad-celo force-pushed the dictification_array_agg branch from 503999d to 1fa8ac3 Compare February 3, 2026 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant