Skip to content

[Feature](agg) add agg function entropy#60833

Open
wrlcke wants to merge 2 commits intoapache:masterfrom
wrlcke:functions/entropy
Open

[Feature](agg) add agg function entropy#60833
wrlcke wants to merge 2 commits intoapache:masterfrom
wrlcke:functions/entropy

Conversation

@wrlcke
Copy link
Contributor

@wrlcke wrlcke commented Feb 25, 2026

What problem does this PR solve?

add new aggregate function entropy

Issue Number: #48203

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Feb 25, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds the entropy aggregate function from DuckDB to Apache Doris. The function calculates Shannon Entropy using a frequency map and computing the empirical distribution function, with entropy measured in bits (base-2 logarithm).

Changes:

  • Added backend (C++) implementation for entropy calculation using hash maps for frequency tracking
  • Added frontend (Java) function definition and registration in Nereids planner
  • Added comprehensive regression tests and unit tests covering various data types and edge cases

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
be/src/vec/aggregate_functions/aggregate_function_entropy.h Core implementation of entropy calculation logic with support for numeric, string, and generic data types
be/src/vec/aggregate_functions/aggregate_function_entropy.cpp Factory registration for the entropy aggregate function with type dispatching
be/src/vec/aggregate_functions/aggregate_function_simple_factory.cpp Registered entropy function in the aggregate function factory
be/test/vec/aggregate_functions/agg_entropy_test.cpp Unit tests covering numeric, string, generic, nullable, and empty input cases
be/test/vec/aggregate_functions/agg_function_test.h Fixed empty block handling in deserialization tests
be/test/testutil/column_helper.h Enhanced helper to support creating blocks with different column types
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/agg/Entropy.java Frontend function definition extending NullableAggregateFunction
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/visitor/AggregateFunctionVisitor.java Added visitor method for entropy function
fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinAggregateFunctions.java Registered entropy in builtin aggregate functions catalog
regression-test/suites/query_p0/aggregate/aggregate_function_entropy.groovy Comprehensive regression tests covering all data types, NULL handling, window functions, and edge cases
regression-test/data/query_p0/aggregate/aggregate_function_entropy.out Expected output for regression tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +111 to +112
// entropy of empty set = 0
execute(block, ColumnHelper::create_column_with_name<DataTypeFloat64>({0.0}));
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a discrepancy in the expected behavior for empty input. The unit test expects entropy of an empty set to be 0.0 (line 112), but the regression test expects NULL (line 114 of the .out file). This inconsistency needs to be resolved. Based on the regression test and the fact that the function inherits from NullableAggregateFunction, the correct behavior should likely be NULL for empty input, which means the unit test expectation at line 112 is incorrect.

Suggested change
// entropy of empty set = 0
execute(block, ColumnHelper::create_column_with_name<DataTypeFloat64>({0.0}));
// entropy of empty set = NULL
execute(block, ColumnHelper::create_nullable_column_with_name<DataTypeFloat64>({0.0}, {1}));

Copilot uses AI. Check for mistakes.
@zclllyybb
Copy link
Contributor

/review

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary: [Feature](agg) add agg function entropy

This PR adds a new aggregate function entropy() that computes Shannon Entropy (in bits, base-2 logarithm) over value distributions. The implementation spans BE (C++ aggregate function with three data paths: numeric, string-hash, generic-hash), FE (Nereids function registration with varargs support), unit tests, and regression tests.

Overall: The implementation is well-structured and follows existing codebase conventions (modeled after uniqExact/multi_distinct_count). The serialization, null handling, empty-input handling, and compile_check usage are all correct. I found a few issues to address:


Issues Found

  1. [Medium] Missing TYPE_IPV4/TYPE_IPV6 in numeric type list (aggregate_function_entropy.cpp:37-41): IPV4 and IPV6 types fall through to the generic path (arena serialization + hash), but they are simple fixed-size numeric types with proper PrimitiveTypeTraits. Other comparable functions like topn and approx_count_distinct include them in their numeric type lists. They should be added for efficiency.

  2. [Low] Unused isSkew parameter in Entropy.java (Entropy.java:50-52): The private constructor accepts and passes isSkew to the superclass, but withIsSkew() is never overridden. The base AggregateFunction.withIsSkew() throws a RuntimeException. If the optimizer ever tries to set skew on an entropy aggregation, it will crash. Since entropy doesn't need skew optimization, consider removing the isSkew parameter (matching Avg/Variance/Sem pattern) or overriding withIsSkew().


Critical Checkpoint Conclusions

  • Goal and correctness: The goal is to add Shannon Entropy as an aggregate function. The implementation correctly computes -sum(p * log2(p)) over the empirical distribution. Tests cover numeric, string, generic (multi-column), nullable, empty, group-by, and window function scenarios. Pass.
  • Modification scope: Clean and focused — only adds the new function and minimal test infrastructure improvements. Pass.
  • Concurrency: No new concurrency concerns. The aggregate function follows the standard single-threaded-per-state model with merge for parallel execution. Pass.
  • Lifecycle management: No special lifecycle issues. Arena memory usage in the generic path is consistent with the established pattern (freed at operator close). Pass.
  • Configuration items: None added. Pass (N/A).
  • Incompatible changes: No format/symbol compatibility issues. New function only. Pass.
  • Parallel code paths: The three data paths (numeric, string, generic) are correctly dispatched. Missing IPV4/IPV6 in numeric path noted above. Minor issue.
  • Special conditional checks: The argument_types.size() == 1 branch for optimized single-column paths is clear and well-structured. Pass.
  • Test coverage: Comprehensive regression tests covering all data types, NULL handling, empty input, group-by, window functions, and multi-column usage. BE unit tests cover numeric, string, generic, nullable, and empty cases. FE unit tests are absent but this is consistent with existing aggregate function conventions. Pass.
  • Observability: N/A for a simple aggregate function. Pass.
  • Transaction/persistence: N/A. Pass.
  • FE-BE variable passing: N/A — uses existing varargs aggregate function infrastructure. Pass.
  • Performance: Hash-based counting with XXH128 for string/generic paths is the same approach used by uniqExact. The Arena memory waste in the generic path (serialized data accumulated but only hashes retained) is a known limitation shared by other aggregate functions. No new performance concerns beyond the IPV4/IPV6 optimization opportunity noted. Pass with minor note.
  • Other issues: The compile_check_begin.h/compile_check_end.h pairing is correct. No narrowing conversion risks. Serialization format follows the established write_var_uint/write_binary convention. Pass.

TYPE_DECIMAL32, TYPE_DECIMAL64, TYPE_DECIMAL128I, TYPE_DECIMAL256, TYPE_DECIMALV2,
TYPE_FLOAT, TYPE_DOUBLE, TYPE_DATE, TYPE_DATETIME, TYPE_DATEV2, TYPE_DATETIMEV2,
TYPE_TIME, TYPE_TIMEV2, TYPE_TIMESTAMPTZ>::
create<AggregateFunctionEntropy, AggregateFunctionEntropySingleNumericData>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] TYPE_IPV4 and TYPE_IPV6 are missing from this numeric type list. Currently they fall through to the generic path (AggregateFunctionEntropyGenericData), which works but is suboptimal — it serializes values into the Arena and hashes them, while the numeric path uses values directly as hash map keys.

Both types have proper PrimitiveTypeTraits defined (IPv4 is uint32-like, IPv6 is uint128-like), and other comparable aggregate functions (topn, approx_count_distinct) include them in their numeric type lists.

Suggested fix:

creator_with_type_list<
        TYPE_BOOLEAN, TYPE_TINYINT, TYPE_SMALLINT, TYPE_INT, TYPE_BIGINT, TYPE_LARGEINT,
        TYPE_DECIMAL32, TYPE_DECIMAL64, TYPE_DECIMAL128I, TYPE_DECIMAL256, TYPE_DECIMALV2,
        TYPE_FLOAT, TYPE_DOUBLE, TYPE_DATE, TYPE_DATETIME, TYPE_DATEV2, TYPE_DATETIMEV2,
        TYPE_TIME, TYPE_TIMEV2, TYPE_TIMESTAMPTZ, TYPE_IPV4, TYPE_IPV6>

}

private Entropy(boolean distinct, boolean alwaysNullable, boolean isSkew, List<Expression> expressions) {
super("entropy", distinct, alwaysNullable, isSkew, expressions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Low] The isSkew parameter is accepted here and passed to the superclass, but Entropy does not override withIsSkew(). The base class AggregateFunction.withIsSkew() throws RuntimeException("current expression has not impl the withIsSkew method"). If the optimizer ever attempts to set skew on an entropy aggregation, this would cause a runtime crash.

Since there's no semantic reason for entropy to support skew optimization, consider removing the isSkew parameter and using the simpler super constructor (matching the pattern used by Avg, Variance, Sem):

private Entropy(boolean distinct, boolean alwaysNullable, List<Expression> expressions) {
    super("entropy", distinct, alwaysNullable, expressions);
}

Or alternatively, override withIsSkew() to handle it properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants