[Feature](agg) add agg function entropy#60833
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
5182bf0 to
33d8b77
Compare
There was a problem hiding this comment.
Pull request overview
This pull request adds the entropy aggregate function from DuckDB to Apache Doris. The function calculates Shannon Entropy using a frequency map and computing the empirical distribution function, with entropy measured in bits (base-2 logarithm).
Changes:
- Added backend (C++) implementation for entropy calculation using hash maps for frequency tracking
- Added frontend (Java) function definition and registration in Nereids planner
- Added comprehensive regression tests and unit tests covering various data types and edge cases
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| be/src/vec/aggregate_functions/aggregate_function_entropy.h | Core implementation of entropy calculation logic with support for numeric, string, and generic data types |
| be/src/vec/aggregate_functions/aggregate_function_entropy.cpp | Factory registration for the entropy aggregate function with type dispatching |
| be/src/vec/aggregate_functions/aggregate_function_simple_factory.cpp | Registered entropy function in the aggregate function factory |
| be/test/vec/aggregate_functions/agg_entropy_test.cpp | Unit tests covering numeric, string, generic, nullable, and empty input cases |
| be/test/vec/aggregate_functions/agg_function_test.h | Fixed empty block handling in deserialization tests |
| be/test/testutil/column_helper.h | Enhanced helper to support creating blocks with different column types |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/agg/Entropy.java | Frontend function definition extending NullableAggregateFunction |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/visitor/AggregateFunctionVisitor.java | Added visitor method for entropy function |
| fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinAggregateFunctions.java | Registered entropy in builtin aggregate functions catalog |
| regression-test/suites/query_p0/aggregate/aggregate_function_entropy.groovy | Comprehensive regression tests covering all data types, NULL handling, window functions, and edge cases |
| regression-test/data/query_p0/aggregate/aggregate_function_entropy.out | Expected output for regression tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // entropy of empty set = 0 | ||
| execute(block, ColumnHelper::create_column_with_name<DataTypeFloat64>({0.0})); |
There was a problem hiding this comment.
There is a discrepancy in the expected behavior for empty input. The unit test expects entropy of an empty set to be 0.0 (line 112), but the regression test expects NULL (line 114 of the .out file). This inconsistency needs to be resolved. Based on the regression test and the fact that the function inherits from NullableAggregateFunction, the correct behavior should likely be NULL for empty input, which means the unit test expectation at line 112 is incorrect.
| // entropy of empty set = 0 | |
| execute(block, ColumnHelper::create_column_with_name<DataTypeFloat64>({0.0})); | |
| // entropy of empty set = NULL | |
| execute(block, ColumnHelper::create_nullable_column_with_name<DataTypeFloat64>({0.0}, {1})); |
|
/review |
There was a problem hiding this comment.
Code Review Summary: [Feature](agg) add agg function entropy
This PR adds a new aggregate function entropy() that computes Shannon Entropy (in bits, base-2 logarithm) over value distributions. The implementation spans BE (C++ aggregate function with three data paths: numeric, string-hash, generic-hash), FE (Nereids function registration with varargs support), unit tests, and regression tests.
Overall: The implementation is well-structured and follows existing codebase conventions (modeled after uniqExact/multi_distinct_count). The serialization, null handling, empty-input handling, and compile_check usage are all correct. I found a few issues to address:
Issues Found
-
[Medium] Missing
TYPE_IPV4/TYPE_IPV6in numeric type list (aggregate_function_entropy.cpp:37-41): IPV4 and IPV6 types fall through to the generic path (arena serialization + hash), but they are simple fixed-size numeric types with properPrimitiveTypeTraits. Other comparable functions liketopnandapprox_count_distinctinclude them in their numeric type lists. They should be added for efficiency. -
[Low] Unused
isSkewparameter inEntropy.java(Entropy.java:50-52): The private constructor accepts and passesisSkewto the superclass, butwithIsSkew()is never overridden. The baseAggregateFunction.withIsSkew()throws aRuntimeException. If the optimizer ever tries to set skew on an entropy aggregation, it will crash. Since entropy doesn't need skew optimization, consider removing theisSkewparameter (matchingAvg/Variance/Sempattern) or overridingwithIsSkew().
Critical Checkpoint Conclusions
- Goal and correctness: The goal is to add Shannon Entropy as an aggregate function. The implementation correctly computes
-sum(p * log2(p))over the empirical distribution. Tests cover numeric, string, generic (multi-column), nullable, empty, group-by, and window function scenarios. Pass. - Modification scope: Clean and focused — only adds the new function and minimal test infrastructure improvements. Pass.
- Concurrency: No new concurrency concerns. The aggregate function follows the standard single-threaded-per-state model with merge for parallel execution. Pass.
- Lifecycle management: No special lifecycle issues. Arena memory usage in the generic path is consistent with the established pattern (freed at operator close). Pass.
- Configuration items: None added. Pass (N/A).
- Incompatible changes: No format/symbol compatibility issues. New function only. Pass.
- Parallel code paths: The three data paths (numeric, string, generic) are correctly dispatched. Missing IPV4/IPV6 in numeric path noted above. Minor issue.
- Special conditional checks: The
argument_types.size() == 1branch for optimized single-column paths is clear and well-structured. Pass. - Test coverage: Comprehensive regression tests covering all data types, NULL handling, empty input, group-by, window functions, and multi-column usage. BE unit tests cover numeric, string, generic, nullable, and empty cases. FE unit tests are absent but this is consistent with existing aggregate function conventions. Pass.
- Observability: N/A for a simple aggregate function. Pass.
- Transaction/persistence: N/A. Pass.
- FE-BE variable passing: N/A — uses existing varargs aggregate function infrastructure. Pass.
- Performance: Hash-based counting with XXH128 for string/generic paths is the same approach used by
uniqExact. The Arena memory waste in the generic path (serialized data accumulated but only hashes retained) is a known limitation shared by other aggregate functions. No new performance concerns beyond the IPV4/IPV6 optimization opportunity noted. Pass with minor note. - Other issues: The
compile_check_begin.h/compile_check_end.hpairing is correct. No narrowing conversion risks. Serialization format follows the establishedwrite_var_uint/write_binaryconvention. Pass.
| TYPE_DECIMAL32, TYPE_DECIMAL64, TYPE_DECIMAL128I, TYPE_DECIMAL256, TYPE_DECIMALV2, | ||
| TYPE_FLOAT, TYPE_DOUBLE, TYPE_DATE, TYPE_DATETIME, TYPE_DATEV2, TYPE_DATETIMEV2, | ||
| TYPE_TIME, TYPE_TIMEV2, TYPE_TIMESTAMPTZ>:: | ||
| create<AggregateFunctionEntropy, AggregateFunctionEntropySingleNumericData>( |
There was a problem hiding this comment.
[Medium] TYPE_IPV4 and TYPE_IPV6 are missing from this numeric type list. Currently they fall through to the generic path (AggregateFunctionEntropyGenericData), which works but is suboptimal — it serializes values into the Arena and hashes them, while the numeric path uses values directly as hash map keys.
Both types have proper PrimitiveTypeTraits defined (IPv4 is uint32-like, IPv6 is uint128-like), and other comparable aggregate functions (topn, approx_count_distinct) include them in their numeric type lists.
Suggested fix:
creator_with_type_list<
TYPE_BOOLEAN, TYPE_TINYINT, TYPE_SMALLINT, TYPE_INT, TYPE_BIGINT, TYPE_LARGEINT,
TYPE_DECIMAL32, TYPE_DECIMAL64, TYPE_DECIMAL128I, TYPE_DECIMAL256, TYPE_DECIMALV2,
TYPE_FLOAT, TYPE_DOUBLE, TYPE_DATE, TYPE_DATETIME, TYPE_DATEV2, TYPE_DATETIMEV2,
TYPE_TIME, TYPE_TIMEV2, TYPE_TIMESTAMPTZ, TYPE_IPV4, TYPE_IPV6>| } | ||
|
|
||
| private Entropy(boolean distinct, boolean alwaysNullable, boolean isSkew, List<Expression> expressions) { | ||
| super("entropy", distinct, alwaysNullable, isSkew, expressions); |
There was a problem hiding this comment.
[Low] The isSkew parameter is accepted here and passed to the superclass, but Entropy does not override withIsSkew(). The base class AggregateFunction.withIsSkew() throws RuntimeException("current expression has not impl the withIsSkew method"). If the optimizer ever attempts to set skew on an entropy aggregation, this would cause a runtime crash.
Since there's no semantic reason for entropy to support skew optimization, consider removing the isSkew parameter and using the simpler super constructor (matching the pattern used by Avg, Variance, Sem):
private Entropy(boolean distinct, boolean alwaysNullable, List<Expression> expressions) {
super("entropy", distinct, alwaysNullable, expressions);
}Or alternatively, override withIsSkew() to handle it properly.
What problem does this PR solve?
add new aggregate function entropy
Issue Number: #48203
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)