Ele 4942 dimension anomalies visualization#847
Conversation
WalkthroughReplaces the generic anomaly query with a dimension-specific macro using a computed Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant IT as integration_test
participant Caller as test_macro_invocation
participant DimMacro as get_anomaly_query_for_dimension_anomalies
participant ReadScores as get_read_anomaly_scores_query
participant DB as Warehouse
IT->>Caller: invoke test that flattens model
Caller->>DimMacro: pass flattened_test
activate DimMacro
DimMacro->>ReadScores: build/read scores query (flattened_test)
ReadScores->>DB: execute read_scores_query
DB-->>ReadScores: metric rows
ReadScores-->>DimMacro: rows annotated with is_anomalous
DimMacro->>DimMacro: select distinct anomalous dimension_value(s)
DimMacro-->>Caller: return anomalous_dimension_rows_sql / result set
Caller-->>IT: store/analyze anomaly results
deactivate DimMacro
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Suggested reviewers
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
|
👋 @NoyaArie |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
macros/edr/tests/test_utils/get_anomaly_query.sql (3)
12-15: Variable name is misleading (non_anomalies_query actually returns anomalies only).non_anomalies_query appends where is_anomalous = true, so it’s an anomalies-only query. Consider renaming for clarity.
Apply this diff for readability:
- {% set non_anomalies_query -%} + {% set anomalies_only_query -%} {{ anomaly_scores_query }} where is_anomalous = true {%- endset -%} @@ - {% if anomalies_count > 0 %} - {{- return(anomaly_scores_query) -}} - {% else %} - {{- return(non_anomalies_query) -}} - {% endif %} + {% if anomalies_count > 0 %} + {{- return(anomaly_scores_query) -}} + {% else %} + {{- return(anomalies_only_query) -}} + {% endif %}
17-24: Cast the count to int to avoid type/coercion surprises in Jinja.agate_to_dicts may return Decimal/str depending on adapter. An explicit cast keeps the comparison robust.
Apply this diff:
- {% set anomalies_count_result = elementary.agate_to_dicts(elementary.run_query(anomalies_count_query)) %} - {% set anomalies_count = anomalies_count_result[0]['anomalies_count'] %} + {% set anomalies_count_result = elementary.agate_to_dicts(elementary.run_query(anomalies_count_query)) %} + {% set anomalies_count = (anomalies_count_result[0]['anomalies_count'] | int) %}
91-101: No-op formatting changes look fine; keep an eye on duplicate columns from “*, value/average/min/max”.Selecting aliased columns and then “” intentionally duplicates columns. If any downstream expects unique column names, consider ordering “” first or omitting superseded originals. No change required if consumers are tolerant.
I can refactor final_results to explicitly project the final column set if you prefer stricter schemas.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
macros/edr/tests/test_dimension_anomalies.sql(1 hunks)macros/edr/tests/test_utils/get_anomaly_query.sql(6 hunks)
🔇 Additional comments (3)
macros/edr/tests/test_dimension_anomalies.sql (1)
74-74: No legacy get_anomaly_query calls remain
I’ve confirmed everyget_anomaly_queryinvocation now usesflattened_test; none passflatten_modelany longer.
• macros/edr/tests/test_dimension_anomalies.sql
• macros/edr/tests/test_all_columns_anomalies.sql
• macros/edr/tests/test_table_anomalies.sql
• macros/edr/tests/test_column_anomalies.sqlApproving these changes.
macros/edr/tests/test_utils/get_anomaly_query.sql (2)
2-5: Wrapper select for downstream filtering looks fine.Building anomaly_scores_query as a selectable wrapper enables clean outer WHEREs without altering the inner query. No issues here.
20-24: Behavioral check: returning full series when anomalies exist is intentional?If anomalies_count > 0 you return the unfiltered series (anomalous + non-anomalous rows). That seems aligned with “dimension anomalies visualization” (to render context around anomalies). Please confirm downstream consumers (e.g., store_anomaly_test_results, UI) expect and efficiently handle non-anomalous rows; otherwise, result volume may spike.
If needed, I can help gate the full-series return behind a configurable flag (e.g., return_full_series_on_anomaly=true) to keep backward compatibility.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
integration_tests/tests/test_dimension_anomalies.py (3)
104-109: Prefer direct set equality and a set comprehensionThis is cleaner and asserts the exact expected set in one check.
- # Only dimension values that are anomalous are stored in the test points - dimension_values = set([x["dimension_value"] for x in anomaly_test_points]) - assert len(dimension_values) == 2 - assert "Superman" in dimension_values - assert "Spiderman" in dimension_values + # Only dimension values that are anomalous are stored in the test points + dimension_values = {x["dimension_value"] for x in anomaly_test_points} + assert dimension_values == {"Superman", "Spiderman"}
113-120: Avoid magic number 13; derive expected points per dimensionHard-coding 13 is brittle. Derive the expected count from the same date generator to make the test resilient to window changes.
- assert len(superman_anomaly_test_points) == 13 + expected_points_per_group = len(list(generate_dates(base_date=utc_today - timedelta(1)))) + assert len(superman_anomaly_test_points) == expected_points_per_group assert any(x["is_anomalous"] for x in superman_anomaly_test_points) @@ - assert len(spiderman_anomaly_test_points) == 13 + assert len(spiderman_anomaly_test_points) == expected_points_per_group assert any(x["is_anomalous"] for x in spiderman_anomaly_test_points)Verification tip: If the anomaly points are intended to exclude the test date, replace the derivation with either:
- expected_points_per_group = len(list(generate_dates(base_date=utc_today - timedelta(1)))) - 1
- or expected_points_per_group = len(training_dates)
Please confirm which window the query returns in your adapters.
101-103: Update the stale comment to match the new behaviorThe comment says “we should only get 1 row with the problematic value,” but the assertions below expect full time-series points per dimension when anomalies exist. Please update the comment to avoid confusion.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
integration_tests/tests/test_dimension_anomalies.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: test (latest_official, snowflake) / test
- GitHub Check: test (latest_official, bigquery) / test
- GitHub Check: test (latest_official, redshift) / test
- GitHub Check: test (latest_official, athena) / test
- GitHub Check: test (latest_official, postgres) / test
- GitHub Check: test (latest_official, databricks_catalog) / test
- GitHub Check: test (latest_official, dremio) / test
- GitHub Check: test (latest_official, clickhouse) / test
- GitHub Check: test (latest_pre, postgres) / test
- GitHub Check: test (latest_official, trino) / test
- GitHub Check: test (1.8.0, postgres) / test
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
macros/edr/tests/test_utils/get_anomaly_query.sql (1)
9-21: Fix cross‑dimension leakage and avoid double‑scanning the base anomaly query.
- Current IN subquery keys only on dimension_value. Identical values across different dimensions will bleed into results.
elementary.get_read_anomaly_scores_queryis expanded twice; this is expensive and can hinder optimizers.- Add null guards for dimension/dimension_value and match on the pair using EXISTS, wrapping the heavy query once in a CTE.
Apply this diff:
-{%- macro get_anomaly_query_for_dimension_anomalies(flattened_test=none) -%} - {%- set dimension_values_query -%} - select distinct dimension_value from ({{ elementary.get_read_anomaly_scores_query(flattened_test) }}) results - where is_anomalous = true - {%- endset -%} - - {% set dimension_anomalies_query -%} - select * from ({{ elementary.get_read_anomaly_scores_query(flattened_test) }}) results - where dimension_value in ({{ dimension_values_query }}) - {%- endset -%} - - {{- return(dimension_anomalies_query) -}} -{%- endmacro -%} +{%- macro get_anomaly_query_for_dimension_anomalies(flattened_test=none) -%} + {%- set dimension_anomalies_query -%} + with base as ( + {{ elementary.get_read_anomaly_scores_query(flattened_test) }} + ), + anomalous_pairs as ( + select distinct + dimension, + dimension_value + from base as results + where results.dimension is not null + and results.dimension_value is not null + and results.is_anomalous = true + ) + select * + from base as results + where exists ( + select 1 + from anomalous_pairs p + where p.dimension = results.dimension + and p.dimension_value = results.dimension_value + ) + {%- endset -%} + + {{- return(dimension_anomalies_query) -}} +{%- endmacro -%}Note: qualifying columns through the alias (results.is_anomalous, etc.) improves portability on stricter adapters.
🧹 Nitpick comments (2)
macros/edr/tests/test_utils/get_anomaly_query.sql (2)
82-97: Partition LAG windows by execution to prevent cross‑run contamination.The LAG() windows for min_value and max_value currently partition by table/column/metric/dimension/bucket_seasonality. If the anomaly_scores table holds multiple runs, the “previous” value can jump across test executions. Include test_execution_id (or test_unique_id) in the partition key to ensure expectations are computed within a single run.
Apply this diff:
-{{ elementary.lag('metric_value') }} over (partition by full_table_name, column_name, metric_name, dimension, dimension_value, bucket_seasonality order by bucket_end) +{{ elementary.lag('metric_value') }} over (partition by test_execution_id, full_table_name, column_name, metric_name, dimension, dimension_value, bucket_seasonality order by bucket_end)-{{ elementary.lag('min_metric_value') }} over (partition by full_table_name, column_name, metric_name, dimension, dimension_value, bucket_seasonality order by bucket_end) +{{ elementary.lag('min_metric_value') }} over (partition by test_execution_id, full_table_name, column_name, metric_name, dimension, dimension_value, bucket_seasonality order by bucket_end)-{{ elementary.lag('metric_value') }} over (partition by full_table_name, column_name, metric_name, dimension, dimension_value, bucket_seasonality order by bucket_end) +{{ elementary.lag('metric_value') }} over (partition by test_execution_id, full_table_name, column_name, metric_name, dimension, dimension_value, bucket_seasonality order by bucket_end)-{{ elementary.lag('max_metric_value') }} over (partition by full_table_name, column_name, metric_name, dimension, dimension_value, bucket_seasonality order by bucket_end) +{{ elementary.lag('max_metric_value') }} over (partition by test_execution_id, full_table_name, column_name, metric_name, dimension, dimension_value, bucket_seasonality order by bucket_end)If the intent is to stitch across executions, prefer test_unique_id so you don’t mix different tests that coincidentally share identifiers.
159-168: Zero check semantics — optional robustness note.Current
(metric_value = 0 and …)is fine for integer/decimal metrics. If some adapters round floats unpredictably, consider an optional tolerance-based helper in the future (e.g., abs(metric_value) < epsilon). Not a blocker.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
macros/edr/materializations/test/test.sql(4 hunks)macros/edr/tests/test_utils/get_anomaly_query.sql(5 hunks)
✅ Files skipped from review due to trivial changes (1)
- macros/edr/materializations/test/test.sql
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-10T11:29:44.525Z
Learnt from: GuyEshdat
PR: elementary-data/dbt-data-reliability#838
File: macros/utils/sql_utils/escape_reserved_keywords.sql:29-31
Timestamp: 2025-08-10T11:29:44.525Z
Learning: In the Elementary dbt-data-reliability codebase, the `escape_keywords` macro in `macros/utils/sql_utils/escape_reserved_keywords.sql` is specifically designed to escape SQL reserved keywords for different database adapters. Reserved keywords are predefined language tokens (like 'filter', 'sql', 'timestamp', 'value', 'one', 'min', 'max', 'sum', 'count' for Dremio) that don't contain special characters like double quotes. Therefore, the `dremio__escape_keywords` macro correctly wraps these keywords in double quotes without needing to handle embedded quotes, as reserved keywords by definition are simple identifiers.
Applied to files:
macros/edr/tests/test_utils/get_anomaly_query.sql
📚 Learning: 2025-08-10T11:29:06.982Z
Learnt from: GuyEshdat
PR: elementary-data/dbt-data-reliability#838
File: macros/utils/table_operations/delete_and_insert.sql:138-146
Timestamp: 2025-08-10T11:29:06.982Z
Learning: In the Elementary dbt-data-reliability codebase, the team prefers to add the `escape_reserved_keywords` macro only in places where they've actually encountered errors with reserved keywords, rather than preemptively adding it everywhere. They rely on E2E tests to catch future issues if reserved keywords cause problems in untested code paths. This pragmatic approach avoids unnecessary defensive programming and keeps the codebase simpler.
Applied to files:
macros/edr/tests/test_utils/get_anomaly_query.sql
📚 Learning: 2025-08-10T11:28:43.632Z
Learnt from: GuyEshdat
PR: elementary-data/dbt-data-reliability#838
File: macros/utils/table_operations/insert_rows.sql:156-0
Timestamp: 2025-08-10T11:28:43.632Z
Learning: In the Elementary dbt-data-reliability codebase, the `dremio__escape_special_chars` macro in `macros/utils/table_operations/insert_rows.sql` receives strings where single quotes are already escaped as `\'` from upstream processing. The macro correctly converts these to Dremio's SQL escaping format by replacing `\'` with `''`. This is different from other database implementations that handle bare single quotes directly.
Applied to files:
macros/edr/tests/test_utils/get_anomaly_query.sql
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: test (latest_official, postgres) / test
- GitHub Check: test (latest_official, redshift) / test
- GitHub Check: test (latest_official, databricks_catalog) / test
- GitHub Check: test (latest_official, trino) / test
- GitHub Check: test (latest_pre, postgres) / test
- GitHub Check: test (latest_official, snowflake) / test
- GitHub Check: test (latest_official, dremio) / test
- GitHub Check: test (latest_official, athena) / test
- GitHub Check: test (latest_official, clickhouse) / test
- GitHub Check: test (latest_official, bigquery) / test
- GitHub Check: test (1.8.0, postgres) / test
🔇 Additional comments (3)
macros/edr/tests/test_utils/get_anomaly_query.sql (3)
77-81: Good call using the reserved‑keyword escape for value.Alias uses {{ elementary.escape_reserved_keywords('value') }}, which aligns with the team’s practice of applying escaping only where we’ve hit issues, and covers adapters where VALUE is special.
138-156: Whitespace-only change — no functional impact.No action needed.
170-184: Anomaly score condition composition looks correct.The fail_on_zero short-circuit plus directional/percent conditions is coherent, and the backfill gating is applied outside via is_anomalous. LGTM.
|
|
||
| {% set flattened_test = elementary.flatten_test(context["model"]) %} | ||
| {% set anomalous_rows_sql = elementary.get_anomaly_query(flatten_model) %} | ||
| {% set anomalous_rows_sql = elementary.get_anomaly_query_for_dimension_anomalies(flattened_test) %} |
There was a problem hiding this comment.
separate to two different parameters, one for storing results and one for the test sql itself
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
macros/edr/tests/test_dimension_anomalies.sql (1)
78-78: Inline SQL output is inconsistent with stored queryYou're storing results from the dimension-specific query but printing the generic
get_anomaly_query, which can confuse debugging and test artifacts. Print the same dimension-specific SQL you store.-{{ elementary.get_anomaly_query(flattened_test) }} +{{ anomalous_dimension_rows_sql }}Alternative (if you prefer not to rely on the variable):
-{{ elementary.get_anomaly_query(flattened_test) }} +{{ elementary.get_anomaly_query_for_dimension_anomalies(flattened_test=flattened_test) }}
🧹 Nitpick comments (1)
macros/edr/tests/test_dimension_anomalies.sql (1)
74-74: Passflattened_testas a named argument for macro stabilityFuture-proof the call against signature changes and improve readability.
-{% set anomalous_dimension_rows_sql = elementary.get_anomaly_query_for_dimension_anomalies(flattened_test) %} +{% set anomalous_dimension_rows_sql = elementary.get_anomaly_query_for_dimension_anomalies(flattened_test=flattened_test) %}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
macros/edr/tests/test_dimension_anomalies.sql(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: test (latest_official, trino) / test
- GitHub Check: test (latest_official, postgres) / test
- GitHub Check: test (1.8.0, postgres) / test
- GitHub Check: test (latest_pre, postgres) / test
- GitHub Check: test (latest_official, clickhouse) / test
- GitHub Check: test (latest_official, snowflake) / test
- GitHub Check: test (latest_official, dremio) / test
- GitHub Check: test (latest_official, bigquery) / test
- GitHub Check: test (latest_official, databricks_catalog) / test
- GitHub Check: test (latest_official, athena) / test
- GitHub Check: test (latest_official, redshift) / test
🔇 Additional comments (2)
macros/edr/tests/test_dimension_anomalies.sql (2)
76-76: Good: storing results uses the dimension-specific queryUsing
anomalous_dimension_rows_sqlwithstore_anomaly_test_resultsaligns the storage path with the new dimension anomalies flow.
43-43: No-op formatting changeWhitespace-only; no action needed.
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Tests